Data cleaning and preparation (Cont'd)

Stefan Sobernig


October 27th, 2020

Unit 4

Unit 4

Data cleaning and preparation (Cont'd)

Data issues: Running example

Looking into last week's EUROSTAT dataset:

Data issues: Running example (cont'd)

Question.

What if we want a more complete set of cities?

Another dataset: UN Data

Downloaded dataset as CSV: ./data/UNdata_Export_20171106_004028652.csv Let's look into this dataset... and think about the following question:

Data issues: Running example (cont'd)

Issues in the UN data file for instance:

Missing data (1)

Missing data (2)

Is there an underlying reason for data being missing?

The imagined "process" having produced missing data is independent from the (other) variables in a dataset.

Exemplary "processes":

Generally: The probability of a value being missing is unrelated to the value itself or any other variable in the data set. Therefore, the missing data is said being missing completely at random (MCAR).

Missing data (3)

The imagined "process" having produced missing data is conditional, dependent on the (other) variables in a dataset.

Exemplary "processes":

Generally: The probability of a value being missing is unrelated to the value itself, but it is associated with other variables in the data set. Therefore, the missing data is said being missing at random (MAR).

Missing data (4)

The imagined "process" having produced missing data is conditional, dependent on the levels of the missing values themselves.

Exemplary "processes":

Generally: The probability of a value being missing is then related to the (expected) value itself. Therefore, the missing data is said being missing not at random (MNAR).

Missing data (5): Handling missings

Notice.

In the case of `urb_cpop1.csv' this would delete all data!

Missing data (6): Handling missings

Single imputation (a.k.a. single substitution):

Missing data (7): Handling missings

Single imputation (cont'd)

Missing data (8): Handling missings

Multiple imputation (procedure)

  1. Starting from the incomplete dataset, create multiple (>1) complete datasets.
  2. These multiple, imputed datasets are created by replacing the missing values by plausible values.
  3. Plausible values are computed for each missing value based on a pool of possible values specific for each missing value (hot decks, predictive mean matching)
  4. The subsequent analysis is then performed for each imputed dataset (e.g., descriptive or inferential statistics).
  5. The final result (e.g., combined statistic) is then computed ("pooled") from the individual results per imputed dataset.

Main benefit: The individual results for each dataset will differ from each other, and one captures these differences systematically to express the uncertainty in the imagined ("real") dataset.

Missing data (9): Handling missings

Hot-deck multiple imputation (ex.):





Missing data (10): Handling missings

Hot-deck multiple imputation (ex., cont'd):





Limitation: Restricted to categorical variables as predictors; more generally: "predictive mean matching" (PMM)

Missing data (11)

A small demo on our running example: How can we deal with missing population values in the Urban Audit data set?

See LOCF applied in this "notebook ./notebooks/missing-dupes.ipynb

Duplicate data (1)

Question.

What do you consider common causes for duplicate data? Have you experienced occurrences of duplicate data in your assignment-2 datasets?

Duplicate data (2)

Two kinds of causes:

Duplicate data (3)

Examples of intra-source duplicates:

Duplicate data (4)

Examples of inter-source duplicates:

Duplicate data (5)

Challenges:

Duplicate data (6a)

A (prototypical) deduplication procedure:

  1. Data preparation:
  2. Search-space reduction: How can we reduce the number of comparisons?
  3. Comparison-function selection:

Duplicate data (6b)

A (prototypical) deduplication procedure (cont'd):

  1. (Manual) Inspection of the candidate duplicates to make a match decision: All pairs of data objects are grouped into either:
  2. Verification of the quality of duplicate detection (if not ok, start all over from step 2)

Duplicate data (7)

Systematically shrinking the search space, which is otherwise: \( (n^2-n)/2 \).

Duplicate data (8): Ex. inter-source duplicates

Example:

Question.

What's the number of inhabitants of 'London'? What issues can you run into answering these questions?

Candidate issues:

Duplicate data (9): Ex. inter-source duplicates

Candidate tactics:

--> Example in Python - Dealing with Duplicates in city population data ./notebooks/missings-dupes.ipynb

Data outliers (1)

Data outliers are ...

values of the variable.

(research, analysis) setting!

Question.

What approaches could we use to detect outliers in 1-dimensional data?

Data outliers (2): 1-dimensional data

(More sophisticated outlier-detection methods and techniques, based on machine learning and models of data distribution are beyond this course's scope.)

Data outliers (3): MAD

The Median Absolute Deviation (MAD) for a univariate dataset \( X_1 \), \( X_2 \), ..., \( X_n \) is defined as the median of the absolute deviations from the data's median:

Data outliers (4): MAD

import numpy as np
def doubleMADsfromMedian(y,thresh=3.5):
    # warning: this function does not check for NAs
    # nor does it address issues when
    # more than 50% of your data have identical values
    m = np.median(y)
    abs_dev = np.abs(y - m)
    left_mad = np.median(abs_dev[y <= m])
    right_mad = np.median(abs_dev[y >= m])
    y_mad = left_mad * np.ones(len(y))
    y_mad[y > m] = right_mad
    modified_z_score = 0.6745 * abs_dev / y_mad
    modified_z_score[y == m] = 0
    return modified_z_score > thresh

In practice, one would consider a library implementation, such as pandas MAD.

Data outliers (5): standard deviation

Mark all values as outliers which are X times the standard deviation away from the mean.

import numpy as np
def xStddevAway(points, x=2):
    mean= np.mean(points)
    stddev=np.std(points)
    abs_dev = np.abs(points - mean)
    return (abs_dev>x*stddev)

Data outliers (6): IQR

The inter-quantile-range (IQR) approach marks outliers as the data points which are \( x \) times the inter-quantile range below the 25% quantile and/ or above the 75% quantile.

import numpy as np
def interQuantil(points, distance=3):
    q25, q75=np.percentile(points, [25, 75])
    IQ= q75-q25
    minval= q25-distance*IQ
    maxval= q75+distance*IQ
    return (points < minval) | (points > maxval)

Some more information can be found at the Engineering statistic handbook

Data outliers (7): percentiles

Mark all values as outliers which are outside a certain percentile range.

import numpy as np
def percentile_based_outlier(data, threshold=95):
   # Marks all data points which are below the 2.5% quantile or above the 97.5% quantile
    diff = (100 - threshold) / 2.0
    minval, maxval = np.percentile(data, [diff, 100 - diff])
    return (data < minval) | (data > maxval)

Data outliers (8): Examples

l=[1, 2, 3, 3, 4, 4, 4, 5, 5.5, 6, 6, 6.5, 7, 7, 7.5, 8, 9, 12, 52, 90]

>>>print(doubleMADsfromMedian(l))
[False False False False False False False False False False False False
 False False False False False False  True  True]

>>>print(interQuantil(l,distance=3))
[False False False False False False False False False False False False
 False False False False False False  True  True]

>>>print(percentile_based_outlier(l))
[ True False False False False False False False False False False False
 False False False False False False False  True]

>>>print(xStddevAway(l,x=2))
[False False False False False False False False False False False False
 False False False False False False False  True]

We see that MAD and the inter-quantil approach filter out the values 52 and 90, while the _percentile_based_outlier_ approach filters 1 and 90 and the two stddev from mean approach only 90.

Data preparation and cleaning: Take-away message





... in this lecture and in the notebooks we cannot do more than providing some "recipes" or examples... which you should learn to

References