Data cleaning and preparation (Cont'd)

Stefan Sobernig


November 2nd, 2021


Unit 4

Unit 4

Data cleaning and preparation (Cont'd)

Data issues: Running example

Looking into last week's EUROSTAT dataset:

Data issues: Running example (cont'd)

Question.

What if we want a more complete set of cities?

Another dataset: UN Data

Downloaded dataset as CSV: ./data/UNdata_Export_20171106_004028652.csv Let's look into this dataset... and think about the following question:

Data issues: Running example (cont'd)

Issues in the UN data file for instance:

Missing data (1)

Missing data (2)

Is there an underlying reason for data being missing?

The imagined "process" having produced missing data is independent from the (other) variables in a dataset.

Exemplary "processes":

Generally: The probability of a value being missing is unrelated to the value itself or any other variable in the data set. Therefore, the missing data is said being missing completely at random (MCAR).

Missing data (3)

The imagined "process" having produced missing data is conditional, dependent on the (other) variables in a dataset.

Exemplary "processes":

Generally: The probability of a value being missing is unrelated to the value itself, but it is associated with other variables in the data set. Therefore, the missing data is said being missing at random (MAR).

Missing data (4)

The imagined "process" having produced missing data is conditional, dependent on the levels of the missing values themselves.

Exemplary "processes":

Generally: The probability of a value being missing is then related to the (expected) value itself. Therefore, the missing data is said being missing not at random (MNAR).

Missing data (5): Handling missings

Notice.

In the case of `urb_cpop1.csv' this would delete all data!

Missing data (6): Handling missings

Single imputation (a.k.a. single substitution):

Missing data (7): Handling missings

Single imputation (cont'd)

Missing data (8): Handling missings

Multiple imputation (procedure)

  1. Starting from the incomplete dataset, create multiple (>1) complete datasets.
  2. These multiple, imputed datasets are created by replacing the missing values by plausible values.
  3. Plausible values are computed for each missing value based on a pool of possible values specific for each missing value (hot decks, predictive mean matching)
  4. The subsequent analysis is then performed for each imputed dataset (e.g., descriptive or inferential statistics).
  5. The final result (e.g., combined statistic) is then computed ("pooled") from the individual results per imputed dataset.
Main benefit: The individual results for each dataset will differ from each other, and one captures these differences systematically to express the uncertainty in the imagined ("real") dataset.

Missing data (9): Handling missings

Hot-deck multiple imputation (ex.):





Missing data (10): Handling missings

Hot-deck multiple imputation (ex., cont'd):





Limitation: Restricted to categorical variables as predictors; more generally: "predictive mean matching" (PMM)

Missing data (11)

A small demo on our running example: How can we deal with missing population values in the Urban Audit data set?

See LOCF applied in this "notebook URL: 05_What_is_the_biggest_European_city_pandas.ipynb

Duplicate data / Deduplication (1)

Question.

What do you consider common causes for duplicate data? Have you experienced occurrences of duplicate data in your assignment-2 datasets?

Duplicate data (2)

Two kinds of causes:

-->

Duplicate data (2)

Two kinds of causes:

Duplicate data (3)

Examples of intra-source duplicates:

Duplicate data (4)

Examples of inter-source duplicates:

Duplicate data (5)

Challenges:

--> Inspection and heuristics needed! Deduplication is often an iterative process!

Duplicate data (6a)

Steps in a (prototypical) deduplication procedure include:

  1. Data preparation:
  2. Search-space reduction: How can we reduce the number of comparisons?
  3. Comparison-function selection:

Duplicate data (6b)

Steps in a (prototypical) deduplication procedure (cont'd):

  1. (Manual) Inspection of the candidate duplicates to make a match decision: All pairs of data objects are grouped into either:
  2. Verification of the quality of duplicate detection (if not ok, start all over from step 2)

Duplicate data (7)

How to scale deduplication? (Hint: numbe of pairwise comparisons: \( (n^2-n)/2 \))

Different strategies (or mixes thereof could be deployed):

Duplicate data (8): Ex. inter-source duplicates

Example:

Question.

What's the number of inhabitants of 'London'? What issues can you run into answering these questions?

Candidate issues:

Duplicate data (9): Ex. inter-source duplicates

Candidate tactics:

Example in Python - Discuss potential issues with Duplicates in city population data: 05_What_is_the_biggest_European_city_pandas_continued.ipynb

Data outliers (1)

Data outliers are ...

values of the variable. (research, analysis) setting!
Question.

What approaches could we use to detect outliers in 1-dimensional data?

Data outliers (2): 1-dimensional data

(More sophisticated outlier-detection methods and techniques, based on machine learning and models of data distribution are beyond this course's scope.)

Data outliers (3): X-fold standard deviation

Mark all values as outliers which are X times the standard deviation away from the mean.

import numpy as np
def xStddevAway(points, x=2):
    mean= np.mean(points)
    stddev=np.std(points)
    abs_dev = np.abs(points - mean)
    return (abs_dev>x*stddev)

Data outliers (4): MAD

The Median Absolute Deviation (MAD) for a univariate dataset \( X_1 \), \( X_2 \), ..., \( X_n \) is defined as the median of the absolute deviations from the data's median:

In practice, one could consider a library implementation, such as pandas MAD.

Data outliers (6): IQR

The inter-quantile-range (IQR) approach marks outliers as the data points which are \( x \) times the inter-quantile range below the 25% quantile and/ or above the 75% quantile.

import numpy as np
def interQuantil(points, distance=3):
    q25, q75=np.percentile(points, [25, 75])
    IQ= q75-q25
    minval= q25-distance*IQ
    maxval= q75+distance*IQ
    return (points < minval) | (points > maxval)

Some more information can be found at the Engineering statistic handbook

Data outliers (7): percentiles

Mark all values as outliers which are outside a certain percentile range.

import numpy as np
def percentile_based_outlier(data, threshold=95):
   # Marks all data points which are below the 2.5% quantile or above the 97.5% quantile
    diff = (100 - threshold) / 2.0
    minval, maxval = np.percentile(data, [diff, 100 - diff])
    return (data < minval) | (data > maxval)

Data outliers (8): Examples

l=[1, 2, 3, 3, 4, 4, 4, 5, 5.5, 6, 6, 6.5, 7, 7, 7.5, 8, 9, 12, 52, 90]
>>>print(interQuantil(l,distance=3))
[False False False False False False False False False False False False
 False False False False False False  True  True]
>>>print(percentile_based_outlier(l))
[ True False False False False False False False False False False False
 False False False False False False False  True]
>>>print(xStddevAway(l,x=2))
[False False False False False False False False False False False False
 False False False False False False False  True]

We see that the inter-quantil approach filters out the values 52 and 90, while the _percentile_based_outlier_ approach filters 1 and 90 and the two stddev from mean approach only 90.

Data preparation and cleaning: Take-away message





... in this lecture and in the notebooks we cannot do more than providing some "recipes" or examples... which you should learn to

References