Data cleaning and preparation (Cont'd)

Prof. Dr. Stefan Sobernig
Prof. Dr. Axel Polleres


31 October 2019


Unit 4

Unit 4

Data cleaning and preparation (Cont'd)

Data issues: Running example

Looking into last week's EUROSTAT dataset:

Data issues: Running example (cont'd)

Question.

What if we want a more complete set of cities?

Another dataset: UN Data

Downloaded dataset as CSV: ./data/UNdata_Export_20171106_004028652.csv Let's look into this dataset... and think about the following question:

Data issues: Running example (cont'd)

Issues in the UN data file for instance:

How to deal with Missing Data?

Missing data (1)

Missing data (2)

Is there an underlying reason for data being missing?

The imagined "process" having produced missing data is independent from the (other) variables in a dataset.

Exemplary "processes":

Generally: The probability of a value being missing is unrelated to the value itself or any other variable in the data set. Therefore, the missing data is said being missing completely at random (MCAR).

Missing data (3)

The imagined "process" having produced missing data is conditional, dependent on the (other) variables in a dataset.

Exemplary "processes":

Generally: The probability of a value being missing is unrelated to the value itself, but it is associated with other variables in the data set. Therefore, the missing data is said being missing at random (MAR).

Missing data (4)

The imagined "process" having produced missing data is conditional, dependent on the levels of the missing values themselves.

Exemplary "processes":

"official" numbers (up to certain thresholds), because they fear remedies in budget allocation, e.g. if numbers decrease. Generally: The probability of a value being missing is then related to the (expected) value itself. Therefore, the missing data is said being missing not at random (MNAR).

Missing data (5): Handling missings

Notice.

In the case of `urb_cpop1.csv' this would delete all data!

Missing data (6): Handling missings

Single imputation (a.k.a. single substitution):

Missing data (7): Handling missings

Single imputation (cont'd)

Missing data (8): Handling missings

Multiple imputation (procedure)

  1. Starting from the incomplete dataset, create multiple (>1) complete datasets.
  2. These multiple, imputed datasets are created by replacing the missing values by plausible values.
  3. Plausible values are computed for each missing value based on a pool of possible values specific for each missing value (hot decks, predictive mean matching)
  4. The subsequent analysis is then performed for each imputed dataset (e.g., descriptive or inferential statistics).
  5. The final result (e.g., combined statistic) is then computed ("pooled") from the individual results per imputed dataset.
Main benefit: The individual results for each dataset will differ from each other, and one captures these differences systematically to express the uncertainty in the imagined ("real") dataset.

Missing data (9): Handling missings

Hot-deck multiple imputation (ex.):





Missing data (10): Handling missings

Hot-deck multiple imputation (ex., cont'd):





Limitation: Restricted to categorical variables as predictors; more generally: "predictive mean matching" (PMM)

Missing data (11)

A small demo on our running example: How can we deal with missing population values in the Urban Audit data set?

See LOCF applied in this "notebook ./notebooks/missing-dupes.ipynb

How to deal with Duplicates in Data?

Duplicate data (1)

Question.

What do you consider common causes for duplicate data? Have you experienced occurrences of duplicate data in your assignment-2 datasets?

Duplicate data (2)

Two kinds of causes:

-->

Duplicate data (2)

Two kinds of causes:

Duplicate data (3)

Examples of intra-source duplicates:

Duplicate data (4)

Examples of inter-source duplicates and inconsistencies:

Duplicate data (5)

What a deduplication procedure has to decide? All pairs of data objects are grouped into either:

Challenges:

Duplicate data (6a)

A (prototypical) deduplication procedure, considerations:

Duplicate data (6b)

A (prototypical) deduplication procedure, considerations (cont'd):

Duplicate data (7) - Search-space reduction

Systematically shrinking the search space, i.e., avoid naive pairwise \( (n^2-n)/2 \) comparisons:

(Note: algorithmically many similarities to efficient JOIN or GROUP BY procedures in Databases!)

Duplicate data (8): Ex. inter-source duplicates

Example:

Question.

What's the number of inhabitants of 'London'? What issues can you run into answering these questions?

Candidate issues:

Duplicate data (9): Ex. inter-source duplicates

Candidate tactics:

--> Example in Python - Dealing with Duplicates in city population data https://datascience.ai.wu.ac.at/ss19/dataprocessing1/notebooks/missings-dupes.ipynb, you find this example in the lecturenotebooks as part of the UrbanAudit notebook: lecturebookscopy/unit5/Eurostat_Cities.ipynb

How to deal with Outliers in Data?

Data outliers (1)

Data outliers are ...

Question.

What approaches could we use to detect outliers in 1-dimensional data?

Data outliers (2): 1-dimensional data

Make use of descriptive statistics! (More sophisticated outlier-detection methods and techniques, based on machine learning and models of data distribution --> Ask in course 2 :-).)

Data outliers (3): MAD

The Median Absolute Deviation (MAD) for a univariate dataset \( X_1 \), \( X_2 \), ..., \( X_n \) is defined as the median of the absolute deviations from the data's median:

Data outliers (4): MAD

import numpy as np
def doubleMADsfromMedian(y,thresh=3.5):
    # warning: this function does not check for NAs
    # nor does it address issues when
    # more than 50% of your data have identical values
    m = np.median(y)
    abs_dev = np.abs(y - m)
    left_mad = np.median(abs_dev[y <= m])
    right_mad = np.median(abs_dev[y >= m])
    y_mad = left_mad * np.ones(len(y))
    y_mad[y > m] = right_mad
    modified_z_score = 0.6745 * abs_dev / y_mad
    modified_z_score[y == m] = 0
    return modified_z_score > thresh
In practice, one would consider a library implementation, such as pandas MAD.

Data outliers (5): standard deviation

Mark all values as outliers which are X times the standard deviation away from the mean.

import numpy as np
def xStddevAway(points, x=2):
    mean= np.mean(points)
    stddev=np.std(points)
    abs_dev = np.abs(points - mean)
    return (abs_dev>x*stddev)

Data outliers (6): IQR

The inter-quartile-range (IQR) approach marks outliers as the data points which are \( x \) times the inter-quartile range below the 25% quartile and/ or above the 75% quartile.

import numpy as np
def interQuartil(points, distance=3):
    q25, q75=np.percentile(points, [25, 75])
    IQ= q75-q25
    minval= q25-distance*IQ
    maxval= q75+distance*IQ
    return (points < minval) | (points > maxval)

Some more information can be found at the Engineering statistic handbook

Data outliers (7): percentiles

Mark all values as outliers which are outside a certain percentile range, e.g. "take the 95% most common data".

import numpy as np
def percentile_based_outlier(data, threshold=95):
   # Marks all data points which are below the 2.5% percentile or above the 97.5% percentile
    diff = (100 - threshold) / 2.0
    minval, maxval = np.percentile(data, [diff, 100 - diff])
    return (data < minval) | (data > maxval)

Data outliers (8): Examples

l=[1, 2, 3, 3, 4, 4, 4, 5, 5.5, 6, 6, 6.5, 7, 7, 7.5, 8, 9, 12, 52, 90]
>>>print(doubleMADsfromMedian(l))
[False False False False False False False False False False False False
 False False False False False False  True  True]
>>>print(interQuartil(l,distance=3))
[False False False False False False False False False False False False
 False False False False False False  True  True]
>>>print(percentile_based_outlier(l))
[ True False False False False False False False False False False False
 False False False False False False False  True]
>>>print(xStddevAway(l,x=2))
[False False False False False False False False False False False False
 False False False False False False False  True]

We see that MAD and the inter-quartil approach filter out the values 52 and 90, while the percentile-based outlier approach filters 1 and 90 and the 2-stddev-from-mean approach only 90.

Data preparation and cleaning: Take-away message





... in this lecture and in the notebooks we cannot do more than providing some "recipes" or examples... which you should learn to

References