Data cleaning and preparation (Basics)

Dr. Stefan Sobernig
Axel Polleres


22 March 2022


Unit3

Unit3

Data cleaning and preparation:

"Bread and Butter" for data scientists...

Cleaning & preparation

Today's topic:

This is where most of the effort in data-science projects is spent, repeatedly (+80%)

Cleaning & Preparation

Question.

Discuss: How would you describe the following two synthetic data sets?







Running example: EUROSTAT Urban audit

Question.

How do you inspect this dataset? How do you characterize this dataset? What do we see when we look at this dataset?

Right questions at the right time

Questions we could ask right now:

  1. Which one is the biggest city?
  2. What are the (most recent) populations per country?
  3. Which ones are the 10 biggest cities?
  4. What was the population of the city with the code AT004C1 in 2014?
  5. What was the population of the city named "Innsbruck" in 2015?
  6. ...
... , but we don't go there just yet

Interpretation of the data structure: (1)
Dataset, Value, Variable

Consider first the following key notions:

Nominal, Ordinal, Interval, Ratio?

Interpretation of the data structure: (2)
Observations, Fixed/Measured and Derived variables

Interpretation of the data structure: (3)
Example





Running example: EUROSTAT Urban Audit





What's wrong/inconvenient about this dataset?

Running example: EUROSTAT Urban Audit

Question.

How would you describe the Urban Audit dataset "./data/urb_cpop1_2017.tsv" using these key notions? What are problems that make describing the dataset difficult? What's missing?

  1. indic_ur,cities\time -> AT,DE1001V, AT001C1,DE1001V
    1. Indicators such as "population" use particular codes, e.g. DE1001V stands for "Population on the 1st of January, total"
    2. Cities use particular codes... The codes are available in another file as RDF or as CSV
    3. Countries use ISO two-letter codes, e.g. available on datahub.io
  2. missing-value notation (NA, ":")
  3. population numbers should be integers, BUT: some values look strange... '964 385', ` 7284552 e'`
  4. BTW, even worse... the format has changes since 2017! cf. the newer version available at ./data/urb_cpop1_10_2019.tsv

Data transformation (1): Overview

Data transformation involves:

  1. Modifying values contained in given variables and/ or
  2. Merging/Adding values, variables and observations (e.g., variables taken from additional datasets, resorting to values from previous observations) and/ or
  3. Reshaping the dataset (i.e., its layout)
Note: Permitted (value) transformations are constrained by the types of variables.

Data transformation (2): Goals

Datasets ("in the wild"):

Some examples: When a dataset is consumed from a datasource as raw strings (typical for CSVs!):

Data transformation (3): Value types

Let us first take a look at data types and how we can handle them in Python.

Python has the following "built-in", bit-representational ("primitive") datatypes:

Other (structured) data types:

Data transformation (4): Value types

Any (planned) transformations might need introspection:

# type(variable)
#
type(5)
<class 'int'>

# isinstance( x, t) .. returns true if x is of type t, else false
# e.g.
isinstance( 5, int)
True

ATTENTION: Not all values in a column may be of the same type!

Data transformation (5): Number conversions

  int (x) # Return an integer object constructed from a number or string x
  float (x) # Return a floating point number constructed from a number or string x.

Examples:

float("   -12345\n")
-12345.0
int(2.0)
2

Data transformation (6): Truth (boolean) values

bool( x )
Return a Boolean value, i.e. one of True or False. x is converted using the standard truth testing procedure

bool(0)
False
bool(10)
True

Data transformation (7): Truth-value checks

Any object can be tested for truth value, for use in an if or while condition or as operand of the Boolean operations below. The following values are considered False:

All other values are considered True — so objects of many types are always true.

Data transformation (8): Date/ datetime values

Available datetime types:

Data transformation (9): Date/datetime values

The datetime.strptime() class method creates a datetime object from

from datetime import datetime
text = '2012-09-20'
datetime.strptime(text, '%Y-%m-%d')
datetime.datetime(2012, 9, 20, 0, 0)

See the online documentation for a full list of variables for the string format

Data transformation (10): Date/datetime values

The standard datetime Python module does not automatically detect and parse date/time strings and still requires to manually provide the format/ pattern string.

Options with (some) auto-detection:

import dateparser
dateparser.parse('12/12/12')
datetime.datetime(2012, 12, 12, 0, 0)
from dateutil.parser import parse
parse("Today is January 1, 2047 at 8:21:00AM", fuzzy_with_tokens=True)
(datetime.datetime(2011, 1, 1, 8, 21), (u'Today is ', u' ', u'at '))

Notice.

!!Careful, such libraries might not necessarily detect the correct format but they cover properly 90% of all common cases.

Remarks: even more powerful libraries for detecting temporal annotations in text - the Heideltime package

Data transformation (11): String manipulation

Data transformation (12): String manipulation

Python provides several functions to do to manipulate strings at the per-character level:

Mind the examples of assignment 1

For more functions, please see the official documentation for str objects

Data transformation (13): String slicing

Problem.

How to "split" strings, or extract substrings? For instance, "AT,DE1001V" -> AT and DE1001V

Use [ # : # ] to get set of letter
word[0]          #get one char of the word
word[0:3]        #get the first three char
word[-3:]        #get the last three char

Keep in mind that python, as many other languages, starts to count from 0!!

word="AT,DE1001V"
print(word[3:11])
DE1001V

Data transformation (14): String slicing

Some useful helper functions for dealing with strings and to find "index positions"

word = "Data Processing"
print(word.count('a'))	# count how many times l is in the string
2
print(word.find("D") ) 	# find the letters "D" in the string, return -1 if not found
0
print( word.index("Data") )	# find the letters "Data" in the string, error if not found
0
print( word.index("Pro") )
5
print( len("Data") )
4

word="AT,DE1001V"
print(word[3:3+len("DE1001V")])
DE1001V

Data transformation (15): Substring search/replace

str.replace(old, new[, count])

Returns a copy of the string with all occurrences of substring old replaced by new. If the optional argument count is given, only the first count occurrences are replaced.

Examples:

word="Data Processing"
word.replace('Processing', 'Science')
Data Science

float( "100,50".replace(",","."))
100.50

# whereas this one gives an error:
float( "100,50")
ValueError: could not convert string to float: '100,50'

Data transformation (16): Testing for character classes

word = "Data Processing"
word.isalnum()         #check if all char are alphanumeric
word.isalpha()         #check if all char in the string are alphabetic
word.isdigit()         #test if string contains digits
word.istitle()         #test if string contains title words
word.isupper()         #test if string contains upper case
word.islower()         #test if string contains lower case
word.isspace()         #test if string contains spaces
word.endswith('g')     #test if string endswith a g
word.startswith('D')   #test if string startswith D

Data transformation (17): Reshaping and "Tidying"


Figure 1: Unstacked


Figure 2: Stacked

Data transformation (18): Reshaping and "Tidying"

Messy datasets result from violating these three main rules in different ways, for example:

Data transformation (19): Tidied dataset





Data transformation (20): Reshaping and "Tidying"

Question.

Discuss: Is the Urban Audit dataset tidy or messy?





Data scanning (1)

Scanning involves reading-in and processing a dataset in piecemeal manner, e.g.:

Python example ("row by row"):

with open('./data/urb_cpop1_2017.tsv', 'r') as f:
    rows = f.readlines()
    for eachRow in rows:
        # do something, e.g.:
        print(eachRow)

Python example ("row by row" + "value by value"):

with open('./data/urb_cpop1_2017.tsv', 'r') as f:
    rows = f.readlines()
    for eachRow in rows:
        r = eachRow.split('\t')
	# do something, e.g.:
        for eachValue in r:
	    print(eachValue)

Data scanning (2)

For a given dimension (e.g., rows), scanning may be used to inspect on:

Data sniffing

"Sniffing" involves detecting in a guided, semi-automated manner:

Rudimentary "sniffing" example using Python (for detecting which character is the delimiter):

with open('./data/urb_cpop1_2017.tsv', 'r') as f:
    rows = f.readlines()
    c=0
    N=10
    for eachRow in rows:
        # do something, e.g.:
        print('Number of ":":',eachRow.count(':'))
        print('Number of TABs:',eachRow.count('\t'))
        c+=1
        if c > N: break

Data filtering (1)

Data filtering (2): Python basics

Filtering lists:

#filter out negative values
L=[0,1,2,-1,4,-6]
Lfiltered=[]
for i in L:
  if i>=0:
    Lfiltered.append(i)

#another way to filter lists is to use list comprehension
Lfiltered=[ i for i in L if i>=0] #same as above

Data filtering (3): Python basics

Filtering nested lists:

#filter out negative values
L=[ ['a',0],['b',1],['c',2],['d',-1],['e',4],['f',-6]]
Lfiltered=[]
for i in L:
  if i[1]>=0:
    Lfiltered.append(i)

#another way to filter list is to use list comprehension
Lfiltered=[ i for i in L if i[1]>=0] #same as above

Data filtering (4): Python basics

Filtering dictionaries:

#filter out negative values
L=[ {'a':0},{'b':1},{'c':2},{'d':-1},{'e':4},{'f':-6}]
Lfiltered=[]
for d in L:
  for k,v in d.items():
    if v>=0:
      Lfiltered.append(d)

# comprehension alternative
[i for i in L for k,v in i.items() if v >= 0]

Data filtering (5): Applications

Data filtering has many applications:

  1. "search" can be seen as filtering
  2. focusing on only the relevant parts of the data
  3. eliminating unnecessary content (e.g., removing unwanted data-object types in reshaping)
  4. removing content which cannot be processed (e.g., structurally missing values)
  5. reducing amount of data to to be processed at once, per job (chunking data)

Data filtering (6): Urban-audit dataset

Modify the scanning previous example to include a (filter) condition:

with open('./data/urb_cpop1_2017.tsv', 'r') as f:
    rows = f.readlines()
    for eachRow in rows:
        # only output those lines containing 'DE1001V'
        if ('DE1001V' in eachRow)  :
            print(eachRow)

Question.

How could this be further improved?

Data filtering (7): Costs

Question.

How expensive is it to do filtering?

       if ('DE1001V' == eachRow.split('\t')[0][-7:] and eachRow[0:2] == 'AT' )

Data sorting (1)

Data sorting (2): Basics

sorted([5, 2, 3, 1, 4])
[1, 2, 3, 4, 5]
# the parameter 'reverse' can be set for descending order: 
sorted([5, 2, 3, 1, 4], reverse = True)
[5, 4, 3, 2, 1]

a = [5, 2, 3, 1, 4]
a.sort()
a

Data sorting (3): List of lists

l = [[0, 1, 'f'], [4, 2, 't'], [9, 4, 'afsd']]
l.sort(key=lambda x: x[2])
print(l)
[[9, 4, 'afsd'], [0, 1, 'f'], [4, 2, 't']]

Note: lambda x: ... is an anonymous function declaration

# i.e., the above is somewhat the same as writing...
def _f(x):
    return x[2]

l = [[0, 1, 'f'], [4, 2, 't'], [9, 4, 'afsd']]
l.sort(key=_f)
print(l)

Data sorting (4): Dictionaries by key

orig = {2: 3, 1: 89, 4: 5, 3: 0}

from collections import OrderedDict

out = OrderedDict(sorted(orig.items(), key=lambda t: t[0]))
print(out)

Data sorting (5): Dictionaries by value

orig = {"aa": 3, "bb": 4, "cc": 2, "dd": 1}

from collections import OrderedDict

out = OrderedDict(sorted(orig.items(), key=lambda t: t[1]))
print(out)

Data sorting (6): List of tuples

student_tuples = [
    ('john', 'A', 15),
    ('jane', 'B', 12),
    ('dave', 'B', 10),
  ]
sorted(student_tuples, key=lambda student: student[2])   # sort by age
[('dave', 'B', 10), ('jane', 'B', 12), ('john', 'A', 15)]

Data filtering and sorting: EUROSTAT Urban Audit

Question.

Discussion: Which of the following problems needs filtering? Which one needs sorting?

  1. Which one is the biggest city?
  2. What are the (most recent) populations per country?
  3. Which ones are the 10 biggest cities?
  4. What was the population of the city with the code AT004C1 in 2014?
  5. What was the population of the city named "Innsbruck" in 2015?
  6. How many cities per country does that dataset contain?
  7. Which country/ies has the most cities listed?
  8. Which city/ies grew fastest over the past 10 years?

Data aggregation (1)

Data aggregation (2): Dictionary-based grouping

data = [
['Vienna','Austria',11,12,13], ['Salzburg','Austria',12,22,23],
['Stuttgart','Germany',12,22,23], ['Berlin','Germany',12,22,23],
['Paris','France',12,22,23]
]

# a bit verbose, using "standard" dictionary"
groupby={}
for item in data:
    group=item[1]
    if group not in groupby:
        groupby[group]=[]
    groupby[group].append(item)

print(groupby)


# more compact, using defaultdict:
from collections import defaultdict
groupby = defaultdict(list)
for row in data:
   groupby[row[1]].append(row)

Data aggregation (3): Dictionary-based grouping

Result:

{
 'Austria':
  [['Vienna', 'Austria', 11, 12, 13], ['Salzburg', 'Austria', 12, 22, 23]],
 'Germany':
  [['Stuttgart', 'Germany', 12, 22, 23], ['Berlin', 'Germany', 12, 22, 23]],
  'France':
  [['Paris', 'France', 12, 22, 23]]
}

Data aggregation (4): Dictionary-based grouping


data = [("animal", "bear"), ("animal", "duck"), ("plant", "cactus"), ("vehicle", "speed boat"), ("vehicle", "school bus")]

# Same as before:
from collections import defaultdict
groupby = defaultdict(list)
for row in data:
   groupby[row[0]].append(row[1])

# We could usw the dictionary now for groupwise printing:
print(groupby.items())
for key, values in groupby.items():
    for thing in values:
        print("A "+thing+" is a(n) "+key)
    print(" ")

A bear is a(n) animal.
A duck is a(n) animal.

A cactus is a(n) plant.

A speed boat is a(n) vehicle.
A school bus is a(n) vehicle.

Data aggregation (5): Computing groupwise aggregates

Data aggregation (6): Computing groupwise aggregates

quant  = [['a', 5], ['a', 1], ['b', 1], ['a', 2], ['b',3], ['b',1], ['a',4]]

from collections import defaultdict
grouped = defaultdict(list)
for row in quant:
    grouped[row[0]].append(row[1])
print(grouped.items())

# element count (group size)
{i: len(v) for i,v in grouped.items()}
# sum 
{i: sum(v) for i,v in grouped.items()}
# mean and median:
from statistics import mean
{i: mean(v) for i,v in grouped.items()}
from statistics import median
{i: median(v) for i,v in grouped.items()}

Data aggregation (6): Computing groupwise aggregates

qual  = ['a', 'c', 'a', 'c', 'b', 'e', 'a', 'c', 'b', 'e','b', 'e', 'a', 'd']

# frequency (absolute)
from collections import defaultdict
freq = defaultdict(int)
for el in qual:
   freq[el] += 1

# frequency (relative)
print({i: v/len(qual) for i,v in freq.items()})


# location (mode):
from statistics import mode
mode(qual)

Data filtering and sorting: EUROSTAT Urban Audit

Question.

Discussion: Which of the following problems needs aggregation?

  1. Which one is the biggest city?
  2. What are the (most recent) populations per country?
  3. Which ones are the 10 biggest cities?
  4. What was the population of the city with the code AT004C1 in 2014?
  5. What was the population of the city named "Innsbruck" in 2015?
  6. How many cities per country does that dataset contain?
  7. Which country/ies has the most cities listed?
  8. Which city/ies grew fastest over the past 10 years?
  9. What is the average city population per country?

Excursion: Data filtering, sorting and aggregation made easy with Pandas:

Pandas is package/library for "Pragmatic Python for high performance data analysis", which offers very efficient and convenient handling for Tabular data in so called DataFrames. WARNING: We do not recommend to rely solely on Pandas just yet already, since it is more obscure than "pure" Python... but we will exemplify Pandas in a separate notebook:

lecture-materials/unit3/03_pandas_intro.ipynb

References