Modifying values contained by given variables and/ or
Adding variables (e.g., taken from previous) and/ or
Reshaping the dataset (i.e., its layout)
Permitted (value) transformations are indicated by the types of
variables.
Data transformation (2): Goals
Datasets ("in the wild"):
values may not be eligible to run the intended checks and value-based operations (e.g.,
numeric operations)
may need you to to reshape the data layout to proceed with data preparation (scanning, filtering, sorting)
Some examples: When a dataset is consumed from a datasource as raw strings:
it does not allow for number operations (e.g "5"+"5" != "10")
it does not allow for comparison or sorting ( e.g. "5" != 5, "11">"2", "2016-10-11" vs "11-10-2016")
it does not allow for splitting & combining variables
it does not allow for combining datasets (e.g., mixed letter cases
as in "Wien" vs. "wien")
Data transformation (3): Value types
Let us first take a look at data types and how we can handle them in Python.
Python has the following "built-in", bit-representational
("primitive") datatypes:
Numerical types: int , float, complex
Boolean
String (i.e., sequences of Unicode characters)
(Collections: lists, tuples, dictionaries)
Other (structured) data types:
Date, Datetime
URL
Data transformation (4): Value types
Any (planned) transformations might need introspection:
type(variable)
#e.g.
>>> type(5)
<class 'int'>
isinstance( x, t) //returns true if x is of type t, else false
#e.g.
>>> isinstance( 5, int)
True
ATTENTION: Not all values in a column may be of the same type!
Data transformation (5): Number conversions
int (x) # Return an integer object constructed from a number or string xfloat (x) # Return a floating point number constructed from a number or string x.
Examples
>>>float(" -12345\n")
-12345.0
>>> int(2.0)
2
Data transformation (6): Truth (boolean) values
bool( x)
Return a Boolean value, i.e. one of True or False. x is converted using the standard truth testing procedure
>>>bool(0)
False
>>>bool(10)
True
Data transformation (7): Truth-value checks
Any object can be tested for truth value, for use in an if or while condition or as operand of the Boolean operations below.
The following values are considered false:
None
False
zero of any numeric type, for example, 0, 0.0, 0j.
any empty sequence, for example, '', (), [].
any empty mapping, for example, {}.
instances of user-defined classes, if the class defines a __bool__() or __len__() method, when that method returns the integer zero or bool value False. [1]
All other values are considered true — so objects of many types are always true.
Data transformation (7): Date/ datetime values
Python offers with several options (modules) to deal and work with dates and datetime information, allowing for parsing, converting, comparing, and manipulating dates and times
The standard datetime Python module does not automatically detect and
parse date/time strings and still requires to manually provide the
format/ pattern string.
Options with (some) auto-detection:
dateparser provides modules to easily parse localized dates in almost any string formats commonly found on web pages.
Some useful helper functions for dealing with strings and to find "index positions"
>>> word = "Data Processing"
>>> print(word.count('a')) # count how many times l is in the string2
>>> print(word.find("D") ) # find the word H in the string0
>>> print( word.index("Data") ) # find the letters World in the string0
>>> print( word.index("Pro") )
5
>>> print( len("Data") )
4
Data transformation (14): Substring search/replace
str.replace(old, new[, count])
Return a copy of the string with all occurrences of substring old replaced by new. If the optional argument count is given, only the first count occurrences are replaced.
>>>word="Data Processing"
>>>word.replace('Processing', 'Science')
Data Science
>>> float( "100,50".replace(",","."))
100.50
//while
>>>float( "100,50")
ValueError: could not convert string to float: '100,50'
Data transformation (15): Testing for character classes
word = "Data Processing"
word.isalnum() #check if all char are alphanumeric
word.isalpha() #check if all char in the string are alphabetic
word.isdigit() #test if string contains digits
word.istitle() #test if string contains title words
word.isupper() #test if string contains upper case
word.islower() #test if string contains lower case
word.isspace() #test if string contains spaces
word.endswith('g') #test if string endswith a g
word.startswith('D') #test if string startswith D
Data transformation (16): Reshaping
Reshaping can involve stacking or unstacking a dataset:
Stacking (melting): Turning columns into rows; typically for
processing and analysis.
Unstacking: Turning rows into columns; typically for presentation.
A tidy dataset is one in which the abstract interpretation of a dataset
(value, variable, observation) is reflected 1:1 by its structure.
Each variable forms a column.
Each observation forms a row.
Each type of data object (observation unit) forms a separate table.
Data transformation (17): Reshaping
Messy datasets result from violating these three main rules in
different ways, for example:
Column headers (labels) denote values, not variable names;
Multiple variables are stored in one column;
Variables are stored both in rows and columns;
Multiple types of data objects are stored in the same dataset (e.g.,
regions and cities across years);
A single observational unit is stored in multiple datasets (e.g.,
split sets by country);
Data transformation (18): Reshaping
Data transformation (19): Reshaping
Question.
Discuss: Is the Urban Audit dataset tidy or messy?
Data scanning (1)
Scanning involves reading-in and processing a dataset in piecemeal manner, e.g.:
row by row (in a messy dataset)
observation by observation (in a tidy dataset)
column by column (in a messy dataset)
variable by variable (in a tidy dataset)
value by value (per row/column, per observation/variable)
Python example ("row by row"):
withopen('./data/urb_cpop1.tsv', 'r') as f:
rows = f.readlines()
for eachRow in rows:
print(eachRow)
Data scanning (2)
For a given dimension (e.g., rows), scanning may be used to inspect
on:
the "head" of a dataset
the "tail" of a dataset
a "sample" (subset, slice) of a dataset
random vs. non-random
ordered vs. unordered
Data sniffing
Sniffing involves detecting in a guided, semi-automated manner:
Details of a dataset layout, in particular:
headers
row labels
column separators
dimensions
The data types of column values, e.g.:
Are columns homogeneous or heterogeneous?
Auto-detection of datetime formats
Sniffing requires means of data scanning
Data filtering (1)
Filtering: Removing or subsetting data objects (observations) based on a filter condition.
Filtering can be considered as a conditional scanning.
Data filtering (2): Python basics
#filter out negative values
L=[0,1,2,-1,4,-6]
Lfiltered=[]
for i in L:
if i>0:
Lfiltered.append(i)
#another way to filter lists is to use list comprehension
Lfiltered=[ i for i in L if i>0] #same as above
Data filtering (3): Python basics
#filter out negative values
L=[ ['a',0],['b',1],['c',2],['d',-1],['e',4],['f',-6]]
Lfiltered=[]
for i in L:
if i[1]>0:
Lfiltered.append(i)
#another way to filter list is to use list comprehension
Lfiltered=[ i for i in L if i[1]>0] #same as above
Data filtering (4): Python basics
Dictionaries:
#filter out negative values
L=[ {'a':0},{'b':1},{'c':2},{'d':-1},{'e':4},{'f':-6}]
Lfiltered=[]
for d in L:
for k,v in d.items():
if v>0:
Lfiltered.append(d)
# comprehension alternative
[i for i in L for k,v in i.items() if v > 0]
Data filtering (5): Applications
Data filtering has many applications:
"search" can be seen as filtering
focusing on only the relevant parts of the data
eliminating unnecessary content (e.g., removing unwanted data-object types in reshaping)
removing content which cannot be processed (e.g., structurally missing values)
reducing amount of data to to be processed at once, per job
(chunking data)
Data filtering (6): Urban-audit dataset
Recall: We got multiple variables in the dataset.
Task: Reduce it to one measure variable (population count)
Observe: indicator-identifier,2-letter-ISO-country-code in the first column
The identifier for population-count variable is DE1001V
Modify the scanning previous example to include a (filter) condition:
withopen('./data/urb_cpop1.tsv', 'r') as f:
rows = f.readlines()
for eachRow in rows:
# only output those lines containing a trailing 'DE1001V'
colEnd = eachRow.find('\t')
if (eachRow[colEnd-7:colEnd] == 'DE1001V') :
print(eachRow)
Question.
How could this be further improved?
Observe: The indicators in the file urb_cpop1.tsv are sorted and DE1001V are the first indicators appearing!
It is important to inspect the data before processing it!
Data filtering (7): Costs
Question.
How expensive is it to do filtering?
For every filtering operation you need to do one scan...
... but several conjunctive filtering conditions can be combined (keyword: and) in one scan
... sometimes (e.g when file is sorted, or when you search only for a single result) you can stop before having to scan the whole file!
Data sorting (1)
Sorting: Changing the order of data objects (observations) depending on the ordinal values of one or several of their variables (attributes).
In-place sorting: Python lists have a built-in list.sort() method that modifies the list in-place.
Out-place sorting: There is also a sorted() built-in function that builds a new sorted list from an iterable.
data = [
['Vienna','Austria',11,12,13], ['Salzburg','Austria',12,22,23],
['Stuttgart','Germany',12,22,23], ['Berlin','Germany',12,22,23],
['Paris','France',12,22,23]
]
# verbose tactic
groupby={}
for item in data:
group=item[1]
if group notin groupby:
groupby[group]=[]
groupby[group].append(item)
print(groupby)
# more compact tacticfromcollectionsimport defaultdict
groupby = defaultdict(list)
for row in data:
groupby[row[1]].append(row)
data = [("animal", "bear"), ("animal", "duck"), ("plant", "cactus"), ("vehicle", "speed boat"), ("vehicle", "school bus")]
fromcollectionsimport defaultdict
groupby = defaultdict(list)
for row in data:
groupby[row[0]].append(row[1])
print(groupby.items())
for key, values in groupby.items():
for thing in values:
print("A "+thing+" is a "+key)
print(" ")
A bear is a animal.
A duck is a animal.
A cactus is a plant.
A speed boat is a vehicle.
A school bus is a vehicle.
Data aggregation (5): Computing groupwise aggregates
Typical tasks you want to do on lists or on groups: provide summary
descriptors (statistics).
The kind of summary descriptor computable depends on the variable
type (quantitative, qualitative)
Frequency: Counting the elements contained by a group (qualitative
variables; absolute/ relative)
Mode: The value of the highest frequency (qualitative variables)
Location: mean and median (quantitative variables)
Spread: range and variance (quantitative variables)
Data aggregation (6): Computing groupwise aggregates
quant = [['a', 5], ['a', 1], ['b', 1], ['a', 2], ['b',3], ['b',1], ['a',4]]
fromcollectionsimport defaultdict
groupby = defaultdict(list)
for row in quant:
groupby[row[0]].append(row[1])
print(groupby.items())
# element count (group size)
{i: len(v) for i,v in groupby.items()}
# sum
{i: sum(v) for i,v in groupby.items()}
# location:fromstatisticsimport mean
{i: mean(v) for i,v in groupby.items()}
fromstatisticsimport median
{i: median(v) for i,v in groupby.items()}
Data aggregation (6): Computing groupwise aggregates
qual = ['a', 'c', 'a', 'c', 'b', 'e', 'a', 'c', 'b', 'e','b', 'e', 'a', 'd']
# frequency (absolute)fromcollectionsimport defaultdict
freq = defaultdict(int)
for el in qual:
freq[el] += 1# frequency (relative)print({i: v/len(qual) for i,v in freq.items()})
References
Pang-Ning Tan, Michael Steinbach, Vipin Kumar (2006): "Introduction to Data Mining", Chapter 2: "Data", Pearson.
Hadley Wickham (2014): "Tidy data", The Journal of Statistical Software (59), DOI: 10.18637/jss.v059.i10