SBWL 1: Data Processing 1 (PI2.0)

Axel Polleres, Stefan Sobernig


October 15, 2019

Announcements

Data Science

Data Science

What is Data Science?

What is Data Science?

... bottomline: there is no single definition, but some main recurring terms:

Datafication

A growing area of private and social life become reflected in computerised data to be turned into "valuable" insights.

... plus some recurring mention of common skills...

Data Scientists' Skills

Data analyst Data scientist
Analyt. skills Analytical thinking Excellent in math and statistics
Apply established analysis methods Visualisation, new approaches
Tech. skills Data modelling, databases Data modelling, databases
Use of analysis tools Data mining
Programming skills of advantage Algorithm development, method abstraction
Domain knowledge Detailed domain knowledge Background domain knowledge
Project management Creativity
Communication skills Team work

Data Scientists' Skills

''3 sexy skills of data geeks'' (Nathan Yau, Rise of the Data Scientist, 2009)

What problems does Data Science address?

Examples from data journalism





Dataset for published articles

What Best Practices can we observe from these Data Science examples?

Some data journalism (or, generally, data science) best practices:

provenance, simplifying assumptions) clear

The role(s) of a Data Scientist in the Enterprise

From the "goldrush" days...

https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century

to a more realistic, diversified picture of the roles of a Data Scientist ("Modeling" Data Scientist" vs"Decision" Data Scientist")...

https://hbr.org/2018/11/the-kinds-of-data-scientist

... but overall, your role will be that of a "Generalist" rather than a "Specialist":

https://hbr.org/2019/03/why-data-science-teams-need-generalists-not-specialists

Data Science as a Process

What does a Data Science process look like?

Example of a "classic" data-driven process: ETL in dataware housing

See., e.g. Matteo Golfarelli, Stefano Rizzi. Data Warehouse Design: Modern Principles and Methodologies. McGraw-Hill, 2009.

What does a Data Science process look like?

"Classic" views are challenged by datafication:

What does a Data Science Process look like?

"Knowledge Discovery in Databases (KDD)" process (often used in the course of Data Mining)





Source: Howard Hamilton

What does a Data Science Lifecycle look like?

Towards a ''Data Science workflow"





Cathy O'Neil, Rachel Schutt. Doing Data Science: Straight Talk from the Frontline (O'Reilly, 2013) (Chapter 2)

Iterative Operationalisation





Danyel Fisher & Miriah Meyer. "Making Data Visual" (O'Reilly, 2018) (Chapter 2)*

Iterative Operationalisation (cont'd)

Data Science Lifecycle: Summary

Again, not a single definition, but some recurring steps (that you often/typically need to iterate over several times):

  1. find and collect all relevant data
  2. identify issues & problems within the data
  3. organise / transform / merge data
  4. systematically operationalise questions about the data: proxies
  5. select a visualisation, a statistical technique, or a machine-learning technique as an outcome of operationalisation
  6. provide interpretations and limitations of the results
  7. communicate results

Challenges in Data Science

WARNING: At each stage, things can go wrong! Any filtering/aggregation may bias the data!

“Data wrangling is a huge — and surprisingly so — part of the job,” said Monica Rogati, vice president for data science at Jawbone, whose sensor-filled wristband and software track activity, sleep and food consumption, and suggest dietary and health tips based on the numbers. “It’s something that is not appreciated by data civilians. At times, it feels like everything we do.”* New York times

Challenges in Data Science (cont'd)





Data Science Challenges: your own experiences?

Which difficulties have you already experienced when working with data?

  1. ... ever had problems loading/ importing a file someone sent to you because of an unknown file format?
  2. ... ever encountered something like this: "K�snudl"?
  3. ... ever encountered blanks (i.e., missing values) in your data?
  4. ... ever saw an observation (an insight, a trend) disappear when combining data from different data sets (a.k.a. Simpson's paradox)
  5. ... more on that in the next lectures! (and also in course 2)

Data Science Ethics

Ethics in Data Science: FACT

Plus, last, but not least:

Ethics in Data Science: FACT (cont'd)





Source http://www.responsibledatascience.org/

Data Science Challenges and Ethics: Summary

NOTE:

Notice.

Data preprocessing and preparation steps may take 80% of the work or more -> This is the focus of our course ''Data Processing I'' !!!

Data Science Tools

Data Science Tools: Python and R





Source: https://www.kdnuggets.com/2015/05/r-vs-python-data-science.html

And some newer links:

Python and R





Source https://www.kdnuggets.com/2015/05/r-vs-python-data-science.html

Python and R





Python and R

Python

Pros: Cons:
general purpose programming dynamically typed
rapid prototyping not optimised for specific statistical and data analysis (but standard libraries emerging)
very active community currently 200,064 packages (libraries) there are (arguably) more robust general purpose languages
Python Software Foundation License

R

Pros: Cons:
designed for statistical data analysis somehow domain specific (numerical, statistical data analysis)
visualisation performance? big data APIs and libraries
very active community, currently 15111 packages (libraries) low level data wrangling
GNU GPL v2


Bottomline: both languages are actively developing, use both for when you need them. Be ready to switch and combine!

Why Python and R

The Python vs R debate confines you to one programming language. You should look beyond it and embrace both tools for their respective strengths. Using more tools will only make you better as a data scientist. [TheNextWeb]

More tools

Big Data tools and Libraries are developing fast...

... no way to learn or teach all of these tools.

But: Good news - many of those have APIs to connect to either Python or R, you will learn about some of them in course 3.

Python & Jupyter





Outline

Why Python?

Python is a dynamic general-purpose language with which one can archive fast results in only a few lines of code.

Companies

See also a verified list of companies using Python

Versions 2.7 vs. 3.x

Python is currently available in two versions: Version 2.x and 3.x.

We are using Python 3 in this course

Examples:

Python 2

Python 3

Jupyter Notebook





Brief Python3 Tutorial

Jupyter Notebook Version

The following slides are also available as Jupyter notebook python3-intro.ipynb.

A useful helper: The print operator

print('test')

Basic Data Types

Basic data types are the essential building blocks for handling information in Python

Strings

Any text between two matching quotes (either single ' ' or double quote " ")

Examples

'data'
"science"
'I study at WU Vienna'

Exercise.

Create some strings and play with the different quotes

see also Chapter 3.1.2 in the Python tutorial (en , de)

Integers

Integers are are whole numbers

Terminal> python3 -c 'print( type( 1 ) )'
<class 'int'>

Some examples:

1
0
-5

Floats

Floats are decimal number types.

Terminal> python3 -c 'print( type( 2.2 ) )'
<class 'float'>

Some examples:

1.0
15.4

Numbers with leading zero

Python does not support numbers with a leading zero

0034

Terminal> python3 -c '0034'
SyntaxError: invalid token

Operations for Numbers: Addition

5+4

Terminal> python3 -c 'print(5+4)'
9

Operations for Numbers: Subtraction

10-34

Terminal> python3 -c 'print( 10-34 )'
-24

Operations for Numbers: Multiplication

5*4

Terminal> python3 -c 'print(5*4)'
20

2.5 *3

Terminal> python3 -c 'print( 2.5*3 )'
7.5

Operations for Numbers: Division

Python 3

4/8

Terminal> python3 -c 'print(4/8)'
0.5

see also Chapter 3.1.1 in the Python tutorial (en , de)

Strings vs. Integers

Question.

The "==" operator compares if two values are equal. What happens if we execute the following command?

5=="5"

Terminal> python3 -c 'print( 5=="5" )'
False

Notice.

If a number is entered within quotes, the value is processed as string.

Float vs. Integers

Try the following

Question.

The "==" operator compares if two values are equal. What happens if we execute the following command?

5==5.5

Terminal> python3 -c 'print( 5==5.5 )'
False

Booleans

A boolean data type has only two possible values: True or False

Terminal> python3 -c 'print( type( True ) )'
<class 'bool'>

Data Containers

Python provides the following containers:

Variables

Variables are a means to store and reference data

Python does not require type declarations (unlike Java), defining variables is thus as simple as:

 VARIABLE_NAME = ASSIGNEMENT

Number assignments

For instance. Assigning the value of 1 to variable a

a = 1

String assignments

For instance. Assigning the value of "Data Science" to variable title

title = "Data Science"

Operations with variables

One can also combine operations with variables

x = 5
y = 10
c = x*y
print(c)

Terminal> python3 -c 'x=5;y=10;c=x*y; print(c)'
50

Operations with variables

a = 'Data'
b = 'Science'
print(a+b)

Terminal> python3 -c 'a = "Data"; b = "Science"; print( a+b )'
DataScience

Lists

Do you remember?



Lists

A list is a group of items

You can create a list in Python by placing the items in square brackets ([]) and separating the items with a comma.

 [ item1, item2, item3, ..., itemN ]

Terminal> python3 -c 'print( type( [] ) )'
<class 'list'>

Lists: Example

[ 'Milk', 'Eggs', 'Lettuce' ]
#or
[ 12.5, 8.0, 61.3, 87.5 ]

Lets store the list in a variable so that we can reuse it later in the code

list = [ 12.5, 8.0, 61.3, 87.5 ]
print(list)
[ 12.5, 8.0, 61.3, 87.5 ]

see also Chapter 3.4 and 5 in the Python tutorial (en , de)

Lists Concatenation

./src/listex.py

a1=['a','b','c']
a2=['d','e']
a3=a1+a2
print(a3)

Terminal> python3 ./src/listex.py
['a', 'b', 'c', 'd', 'e']

Notice.

"-","*","/" are not allowed as operations for lists

Iterating over lists

Do you remember?



Iterating over lists

./src/listex2.py

list=[11,22,33,44,55]
for  item in list:
    print(item)

Terminal> python3 ./src/listex2.py
11
22
33
44
55

Dictionaries

A Python dictionary is a more complex data container than a variable or a list.

  • key: the word you lookup
  • value: result for the lookup

 { key1: value, key2: values }





Source Wikimedia

Terminal> python3 -c 'print( type(  {} ) )'
<class 'dict'>

Dictionaries: Example

./src/dict.py

wordCounts={ 'Data':10, 'Science': 1, 'Course':5 }
print(wordCounts)

#acces key-value
print( wordCounts['Data'] )

Terminal> python3 ./src/dict.py
{'Data': 10, 'Science': 1, 'Course': 5}
10

Dictionaries: Values

The values of a dictionary itself can be:

Dictionaries: Values

./src/dict2.py

course={ 'title': 'DataProcessing1 (WS17)',
                'authors':['A. Polleres', 'J. Umbrich'],
                'wordCounts': {'Data':10, 'Science':10}
                }

value=course['wordCounts']
print(value)
print( type(value) )

Terminal> python3 ./src/dict2.py
{'Data': 10, 'Science': 10}
<class 'dict'>

Jupyter

Jupyter

The Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more.[Jupyter.org]





Jupyter UI





Jupyter: Create a new Notebook





Jupyter: Set a title





Jupyter: Markdown Cells





Jupyter: Markdown Cells





Jupyter: Code Cells





Jupyter: Running Code





Markdown?

Markdown is a text-to-HTML conversion tool for web writers. Markdown allows you to write using an easy-to-read, easy-to-write plain text format, then convert it to structurally valid XHTML (or HTML).[Official homepage]

See a good introduction at help.gitHub.com

Markdown Cheatsheet

Headers and text formating

# The largest heading
## The second largest heading
###### The smallest heading

**This is bold text**
*This text is italicized*
> This is a quote

Markdown Cheatsheet

Lists

- George Washington
- John Adams
- Thomas Jefferson

1. James Madison
2. James Monroe
3. John Quincy Adams

Lets Try

Further Reading material