\ SBWL 1: Data Processing 1 (PI2.0)

SBWL 1: Data Processing 1 (PI2.0)

Stefan Sobernig


October, 2 2018

Data Science

Data Science

What is Data Science?

What is Data Science?

... bottomline: there is no single definition, but some main recurring terms:

... plus some recurring mention of common skills...

Datafication

A growing area of private and social life become reflected in computerised data to be turned into "valuable" insights.

... plus some recurring mention of common skills...

Data Scientists' Skills

Data analyst Data scientist
Analyt. skills Analytical thinking Excellent in math and statistics
Apply established analysis methods Visualisation, new approaches
Tech. skills Data modelling, databases Data modelling, databases
Use of analysis tools Data mining
Programming skills of advantage Algorithm development, method abstraction
Domain knowledge Detailed domain knowledge Background domain knowledge
Project management Creativity
Communication skills Team work

Data Scientists' Skills

''3 sexy skills of data geeks'' (Nathan Yau, Rise of the Data Scientist, 2009)

What problems does Data Science address?

Example for data journalism





Dataset for published articles

Data Science as a Process

What does a Data Science process look like?

Example of a "classic" data-driven process: ETL in dataware housing

operational data store, data mart, or data warehouse)

See., e.g. Matteo Golfarelli, Stefano Rizzi. Data Warehouse Design: Modern Principles and Methodologies. McGraw-Hill, 2009.

What does a Data Science process look like?

"Classic" views are challenged by datafication:

What does Data Science Process look like?

"Knowledge Discovery in Databases (KDD)" process (often used in the course of Data Mining)





Source: Howard Hamilton

What does a Data Science Lifecycle look like?

Towards a ''Data Science workflow"





Cathy O'Neil, Rachel Schutt. Doing Data Science: Straight Talk from the Frontline (O'Reilly, 2013) (Chapter 2)

Iterative Operationalisation





Danyel Fisher & Miriah Meyer. "Making Data Visual" (O'Reilly, 2018) (Chapter 2)*

Iterative Operationalisation (cont'd)

Challenges in Data Science

WARNING: At each stage, things can go wrong! Any filtering/aggregation may bias the data!

“Data wrangling is a huge — and surprisingly so — part of the job,” said Monica Rogati, vice president for data science at Jawbone, whose sensor-filled wristband and software track activity, sleep and food consumption, and suggest dietary and health tips based on the numbers. “It’s something that is not appreciated by data civilians. At times, it feels like everything we do.”* New York times

Challenges in Data Science (cont'd)





The Data Science Lifecycle: your own experiences?

Which difficulties have you already experienced when working with data?

  1. ... ever had problems loading/ importing a file someone sent to you because of an unknown file format?
  2. ... ever encountered something like this: "K�snudl"?
  3. ... ever encountered blanks in your data?
  4. ... ever saw an observation (an insight, a trend) disappear when combining from different data sets (a.k.a. "Simpson's paradox")
  5. ... more on that in the next lectures!

Data Science Lifecycle: Summary

Again, not a single definition, but some recurring terms:

  1. find and collect all relevant data
  2. identify issues & problems within the data
  3. organise / transform / merge data
  4. systematically operationalise questions about the data: proxies
  5. select a visualisation, a statistical technique, or a machine-learning technique as an outcome of operationalisation
  6. provide interpretations and limitations of the results
  7. communicate results

Data Science Ethics

Ethics in Data Science: FACT

Ethics in Data Science: FACT (cont'd)





Source http://www.responsibledatascience.org/

Data Science Lifecycle: Summary

NOTE:

Notice.

These steps may take 80% of the work or more -> This is the focus of our course ''Data Processing I'' !!!

Data Science Tools

Data Science Tools: Python and R





Source https://www.datacamp.com/community/tutorials/r-or-python-for-data-analysis

Python and R





Source https://www.datacamp.com/community/tutorials/r-or-python-for-data-analysis

Python and R





Source https://www.datacamp.com/community/tutorials/r-or-python-for-data-analysis

Why Python and R

The Python vs R debate confines you to one programming language. You should look beyond it and embrace both tools for their respective strengths. Using more tools will only make you better as a data scientist. [TheNextWeb]

Python & Jupyter





Outline

Why Python?

Python is a dynamic general-purpose language with which one can archive fast results in only a few lines of code.

Companies

See also a verified list of companies using Python

Versions 2.7 vs. 3.x

Python is currently available in two versions: Version 2.x and 3.x.

We are using Python 3 in this course

Examples:

Python 2

Python 3

Jupyter Notebook





Brief Python3 Tutorial

Jupyter Notebook Version

The following slides are also available as Jupyter notebook python3-intro.ipynb.

A useful helper: The print operator

print('test')

Basic Data Types

Basic data types are the essential building blocks for handling information in Python

Strings

Any text between two matching quotes (either single ' ' or double quote " ")

Examples

'data'
"science"
'I study at WU Vienna'

Exercise.

Create some strings and play with the different quotes

see also Chapter 3.1.2 in the Python tutorial (en , de)

Integers

Integers are are whole numbers

Terminal> python3 -c 'print( type( 1 ) )'
<class 'int'>

Some examples:

1
0
-5

Floats

Floats are decimal number types.

Terminal> python3 -c 'print( type( 2.2 ) )'
<class 'float'>

Some examples:

1.0
15.4

Numbers with leading zero

Python does not support numbers with a leading zero

0034

Terminal> python3 -c '0034'
SyntaxError: invalid token

Operations for Numbers: Addition

5+4

Terminal> python3 -c 'print(5+4)'
9

Operations for Numbers: Subtraction

10-34

Terminal> python3 -c 'print( 10-34 )'
-24

Operations for Numbers: Multiplication

5*4

Terminal> python3 -c 'print(5*4)'
20

2.5 *3

Terminal> python3 -c 'print( 2.5*3 )'
7.5

Operations for Numbers: Division

Python 3

4/8

Terminal> python3 -c 'print(4/8)'
0.5

see also Chapter 3.1.1 in the Python tutorial (en , de)

Strings vs. Integers

Question.

The "==" operator compares if two values are equal. What happens if we execute the following command?

5=="5"

Terminal> python3 -c 'print( 5=="5" )'
False

Notice.

If a number is entered within quotes, the value is processed as string.

Float vs. Integers

Try the following

Question.

The "==" operator compares if two values are equal. What happens if we execute the following command?

5==5.5

Terminal> python3 -c 'print( 5==5.5 )'
False

Booleans

A boolean data type has only two possible values: True or False

Terminal> python3 -c 'print( type( True ) )'
<class 'bool'>

Data Containers

Python provides the following containers:

Variables

Variables are a means to store and reference data

Python does not require type declarations (unlike Java), defining variables is thus as simple as:

 VARIABLE_NAME = ASSIGNEMENT

Number assignments

For instance. Assigning the value of 1 to variable a

a = 1

String assignments

For instance. Assigning the value of "Data Science" to variable title

title = "Data Science"

Operations with variables

One can also combine operations with variables

x = 5
y = 10
c = x*y
print(c)

Terminal> python3 -c 'x=5;y=10;c=x*y; print(c)'
50

Operations with variables

a = 'Data'
b = 'Science'
print(a+b)

Terminal> python3 -c 'a = "Data"; b = "Science"; print( a+b )'
DataScience

Lists

Do you remember?



Lists

A list is a group of items

You can create a list in Python by placing the items in square brackets ([]) and separating the items with a comma.

 [ item1, item2, item3, ..., itemN ]

Terminal> python3 -c 'print( type( [] ) )'
<class 'list'>

Lists: Example

[ 'Milk', 'Eggs', 'Lettuce' ]
#or
[ 12.5, 8.0, 61.3, 87.5 ]

Lets store the list in a variable so that we can reuse it later in the code

list = [ 12.5, 8.0, 61.3, 87.5 ]
print(list)
[ 12.5, 8.0, 61.3, 87.5 ]

see also Chapter 3.4 and 5 in the Python tutorial (en , de)

Lists Concatenation

./src/listex.py

a1=['a','b','c']
a2=['d','e']
a3=a1+a2
print(a3)

Terminal> python3 ./src/listex.py
['a', 'b', 'c', 'd', 'e']

Notice.

"-","*","/" are not allowed as operations for lists

Iterating over lists

Do you remember?



Iterating over lists

./src/listex2.py

list=[11,22,33,44,55]
for  item in list:
    print(item)

Terminal> python3 ./src/listex2.py
11
22
33
44
55

Dictionaries

A Python dictionary is a more complex data container than a variable or a list.

  • key: the word you lookup
  • value: result for the lookup

 { key1: value, key2: values }





Source Wikimedia

Terminal> python3 -c 'print( type(  {} ) )'
<class 'dict'>

Dictionaries: Example

./src/dict.py

wordCounts={ 'Data':10, 'Science': 1, 'Course':5 }
print(wordCounts)

#acces key-value
print( wordCounts['Data'] )

Terminal> python3 ./src/dict.py
{'Data': 10, 'Science': 1, 'Course': 5}
10

Dictionaries: Values

The values of a dictionary itself can be:

Dictionaries: Values

./src/dict2.py

course={ 'title': 'DataProcessing1 (WS17)',
                'authors':['A. Polleres', 'J. Umbrich'],
                'wordCounts': {'Data':10, 'Science':10}
                }

value=course['wordCounts']
print(value)
print( type(value) )

Terminal> python3 ./src/dict2.py
{'Data': 10, 'Science': 10}
<class 'dict'>

Jupyter

Jupyter

The Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more.[Jupyter.org]





Jupyter UI





Jupyter: Create a new Notebook





Jupyter: Set a title





Jupyter: Markdown Cells





Jupyter: Markdown Cells





Jupyter: Code Cells





Jupyter: Running Code





Markdown?

Markdown is a text-to-HTML conversion tool for web writers. Markdown allows you to write using an easy-to-read, easy-to-write plain text format, then convert it to structurally valid XHTML (or HTML).[Official homepage]

See a good introduction at help.gitHub.com

Markdown Cheatsheet

Headers and text formating

# The largest heading
## The second largest heading
###### The smallest heading

**This is bold text**
*This text is italicized*
> This is a quote

Markdown Cheatsheet

Lists

- George Washington
- John Adams
- Thomas Jefferson

1. James Madison
2. James Monroe
3. John Quincy Adams

Lets Try

Further Reading material