\
... bottomline: there is no single definition, but some main recurring terms:
... plus some recurring mention of common skills...
A growing area of private and social life become reflected in computerised data to be turned into "valuable" insights.
... plus some recurring mention of common skills...
Data analyst | Data scientist | |
---|---|---|
Analyt. skills | Analytical thinking | Excellent in math and statistics |
Apply established analysis methods | Visualisation, new approaches | |
Tech. skills | Data modelling, databases | Data modelling, databases |
Use of analysis tools | Data mining | |
Programming skills of advantage | Algorithm development, method abstraction | |
Domain knowledge | Detailed domain knowledge | Background domain knowledge |
Project management | Creativity | |
Communication skills | Team work |
''3 sexy skills of data geeks'' (Nathan Yau, Rise of the Data Scientist, 2009)
Example for data journalism
|
|
Example of a "classic" data-driven process: ETL in dataware housing
operational data store, data mart, or data warehouse)
See., e.g. Matteo Golfarelli, Stefano Rizzi. Data Warehouse Design: Modern Principles and Methodologies. McGraw-Hill, 2009.
"Classic" views are challenged by datafication:
"Knowledge Discovery in Databases (KDD)" process (often used in the course of Data Mining)
Source: Howard Hamilton
Towards a ''Data Science workflow"
Cathy O'Neil, Rachel Schutt. Doing Data Science: Straight Talk from the Frontline (O'Reilly, 2013) (Chapter 2)
Danyel Fisher & Miriah Meyer. "Making Data Visual" (O'Reilly, 2018) (Chapter 2)*
WARNING: At each stage, things can go wrong! Any filtering/aggregation may bias the data!
“Data wrangling is a huge — and surprisingly so — part of the job,” said Monica Rogati, vice president for data science at Jawbone, whose sensor-filled wristband and software track activity, sleep and food consumption, and suggest dietary and health tips based on the numbers. “It’s something that is not appreciated by data civilians. At times, it feels like everything we do.”* New York times
Which difficulties have you already experienced when working with data?
Again, not a single definition, but some recurring terms:
NOTE:
These steps may take 80% of the work or more -> This is the focus of our course ''Data Processing I'' !!!
Source https://www.datacamp.com/community/tutorials/r-or-python-for-data-analysis
Source https://www.datacamp.com/community/tutorials/r-or-python-for-data-analysis
Source https://www.datacamp.com/community/tutorials/r-or-python-for-data-analysis
The Python vs R debate confines you to one programming language. You should look beyond it and embrace both tools for their respective strengths. Using more tools will only make you better as a data scientist. [TheNextWeb]
Python is a dynamic general-purpose language with which one can archive fast results in only a few lines of code.
Companies
See also a verified list of companies using Python
Python is currently available in two versions: Version 2.x and 3.x.
We are using Python 3 in this course
Examples:
Python 2
Python 3
The following slides are also available as Jupyter notebook python3-intro.ipynb.
print('test')
Basic data types are the essential building blocks for handling information in Python
Any text between two matching quotes (either single ' ' or double quote " ")
Examples
'data'
"science"
'I study at WU Vienna'
Create some strings and play with the different quotes
Integers are are whole numbers
Terminal> python3 -c 'print( type( 1 ) )'
<class 'int'>
Some examples:
1
0
-5
Floats are decimal number types.
Terminal> python3 -c 'print( type( 2.2 ) )'
<class 'float'>
Some examples:
1.0
15.4
Python does not support numbers with a leading zero
0034
Terminal> python3 -c '0034' SyntaxError: invalid token
5+4
Terminal> python3 -c 'print(5+4)'
9
10-34
Terminal> python3 -c 'print( 10-34 )'
-24
5*4
Terminal> python3 -c 'print(5*4)'
20
2.5 *3
Terminal> python3 -c 'print( 2.5*3 )'
7.5
Python 3
4/8
Terminal> python3 -c 'print(4/8)' 0.5
The "==" operator compares if two values are equal. What happens if we execute the following command?
5=="5"
Terminal> python3 -c 'print( 5=="5" )' False
If a number is entered within quotes, the value is processed as string.
Try the following
The "==" operator compares if two values are equal. What happens if we execute the following command?
5==5.5
Terminal> python3 -c 'print( 5==5.5 )' False
A boolean data type has only two possible values: True or False
Terminal> python3 -c 'print( type( True ) )'
<class 'bool'>
Python provides the following containers:
Variables are a means to store and reference data
Python does not require type declarations (unlike Java), defining variables is thus as simple as:
VARIABLE_NAME = ASSIGNEMENT
For instance. Assigning the value of 1 to variable a
a = 1
For instance. Assigning the value of "Data Science" to variable title
title = "Data Science"
One can also combine operations with variables
x = 5
y = 10
c = x*y
print(c)
Terminal> python3 -c 'x=5;y=10;c=x*y; print(c)'
50
a = 'Data'
b = 'Science'
print(a+b)
Terminal> python3 -c 'a = "Data"; b = "Science"; print( a+b )'
DataScience
Do you remember?
A list is a group of items
You can create a list in Python by placing the items in square brackets ([]) and separating the items with a comma.
[ item1, item2, item3, ..., itemN ]
Terminal> python3 -c 'print( type( [] ) )'
<class 'list'>
[ 'Milk', 'Eggs', 'Lettuce' ]
#or
[ 12.5, 8.0, 61.3, 87.5 ]
Lets store the list in a variable so that we can reuse it later in the code
list = [ 12.5, 8.0, 61.3, 87.5 ]
print(list)
[ 12.5, 8.0, 61.3, 87.5 ]
a1=['a','b','c']
a2=['d','e']
a3=a1+a2
print(a3)
Terminal> python3 ./src/listex.py
['a', 'b', 'c', 'd', 'e']
"-","*","/" are not allowed as operations for lists
Do you remember?
list=[11,22,33,44,55]
for item in list:
print(item)
Terminal> python3 ./src/listex2.py
11
22
33
44
55
A Python dictionary is a more complex data container than a variable or a list.
{ key1: value, key2: values }
|
Source Wikimedia |
Terminal> python3 -c 'print( type( {} ) )'
<class 'dict'>
wordCounts={ 'Data':10, 'Science': 1, 'Course':5 }
print(wordCounts)
#acces key-value
print( wordCounts['Data'] )
Terminal> python3 ./src/dict.py
{'Data': 10, 'Science': 1, 'Course': 5}
10
The values of a dictionary itself can be:
course={ 'title': 'DataProcessing1 (WS17)',
'authors':['A. Polleres', 'J. Umbrich'],
'wordCounts': {'Data':10, 'Science':10}
}
value=course['wordCounts']
print(value)
print( type(value) )
Terminal> python3 ./src/dict2.py
{'Data': 10, 'Science': 10}
<class 'dict'>
The Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more.[Jupyter.org]
Markdown is a text-to-HTML conversion tool for web writers. Markdown allows you to write using an easy-to-read, easy-to-write plain text format, then convert it to structurally valid XHTML (or HTML).[Official homepage]
See a good introduction at help.gitHub.com
Headers and text formating
# The largest heading
## The second largest heading
###### The smallest heading
**This is bold text**
*This text is italicized*
> This is a quote
Lists
- George Washington
- John Adams
- Thomas Jefferson
1. James Madison
2. James Monroe
3. John Quincy Adams