... bottomline: there is no single definition, but some main recurring terms:
A growing area of private and social life become reflected in computerised data to be turned into "valuable" insights.
... plus some recurring mention of common skills...
Data analyst | Data scientist | |
---|---|---|
Analyt. skills | Analytical thinking | Excellent in math and statistics |
Apply established analysis methods | Visualisation, new approaches | |
Tech. skills | Data modelling, databases | Data modelling, databases |
Use of analysis tools | Data mining | |
Programming skills of advantage | Algorithm development, method abstraction | |
Domain knowledge | Detailed domain knowledge | Background domain knowledge |
Project management | Creativity | |
Communication skills | Team work |
''3 sexy skills of data geeks'' (Nathan Yau, Rise of the Data Scientist, 2009)
Examples from data journalism
|
|
Some data journalism (or, generally, data science) best practices:
provenance, simplifying assumptions) clear
From the "goldrush" days...
https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century
to a more realistic, diversified picture of the roles of a Data Scientist ("Modeling" Data Scientist" vs"Decision" Data Scientist")...
https://hbr.org/2018/11/the-kinds-of-data-scientist
... but overall, your role will be that of a "Generalist" rather than a "Specialist":
https://hbr.org/2019/03/why-data-science-teams-need-generalists-not-specialists
Example of a "classic" data-driven process: ETL in dataware housing
See., e.g. Matteo Golfarelli, Stefano Rizzi. Data Warehouse Design: Modern Principles and Methodologies. McGraw-Hill, 2009.
"Classic" views are challenged by datafication:
"Knowledge Discovery in Databases (KDD)" process (often used in the course of Data Mining)
Source: Howard Hamilton
Towards a ''Data Science workflow"
Cathy O'Neil, Rachel Schutt. Doing Data Science: Straight Talk from the Frontline (O'Reilly, 2013) (Chapter 2)
Danyel Fisher & Miriah Meyer. "Making Data Visual" (O'Reilly, 2018) (Chapter 2)*
Again, not a single definition, but some recurring steps (that you often/typically need to iterate over several times):
WARNING: At each stage, things can go wrong! Any filtering/aggregation may bias the data!
“Data wrangling is a huge — and surprisingly so — part of the job,” said Monica Rogati, vice president for data science at Jawbone, whose sensor-filled wristband and software track activity, sleep and food consumption, and suggest dietary and health tips based on the numbers. “It’s something that is not appreciated by data civilians. At times, it feels like everything we do.”* New York times
Which difficulties have you already experienced when working with data?
Plus, last, but not least:
NOTE:
Data preprocessing and preparation steps may take 80% of the work or more -> This is the focus of our course ''Data Processing I'' !!!
Source: https://www.kdnuggets.com/2015/05/r-vs-python-data-science.html
And some newer links:
Source https://www.kdnuggets.com/2015/05/r-vs-python-data-science.html
Pros: | Cons: |
---|---|
general purpose programming | dynamically typed |
rapid prototyping | not optimised for specific statistical and data analysis (but standard libraries emerging) |
very active community currently 200,064 packages (libraries) | there are (arguably) more robust general purpose languages |
Python Software Foundation License |
Pros: | Cons: |
---|---|
designed for statistical data analysis | somehow domain specific (numerical, statistical data analysis) |
visualisation | performance? big data APIs and libraries |
very active community, currently 15111 packages (libraries) | low level data wrangling |
GNU GPL v2 |
Bottomline: both languages are actively developing, use both for when you need them. Be ready to switch and combine!
The Python vs R debate confines you to one programming language. You should look beyond it and embrace both tools for their respective strengths. Using more tools will only make you better as a data scientist. [TheNextWeb]
Big Data tools and Libraries are developing fast...
... no way to learn or teach all of these tools.
But: Good news - many of those have APIs to connect to either Python or R, you will learn about some of them in course 3.
Python is a dynamic general-purpose language with which one can archive fast results in only a few lines of code.
Companies
See also a verified list of companies using Python
Python is currently available in two versions: Version 2.x and 3.x.
We are using Python 3 in this course
Examples:
Python 2
Python 3
The following slides are also available as Jupyter notebook python3-intro.ipynb.
print('test')
Basic data types are the essential building blocks for handling information in Python
Any text between two matching quotes (either single ' ' or double quote " ")
Examples
'data'
"science"
'I study at WU Vienna'
Create some strings and play with the different quotes
Integers are are whole numbers
Terminal> python3 -c 'print( type( 1 ) )'
<class 'int'>
Some examples:
1
0
-5
Floats are decimal number types.
Terminal> python3 -c 'print( type( 2.2 ) )'
<class 'float'>
Some examples:
1.0
15.4
Python does not support numbers with a leading zero
0034
Terminal> python3 -c '0034' SyntaxError: invalid token
5+4
Terminal> python3 -c 'print(5+4)'
9
10-34
Terminal> python3 -c 'print( 10-34 )'
-24
5*4
Terminal> python3 -c 'print(5*4)'
20
2.5 *3
Terminal> python3 -c 'print( 2.5*3 )'
7.5
Python 3
4/8
Terminal> python3 -c 'print(4/8)' 0.5
The "==" operator compares if two values are equal. What happens if we execute the following command?
5=="5"
Terminal> python3 -c 'print( 5=="5" )' False
If a number is entered within quotes, the value is processed as string.
Try the following
The "==" operator compares if two values are equal. What happens if we execute the following command?
5==5.5
Terminal> python3 -c 'print( 5==5.5 )' False
A boolean data type has only two possible values: True or False
Terminal> python3 -c 'print( type( True ) )'
<class 'bool'>
Python provides the following containers:
Variables are a means to store and reference data
Python does not require type declarations (unlike Java), defining variables is thus as simple as:
VARIABLE_NAME = ASSIGNEMENT
For instance. Assigning the value of 1 to variable a
a = 1
For instance. Assigning the value of "Data Science" to variable title
title = "Data Science"
One can also combine operations with variables
x = 5
y = 10
c = x*y
print(c)
Terminal> python3 -c 'x=5;y=10;c=x*y; print(c)'
50
a = 'Data'
b = 'Science'
print(a+b)
Terminal> python3 -c 'a = "Data"; b = "Science"; print( a+b )'
DataScience
Do you remember?
A list is a group of items
You can create a list in Python by placing the items in square brackets ([]) and separating the items with a comma.
[ item1, item2, item3, ..., itemN ]
Terminal> python3 -c 'print( type( [] ) )'
<class 'list'>
[ 'Milk', 'Eggs', 'Lettuce' ]
#or
[ 12.5, 8.0, 61.3, 87.5 ]
Lets store the list in a variable so that we can reuse it later in the code
list = [ 12.5, 8.0, 61.3, 87.5 ]
print(list)
[ 12.5, 8.0, 61.3, 87.5 ]
a1=['a','b','c']
a2=['d','e']
a3=a1+a2
print(a3)
Terminal> python3 ./src/listex.py
['a', 'b', 'c', 'd', 'e']
"-","*","/" are not allowed as operations for lists
Do you remember?
list=[11,22,33,44,55]
for item in list:
print(item)
Terminal> python3 ./src/listex2.py
11
22
33
44
55
A Python dictionary is a more complex data container than a variable or a list.
{ key1: value, key2: values }
|
Source Wikimedia |
Terminal> python3 -c 'print( type( {} ) )'
<class 'dict'>
wordCounts={ 'Data':10, 'Science': 1, 'Course':5 }
print(wordCounts)
#acces key-value
print( wordCounts['Data'] )
Terminal> python3 ./src/dict.py
{'Data': 10, 'Science': 1, 'Course': 5}
10
The values of a dictionary itself can be:
course={ 'title': 'DataProcessing1 (WS17)',
'authors':['A. Polleres', 'J. Umbrich'],
'wordCounts': {'Data':10, 'Science':10}
}
value=course['wordCounts']
print(value)
print( type(value) )
Terminal> python3 ./src/dict2.py
{'Data': 10, 'Science': 10}
<class 'dict'>
The Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more.[Jupyter.org]
Markdown is a text-to-HTML conversion tool for web writers. Markdown allows you to write using an easy-to-read, easy-to-write plain text format, then convert it to structurally valid XHTML (or HTML).[Official homepage]
See a good introduction at help.gitHub.com
Headers and text formating
# The largest heading
## The second largest heading
###### The smallest heading
**This is bold text**
*This text is italicized*
> This is a quote
Lists
- George Washington
- John Adams
- Thomas Jefferson
1. James Madison
2. James Monroe
3. John Quincy Adams