SBWL 1: Data Processing 1 (PI2.0)

Stefan Sobernig


March 2nd 2021

Announcements

Data Science

Data Science

What is Data Science?

What is Data Science?

... bottomline: there is no single definition, but some main recurring terms:

Datafication

A growing area of private and social life become reflected in computerised data to be turned into "valuable" insights.

... plus some recurring mention of common skills...

Data Scientists' Skills

Data analyst Data scientist
Analyt. skills Analytical thinking Excellent in math and statistics
Apply established analysis methods Visualisation, new approaches
Tech. skills Data modelling, databases Data modelling, databases
Use of analysis tools Data mining
Programming skills of advantage Algorithm development, method abstraction
Domain knowledge Detailed domain knowledge Background domain knowledge
Project management Creativity
Communication skills Team work

Data Scientists' Skills

''3 sexy skills of data geeks'' (Nathan Yau, Rise of the Data Scientist, 2009)

What problems does Data Science address?

Example for data journalism





Dataset for published articles

Data Science as a Process

What does a Data Science process look like?

Example of a "classic" data-driven process: ETL in dataware housing

operational data store, data mart, or data warehouse)

See., e.g. Matteo Golfarelli, Stefano Rizzi. Data Warehouse Design: Modern Principles and Methodologies. McGraw-Hill, 2009.

What does a Data Science process look like?

"Classic" views are challenged by datafication:

What does a Data Science Process look like?

"Knowledge Discovery in Databases (KDD)" process (often used in the course of Data Mining)





Source: Howard Hamilton

What does a Data Science Lifecycle look like?

Towards a ''Data Science workflow"





Cathy O'Neil, Rachel Schutt. Doing Data Science: Straight Talk from the Frontline (O'Reilly, 2013) (Chapter 2)

Iterative Operationalisation





Danyel Fisher & Miriah Meyer. "Making Data Visual" (O'Reilly, 2018) (Chapter 2)*

Iterative Operationalisation (cont'd)

Challenges in Data Science

WARNING: At each stage, things can go wrong! Any filtering/aggregation may bias the data!

“Data wrangling is a huge — and surprisingly so — part of the job,” said Monica Rogati, vice president for data science at Jawbone, whose sensor-filled wristband and software track activity, sleep and food consumption, and suggest dietary and health tips based on the numbers. “It’s something that is not appreciated by data civilians. At times, it feels like everything we do.”* New York times

Challenges in Data Science (cont'd)





The Data Science Lifecycle: your own experiences?

Which difficulties have you already experienced when working with data?

  1. ... ever had problems loading/ importing a file someone sent to you because of an unknown file format?
  2. ... ever encountered something like this: "K�snudl"?
  3. ... ever encountered blanks in your data?
  4. ... ever saw an observation (an insight, a trend) disappear when combining from different data sets (a.k.a. "Simpson's paradox")
  5. ... more on that in the next lectures!

Excursus: Data losses (1)





Excursus: Simpson's paradox (1)





Excursus: Simpson's paradox (2)





Data Science Lifecycle: Summary

Again, not a single definition, but some recurring terms:

  1. find and collect all relevant data
  2. identify issues & problems within the data
  3. organise / transform / merge data
  4. systematically operationalise questions about the data: proxies
  5. select a visualisation, a statistical technique, or a machine-learning technique as an outcome of operationalisation
  6. provide interpretations and limitations of the results
  7. communicate results

Data Science Ethics

Ethics in Data Science: FACT

Ethics in Data Science: FACT (cont'd)





Source http://www.responsibledatascience.org/

Data Science Lifecycle: Summary

NOTE:

Notice.

These steps may take 80% of the work or more -> This is the focus of our course ''Data Processing I'' !!!

Data Science Tools

Data Science Tools: Python and R





Source https://www.datacamp.com/community/tutorials/r-or-python-for-data-analysis

Python and R





Source https://www.datacamp.com/community/tutorials/r-or-python-for-data-analysis

Python and R





Source https://www.datacamp.com/community/tutorials/r-or-python-for-data-analysis

Why Python and R

The Python vs R debate confines you to one programming language. You should look beyond it and embrace both tools for their respective strengths. Using more tools will only make you better as a data scientist. [TheNextWeb]