SBWL 1: Data Processing 1 (PI2.0)

Winter Term 2021 Stefan Sobernig, Axel Polleres

Schedule
Organisational
Unit details
Jupyter Notebook
Supplemental Reading

Syllabus

Overall, students shall gain fundamental knowledge for dealing with different data formats and in using methods and tools to integrate data from various sources in this course

Hands-on experience in processing and preparing data for data science tasks with Python.
An understanding of how to use Python's standard libraries to write programs, access various data science tools.
Working knowledge how to solve basic data (pre-)processing tasks , including:

Finding & accessing data (e.g., tabular (CSV), tree (JSON or XML), graph-shaped (RDF) data but also databases)
Cleansing and normalizing data
Sorting, filtering and grouping data
Tools and algorithms for data transformation
Connection to and loading data into a database system and indexing techniques, for faster access of data in a database

Schedule

Unit	Date	Room	Topic
1	Tue 05.10.2021 13:00 – 18:30	"Online"	Course introduction
2	Tue 12.10.2021 13:00 – 18:30	"Online"	Data access
3	Tue 19.10.2021 13:00 – 18:30	"Online"	Data processing (basics)
4	Tue 02.11.2021 13:00 – 18:30	"Online"	Data processing (cont'd)
5	Tue 09.11.2021 13:00 – 18:30	"Online"	Data storage & persistence
6	Tue 23.11.2021 13:00 – 18:30	"Online"	Advanced topics (pandas, visualisation)
7	Tue 07.12.2021 14:00 – 18:00	"D2.0.030"	"Feedback / review session"

Organisational

Instructor(s)

Stefan Sobernig
Axel Polleres
Mariusz Nitecki (Tutor)
Sebastian Kreimel (Tutor)

dp1-team@wi.wu.ac.at

Grading

See the authorative details at Learn@WU 1211 and Learn@WU 1855.

Course Material

Course Learn@WU homepage 1211 / Course Learn@WU homepage 1855
Code of Conduct
Supporting text book: Data Science from Scratch (available via the WU library, EBSCO)
Student Jupyter installation at: Learn@WU 5259 / Learn@WU 5781

Unit details

Unit 1: Course Overview & Introduction

Introduction

Motivation & expected learning outcome
Course structure
Grading

What is Data Science and how does it work? (theory)
Course tools and materials (practice)
Python & Jupyter Coding Environment (practice)

Slides: This unit is also available as a single HTML Page, e.g. for easier printing

Readings:

Grus, J. (2015) Data Science from Scratch, O'Reilley, Chapter 1 (available via the WU library, EBSCO)
van der Aalst, W.M.P., Bichler, M. & Heinzl, A. Bus Inf Syst Eng (2017) 59: 311. https://doi.org/10.1007/s12599-017-0487-z
Responsible Data Science

Unit 2: Data formats, encoding & access

Data encoding and exchange formats, standards (JSON, CSV, XML, RDF) (theory)

Data format specific parsing in Python
Encoding (conversion of encodings)

Data access and parsing

Accessing data from files or from the Web

How and where to find data?

Slides: This unit is also available as a single HTML Page, e.g. for easier printing

Readings:

Grus, J. (2015) Data Science from Scratch, O'Reilley, Chapter 9 (available via the WU library, EBSCO)

Unit 3: Data cleaning and preparation (Basics)

data inspection/ reshaping
data filtering
data sorting
data aggregation (grouping)

Slides: This unit is also available as a single HTML Page, e.g. for easier printing

Unit 4: Data cleaning and preparation (Cont'd)

Missing data
Data duplicates
Data outliers (incl. outlier exploration, removal)

Slides: This unit is also available as a single HTML Page

Unit 5: Data storage & Persistence

Why you need persistence?
Persisting in files vs. in a database system
Python and Persistence:

Persisting objects in files: Pickle
Persisting objects in a Relational Database

Working with Relational Databases Systems: SQLite

Connection to and loading data into and from a database system
Creating, Updating, Querying a Database
Querying data from a Relational Database

Slides: This unit is also available in a as a single HTML Page for printing.

Readings:

Grus, J. (2015) Data Science from Scratch, O'Reilley, Chapter 23 (available via the WU library, EBSCO)

Unit 6: Advanced topics

Basic analysis of algorithms: The Big O
Visualizations for Data Science:

Picking the "right" visualization
Tooling primer: matplotlib, pandas

(Library support):

High-level libraries: pandas (cont'd)
Low-level libraries: numpy, scipy
Plotting (cont'd): seaborn, bokeh
Parsing

Slides: This unit is also available in a PDF format and as a single HTML Page

Readings:

Grus, J. (2015) Data Science from Scratch, O'Reilley, Chapter 3 (available via the WU library, EBSCO)

Jupyter Notebook

The theoretical part of the course is accompanied by practical code examples and hands on exercises using the interactive Python environment Jupyter.

You will find yourself in your personal folder "my-notebooks"
You can go back to the overview by clicking on the small folder left from "my-notebooks"
The "share" folder can be accessed by any JupyterHub user
The course notebooks can be found in the home directory (Note: For the course you are logged into)

Supplemental Reading

Coding

Learn Python in Y minutes
Learning Python by Mark Lutz and David Ascher (O’Reilly)
Doing Data Science: Straight Talk From the Frontline by Rachel Schutt and Cathy O'Neil. O'Reilly Media, Inc., 2013. ISBN 1449358659, 9781449358655
Official Python 3 Tutorial (english)
Official Python 3 Tutorial (german)
A gallery of interesting Jupyter Notebooks
PythonDataScienceHandbook

SBWL 1: Data Processing 1 (PI2.0)

Table of contents

Syllabus

Schedule

Organisational

Instructor(s)

Grading

Course Material

Unit details

Unit 1: Course Overview & Introduction

Unit 2: Data formats, encoding & access

Unit 3: Data cleaning and preparation (Basics)

Unit 4: Data cleaning and preparation (Cont'd)

Unit 5: Data storage & Persistence

Unit 6: Advanced topics

Jupyter Notebook

Supplemental Reading

Coding