SBWL 1: Data Processing 1 (PI2.0)

Winterterm 2019 Axel Polleres, Stefan Sobernig

Schedule
Organisational
Unit details
Jupyter Notebook
Supplemental Reading

Syllabus

Overall, students shall gain fundamental knowledge for dealing with different data formats and in using methods and tools to integrate data from various sources in this course

Hands-on experience in processing and preparing data for data science tasks with Python.
An understanding of how to use Python's standard libraries to write programs, access various data science tools.
Working knowledge how to solve basic data (pre-)processing tasks , including:

Finding & accessing data (e.g., tabular (CSV), tree (JSON or XML), graph-shaped (RDF) data but also databases)
Cleansing and normalizing data
Sorting, filtering and grouping data
Tools and algorithms for data transformation
Connection to and loading data into a database system and indexing techniques, for faster access of data in a database

Schedule

Unit	Date	Room	Topic
1	Tue 15.10.2019 10:00 – 14:00	TC.4.14	Course introduction
2	Tue 22.10.2019 10:00 – 14:00	TC.4.14	Data access
3	Tue 29.10.2019 10:00 – 14:00	TC.4.14	Data processing (basics)
4	Thu 31.10.2019 10:00 – 14:00	TC.4.15	Data processing (cont'd)
5	Tue 05.11.2019 10:00 – 14:00	TC.4.14	Data storage
6	Thu 07.11.2019 10:00 – 14:00	TC.4.15	Advanced topics (pandas, visualisation)
7	Thu 14.11.2019 10:00 – 14:00	TC.4.15	buffer
8	Fri 22.11.2019 12:00 – 17:00	D3.0.233	Project presentation

Organisational

Instructor(s)

Grading

See the authoritative details at Learn@WU.

Course Material

Course Learn@WU homepage
Course Slides
Supporting text book: Data Science from Scratch (available via the WU library, EBSCO)
Student Jupyter installation at: https://datascience.ai.wu.ac.at/sbwl/ws19/$MATRIKELNUMMER e.g. https://datascience.ai.wu.ac.at/sbwl/ws19/h1234567

Unit details

Unit 1: Course Overview & Introduction

Introduction

Motivation & expected learning outcome
Course structure
Grading

What is Data Science and how does it work? (theory)
Course tools and materials (practice)
Python & Jupyter Coding Environment (practice)

Slides: This unit is also available in a PDF format and as a single HTML Page

Readings:

Grus, J. (2015) Data Science from Scratch, O'Reilley, Chapter 1 (available via the WU library, EBSCO)
van der Aalst, W.M.P., Bichler, M. & Heinzl, A. Bus Inf Syst Eng (2017) 59: 311. https://doi.org/10.1007/s12599-017-0487-z
Responsible Data Science

Notebook of Unit1

Hello World ( download notebook, external NBViewer)
Python3 Intro ( download notebook, external NBViewer)

Unit 2: Data access, formats, & encoding

Data encoding and exchange formats, standards (JSON, CSV, XML, RDF)
How and where to find data?
Data access and parsing
Encoding (conversion of encodings)
Data format specific parsing in Python

Slides: This unit is also available in a PDF format and as a single HTML Page

Readings:

Grus, J. (2015) Data Science from Scratch, O'Reilley, Chapter 9 (available via the WU library, EBSCO)

Notebook of Unit2

Encodings and reading text files ( download notebook, external NBViewer)
Dealing with CSV in Python (download notebook, external NBViewer)
Dealing with JSON in Python (download notebook, external NBViewer)
Dealing with XML in Python (download notebook, external NBViewer)
Dealing with RDF in Python (download notebook, external NBViewer)

Unit 3: Data cleaning and preparation (Basics)

data inspection/ reshaping
data filtering
data sorting
data aggregation (grouping)

Slides: This unit is also available in a PDF format and as a single HTML Page

Notebook of Unit3

simple data transformations ( download notebook, external NBViewer)
Urban Audit ( download notebook, external NBViewer)

Unit 4: Data cleaning and preparation (Cont'd)

Missing data
Data duplicates
Data outliers (incl. outlier exploration, removal)

Slides: This unit is also available in a PDF format and as a single HTML Page

Notebooks of Unit 4

Missing data and duplicate data ( download notebook, external NBViewer)

Unit 5: Data storage & Persistence

Storing/loading data to/from a file vs. Connection to and loading data into and from a Database System

Python and Persistence:

Persisting objects in files: Pickle

Relational Databases Systems: SQLite

Querying data from a Relational Database
Persisting objects in a Relational Database

Slides: This unit is also available in a PDF format and as a single HTML Page

Readings:

Grus, J. (2015) Data Science from Scratch, O'Reilley, Chapter 23 (available via the WU library, EBSCO)

Notebook of Unit5

storing-loading ( download notebook, external NBViewer)
SQLite+Python ( download notebook, external NBViewer)

Unit 6: Advanced topics

Basic analysis of algorithms: The Big O
(Library support):

High-level libraries: pandas (cont'd)
Low-level libraries: numpy, scipy
Plotting (cont'd): seaborn, bokeh
Parsing

Visualization primer: matplotlib, pandas

Slides: This unit is also available in a PDF format and as a single HTML Page

Readings:

Grus, J. (2015) Data Science from Scratch, O'Reilley, Chapter 3 (available via the WU library, EBSCO)

Notebooks of Unit 6

Intro to Pandas and Visualisation ( download notebook, external NBViewer)
Binary search ( download notebook, external NBViewer)

Jupyter Notebook

The theoretical part of the course is accompanied by practical code examples and hands on exercises using the interactive Python environment Jupyter.

Supplemental Reading

Coding

Learn Python in Y minutes
Learning Python by Mark Lutz and David Ascher (O’Reilly)
Doing Data Science: Straight Talk From the Frontline by Rachel Schutt and Cathy O'Neil. O'Reilly Media, Inc., 2013. ISBN 1449358659, 9781449358655
Official Python 3 Tutorial (english)
Official Python 3 Tutorial (german)
A gallery of interesting IPython Notebooks
PythonDataScienceHandbook

SBWL 1: Data Processing 1 (PI2.0)

Table of contents

Syllabus

Schedule

Organisational

Instructor(s)

Rositsa Ivanova (Tutor)

Grading

Course Material

Unit details

Notebook of Unit1

Notebook of Unit2

Notebook of Unit3

Notebooks of Unit 4

Notebook of Unit5

Notebooks of Unit 6

Jupyter Notebook

Supplemental Reading

Coding