SBWL 1: Data Processing 1 (PI2.0)

Summerterm 2019 Axel Polleres, Stefan Sobernig

Schedule
Organisational
Unit details
Jupyter Notebook
Supplemental Reading

Syllabus

Overall, students shall gain fundamental knowledge for dealing with different data formats and in using methods and tools to integrate data from various sources in this course

Hands-on experience in processing and preparing data for data science tasks with Python.
An understanding of how to use Python's standard libraries to write programs, access various data science tools.
Working knowledge how to solve basic data (pre-)processing tasks , including:

Finding & accessing data (e.g., tabular (CSV), tree (JSON or XML), graph-shaped (RDF) data but also databases)
Cleansing and normalizing data
Sorting, filtering and grouping data
Tools and algorithms for data transformation
Connection to and loading data into a database system and indexing techniques, for faster access of data in a database

Schedule

Unit	Date	Room	Topic
1	Tue 05.03.2019 11:00 – 15:00	TC.1.01 OeNB	Course introduction
2	Tue 12.03.2019 10:00 – 14:00	TC.0.01 ERSTE	Data access
3	Tue 19.03.2019 10:00 – 14:00	TC.0.02 Red Bull	Data processing (basics)
4	Mon 25.03.2019 10:00 – 14:00	TC2.03	Data processing (cont'd)
5	Tue 02.04.2019 10:00 – 14:00	D1.1.074	Data storage
6	Tue 09.04.2019 10:00 – 14:00	D1.1.078	Advanced topics (pandas, visualisation)
7	Tue 30.04.2019 11:00 – 16:00	EA.6.026	Project presentation

Organisational

Instructor(s)

Grading

See the authoritative details at Learn@WU.

Course Material

Course Learn@WU homepage
Course Slides
Supporting text book: Data Science from Scratch (available via the WU library, EBSCO)
Course Zip Archive
Student Jupyter installation at: https://datascience.ai.wu.ac.at/sbwl/ss19/$MATRIKELNUMMER e.g. https://datascience.ai.wu.ac.at/sbwl/ss19/h1234567

Unit details

Unit 1: Course Overview & Introduction

Introduction

Motivation & expected learning outcome
Course structure
Grading

What is Data Science and how does it work? (theory)
Course tools and materials (practice)
Python & Jupyter Coding Environment (practise)

Slides: This unit is also available in a PDF format and as a single HTML Page

Readings:

Grus, J. (2015) Data Science from Scratch, O'Reilley, Chapter 1 (available via the WU library, EBSCO)
van der Aalst, W.M.P., Bichler, M. & Heinzl, A. Bus Inf Syst Eng (2017) 59: 311. https://doi.org/10.1007/s12599-017-0487-z
Responsible Data Science

Notebook of Unit1

Hello World ( download notebook, external NBViewer)
Python3 Intro ( download notebook, external NBViewer)

Unit1: Homework

Task:

Get comfortable using the Juypter notebook environment.
Learn basics of Python programming.
Solve four tasks

Details: Assignment 1 on Learn@WU

Submission: Via Assignment 1 on Learn@WU, until Mon, March 18, 2019, 23:55.

Unit 2: Data access, formats, & encoding

Data encoding and exchange formats, standards (JSON, CSV, XML, RDF)
How and where to find data?
Data access and parsing
Encoding (conversion of encodings)
Data format specific parsing in Python

Slides: This unit is also available in a PDF format and as a single HTML Page

Readings:

Grus, J. (2015) Data Science from Scratch, O'Reilley, Chapter 9 (available via the WU library, EBSCO)

Notebook of Unit2

Encodings and reading text files ( download notebook, external NBViewer)
Dealing with CSV in Python (download notebook, external NBViewer)
Dealing with JSON in Python (download notebook, external NBViewer)
Dealing with XML in Python (download notebook, external NBViewer)
Dealing with RDF in Python (download notebook, external NBViewer)

Unit 3: Data cleaning and preparation (Basics)

data inspection/ reshaping
data filtering
data sorting
data aggregation (grouping)

Slides: This unit is also available in a PDF format and as a single HTML Page

Notebook of Unit3

simple data transformations ( download notebook, external NBViewer)
Urban Audit ( download notebook, external NBViewer)

Homework

Find two datasets online
Detect the file format and file size (using Python)
Validate the data sets according to the deteced file format (you can use online validation services
Find out and describe the characteristics of the datasets (depending on the file format)

Details: Assignment 2 on Learn@WU

Submission: Via Assignment 2 on Learn@WU, until Tue, March 27th, 2019, 12:00.

Unit 4: Data cleaning and preparation (Cont'd)

Missing data
Data duplicates
Data outliers (incl. outlier exploration, removal)

Slides: This unit is also available in a PDF format and as a single HTML Page

Notebooks of Unit 4

Missing data and duplicate data ( download notebook, external NBViewer)

Unit 5: Data storage & Persistence

Connection to and loading data into and from a database system (vs. storing/loading from a file)

Relational Databases Systems: SQLite

Querying a Database

Python and Persistence:

Persisting objects in files: Pickle
Persisting objects in a Relational Database
Querying data from a Relational Database

Slides: This unit is also available in a PDF format and as a single HTML Page

Readings:

Grus, J. (2015) Data Science from Scratch, O'Reilley, Chapter 23 (available via the WU library, EBSCO)

Notebook of Unit5

storing-loading ( download notebook, external NBViewer)
SQLite+Python ( download notebook, external NBViewer)

Unit 6: Advanced topics

Basic analysis of algorithms: The Big O
(Library support):

High-level libraries: pandas (cont'd)
Low-level libraries: numpy, scipy
Plotting (cont'd): seaborn, bokeh
Parsing

Visualization primer: matplotlib, pandas

Slides: This unit is also available in a PDF format and as a single HTML Page

Readings:

Grus, J. (2015) Data Science from Scratch, O'Reilley, Chapter 3 (available via the WU library, EBSCO)

Notebooks of Unit 6

Intro to Pandas and Visualisation ( download notebook, external NBViewer)
Binary search ( download notebook, external NBViewer)

Jupyter Notebook

The theoretical part of the course is accompanied by practical code examples and hands on exercises using the interactive Python environment Jupyter.

Supplemental Reading

Coding

Learn Python in Y minutes
Learning Python by Mark Lutz and David Ascher (O’Reilly)
Doing Data Science: Straight Talk From the Frontline by Rachel Schutt and Cathy O'Neil. O'Reilly Media, Inc., 2013. ISBN 1449358659, 9781449358655
Official Python 3 Tutorial (english)
Official Python 3 Tutorial (german)
A gallery of interesting IPython Notebooks
PythonDataScienceHandbook

SBWL 1: Data Processing 1 (PI2.0)

Table of contents

Syllabus

Schedule

Organisational

Instructor(s)

Raphael Dachs (Tutor)

Grading

Course Material

Unit details

Notebook of Unit1

Unit1: Homework

Notebook of Unit2

Notebook of Unit3

Homework

Notebooks of Unit 4

Notebook of Unit5

Notebooks of Unit 6

Jupyter Notebook

Supplemental Reading

Coding