SBWL 1: Data Processing 1 (PI2.0)
Summerterm 2019
Axel Polleres, Stefan Sobernig
Table of contents
Schedule
Organisational
Unit details
Jupyter Notebook
Supplemental Reading
Syllabus
Overall, students shall gain fundamental knowledge for dealing with different data formats and in using methods and tools to integrate data from various sources in this course
- Hands-on experience in processing and preparing data for data science tasks with Python.
- An understanding of how to use Python's standard libraries to write programs, access various data science tools.
- Working knowledge how to solve basic data (pre-)processing tasks , including:
- Finding & accessing data (e.g., tabular (CSV), tree (JSON or XML), graph-shaped (RDF) data but also databases)
- Cleansing and normalizing data
- Sorting, filtering and grouping data
- Tools and algorithms for data transformation
- Connection to and loading data into a database system and indexing techniques, for faster access of data in a database
Schedule
Organisational
Instructor(s)
axel.polleres@wu.ac.at
stefan.sobernig@wu.ac.at
Raphael Dachs (Tutor)
raphael.dachs@wu.ac.at
Grading
See the authoritative details at Learn@WU.
Course Material
Unit details
- Introduction
- Motivation & expected learning outcome
- Course structure
- Grading
- What is Data Science and how does it work? (theory)
- Course tools and materials (practice)
- Python & Jupyter Coding Environment (practise)
Slides: This unit is also available in a
PDF format and as a single
HTML Page
Readings:
Notebook of Unit1
Unit1: Homework
Task:
- Get comfortable using the Juypter notebook environment.
- Learn basics of Python programming.
- Solve four tasks
Details: Assignment 1 on Learn@WU
Submission: Via Assignment 1 on Learn@WU, until Mon, March 18, 2019, 23:55.
- Data encoding and exchange formats, standards (JSON, CSV, XML, RDF)
- How and where to find data?
- Data access and parsing
- Encoding (conversion of encodings)
- Data format specific parsing in Python
Slides: This unit is also available in a
PDF format and as a single
HTML Page
Readings:
Notebook of Unit2
- data inspection/ reshaping
- data filtering
- data sorting
- data aggregation (grouping)
Slides: This unit is also available in a
PDF format and as a single
HTML Page
Notebook of Unit3
Homework
- Find two datasets online
- Detect the file format and file size (using Python)
- Validate the data sets according to the deteced file format (you can use online validation services
- Find out and describe the characteristics of the datasets (depending on the file format)
Details: Assignment 2 on Learn@WU
Submission: Via Assignment 2 on Learn@WU, until Tue, March 27th, 2019, 12:00.
- Missing data
- Data duplicates
- Data outliers (incl. outlier exploration, removal)
Slides: This unit is also available in a
PDF format and as a single
HTML Page
Notebooks of Unit 4
Connection to and loading data into and from a database system
(vs. storing/loading from a file)
- Relational Databases Systems: SQLite
- Python and Persistence:
- Persisting objects in files: Pickle
- Persisting objects in a Relational Database
- Querying data from a Relational Database
Slides: This unit is also available in a
PDF format and as a single
HTML Page
Readings:
Notebook of Unit5
- Basic analysis of algorithms: The Big O
- (Library support):
- High-level libraries: pandas (cont'd)
- Low-level libraries: numpy, scipy
- Plotting (cont'd): seaborn, bokeh
- Parsing
- Visualization primer: matplotlib, pandas
Slides: This unit is also available in a
PDF format and as a single
HTML Page
Readings:
Notebooks of Unit 6
Jupyter Notebook
The theoretical part of the course is accompanied by practical code examples and hands on exercises using the interactive Python environment Jupyter.
Supplemental Reading
Coding