SBWL 1: Data Processing 1 (PI2.0)
Summer Term 2021
Stefan Sobernig, Axel Polleres
Table of contents
Schedule
Organisational
Unit details
Jupyter Notebook
Supplemental Reading
Syllabus
Overall, students shall gain fundamental knowledge for dealing with different data formats and in using methods and tools to integrate data from various sources in this course
- Hands-on experience in processing and preparing data for data science tasks with Python.
- An understanding of how to use Python's standard libraries to write programs, access various data science tools.
- Working knowledge how to solve basic data (pre-)processing tasks , including:
- Finding & accessing data (e.g., tabular (CSV), tree (JSON or XML), graph-shaped (RDF) data but also databases)
- Cleansing and normalizing data
- Sorting, filtering and grouping data
- Tools and algorithms for data transformation
- Connection to and loading data into a database system and indexing techniques, for faster access of data in a database
Schedule
Organisational
Instructor(s)
dp1-team@wi.wu.ac.at
Grading
See the authorative details at Learn@WU 5259 and Learn@WU 5781.
Course Material
Unit details
- Introduction
- Motivation & expected learning outcome
- Course structure
- Grading
- What is Data Science and how does it work? (theory)
- Course tools and materials (practice)
- Python & Jupyter Coding Environment (practice)
Slides: This unit is also available as a single
HTML Page, e.g. for easier printing
Readings:
- Data encoding and exchange formats, standards (JSON, CSV, XML, RDF) (theory)
- Data format specific parsing in Python
- Encoding (conversion of encodings)
- Data access and parsing
- Accessing data from files or from the Web
- How and where to find data?
Slides: This unit is also available as a single
HTML Page, e.g. for easier printing
Readings:
- data inspection/ reshaping
- data filtering
- data sorting
- data aggregation (grouping)
Slides: This unit is also available as a single
HTML Page, e.g. for easier printing
- Missing data
- Data duplicates
- Data outliers (incl. outlier exploration, removal)
Slides: This unit is also available in a
PDF format and as a single
HTML Page
- Why you need persistence?
- Persisting in files vs. in a database system
- Python and Persistence:
- Persisting objects in files: Pickle
- Persisting objects in a Relational Database
- Working with Relational Databases Systems: SQLite
- Connection to and loading data into and from a database system
- Creating, Updating, Querying a Database
- Querying data from a Relational Database
Slides: This unit is also available in a
PDF format and as a single
HTML Page
Readings:
- Basic analysis of algorithms: The Big O
- Visualizations for Data Science:
- Picking the "right" visualization
- Tooling primer: matplotlib, pandas
- (Library support):
- High-level libraries: pandas (cont'd)
- Low-level libraries: numpy, scipy
- Plotting (cont'd): seaborn, bokeh
- Parsing
Slides: This unit is also available in a
PDF format and as a single
HTML Page
Readings:
Jupyter Notebook
The theoretical part of the course is accompanied by practical code examples and hands on exercises using the interactive Python environment Jupyter.
- You will find yourself in your personal folder "my-notebooks"
- You can go back to the overview by clicking on the small folder left from "my-notebooks"
- The "share" folder can be accessed by any JupyterHub user
- The course notebooks can be found in the home directory (Note: For the course you are logged into)
Supplemental Reading
Coding