SBWL 1: Data Processing 1 (PI2.0)
Winter Term 2018/19
Aniko Hannak, Axel Polleres, Stefan Sobernig
Table of contents
Schedule
Organisational
Unit details
Jupyter Notebook
Supplemental Reading
Syllabus
Overall, students shall gain fundamental knowledge for dealing with different data formats and in using methods and tools to integrate data from various sources in this course
- Hands-on experience in processing and preparing data for data science tasks with Python.
- An understanding of how to use Python's standard libraries to write programs, access various data science tools.
- Working knowledge how to solve basic data (pre-)processing tasks , including:
- Finding & accessing data (e.g., tabular (CSV), tree (JSON or XML), graph shaped (RDF) data but also databases)
- Cleansing and normalizing data
- Sorting, filtering and grouping data
- Tools and algorithms for data transformation
- Connection to and loading data into a database system and indexing techniques, for faster access of data in a database
Schedule
Organisational
Instructor(s)
aniko.hannak@wu.ac.at
axel.polleres@wu.ac.at
stefan.sobernig@wu.ac.at
Konstantin Kueffner (Tutor)
konstantin.kueffner@wu.ac.at
Grading
See the authoritative details at Learn@WU.
Course Material
Unit details
- Introduction
- Motivation & expected learning outcome
- Course structure
- Grading
- What is Data Science and how does it work? (theory)
- Course tools and materials (practice)
- Python & Jupyter Coding Environment (practise)
Slides: This unit is also available in a
PDF format and as a single
HTML Page
Readings:
Notebook of Unit1
Unit1: Homework
Task:
- Get comfortable using the Juypter notebook environment.
- Learn basics of Python programming.
- Solve six tasks
Details: Assignment 1 on Learn@WU
Submission: Via Assignment 1 on Learn@WU, until Tue, October 16 2018, 23:59.
- Data encoding and exchange formats, standards (JSON, CSV, XML, RDF)
- How and where to find data?
- Data access and parsing
- Encoding (conversion of encodings)
- Data format specific parsing in Python
Slides: This unit is also available in a
PDF format and as a single
HTML Page
Readings:
Notebook of Unit2
- data inspection/ reshaping
- data filtering
- data sorting
- data aggregation (grouping)
Slides: This unit is also available in a
PDF format and as a single
HTML Page
Notebook of Unit3
Unit 2: Homework
Task:
- Detect the file format and file size (convert to KB or MB) of each data set, clearly documenting your actions (e.g. through commented code).
- Validate the data sets according to the data format used: Are there any data-formatting issues? Hint: You may use online validators for CSV, XML, or JSON. Pls. clearly indicate any validators used and summarize their reports. (also mind "bonus work" below).
- Access the two data sources and inspect their content (e.g., using the Python recipes presented to you in the notebooks and tutorials). Describe the characteristics of the data sets (depending on the format), some examples:
* number of data items
* data columns, data rows (if applicable)
* nesting level (if applicable)
Details: Assignment 2 on Learn@WU
Submission: Via Assignment 2 on Learn@WU, until Fri, October 26 2018, 23:59.
- Missing data
- Data duplicates
- Data outliers (incl. outlier exploration, removal)
Slides: This unit is also available in a
PDF format and as a single
HTML Page
Notebooks of Unit 4
Connection to and loading data into and from a database system
(vs. storing/loading from a file)
- Relational Databases Systems: SQLite
- Python and Persistence:
- Persisting objects in files: Pickle
- Persisting objects in a Relational Database
- Querying data from a Relational Database
Slides: This unit is also available in a
PDF format and as a single
HTML Page
Readings:
Notebook of Unit5
- Basic analysis of algorithms: The Big O
- Visualization primer: matplotlib, pandas
- (Library support):
- Low-level libraries: numpy, scipy
- High-level libraries: pandas (cont'd)
- Plotting (cont'd): seaborn, bokeh
- Parsing
Slides: This unit is also available in a
PDF format and as a single
HTML Page
Readings:
Notebooks of Unit 6
Jupyter Notebook
The theoretical part of the course is accompanied by practical code examples and hands on exercises using the interactive Python environment Jupyter.
Supplemental Reading
Coding