Data storage & Persistence

Prof. Dr. Axel Polleres
Dr. Stefan Sobernig


5 November 2019

Unit 5: Data storage & Persistence

Storing/loading data to/from a file vs. Connection to and loading data into and from a Database System

Slides: This unit is also available in a PDF format and as a single HTML Page

Readings:

Storing/persisting data to disc

Storing/persisting data to disc

We will briefly cover the following methods:

All code snippets on the next slides are also available as notebook (lectureBooks/unit5/Storing_and_Pickling.ipynb)

Do you remember?

cityCodeFile="./data/cities.csv"
#Building the cityCode to Label map
cityCodeMap={}
with open(cityCodeFile) as f:
    csvfile = csv.reader(f)
    for i,row in enumerate(csvfile):
        cityCodeMap[row[3]]= row[1]

Storing/persisting data as CSV

Let's store the dictionary to a CSV file.

import csv
with open('cityNames.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile, delimiter=',')
    for cityCode, cityName in cityCodeMap.items():
      writer.writerow( [ cityCode, cityName] )

The method writerow() expects a list of values. Each value in the list will be convert to its string representation and written to file

Loading the data back into a dictionary requires to parse the file as CSV and build the dictionary again (see our code before).

Storing/persisting data as JSON

Another way to persist our data structure is to store it to a JSON file.

import json
with open('data.json', 'w') as fp:
    json.dump(cityCodeMap, fp)

NOTE: Storing/persisting data as JSON

Storing and loading objects to and from JSON is normally fast and the preferable way.

HOWEVER :

default, i.e., they will not be preserved when storing and loading to/from a JSON file! This requires a custom JSONEncoder and JSONDecoder

NOTE: Storing/persisting data as PICKLE

Alternative: The pickle module implements binary protocols for serializing and de-serializing a Python object structure. That is, any Python data structure can be "pickled"

import pickle

with open('data.pickle', 'wb') as f:
    # Pickle the 'data' dictionary using the highest protocol available.
    pickle.dump(cityCodeMap, f, pickle.HIGHEST_PROTOCOL)

Different protocols are supported.

with open('data.pickle', 'rb') as f:
    data=pickle.load( f)

To JSON or to Pickle

There are fundamental differences between the pickle protocols and JSON (JavaScript Object Notation):

Question.

When should you use JSON and when Pickle as serialisation format?

see also official documentation

(Relational) Databases Systems

(Relational) Databases Systems

Question.

What is a (Relational) Database?





Why does a data scientist need databases?

Question.

What features does a DBMS support that you'd need to take care of in your code otherwise?

(Relational) Databases Systems: main features

RDBMSs shield some functionality from the user, which you'd need to take care of yourself when storing all data in files:

Plus they offer efficient and declarative access to the data via a universal, structured query language (SQL):

(Relational) Databases Systems: SQLite

SQLite: Overview

In today's lecture we use a popular Open Source database engine: SQLite

... Due to the SQL standard, working with other RDBMS (e.g. PostgreSQL) is pretty similar!

SQLite: Resources

SQLite: Creating a table

CREATE TABLE table (
column_name1 data_type(size) constraint,
column_name2 data_type(size) constraint,
column_name3 data_type(size) constraint,
....
);

Example SQLite:

CREATE TABLE `person` (
	`personID`	INTEGER PRIMARY KEY AUTOINCREMENT,
	`name`	TEXT NOT NULL,
	`PLZ`	NUMERIC,
	`city`	TEXT,
	`country`	TEXT
);

SQLite: Inserting records in a table

Note: SQLite uses simplified data types. Other RDBMS provide more precise specification.

INSERT INTO table (column1,column2,...)
VALUES (value1,value2,...);

Example SQLite:

INSERT INTO person (personID, name, PLZ, city, country)
VALUES (1, "Peter", "1220", "Vienna", "Austria"), 
       (2, "Jenny", "1220", "Vienna", "Austria");

SQLite: Updating records in a table

Note: SQLite uses simplified data types. Other RDBMS provide more precise specification.

UPDATE table
SET column_1 = new_value_1, column_2 = new_value_2 ...
WHERE search_condition;

Example SQLite:

UPDATE person
SET name="Claire", PLZ="1020"
WHERE personID=2;

SQLite: Deleting records from a table

DELETE FROM table
WHERE search_condition;

Example SQLite:

DELETE FROM person
WHERE name LIKE "C%";

SQLite: Querying Data

Examples SQLite:

SELECT column1, column4
FROM table
WHERE search_condition;

SELECT name, city
FROM person
WHERE PLZ < 1000;

SQLite: Querying Data - Merging Data

Connecting multiple tables using a relationship between two of their attributes, typically the primary key of one table and a foreign key of another.

Examples:

SELECT person.name, data.total
FROM person, data
WHERE person.personID=data.personID
	 AND data.year < 2000;

SELECT person.name, data.total
FROM person JOIN data ON person.personID=data.personID
WHERE data.year > 2000; 

SQLite: Querying Data - Sorting

Note that many things we did on Python, can be done in SQL as well:

Example:

SELECT person.name, data.total
FROM person, data
WHERE person.personID=data.personID
ORDER BY name DESC
LIMIT 10 OFFSER 31;

SQLite: Querying Data - Grouping/Aggregation

You can also do grouping (using the keyword GROUP BY) and aggregation, e.g. counting.

Example:

SELECT person.name, SUM(data.total) as TotalSum
FROM person, data
WHERE person.personID=data.personID
GROUP BY data.year;

SQL/RDB Disclaimer

We skipped a lot of stuff important for Relational Databases & SQL:

\( \Rightarrow \) Recommended courses: Database Systems (BSc) or Database Systems (IS Master)

Python and SQLite

Example: let's work it through in an example in our Notebook! notebooks/SQLite+Python.ipynb

Summary: Python and SQLite (or another DBMS)