Data storage & Persistence

Axel Polleres


9 November 2021

Unit 5: Data storage & Persistence

Slides: This unit is also available in a as a single HTML Page for printing.

Readings:

Persistence

What is Persistence?

"In computer science, persistence refers to the characteristic of state of a system that outlives (persists more than) the process that created it. This is achieved in practice by storing the state as data in computer data storage. Programs have to transfer data to and from storage devices and have to provide mappings from the native programming-language data structures to the storage device data structures."

(Wikipedia's definition)

Why do we need Persistence?

Storing/persisting data to disc

Storing/persisting program internal data structures to disc

We will briefly cover the following methods:

All code snippets on the next slides are also available as notebook

Do you remember? Example: Creating a dictionary from a CSV

import csv

cityCodeFile="./lecture-materials/unit5/data/cities.csv"
#Building the cityCode to Label map
cityCodeMap={}
with open(cityCodeFile) as f:
    csvfile = csv.reader(f)
    for i,row in enumerate(csvfile):
        cityCodeMap[row[3]]= row[1]
        
cityCodeMap

How can we persist the dictionary?

Storing/persisting data as CSV

Let's store the dictionary to a CSV file.

import csv
with open('cityNames.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile, delimiter=',')
    for cityCode, cityName in cityCodeMap.items():
      writer.writerow( [ cityCode, cityName] )

The method writerow() expects a list of values. Each value in the list will be converted to its string representation and written to file.

Loading the data back into a dictionary requires to parse the file as CSV and build the dictionary again (see our code before).

Storing/persisting data as JSON

Another - more direct - way to persist our internal data structure is to store it to a JSON file.

import json
with open('data.json', 'w') as fp:
    json.dump(cityCodeMap, fp)

NOTE: Storing/persisting data as JSON

Storing and loading objects to and from JSON is normally fast and the preferable way.

Question.

Why?

HOWEVER :

Storing/persisting data as PICKLE

Alternative: The pickle module implements binary protocols for serializing and de-serializing a Python object structure. That is, any Python data structure can be "pickled"

import pickle

with open('data.pickle', 'wb') as f:
    # Pickle the 'data' dictionary using the highest protocol available.
    pickle.dump(cityCodeMap, f, pickle.HIGHEST_PROTOCOL)

Different protocols are supported.

with open('data.pickle', 'rb') as f:
    data=pickle.load( f)

To JSON or to Pickle?

There are fundamental differences between the pickle protocols and JSON (JavaScript Object Notation):

Question.

When should you use JSON and when Pickle as serialisation format?

see also official documentation

Example: let's work it through storing objects to files in a Notebook! 01_storing-loading-pickling-unpickling.ipynb

(Relational) Databases Systems

(Relational) Databases Systems

Question.

What is a (Relational) Database?





Why does a data scientist need databases?

Question.

What features does a DBMS support that you'd need to take care of in your code otherwise?

(Relational) Databases Systems: main features 1/2

RDBMSs shield some functionality from the user, which you'd need to take care of yourself when storing all data in files:

(Relational) Databases Systems: main features 2/2

In particular: ACID

Plus DBMS offer efficient and declarative access to the data via a universal, *structured query language* (SQL):

(Relational) Databases Systems: SQLite

SQLite: Overview

In today's lecture we use a popular Open Source database engine: SQLite

... Due to the SQL standard, working with other RDBMS (e.g. PostgreSQL) is pretty similar!

SQLite: Resources

SQLite: Creating a table

CREATE TABLE table (
column_name1 data_type(size) constraint,
column_name2 data_type(size) constraint,
column_name3 data_type(size) constraint,
....
);

Example SQLite:

CREATE TABLE 'person' (
    'personID' INTEGER PRIMARY KEY AUTOINCREMENT,
    'name TEXT NOT NULL,
    'PLZ' NUMERIC,
    'city' TEXT,
    'country' TEXT
);

SQLite: Inserting records in a table

Note: SQLite uses simplified data types. Other RDBMS provide more precise specification.

INSERT INTO table (column1,column2,...)
VALUES (value1,value2,...);

Example SQLite:

INSERT INTO person (personID, name, PLZ, city, country)
VALUES (1, "Peter", "1220", "Vienna", "Austria"), 
       (2, "Jenny", "1220", "Vienna", "Austria");

SQLite: Updating records in a table

Note: SQLite uses simplified data types. Other RDBMS provide more precise specification.

UPDATE table
SET column_1 = new_value_1, column_2 = new_value_2 ...
WHERE search_condition;

Example SQLite:

UPDATE person
SET name="Claire", PLZ="1020"
WHERE personID=2;

SQLite: Deleting records from a table

DELETE FROM table
WHERE search_condition;

Example SQLite:

DELETE FROM person
WHERE name LIKE "C%";

SQLite: Querying Data

Examples SQLite:

SELECT column1, column4
FROM table
WHERE search_condition;

SELECT name, city
FROM person
WHERE PLZ < 1000;

SQLite: Querying Data - Merging Data

Connecting multiple tables using a relationship between two of their attributes, typically the primary key of one table and a foreign key of another.

Examples:

SELECT person.name, data.total
FROM person, data
WHERE person.personID=data.personID
	 AND data.year < 2000;

SELECT person.name, data.total
FROM person JOIN data ON person.personID=data.personID
WHERE data.year > 2000; 

SQLite: Querying Data - Sorting

Note that many things we did on Python, can be done in SQL as well:

Example:

SELECT person.name, data.total
FROM person, data
WHERE person.personID=data.personID
ORDER BY name DESC
LIMIT 10 OFFSER 31;

SQLite: Querying Data - Grouping/Aggregation

You can also do grouping (using the keyword GROUP BY) and aggregation, e.g. counting.

Example:

SELECT person.name, SUM(data.total) as TotalSum
FROM person, data
WHERE person.personID=data.personID
GROUP BY data.year;

SQL/RDB Disclaimer

We skipped a lot of stuff important for Relational Databases & SQL:

\( \Rightarrow \) Recommended courses: Database Systems (BSc) or Database Systems (IS Master)

Python and SQLite

Example: let's work it through in an example in our Notebook! 02_Read+Write_in_a_Database_SQLite+Python.ipynb

Summary: Python and SQLite (or another DBMS)