Connection to and loading data into and from a database system (vs. storing/loading from a file)
Readings:
We will briefly cover the following methods:
cityCodeFile="./data/cities.csv"
#Building the cityCode to Label map
cityCodeMap={}
with open(cityCodeFile) as f:
csvfile = csv.reader(f)
for i,row in enumerate(csvfile):
cityCodeMap[row[3]]= row[1]
Let's store the dictionary to a CSV file.
import csv
with open('cityNames.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile, delimiter=',')
for cityCode, cityName in cityCodeMap.items():
writer.writerow( [ cityCode, cityName] )
The method writerow() expects a list of values. Each value in the list will be convert to its string representation and written to file
Loading the data back into a dictionary requires to parse the file as CSV and build the dictionary again (see our code before).
Another way to persist our data structure is to store it to a JSON file.
import json
with open('data.json', 'w') as fp:
json.dump(cityCodeMap, fp)
Storing and loading objects to and from JSON is normally fast and the preferable way.
HOWEVER :
Alternative: The pickle module implements binary protocols for serializing and de-serializing a Python object structure. That is, any Python data structure can be "pickled"
import pickle
with open('data.pickle', 'wb') as f:
# Pickle the 'data' dictionary using the highest protocol available.
pickle.dump(cityCodeMap, f, pickle.HIGHEST_PROTOCOL)
Different protocols are supported.
with open('data.pickle', 'rb') as f:
data=pickle.load( f)
There are fundamental differences between the pickle protocols and JSON (JavaScript Object Notation):
When should you use JSON and when Pickle as serialisation format?
see also official documentation
What features does a DBMS support that you'd need to take care of in your code otherwise?
RDBMSs shield some functionality from the user, which you'd need to take care of yourself when storing all data in files:
Today we use a popular Open Source database engine: SQLite
CREATE TABLE table (
column_name1 data_type(size) constraint,
column_name2 data_type(size) constraint,
column_name3 data_type(size) constraint,
....
);
Example SQLite:
CREATE TABLE `person` (
`personID` INTEGER PRIMARY KEY AUTOINCREMENT,
`name` TEXT NOT NULL,
`PLZ` NUMERIC,
`city` TEXT,
`country` TEXT
);
Note: SQLite uses simplified data types. Other RDBMS provide more precise specification.
INSERT INTO table (column1,column2,...)
VALUES (value1,value2,...);
Example SQLite:
INSERT INTO person (personID, name, PLZ, city, country)
VALUES (1, "Peter", "1220", "Vienna", "Austria"),
(2, "Jenny", "1220", "Vienna", "Austria");
Note: SQLite uses simplified data types. Other RDBMS provide more precise specification.
UPDATE table
SET column_1 = new_value_1, column_2 = new_value_2 ...
WHERE search_condition;
Example SQLite:
UPDATE person
SET name="Claire", PLZ="1020"
WHERE personID=2;
DELETE FROM table
WHERE search_condition;
Example SQLite:
DELETE FROM person
WHERE name LIKE "C%";
SELECT column1, column4
FROM table
WHERE search_condition;
SELECT name, city
FROM person
WHERE PLZ < 1000;
Connecting multiple tables using a relationship between two of their attributes, typically the primary key of one table and a foreign key of another.
Examples:
SELECT person.name, data.total
FROM person, data
WHERE person.personID=data.personID
AND data.year < 2000;
SELECT person.name, data.total
FROM person JOIN data ON person.personID=data.personID
WHERE data.year > 2000;
Note that many things we did on Python, can be done in SQL as well:
SELECT person.name, data.total
FROM person, data
WHERE person.personID=data.personID
ORDER BY name DESC
LIMIT 10 OFFSER 31;
You can also do grouping (using the keyword GROUP BY
) and aggregation, e.g. counting.
Example:
SELECT person.name, SUM(data.total) as TotalSum
FROM person, data
WHERE person.personID=data.personID
GROUP BY data.year;
We skipped a lot of stuff important for Relational Databases & SQL:
Example: let's work it through in an example in our Notebook! notebooks/SQLite+Python.ipynb