\
Slides: This unit is also available in a PDF format and as a single HTML Page
Readings:
and memory consumption.
Commonly found types of time growth for some input n:
How could we sort it by a different column? e.g., how could we sort countries by population?
Let's look at the excerpts from the following notebook
haystack = [('BE', 10839905),
('BG', 7563710),
('CZ', 10532770),
('DE', 81802257),
('EE', 1365275),
('ES', 47021031),
('FR', 64611814),
('IT', 60340328),
('CY', 819100),
('HU', 10014324),
('NL', 16574989),
('PL', 38529866),
('PT', 10573479),
('RO', 22480599),
('SK', 5435273),
('FI', 5351427),
('SE', 9415570),
('NO', 4858199),
('CH', 7877571)]
haystack.sort() # by country code
haystack.sort(key=lambda x:x[1]) # by population count
Note: if you know that a file is sorted, then searching in that file becomes easier/cheaper!
Bottomline: (pre-)sorting can be costly, but might speed up other operations... another example: grouping!
# Search for first entry bigger than number in a sorted
# list of lists of length 2:
def binary_search(number, array, lo, hi):
if hi < lo: return array[lo] # no more numbers
mid = (lo + hi) // 2 # midpoint in array
if number == array[mid][0]:
return array[mid] # number found here
elif number < array[mid][0]:
# try left of here
return binary_search(number, array, lo, mid - 1)
else:
# try above here
return binary_search(number, array, mid + 1, hi)
# Sample call: Find me a country with a pop. > 5m people?
binary_search(5000000, haystack, 0, len(haystack))
sort
applies Timsort: \( O(n\log{}n) \) (worst case).
import numpy as np
Numpy the fundamental package for scientific computing with Python. It contains among other things:
Check out this tutorial or this one (includes also scipy and matplotlib)
NumPy does not provide high-level data analysis functionality, having an understanding of NumPy arrays and array-oriented computing will help you use tools like Pandas much more effectively.
SciPy is open-source software for mathematics, science, and engineering
The SciPy library depends on NumPy, which provides convenient and fast N-dimensional array manipulation. The SciPy library is built to work with NumPy arrays, and provides many user-friendly and efficient numerical routines , such as routines for numerical integration and optimization.
from scipy import linalg, optimize
Again, check out the official tutorials
Some examples:
import pandas as pd
contains high-level data structures and tools designed to make data analysis fast and easy. Pandas are built on top of NumPy, and makes it easy to use in NumPy-centric applications.
pandas is well suited for many different kinds of data:
Here are just a few of the things that pandas does well:
It takes a while to get used to pandas. The documentation is exhaustive and there exists hundreds of tutorials and use cases
Checkout the notebook pandas.ipynb
There exists many libraries for plotting: