Advanced topics

Dr. Stefan Sobernig
Dr. Anniko Hannak


Nov 4, 2018


Unit 6: Advanced topics

Slides: This unit is also available in a PDF format and as a single HTML Page

Readings:

Analysis of algorithms (1)

Analysis of algorithms (2)

Analysis of algorithms (3)

Analysis of algorithms (4): Types of growth

Commonly found types of time growth for some input n:

Analysis of algorithms (5): Big O(rder) notation

Analysis of algorithms (6): Big O(rder) notation





Analysis of algorithms (7): Urban Audit example

Question.

How could we sort it by a different column? e.g., how could we sort countries by population?

Let's look at the excerpts from the following notebook

haystack = [('BE', 10839905),
 ('BG', 7563710),
 ('CZ', 10532770),
 ('DE', 81802257),
 ('EE', 1365275),
 ('ES', 47021031),
 ('FR', 64611814),
 ('IT', 60340328),
 ('CY', 819100),
 ('HU', 10014324),
 ('NL', 16574989),
 ('PL', 38529866),
 ('PT', 10573479),
 ('RO', 22480599),
 ('SK', 5435273),
 ('FI', 5351427),
 ('SE', 9415570),
 ('NO', 4858199),
 ('CH', 7877571)]
haystack.sort() # by country code
haystack.sort(key=lambda x:x[1]) # by population count

Analysis of algorithms (8): Urban Audit (cont'd)

Note: if you know that a file is sorted, then searching in that file becomes easier/cheaper!

Question.

Bottomline: (pre-)sorting can be costly, but might speed up other operations... another example: grouping!

Analysis of algorithms (9): Urban Audit example

# Search for first entry bigger than number in a sorted
# list of lists of length 2:
def binary_search(number, array, lo, hi):

    if hi < lo: return array[lo]       # no more numbers
    mid = (lo + hi) // 2               # midpoint in array
    if number == array[mid][0]:
        return array[mid]              # number found here
    elif number < array[mid][0]:
        # try left of here
        return binary_search(number, array, lo, mid - 1)
    else:
        # try above here
        return binary_search(number, array, mid + 1, hi)

# Sample call: Find me a country with a pop. > 5m people?
binary_search(5000000, haystack, 0, len(haystack))

Analysis of algorithms (10): Outlook

Visualization (1)

Visualization (2a): Scatterplot





Visualization (2b): Scatterplot





Visualization (3a): Lineplot





Visualization (3b): Lineplot





Visualization (4a): Barplot





Visualization (4b): Barplot





Visualization (5): Boxplot





Low-level libraries

Numpy

import numpy as np

Numpy the fundamental package for scientific computing with Python. It contains among other things:

Check out this tutorial or this one (includes also scipy and matplotlib)

NumPy does not provide high-level data analysis functionality, having an understanding of NumPy arrays and array-oriented computing will help you use tools like Pandas much more effectively.

SciPy

SciPy is open-source software for mathematics, science, and engineering

The SciPy library depends on NumPy, which provides convenient and fast N-dimensional array manipulation. The SciPy library is built to work with NumPy arrays, and provides many user-friendly and efficient numerical routines , such as routines for numerical integration and optimization.

SciPy subpackages (1/2)

SciPy subpackages (2/2)

from scipy import linalg, optimize

SciPy

Again, check out the official tutorials

Some examples:

High-level libraries

Pandas

import pandas as pd

contains high-level data structures and tools designed to make data analysis fast and easy. Pandas are built on top of NumPy, and makes it easy to use in NumPy-centric applications.

pandas is well suited for many different kinds of data:

Pandas features (1/2)

Here are just a few of the things that pandas does well:

Pandas features (2/2)

Pandas: Some more words

It takes a while to get used to pandas. The documentation is exhaustive and there exists hundreds of tutorials and use cases

Some hands on

Checkout the notebook pandas.ipynb

Plotting

Plotting

There exists many libraries for plotting:

Machine learning?

Machine learning

Data Mining & NLP

Data Mining & NLP

References