Advanced topics

Prof. Dr. Stefan Sobernig


03 November 2020

Unit 5: Advanced topics

Slides: This unit is also available in a PDF format and as a single HTML Page

Readings:

Analysis of algorithms (1)

Analysis of algorithms (2)

Analysis of algorithms (3)

Question.

What would you consider "rapidly", "slowly"?

Analysis of algorithms (4): Types of growth

Commonly found types of time growth for some input n:

increases the required time by a constant amount, i.e. logarithmic (example: binary search).

Analysis of algorithms (5): Big O(rder) notation

Analysis of algorithms (6): Big O(rder) notation

Analysis of algorithms (7): Urban Audit example

Question.

How could we sort it by a different column? e.g., how could we sort countries by population?

Let's look at the excerpts from the following notebook

haystack = [('BE', 10839905),
 ('BG', 7563710),
 ('CZ', 10532770),
 ('DE', 81802257),
 ('EE', 1365275),
 ('ES', 47021031),
 ('FR', 64611814),
 ('IT', 60340328),
 ('CY', 819100),
 ('HU', 10014324),
 ('NL', 16574989),
 ('PL', 38529866),
 ('PT', 10573479),
 ('RO', 22480599),
 ('SK', 5435273),
 ('FI', 5351427),
 ('SE', 9415570),
 ('NO', 4858199),
 ('CH', 7877571)]
haystack.sort() # by country code
haystack.sort(key=lambda x:x[1]) # by population count

Analysis of algorithms (8): Urban Audit (cont'd)

Note: if you know that a file is sorted, then searching in that file becomes easier/cheaper!

Question.

Bottomline: (pre-)sorting can be costly, but might speed up other operations... another example: grouping!

Analysis of algorithms (9): Urban Audit example

# Search for first entry bigger than number in a sorted
# list of lists of length 2:
def binary_search(number, array, lo, hi):

    if hi < lo: return array[lo]       # no more numbers
    mid = (lo + hi) // 2               # midpoint in array
    if number == array[mid][0]:
        return array[mid]              # number found here
    elif number < array[mid][0]:
        # try left of here
        return binary_search(number, array, lo, mid - 1)
    else:
        # try above here
        return binary_search(number, array, mid + 1, hi)

# Sample call: Find me a country with a pop. > 5m people?
binary_search(5000000, haystack, 0, len(haystack))

Analysis of algorithms (10): Outlook

Visualization (1)

Visualization (2)

Visualization (3)

Visualization (4a): Scatterplot





Visualization (4b): Scatterplot





Visualization (5a): Lineplot





Visualization (5b): Lineplot





Visualization (6a): Barplot





Visualization (6b): Barplot





Visualization (7): Boxplot





Visualization (8): Task-based effectiveness





Visualization (9)

High-level libraries

Pandas

import pandas as pd

contains high-level data structures and tools designed to make data analysis fast and easy. Pandas are built on top of NumPy, and makes it easy to use in NumPy-centric applications.

Pandas is well suited for many different kinds of data:

Pandas features (1/2)

Here are just a few of the things that pandas does well:

Pandas features (2/2)

Pandas: Some more words

It takes a while to get used to pandas. The documentation is exhaustive and there exists hundreds of tutorials and use cases

Some hands on

Checkout the notebook pandas.ipynb

Low-level libraries

Numpy

import numpy as np

Numpy the fundamental package for scientific computing with Python. It contains among other things:

Check out this tutorial or this one (includes also scipy and matplotlib)

NumPy does not provide high-level data analysis functionality, having an understanding of NumPy arrays and array-oriented computing will help you use tools like Pandas much more effectively.

SciPy

SciPy is open-source software for mathematics, science, and engineering

The SciPy library depends on NumPy, which provides convenient and fast N-dimensional array manipulation. The SciPy library is built to work with NumPy arrays, and provides many user-friendly and efficient numerical routines , such as routines for numerical integration and optimization.

SciPy subpackages (1/2)

SciPy subpackages (2/2)

from scipy import linalg, optimize

SciPy

Again, check out the official tutorials

Some examples:

Plotting

Plotting

There exists many libraries for plotting:

Machine learning?

Machine learning

Data Mining & NLP

Data Mining & NLP

References