Data Formats, Encoding & Access

Axel Polleres
Stefan Sobernig


October 12, 2021

Data formats & encoding

Data

Question.

What is data?

Possible views on data

Data formats

Question.

What is a data format?

What data formats do you know?

What differences between data formats did you encounter?

Data (exchange) formats

Data (exchange) formats

For data exchange formats (representation of data to encode and to store these data in a computer, and to transfer/exchange data between computers) we further distinguish

character-encoded data ("text")

directly binary-encoded data (0s and 1s)

Character-encoded, unstructured text data

Unstructured, textual data:

Some useful Python libraries:

Character-encoded, (semi-)structured data

Question.

What is structured data? What is semi-structured data?

Character-encoded, (semi-)structured data

CSV

This is what the RFC 4180 says:

Unfortunately, these rules are not always followed "in the wild":

Johann Mitlöhner, Sebastian Neumaier, Jürgen Umbrich, and Axel Polleres. Characteristics of open data CSV files. In 2nd International Conference on Open and Big Data, August 2016.

CSV





CSV

You find a CSV version of this data here: http://www.zamg.ac.at/ogd/

"Station";"Name";"Höhe m";"Datum";"Zeit";"T °C";"TP °C";"RF %";"WR °";"WG km/h";"WSR °";"WSG km/h";"N l/m²";"LDred hPa";"LDstat hPa";"SO %"

11010;"Linz/Hörsching";298;"13-10-2016";"01:00";5,8;5,3;97;230;3,6;;5,4;0;1019,4;981,3;0

11012;"Kremsmünster";383;"13-10-2016";"01:00";5,2;4;94;226;10,8;220;13,3;0;1019,6;972,3;0

11022;"Retz";320;"13-10-2016";"01:00";7;5,3;89;323;14,8;323;28,1;0;1017,7;979;0

11035;"Wien/Hohe Warte";203;"13-10-2016";"01:00";8,1;5,4;83;294;15,1;299;33,1;0;1017,4;992,2;0

11036;"Wien/Schwechat";183;"13-10-2016";"01:00";8,2;5,2;81;300;25,9;;38,9;0;1017,3;995,1;0

CSV

Question: What's NOT conformant to RFC 4180 here?

Potential issues:

Another example: https://info.gesundheitsministerium.at/opendata/ --> Try to download timeline-bbg.csv and open it with Microsoft Excel...

XML

Various "companion standards", e.g. schema languages:

XML

Example from the entry tutorial:

<pokemon id=1>
   <name>Jürgen</name>   
   <type>Caterpie</type>
   <location>
       <longitude>16.4101</longitude>
       <latitude>48.2126</latitude>
 </location>
 <carries>
    <fruit>apple</fruit>
    <fruit>banana</fruit>
 </carries>
</pokemon>





XML

Potential issues: e.g.

JSON

JSON

Example:

{  "id": 10,
    "firstname": "Alice",
    "lastname": "Doe",
    "active": true,
    "shipping_addresses":
    [ { "street": "Wonderland 1", "zip": 4711, "city": "Vienna", "country": "Austria", "home": true },
        { "street": "Welthandelsplatz 1", "zip": 1020, "city": "Vienna", "country": "Austria" },
        { "street": "MickeyMouseStreet10", "zip": 12345, "city": "Entenhausen", "country": "Germany" } ]
}

JSON

Example (vs. XML):

<customer id="10" active="true">
    <firstname>Alice</firstname>
    <lastname>Doe</lastname>

    <shipping_addresses>
      <address home = "true"><street>Wonderland 1</street><zip>4711</zip><city>Vienna</city><country>Austria</country></address>
      <address><street>Welthandelsplatz 1</street><zip>1020</zip><city>Vienna</city><country>Austria</country></address>
      <address><street>MickeyMouseStreet10</street><zip>12345</zip><city>Entenhausen</city><country>Germany</country></address>
</customer>

JSON

Jupyter notebooks are represented as JSON documents:

{
 "cells": [
  {
   "cell_type": "markdown",
   "source": [
    "## Assignment 1\n",
    "\n",
    "This assignment is due on mm-dd-YYY-hh:mm by uploading the completed notebook at Learn@WU.\n",
    "\n",
    "### Task 1\n",
    "\n"
...

JSON

Summary: Character-encoded, (semi-)structured data

Excursus/repition: Character Encodings

Character-encoded data, is encoding data in text made upd from a character set, which is encoded into 0s and 1s,

Question.

Why is that occurring? "J�rgen likes to eat K�rntner K�snudln"

Excursus/repition: Character Encodings

Excursus: Binary encoding of structured data

Notebook for Accessing Data: the Python Way

02_Encodings+and+reading+text+files.ipynb in Jupyter's unit2 subfolder

03_Data_Formats_and_Standards.ipynb in Jupyter's unit2 subfolder

Data Access

Ways to access and get data

Question.

Which access methods can be used to retrieve/download a dataset?

Ways to access and get data

From a file on disk:

From the Web:

Downloading data

Datasets which have an URL (Web address) can be in general directly downloaded

The underlying protocol is called Hypertext Transfer Protocol (HTTP), nowadays typically HTTP1.1

Let's see how HTTP works

Downloading data

Question.

Can all URLs be easily downloaded? If no, why?

Downloading data

Things to consider when downloading files.

NOTE: If you want to respect the robots.txt file, you need to first access the file (if available),
inspect the allow/disallow rules and apply them to the URL you want to download.

Robots.txt

Robots.txt: Example

http://data.wu.ac.at/robots.txt

User-agent: *
Disallow: /portalwatch/api/
Disallow: /portalwatch/portal/
Disallow: /portal/dataset/rate/
Disallow: /portal/revision/
Disallow: /portal/dataset/*/history
Disallow: /portal/api/

User-Agent: *
Crawl-Delay: 10

In this example, any robot is not allowed to access the specified sub-directories and any robot should wait 10 seconds between two requests

Accessing data via API

Some data sources can be only retrieved via Application Programming Interfaces (APIs).

Question.

Any reasons a data publisher would provide data access via an API rather than providing the data as files?

Accessing data via API

The reason for providing data access via an API:

Accessing data via API: Examples

Last.fm

The Last.fm API allows anyone to build their own programs using Last.fm data, whether they're on the Web, the desktop or mobile devices. Find out more about how you can start exploring the social music playground or just browse the list of methods below.

Twitter

The REST APIs provide programmatic access to read and write Twitter data. Author a new Tweet, read author profile and follower data, and more

(not entirely) Open Weatherdata

a JSON API for easy access of current weather, freemium model (e.g., historic data is not free)

ProgrammableWeb - an API directory for over 20K Web accessible APIs

Accessing data via a Distributed System API

Accessing data via API: WU BACH API

The WU BACH API provide machine-readable data of WU's digital ecosystem in line with many OGD [1] initiatives.

e.g https://bach.wu.ac.at/z/BachAPI/courses/search?query=data+science

[
   [
      "19S",
      "5585",
      "Data Processing 1",
      [
         [
            6947,
            "Sobernig S."
         ],
         [
            12154,
            "Polleres A."
         ]
      ]
   ]
   /* ... */
]

1: Open Government Data

Scraping Web data

Web scraping is the act of taking content from a Web site with the intent of using it for purposes outside the direct control of the site owner. [source]

Typical scenarios for Web scraping: Collecting data on

Some examples:

Web scraping also requires to parse a HTML file using dedicated libraries.

WARNING: The legal ground for Web scraping is often not clear and we do not encourage or suggest to do Web scraping before checking if the site allows it.
Legal topics around Web scraping will be covered in course III.

Accessing Data: the Python Way

Notebook for Accessing Data: the Python Way

01_Accessing-Data.ipynb in Jupyter's unit2 subfolder

Let's look at the notebooks now!

Excursus: How and where to find data?

Question

Question.

How can you find interesting datasets for your project?

Possible ways to find data

Google, Bing, etc.

Many datasets can be found using a Web Search engine such as Google, Bing, etc.

Combine your keyword search with tokens such as "csv", ".csv".

Such search engines offer also more advanced search features to filter for particular data formats

Fileformat search does not return very good results on all search engines, unfortunately.

(Try out e.g.: 'WU Vienna Lectures')

Follow questions/search on Quora, Stackoverflow, Reddit

Quora and Stackoverflow are question-and-answer sites where people can pose any question and receive community answers

Some direct links:

Notice.

The popular useful platform change over time. Hint: Follow Metcalfe's Law

Blogs about datascience

Some general datascience blogs regularly have posts about datasets

Lists of datascience blogs:

Curated lists of datasets

Many people also provide a curated lists of public datasets or APIs to datasets. These lists can be typically found via a Google/Bing/Yahoo Search

Some examples:

Open Data Portals

So called (Open Data) portals are catalogs for datasets.

Further links:

Question

Question.

What should you consider if you use "public" datasets

Consuming "public" datasets

Public does not necessarily mean free

Many public datasets come with certain restrictions of what one is allowed to do with the data.

The data license ( if available) typically specifies the following questions:

Notice.

More about licenses of datasets in SBWL3

Homework

Details: Assignment 2 will be published on learn by tomorrow latest

Submission: Via Assignment 2 on Learn@WU (deadline will be exactly 2 weeks from publication).