## Accessing Data: the Python Way

In this part, we cover two data access methods

1. Loading data from disk

2. Loading data from a Web resource (URL)

We will also learn how to *guess* the file format by inspecting the *metadata*  and *the content* of the retrieved data.

## Python3: Opening and Closing data streams

The typical steps involved in consuming data are:
1. Open a stream to read the data ( either from file or HTTP)

2. Consume the content (e.g. loading the whole content or parts of it)

3. Closing the stream to free up resources:

  * files: allow other processes to access the file (Exception *File used by another programm*)

  * HTTP: closing a stream allows to reuse connections


## Python3: Automatically Closing data streams

The **with** statement s used to wrap the execution of a block with methods defined by a context manager.
This allows common try...except...finally usage patterns to be encapsulated for convenient reuse.

Typical use-case : automatically ensure that streams are closed.

Other use cases: timing of functions, printing of logs at the end of  a call,

        with COMMAND as C:
          #work with C


## Loading files from disk

 Given that a file is stored on the local machine, we can access the file and inspect or load its content.

 There are typically two ways to read the content of a file:

1. Load the whole content of the file and store it in a variable for further processing

2. Read the file line by line (e.g., if files are large)

See also Chapter 7.2 in the [Python3 tutorial](https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files)

## File location

We need the location of the file on disk to load its content.

An **absolute file path** points to the same location in a file system, regardless of the current working directory. To do that, it must include the root directory.

        Windows: C:\Users\jumbrich\data\course-syllabus.txt
        Linux/Mac: /home/jumbrich/data/course-syllabus.txt


A **relative path** points to the relative location of a file based on the given/current working directory.

        Windows:
        Linux: ~/data/course-syllabus.txt #starting from home directory
        Linux: ../data/course-syllabus.txt #go one folder back, then into data


## Function:  **open()**

In [None]:
help( open )

## Read content of file into memory

the function **read()** reads the entire contents of the file will be read and returned

In [None]:
filePath="data/course-syllabus.txt"
#open file in read mode
f = open(filePath) # or open(filePath, 'r')

print("Full Output of content:")
content= f.read() # read the whole content and store it in variable content
print(content)
f.close() # do not forget to close the file

#better
with open(filePath) as f: # Carefully with indention and tabs
    content = f.read()
    print(content)

## Read a single line from a file

The function **readline()** reads a single line from the file; a newline character (\n) is left at the end of the string

In [None]:
filePath="data/course-syllabus.txt"
#open file in read mode
with open(filePath) as f:# or open(filePath, 'r')
    print("first line: "+f.readline())
    print("second line: "+f.readline())
    print("third line: "+f.readline())

## Read lines from a file using a loop

In [None]:
filePath="data/course-syllabus.txt"
#open file in read mode
with open(filePath)  as f:# or open(filePath, 'r')
    for line in f: # loop over every line in the file (separated by newline)
        print(line)

## Resource-saving way to guess the format of a file

**Question.**

How can we guess the format of a file with as few resources as possible?



* Inspect the file extension of the file , if availabel (e.g. *.txt*)

* read the first couple of lines, print them and see if you detect any known format syntax patterns (e.g. JSON brackets, CSV delimiters)

## Getting the file size of a local file

In [None]:
filePath="data/course-syllabus.txt"

import os
fSize = os.path.getsize(filePath)

print('File size of'+filePath+' is: '+str(fSize) + ' Bytes') # typcasting of an int to str for str concatination

## Loading data from a Web resource (URL)

There exists many libaries in Python 3 to interact with Web resources using the HTTP protocol.
* the [urllib library](https://docs.python.org/dev/library/urllib.html) is preinstalled in any Python installation

* the **[requests library](http://docs.python-requests.org/en/master/)**, requires to be installed, but is easier to use

**Some warnings ;).**

Warning: Recreational use of other HTTP libraries may result in dangerous side-effects, including: security vulnerabilities, verbose code, reinventing the wheel, constantly reading documentation, depression, headaches, or even death. [[requests library index page](http://docs.python-requests.org/en/master/)]



## HTTP Protocol Operations

**The HTTP protocol is the foundation of data communication for the World Wide Web**

The current version of the protocol is [HTTP1.1](https://tools.ietf.org/html/rfc2616).

A client (browser or library) typically uses the **HTTP GET**  operation to retrieve information about and the content of a HTTP URL.

## Loading data from a Web resource: urllib

First things first. We need to load the library to be able to use it

In [None]:
import urllib.request

Afterwards we need to open a connection to the HTTP Server and request the content of the URL

        urllib.request.urlopen( URL )


## Urllib: accessing a URL

In [None]:
import urllib.request
url="https://datascience.ai.wu.ac.at/ss19/dataprocessing1/data/course-syllabus.txt"
with urllib.request.urlopen(url) as f:
    print(f.read())

## Loading data from a Web resource: requests library

First things first.
<newline>
We need to install and then load the library to be able to use it.
<newline>
the requests library is installed by default on the course container and in any anaconda installation.

In [None]:
import requests

Afterwards we need to open a connection to the HTTP Server and request the content of the URL

        requests.get( URL )


## Requests: accessing a URL

In [None]:
import requests
url="https://datascience.ai.wu.ac.at/ss19/dataprocessing1/data/course-syllabus.txt"

r = requests.get( url )
content=r.text
print(content)

## Guessing the file format via the URL


* patterns in the URL

  * file extension: <http://data.wu.ac.at/.../course-syllabus.txt> (**.txt**)

  * query path: <http://..../api/courses>?format=csv (**format=csv**)


* HTTP Response Header

  * contains not only information about the file format


## HTTP Response Header

Every HTTP operation has a HTTP request and response header.
<newline>
A HTTP response header is a message from a HTTP server for a request.
<newline>
The header message contains:
* The [HTTP status code](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)

* The [response header fields](https://en.wikipedia.org/wiki/List_of_HTTP_header_fields#Response_fields)

* Empty line

* Message body/content

## HTTP Response Header Examples

        HTTP/1.1 200 OK
        Date: Wed, 12 Oct 2016 12:39:12 GMT
        Server: Apache/2.4.18 (Ubuntu) mod_wsgi/4.3.0 Python/2.7.12
        Last-Modified: Wed, 12 Oct 2016 07:29:32 GMT
        ETag: "1b3-53ea5f4498d97"
        Accept-Ranges: bytes
        Content-Length: 435
        Vary: Accept-Encoding
        Content-Type: text/plain


See also [a full list of possible HTTP Response Header fields on Wikipedia](https://en.wikipedia.org/wiki/List_of_HTTP_header_fields)
<newline>
Interesting header fields: **Content-Type** and **Content-Length**
**Notice.**

**Python3 is case-sensitive**, meaning that "Content-Type" != "content-type". Sometimes, header fields might be in lower-case or capitalized



## HTTP Response Header with Urllib

In [None]:
import urllib.request
url="http://datascience.ai.wu.ac.at/ss19/dataprocessing1/data/course-syllabus.txt"
req =  urllib.request.Request( url , method="HEAD")  # create a HTTP HEAD request
with urllib.request.urlopen(req) as resp:
    header = resp.info()
    # print the full header
    print("Header:")
    print(header)

    ## print the content-type
    print("Content-Type:")
    print(header['Content-Type'])

    ## print the content-type
    print("Content-Length in Bytes:")
    print(header['Content-Length'])

## HTTP Response Header with Requests

In [None]:
import requests
url="http://datascience.ai.wu.ac.at/ss19/dataprocessing1/data/course-syllabus.txt"

r = requests.head( url ) # would also work with a HTTP Get
headerDict=r.headers
print(headerDict)

## Inspect Request library HTTP Response Headers

In [None]:
#print all available response header keys
print("Header")
print(headerDict)

#access content-type header, if it exists
if "Content-Type" in headerDict:
    print("Content-Type: ", headerDict['Content-Type'] )
    
#access content-length header, if it exists
if "Content-Length" in headerDict:
    print("Content-Length: ", headerDict['Content-Length'] )