## Python3 and character encodings

Python3 [natively supports Unicode](https://docs.python.org/3/howto/unicode.html) (UTF-8)... which eases many things compared to other programming languages!

In [19]:
# Let's read a file in UTF-8 Unicode:

with open('./data/unicode_example_utf8.txt', 'r') as f:
    s =f.read()

print(s)

Daß Jürgen heute nicht in Österreich ist, liegt daran, daß er zu einer Konferenz nach 神戸 (Kobe, Japan) geflogen ist. 



As you already know from BIS I, Unicode/UTF-8 is upwards compatible with ASCII, so all textfiles without any special sympols can be read in with out problems. Those two formats/encodings are probably by far themost common ones nowadays.

In [99]:
# Let's read an ASCII file in UTF-8 Unicode:

with open('./data/ascii.txt', 'r') as f:
    s =f.read()

print(s)


Some arbitrary test without special characters.


However, there are other other common encodings that may produce errors when you try to read them.
You can provide the input endocing for a file as an explicit parameter in the open() function.
Find a list of known encodings here: https://docs.python.org/3/library/codecs.html

In [100]:
# Another common encoding, especially for French or German documents, LATIN-1 (or also, more precisely, called )

# This one doesn't work...:
with open('./data/latin-1.txt',mode='r') as f:
    s =f.read()

# ...but this one would:
# with open('./data/latin-1.txt',mode='r', encoding='latin-1') as f:
#    s =f.read()

print(s)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdf in position 2: invalid continuation byte

On top of that, there are encoding variants for Unicode, while more rarely used than UTF-8, such as UTF-16, or UTF-32. The differences between these Unicode encodings are well explained here: [(EN)](https://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings) [(DE)](https://de.wikipedia.org/wiki/Unicode_Transformation_Format#UTF-8.2C_UTF-16_und_UTF-32)

In [83]:
# Let's read a file in UTF-32:

with open('./data/unicode_example_utf32.txt', encoding='UTF-32', mode='r') as f:
    s =f.read()

print(s)

Some arbitrary test without special characters, but in a strange encoding...



In [None]:
# We can also convert formats, by first reading it in as binary and then 
# forcing an encoding with the decode() method... but that obviously produces nonsense:
with open('./data/ascii.txt', mode='rb') as f:
    s =f.read()

print(s.decode("utf-16", "replace"))

It's kinda hard to guess the encoding manually... but a Python package module comes to the rescue: [chardet](https://pypi.python.org/pypi/chardet) :-)

In [85]:
import chardet
with open('./data/latin-9.txt','rb') as f:
    rawdata = f.read()
    print(chardet.detect(rawdata))
    
    
# but it doesn't always work, bad news: try the above one with 'latin-9.txt'

{'encoding': 'ISO-8859-2', 'confidence': 0.8327456309366219}
