Being a C/C++ programmer I really found it a pain handling string encodings. But in python, it is a breeze! Here are a few short tips for handling unicode and other types of encoding.
>>> #Create a unicode string
>>> s = u'Hello unicode world! Ü'
>>> s
u'Hello unicode world! \xdc'
>>>
>>> #Convert to an encoding using encode
>>> s = s.encode('utf-8')
>>> s
'Hello unicode world! \xc3\x9c'
>>>
>>> #Convert back to unicode using decode
>>> s = s.decode('utf-8')
>>> s
u'Hello unicode world! \xdc'
>>>
>>> #Convert to another encoding
>>> s = s.encode('iso-8859-1')
'Hello unicode world! \xdc'
>>>
>>> #Convert back to unicode
>>> s = s.decode('utf-8')
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xdc in position 21: unexpected end of data
. . .
This time when the string codec tried to convert the string back to unicode, it was expecting a utf-8 but we supplied a string in iso-8859-1 thus, the exception. Don't get your encodings mixed!
Not all situations results in exceptions, though. For example:
>>> #Create a unicode string
>>> s = u'Hello unicode world! Ü'
>>> s
u'Hello unicode world! \xdc'
>>>
>>> #Convert to an encoding using encode
>>> s = s.encode('utf-8')
>>> s
'Hello unicode world! \xc3\x9c'
>>>
>>> #Convert back to unicode using decode
>>> s = s.decode('iso-8859-1')
>>> s
u'Hello unicode world! \xc3\x9c'
As you can see we did not get the original unicode string anymore!