Uhaw Pa Sa Camel

HITS:

d'Doc
Alabang, Muntinlupa City, Philippines
Beer-loving Gunner extraordinaire, perennial vocalist, guitarist, dog person, and wet kisser in one neat li'l package.

>> VIEW MY COMPLETE PROFILE

>> Home

Subscribe to
Posts [Atom]

Shelfari: Book reviews on your book blog

Python Short Hacking Tip #4: Don't mix encodings

Being a C/C++ programmer I really found it a pain handling string encodings. But in python, it is a breeze! Here are a few short tips for handling unicode and other types of encoding.

>>> #Create a unicode string
>>> s = u'Hello unicode world! Ü'
>>> s
u'Hello unicode world! \xdc'
>>> 
>>> #Convert to an encoding using encode
>>> s = s.encode('utf-8')
>>> s
'Hello unicode world! \xc3\x9c'
>>> 
>>> #Convert back to unicode using decode
>>> s = s.decode('utf-8')
>>> s
u'Hello unicode world! \xdc'
>>> 
>>> #Convert to another encoding
>>> s = s.encode('iso-8859-1')
'Hello unicode world! \xdc'
>>>
>>> #Convert back to unicode
>>> s = s.decode('utf-8')
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xdc in position 21: unexpected end of data
. . .

This time when the string codec tried to convert the string back to unicode, it was expecting a utf-8 but we supplied a string in iso-8859-1 thus, the exception. Don't get your encodings mixed!

Not all situations results in exceptions, though. For example:

>>> #Create a unicode string
>>> s = u'Hello unicode world! Ü'
>>> s
u'Hello unicode world! \xdc'
>>> 
>>> #Convert to an encoding using encode
>>> s = s.encode('utf-8')
>>> s
'Hello unicode world! \xc3\x9c'
>>> 
>>> #Convert back to unicode using decode
>>> s = s.decode('iso-8859-1')
>>> s
u'Hello unicode world! \xc3\x9c'

As you can see we did not get the original unicode string anymore!

Subscribe to
Posts [Atom]

Posted by d'Doc @ 8:42 PM 1 comments

Python Short Hacking Tip #3: Know your encoding

It is next to impossible to determine what encoding was used just by looking at a string of bytes. The second best thing for us is knowing whether the string is encoded using a certain specific encoding.

def is_encoding(enc, s)
    try:
        s.decode(enc)
        return True
    except UnicodeDecodeError:
        return False

Sample run:

>>> is_encoding('utf-8', u'Hello World \xdc'.encode('iso-8859-1'))
False

Take note that if the characters in the byte string are all in the ascii set, is_encoding will return true even for the call above.

>>> is_encoding('utf-8', u'Hello World'.encode('iso-8859-1'))
True

Subscribe to
Posts [Atom]

Posted by d'Doc @ 7:12 PM 0 comments

Python Short Hacking Tip #2: is_pys60 (duh!)

try:
    import e32
    __has_e32 = True
except ImportError:
    __has_e32 = False

def is_pys60():
    return __has_e32

Subscribe to
Posts [Atom]

Posted by d'Doc @ 8:47 AM 0 comments

Wednesday, November 03, 2010

Python Short Hacking Tip #1: Plain Auth Base 64 Encoding

auth_string = base64.encodestring( '\x00%s\x00%s' % (username, password) )

If you don't want the new line at the end feel free to remove it:
auth_string = base64.encodestring( '\x00%s\x00%s' % (username, password) )[0:-1]

Reading from
http://stackoverflow.com/questions/2620975/strange-n-in-base64-encoded-string-in-ruby

I learned that:

"The reason content-free newlines are added at the encode stage is because base64 was originally devised as an encoding mechanism for sending binary content in e-mail, where the line length is limited. Feel free to replace them away if you don't need them."

Subscribe to
Posts [Atom]

Posted by d'Doc @ 3:32 PM 3 comments

Uhaw Pa Sa Camel

Links

Previous Posts

Archives

Python Short Hacking Tip #4: Don't mix encodings

Python Short Hacking Tip #3: Know your encoding

Python Short Hacking Tip #2: is_pys60 (duh!)

Python Short Hacking Tip #1: Plain Auth Base 64 Encoding