Uhaw Pa Sa Camel

  HITS:  
 
d'Doc
Alabang, Muntinlupa City, Philippines
Beer-loving Gunner extraordinaire, perennial vocalist, guitarist, dog person, and wet kisser in one neat li'l package.

>> VIEW MY COMPLETE PROFILE

>> Home  

Subscribe to
Posts [Atom]  

Links
Previous Posts
Archives

Powered by Blogger

 
Thursday, November 04, 2010

Python Short Hacking Tip #4: Don't mix encodings

 
Being a C/C++ programmer I really found it a pain handling string encodings. But in python, it is a breeze! Here are a few short tips for handling unicode and other types of encoding.
>>> #Create a unicode string
>>> s = u'Hello unicode world! Ü'
>>> s
u'Hello unicode world! \xdc'
>>>
>>> #Convert to an encoding using encode
>>> s = s.encode('utf-8')
>>> s
'Hello unicode world! \xc3\x9c'
>>>
>>> #Convert back to unicode using decode
>>> s = s.decode('utf-8')
>>> s
u'Hello unicode world! \xdc'
>>>
>>> #Convert to another encoding
>>> s = s.encode('iso-8859-1')
'Hello unicode world! \xdc'
>>>
>>> #Convert back to unicode
>>> s = s.decode('utf-8')
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xdc in position 21: unexpected end of data
. . .


This time when the string codec tried to convert the string back to unicode, it was expecting a utf-8 but we supplied a string in iso-8859-1 thus, the exception. Don't get your encodings mixed!

Not all situations results in exceptions, though. For example:

>>> #Create a unicode string
>>> s = u'Hello unicode world! Ü'
>>> s
u'Hello unicode world! \xdc'
>>>
>>> #Convert to an encoding using encode
>>> s = s.encode('utf-8')
>>> s
'Hello unicode world! \xc3\x9c'
>>>
>>> #Convert back to unicode using decode
>>> s = s.decode('iso-8859-1')
>>> s
u'Hello unicode world! \xc3\x9c'


As you can see we did not get the original unicode string anymore!

Subscribe to
Posts [Atom]

 
 
 

Python Short Hacking Tip #3: Know your encoding

 
It is next to impossible to determine what encoding was used just by looking at a string of bytes. The second best thing for us is knowing whether the string is encoded using a certain specific encoding.

def is_encoding(enc, s)
try:
s.decode(enc)
return True
except UnicodeDecodeError:
return False


Sample run:
>>> is_encoding('utf-8', u'Hello World \xdc'.encode('iso-8859-1'))
False


Take note that if the characters in the byte string are all in the ascii set, is_encoding will return true even for the call above.

>>> is_encoding('utf-8', u'Hello World'.encode('iso-8859-1'))
True

Subscribe to
Posts [Atom]

 
 
 

Python Short Hacking Tip #2: is_pys60 (duh!)

 
try:
import e32
__has_e32 = True
except ImportError:
__has_e32 = False

def is_pys60():
return __has_e32

Subscribe to
Posts [Atom]

 
 
 
Wednesday, November 03, 2010

Python Short Hacking Tip #1: Plain Auth Base 64 Encoding

 
auth_string = base64.encodestring( '\x00%s\x00%s' % (username, password) )

If you don't want the new line at the end feel free to remove it:
auth_string = base64.encodestring( '\x00%s\x00%s' % (username, password) )[0:-1]

Reading from
http://stackoverflow.com/questions/2620975/strange-n-in-base64-encoded-string-in-ruby

I learned that:

"The reason content-free newlines are added at the encode stage is because base64 was originally devised as an encoding mechanism for sending binary content in e-mail, where the line length is limited. Feel free to replace them away if you don't need them."

Subscribe to
Posts [Atom]