Thursday, April 14, 2011

Unable to decode unicode string in Python 2.4

This is in python 2.4. Here is my situation. I pull a string from a database, and it contains an umlauted 'o' (\xf6). At this point if I run type(value) it returns str. I then attempt to run .decode('utf-8'), and I get an error ('utf8' codec can't decode bytes in position 1-4).

Really my goal here is just to successfully make type(value) return unicode. I found an earlier question that had some useful information, but the example from the picked answer doesn't seem to run for me. Is there something I am doing wrong here?

Here is some code to reproduce:

Name = 'w\xc3\xb6rner'.decode('utf-8')
file.write('Name: %s - %s\n' %(Name, type(Name)))

I never actually get to the write statement, because it fails on the first statement.

Thank you for your help.

Edit:

I verified that the DB's charset is utf8. So in my code to reproduce I changed '\xf6' to '\xc3\xb6', and the failure still occurs. Is there a difference between 'utf-8' and 'utf8'?

The tip on using codecs to write to a file is handy (I'll definitely use it), but in this scenario I am only writing to a log file for debugging purposes.

From stackoverflow
  • You need to use "ISO-8859-1":

    Name = 'w\xf6rner'.decode('iso-8859-1')
    file.write('Name: %s - %s\n' %(Name, type(Name)))
    

    utf-8 uses 2 bytes for escaping anything outside ascii, but here it's just 1 byte, so iso-8859-1 is probably correct.

  • Your string is not in UTF8 encoding. If you want to 'decode' string to unicode, your string must be in encoding you specified by parameter. I tried this and it works perfectly:

    print 'w\xf6rner'.decode('cp1250')
    

    EDIT

    For writing unicode strings to the file you can use codecs module:

    import codecs
    f = codecs.open("yourfile.txt", "w", "utf8")
    f.write( ... )
    

    It is handy to specify encoding of the input/output and using 'unicode' string throughout your code without bothering of different encodings.

  • It's obviously 1-byte encoding. 'รถ' in UTF-8 is '\xc3\xb6'.

    The encoding might be:

    • ISO-8859-1
    • ISO-8859-2
    • ISO-8859-13
    • ISO-8859-15
    • Win-1250
    • Win-1252
  • So in my code to reproduce I changed '\xf6' to '\xc3\xb6', and the failure still occurs

    Not in the first line it doesn't:

    >>> 'w\xc3\xb6rner'.decode('utf-8')
    u'w\xf6rner'
    

    The second line will error out though:

    >>> file.write('Name: %s - %s\n' %(Name, type(Name)))
    UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 7: ordinal not in range(128)
    

    Which is entirely what you'd expect, trying to write non-ASCII Unicode characters to a byte stream. If you use Jiri's suggestion of a codecs-wrapped stream you can write Unicode directly, otherwise you will have to re-encode the Unicode string into bytes manually.

    Better, for logging purposes, would be simply to spit out a repr() of the variable. Then you don't have to worry about Unicode characters being in there, or newlines or other unwanted characters:

    name= 'w\xc3\xb6rner'.decode('utf-8')
    file.write('Name: %r\n' % name)
    
    Name: u'w\xf6rner'
    

0 comments:

Post a Comment