Python Unicode Gotcha
by Justin Michalicek on Dec. 4, 2012, 2:11 p.m. UTCI was peer reviewing a bug fix for some code at work the other day and learned something about Python and encoded Unicode. It makes sense now, but when I first saw the fix my initial thought was that there's no way that's what is going wrong. It turns out that once you've encoded a Unicode string, perhaps to utf8, you can chop off just part of the bytes of a single Unicode character using string slices.
>>> u = u'A string with some random unicode \u0200 \u0202. There it is.' >>> u u'A string with some random unicode \u0200 \u0202. There it is.'
Now we use a string slice while it's still in pure Unicode and everything behaves as expected.
>>> u[:35] u'A string with some random unicode \u0200' >>> u[:34] u'A string with some random unicode '
Encode to utf8 and you need to be careful.
>>> s = u.encode('utf8') >>> s 'A string with some random unicode \xc8\x80 \xc8\x82. There it is.' >>> s[:35] 'A string with some random unicode \xc8'
The encode method converts a Unicode string to a series of bytes, at which point using indexes/slices can remove just some of the multiple bytes required to represent that character.