Python Unicode Gotcha

I was peer reviewing a bug fix for some code at work the other day and learned something about Python and encoded Unicode. It makes sense now, but when I first saw the fix my initial thought was that there's no way that's what is going wrong. It turns out that once you've encoded a Unicode string, perhaps to utf8, you can chop off just part of the bytes of a single Unicode character using string slices.

>>> u = u'A string with some random unicode \u0200 \u0202. There it is.'
>>> u
u'A string with some random unicode \u0200 \u0202. There it is.'

Now we use a string slice while it's still in pure Unicode and everything behaves as expected.

>>> u[:35]
u'A string with some random unicode \u0200'
>>> u[:34]
u'A string with some random unicode '

Encode to utf8 and you need to be careful.

>>> s = u.encode('utf8')
>>> s
'A string with some random unicode \xc8\x80 \xc8\x82. There it is.'
>>> s[:35]
'A string with some random unicode \xc8'

The encode method converts a Unicode string to a series of bytes, at which point using indexes/slices can remove just some of the multiple bytes required to represent that character.