I was trying to implement something really simple in Python. I wrote a simple script… and there were some errors. Well, after a small research I found that simple len() function doesn’t count unicode characters but count bytes in unicode string. What a crap.
Example:
>>> string = 'lksfwioerx' >>> unicode = 'łóąśćńżźłę' >>> len(string) 10 >>> len(unicode) 20
Another funny thing is that unicode is being said to be an object language. What object language needs to use a global function len() instead of x.len() ?
Related posts:
- Naming Convention aka PHP vs Python Is Python really better than PHP? I doubt, and in...
- What I Don’t Like About Python Python – maybe you know – is a computer language...
- Trying to Use Database in Python Why the f**** there is nothing like DBI in Python?...













Python 2.4.3 (#1, Jun 11 2009, 14:09:58)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-44)] on linux2
Type “help”, “copyright”, “credits” or “license” for more information.
>>> len(“hello”)
5
>>> len(“здравствуйте”)
24
>>> len(u”здравствуйте”)
12
>>>
Note the ‘u’ before the correct unicode version, without it, you’re just reading a stream of bytes fed into Python via the terminal that is interpreted as an ASCII string.
@Jeff
Right… I’ve already found that in the documentation… but it means that I need to decide if the string is unicode or not at the time of writing the code. I don’t know Python so much, should I do something magical with variables that I read from files or get from standard input? I want to have a script that uses unicode or not, depending only on the input (or maybe some environment variables).
Py3k fixes most of that oddness as it differentiates between text & binary data pretty cleanly. Look at the codecs module:
Test file:
————-
[root@virtapi01 platform]# cat myfile
здравствуйте!!
[root@virtapi01 platform]#
Just reading — notice it’s a str type.
———————————————
>>> type(open(‘myfile’).read())
>>>
Using codecs — notice it’s the correct unicode type.
—————————————————————————-
>>> type (codecs.open(‘myfile’, encoding=’utf8′).read())
>>>
You could always (as the byte stream *is* valid utf8 data):
—————————
>>> open(‘myfile’).read().decode(‘utf8′)
u’\u0437\u0434\u0440\u0430\u0432\u0441\u0442\u0432\u0443\u0439\u0442\u0435!!\n\n’
And the reason the prints work direct to the console?
———————————————————————————
>>> sys.stdin.encoding
‘UTF-8′
>>> sys.stdout.encoding
‘UTF-8′
>>>
HTH =) I guess I know this better than I thought I did. Interesting blog, btw.
Oh, and yes, it is goofy. Not defending it =)