Skip to content


Python And Unicode… What a Crap

I was trying to implement something really simple in Python. I wrote a simple script… and there were some errors. Well, after a small research I found that simple len() function doesn’t count unicode characters but count bytes in unicode string. What a crap.

Example:

>>> string = 'lksfwioerx'
>>> unicode = 'łóąśćńżźłę'
>>> len(string)
10
>>> len(unicode)
20

Another funny thing is that unicode is being said to be an object language. What object language needs to use a global function len() instead of x.len() ?

Related posts:

  1. Naming Convention aka PHP vs Python Is Python really better than PHP? I doubt, and in...
  2. What I Don’t Like About Python Python – maybe you know – is a computer language...
  3. Trying to Use Database in Python Why the f**** there is nothing like DBI in Python?...

Posted in programming, wtf.

Tagged with , , .


4 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

  1. Jeff McNeil says

    Python 2.4.3 (#1, Jun 11 2009, 14:09:58)
    [GCC 4.1.2 20080704 (Red Hat 4.1.2-44)] on linux2
    Type “help”, “copyright”, “credits” or “license” for more information.
    >>> len(“hello”)
    5
    >>> len(“здравствуйте”)
    24
    >>> len(u”здравствуйте”)
    12
    >>>

    Note the ‘u’ before the correct unicode version, without it, you’re just reading a stream of bytes fed into Python via the terminal that is interpreted as an ASCII string.

  2. Simon says

    @Jeff
    Right… I’ve already found that in the documentation… but it means that I need to decide if the string is unicode or not at the time of writing the code. I don’t know Python so much, should I do something magical with variables that I read from files or get from standard input? I want to have a script that uses unicode or not, depending only on the input (or maybe some environment variables).

  3. Jeff McNeil says

    Py3k fixes most of that oddness as it differentiates between text & binary data pretty cleanly. Look at the codecs module:

    Test file:
    ————-
    [root@virtapi01 platform]# cat myfile
    здравствуйте!!
    [root@virtapi01 platform]#

    Just reading — notice it’s a str type.
    ———————————————
    >>> type(open(‘myfile’).read())

    >>>

    Using codecs — notice it’s the correct unicode type.
    —————————————————————————-
    >>> type (codecs.open(‘myfile’, encoding=’utf8′).read())

    >>>

    You could always (as the byte stream *is* valid utf8 data):
    —————————
    >>> open(‘myfile’).read().decode(‘utf8′)
    u’\u0437\u0434\u0440\u0430\u0432\u0441\u0442\u0432\u0443\u0439\u0442\u0435!!\n\n’

    And the reason the prints work direct to the console?
    ———————————————————————————
    >>> sys.stdin.encoding
    ‘UTF-8′
    >>> sys.stdout.encoding
    ‘UTF-8′
    >>>

    HTH =) I guess I know this better than I thought I did. Interesting blog, btw.

  4. Jeff McNeil says

    Oh, and yes, it is goofy. Not defending it =)



Some HTML is OK

or, reply to this post via trackback.



Better Tag Cloud