Why Python Unicode Sucks

Besides concrete problems with the current status quo in Python Unicode, I think there's a general philosophical problem to the way Unicode is expected to be used in Python.

The convention in other languages is that you define boundaries, and you put thought into the encoding at those boundaries (maybe using some particular metadata like an HTTP header, maybe using convention or configuration -- there's no single rule). Then inside the boundaries there's the safe All-Unicode inner world.

This is a good solution for Java. Unfortunately it just doesn't work in Python, because you can't build good boundaries in Python. There's a couple reasons:

Python is not statically typed. If it was, we could use the typing to make it very clear where those boundaries were, and what parts of the code required decoded strings. Adaptation-based type declarations would probably be just as effective here.
We have two kinds of strings, Java has one.
Those two kinds of strings act almost exactly like each other. This means duck typing does not work. If the two string objects had a very different set of methods and were not interchangeable, then the boundaries would become very clear at runtime. (This is a similar string-related wart in Python.) As it is str objects can get deep into the system before a Unicode expectation causes an exception.
Byte (non-Unicode) strings are prevalent in Python code, both in the core and in libraries. If you only use mindfully-written code that deals with the Unicode properly you are okay. This is fine for, say, Zope users. Or people who do all their work as XML transformations, since XML libraries are another place where Unicode is mindfully supported. But for people that don't live in a walled city of vetted code, this doesn't work.

If we got rid of str entirely and added a bytestring type (with a very different API than strings!) then the rest of Python's system would work. Duck typing would work. You wouldn't have to learn best practices through hard-won experience, and you wouldn't have to audit every piece of outside code you use for problems. You could handle Unicode safely and confirm it through unit testing and during the development process. But that's now where we are now; and as a result Python is very prickly and unfriendly when it comes to this issue.

Most criticisms of dynamic typing apply to this very case; and those criticisms are right. This is a case where dynamic typing leads very directly to difficult-to-predict and difficult-to-detect runtime bugs. Dynamic typing only works when you adhere to certain important principles -- one of those is that if objects are not interchangeable they should use differently-named methods.

As a stop-gap I think setdefaultencoding will paper over a lot of these issues. It's not perfect by any means. It's akin to being able to add numbers to strings, and having the numbers automatically coerced into strings in the process -- it's not the sort of feature you really want to introduce; it's clearly sloppy. But until Python 3.0 it's the best option I see.

Note: I don't really think Python Unicode sucks, even if it does annoy me sometimes.

Created 02 Aug '05
Modified 12 Aug '05

Comments:

As I mentioned in a comment to the previous post: I setdefaultencoding to utf8 and it pretty much seems to work. No use though for shared hosting environments for code you're distributing for others to use.

I don't think the python situation is so terrible - I've been tempted to look at Rails and/or Borges and/or Seaside lately; one of the things holding me back (apart from the sheer wierdness of the Squeak environment) is that I get the impression unicode handling is even flakier in Ruby & Squeak. That's purely from a few minutes' googling in both cases and I may be wrong.

# Alan Little

A few more minutes strolling through comp.lang.ruby strengthens my "it's all about the language's inventor's native language" theory - Ruby's BDFL is of course Japanese, and it seems one reason Ruby is behind python on unicode support (yes, really - or at least lots of Ruby people appear to think so) is the widely-held Japanese belief that Unicode is a tiny and inadequate character set. And for all I know they may be right, although it's good enough for all the languages I think I'm ever likely to be interested in so (just like ASCII-only American programmers) I don't care.

# Alan Little

I get the impression that some Japanese people are annoyed that their text is especially space-inefficient in UTF-8. But I don't quite get it -- disk is cheap, RAM doesn't fill up due to native-language text, and gzip encoding should mitigate the bandwidth issues.

# Ian Bicking

you can get a better sense of the issue starting here: http://en.wikipedia.org/wiki/Han_unification

# anonymous

At least in Ruby my impression is that they only have bytestrings, and a module iconv to convert between encodings. And it's just a wrapper around system libraries, which seems like it's opening up a whole can o' worms of cross-platform compatibility. I'll give credit to the people who handled Python Unicode -- they stepped up and took on a whole lot of work that I expect was both boring and tedious, to give us a very complete and reliable foundation.

This seems strange to me that Ruby isn't better, since Ruby is Japanese and you'd think they'd care even more about encodings. There's even a separate module for Japanese encodings -- Python appears to have better Japanese support than Ruby! It has 13 Japanese encodings built in. I'd tear my hair out if I had to deal with 13 encodings for one language. But at least, bald or not, I'd appreciate the support.

# Ian Bicking

It looks to me like the Ruby example is a case of the perfect being the enemy of the good. Matz, being a smart Japanese programmer and therefore much more keenly aware of Unicode's faults and failings than most people, is therefore reluctant to regard it and use it as the be all and end all of string handling, a la Java or C#. If you google a few of his comp.lang.ruby postings on the subject, it appears he has big ideas, maybe even a prototype, of some kind of better-than-unicode ultimate international text handling. But it Isn't Ready Yet (will it ever be?) and meanwhile Ruby struggles along with some kind of half-working implementation of utf8.

# Alan Little

Yeah, str really needs to just die already. Python 1.4 called, it wants its broken text handling back.

# Bob Ippolito

I'm sorry Ian, but this reads to me like knee-jerk abuse of Python's Unicode triggers knee-jerk abuse of dynamic typing. I disagree with you strongly on both points. I've worked with a lot of people who had the same initial complaints, and with a little bit of discipline and experience, they simply cease to have these sorts of problems, and they do not need to resort to such nasty hacks as changing sysdefault encoding.

And things are far from rosy in the ststically typed Java world, BTW. Static typing does not save any language from the complexities of Unicode. Python is well ahead of Java in some respects. And what of that complexity of Unicode? My thoughts here:

http://copia.ogbuji.net/blog/2005-08-04/alt_unicod

TANSTAAFL.

# Uche

Having dynamic types does NOT meant that you can be sloppy about what your functions should expect and can handle. And this comment is not just valid for string types, it is for all types. If you're not thinking about what types you are working on, then you will suffer from those encode/decode errors. There is no fix for this. Once you delineate clearly what objects you are handling and you are publishing that you are handling, and you write your code accordingly, the Python unicode system is just fantastic.

Once trick that you might want to use: use a variant of Hungarian notation to indicate the expected type, e.g.

uname = name.decode('latin-1')

name_u = name.decode('latin-1')

Sometimes when there is a chance for confusion, I even mark the encoded strings:

buffer_utf8 = ...

Maybe you could even have another suffix for when the object your handling is _either_ of the string types. I used to suffer the same plight, until the day I decided to sit down and really understand how Unicode works, and then I made decisions in my source code to always think about which kinds of string I'm handling where. Now I never have any troubles anymore. Dynamic typing means that it is easy for you to make mistakes. Make decisions and add assertions in your code to ensure that you're moving the correct types between functions.

Also, dealing with Unicode strings is not as efficient as simple encoded strings (e.g. data), so both data types need to remain. This problems is thus not likely to go away.

To me all this ranting is just telling that you've been sloppy about which types you are working with. The problem is not Python, the problem is this habit that we all fall into at some point to not look at the problem straight in the face and to spend some time understanding all the details (granted, I suppose that's what you're doing now, but with a lot of blog noise...).

If you just take the habit to decide, everywhere, all the time, which types of string object you're accepting (str, Unicode, or string-or-unicode), your problems will go away.

# Martin

Dynamic typing only works when you adhere to certain important principles -- one of those is that if objects are not interchangeable they should use differently-named methods.

If you really meant that, you'd have to abandon polymorphism for good and all. If (classes of) objects were truly interchangeable, the only possible difference would be in performance. So since sets and bags are not interchangeable, we'd have to use addToSet and addToBag methods instead of just add methods in each one.

You've got a hold of a serious point here, but you are holding the stick at the wrong end. To begin with, a better title (if less blogospherically trendy) would be "Why Python 'str' Sucks".4

# John Cowan

Abandoning polymorphism seems extreme. I read the originating comment as 'if objects are not interchangeable, they should use differently-named methods for places where that difference lies".

Both sets and bags support addition, existence testing, held object counts, and the like. 'add' is thus a perfectly fine method for both, and usage via duck typing makes sense - if an add method exists, then I can use it. Where you run into problems is when there is a behavioral difference - a set can return only a single item, thus the idea of a 'countOfItem' method in a Set is silly. In a bag, it is not. Thus, the key behavioral difference shows up in a way visible to duck typing, and it is visible for those clients that care.

The problem with strings is that there are two different classes that have virtually identical method signatures, so client code interacting with these classes has a hard time knowing wherther it will get something that behaves well with Unicode.

Scott

# Scott Ellsworth

The point was intended to include only dynamically typed languages (i.e. Ruby with its duck typing). In other words, the risk of naming incompatible methods with the same name only exists if the language will assume these similarly named methods denote polymorphic behavior. In contrast, with a statically typed language like Java, polymorphism will only be assumed if the classes are explicitly placed in a class hierarchy using inheritance. (Unless, of course that is bypassed using reflection, as with Java Beans).

# Keith Bennett

http://bestbody-piercing.com http://bestmedical-tests.com http://charmssale.com http://shop-sextoys.com http://v-jewelrystores.com http://bestgift-baskets.com http://bestprogressiveinsurance.com http://cheap-uggboots.biz http://cheap-uggboots.info http://cheap-uggboots.com http://v-debtconsolidation.info http://v-debtconsolidation.net http://v-debtconsolidation.com http://v-homebusinesses.info http://v-homebusinesses.com http://v-homebusinesses.biz

# body piercing

Ian Bicking: the old part of his blog

Why Python Unicode Sucks

Comments: