Ian Bicking: the old part of his blog

Re: Do I hate Unicode, or Do I Hate ASCII?

That why you should set the default encoding to something inexistent. automatic conversion from str to unicode or unicode to str is a source of endless bugs. Better to have everything explicit ;)

Comment on Do I hate Unicode, or Do I Hate ASCII?


But sometimes automatic conversion really is the Right Thing. For instance, let's say I have some code that does this:

def popup(href, title):
    return '<a href="%s" onclick="window.open(%r,'_blank')">%s</a>' % (
        href, str(href), title)

Without a u' this will raise an exception if title is Unicode, if no default encoding is given. Is that the right behavior? No! This function works fine. This is not a boundary where encoding needs to be defined; in fact, it would suck if you had to encode the title before passing it in, because then you'd have to decode the result of the function as well, so you could re-encode it at the real boundary (when you serve the page). All this because of a missing u -- and that u is missing far more often than it is included.

If you can "fix" that function, you're okay. But there's too much code out there that does this now. I simply can't update all the code out there that uses bytestrings instead of Unicode.

# Ian Bicking

But that is the problem. Unicode difficulties always come when you try to mix 8 bit strings with unicode strings. A solution ( and one of the best in my opinion ) is to do all string handling in unicode.

You read a string from a file, first thing you do is convert it to unicode. You write a string to a file ? Encode it. Inbetween, always use unicode. If there is some unruly code somewhere, then it's better to ask the author of the code to correct it then to add some workaround somewhere. It is the safest solution that way because there are less hacks involved.

If really there is some external code you can't change, then consider encapsuling the piece of code in an automatic convert/unconvert routine with utf8 as the encoding.

# anonymous

I have a hard time asking a library author to "fix" their library in this way. Because when they've fixed it for Unicode-using me, they've simultaneously broken it for everyone else.

# Ian Bicking

I always set my default string encoding to utf8 if I'm working in a python environment I have control over, and having done that I don't recalll having had any difficulties with library code that uses strings. Of course that's no use if you're in a shared hosting environment, or distributing code for others to use in their environments. The utterly stupid and bizarre only-at-startup way to set this really doesn't help.

It occurs to me that Dutch is one of only three ASCII-only languages I can think of off the top of m head (the other one is Italian). This might explain a lot.

# Alan Little

Not even Italian and Dutch are absolutely ASCII. Italian makes use of accented vowels quite a lot (e.g. á) and Dutch technically needs the trema (e.g. tetraëder); the IJ (http://en.wikipedia.org/wiki/%C4%B2) is usually written as two letters when using computers, but typewriters still have this as its own ligature.

So, unless I'm missing something, English is the only pure ASCII language there is (except Latin, of course).

# Philipp von Weitershausen

You think English is a "pure ASCII" language?? Nonsense!

What about 'façade' or 'rôle', or 'résumé', all perfectly good English words!


# Francis Tyers