Ian Bicking: the old part of his blog

Re: Do I hate Unicode, or Do I Hate ASCII?

I used to have the same annoyances, but then I learned to stop worrying and love the exceptions :) Or, more prosaically, faced with a large and complex project that required use of Unicode, I sat down and worked out how Python's support for it operates. The conclusion I came to is that Python is actually one of the better (if not best) languages at handling Unicode, as long as you work with it. That is; it presupposes a particular way of handling strings. The faults are more in the explanations and documentation.

The project of which I speak is driven by a vast MySQL database, in which [essentially] all strings are Unicode. Since Python tends to promote non-Unicode strings to Unicode as required (in string operations and the like), that means that any textual object must be considered as being Unicode. They come to Python code (via MySQLdb) as Unicode objects. To the console and to files, they are sent as UTF-8-encoded strings.

The general rules I've followed are: (a) Use a console environment that supports UTF8 characters (in my case, PuTTY ssh sessions to Linux boxen). Thus one can print any string that's encoded in UTF8 to stdout in Python code. (b) Assume that all strings are Unicode; when using "print", convert to UTF8. Since non-Uncode strings also have an "encode" method, this makes life easier. Converting an ASCII string to UTF8 is essentially a no-op. Only encode when writing out from code to files or the console. Only decode when reading in. (c) When creating or reading files, know what encoding format you're handling. That should be as much an attribute of any defined file format as the line-endings or use of Ctrl-Z/EOF. I use BOM marks (as does Windows) via the constants in the "codecs" module to ensure that I tag files appropriately. Writing a generic file reader that spots BOM marks and decodes appropriately is an easy task. (d) Keep in mind that "the console and the world are in ASCII" is a falsehood that will bite you as much as "everyone in the world speaks English" does :) (e) UTF8 is your friend - it'll handle encoding of any Unicode character and if you can't meet rule (a) is still more-or-less printable, though you don't see the strings as intended. (f) the "default encoding" is your enemy. You can't rely on it, it only takes effect in some circumstances and it may bear no relation at all to what the console can or cannot handle.

Given the above, I have never had a problems with rogue Unicode[De|En]codeErrors, and we handle all Western languages plus (recently) Russian and Japanese. Generally, the only times they occur is where I've found an old "print" line that spits some object out without encoding it first.

Comment on Do I hate Unicode, or Do I Hate ASCII?
by Ben Last

Comments:

No I don't but I do hate all things microsoft (micro as in small soft as per brain type)

All I want to do when I call word and when it tells me it needs to convert a file I select OEM United states. I now want this selection to be the default for all the following files till I end the word session.

How can I do this ?

Hope one of U Gurus can help.

I have 'Googled@ for the answer no help

Paul

# Paul Suret