Ian Bicking: the old part of his blog

The Illusive setdefaultencoding

So... thinking some more about my Unicode woes, I think UTF-8 is the Right Default Encoding For Me. I think it will solve a large number of my problems.

If you set the default encoding to UTF-8, things like str(u'\u0100') actually works (and gives you the encoded version). If you concatenate the result ('\xc4\x80') to a Unicode string, the string is automatically decoded and it works perfectly. This is what I want! UTF-8, being a superset of ASCII, happens to be the encoding I'm already using in my sourcecode. I'm perfectly happy moving as many of my external data sources to UTF-8 as possible. I'll set DefaultEncoding in Apache, I'll fiddle with my database, whatever. In those cases where I can't, I'll just have to carefully decode the data, but I have to do that anyway. To the degree I can make my systems and communications consistently UTF-8, things will just get better. I really don't see a downside.

But why does Python make it SO DAMN HARD to change my encoding? I don't understand this at all. There is a function sys.setdefaultencoding, but site.py (which is loaded on Python startup) deletes the function. I feel like someone decided they were smarter than me, but I'm not sure I believe them.

From what I can tell, there's three ways to fix the default encoding:

There's some discussion in the comments here. This post suggests running reload(sys) to restore setdefaultencoding, which is very clean to enable (none of this site crap) but reloading sys scares me a bit.

And searching about I didn't see one justification for why doing any of this is bad, just references to it being a hack, which is not very convincing. Are people claiming that there should be no default encoding? As long as we have non-Unicode strings, I find the argument less than convincing, and I think it reflects the perspective of people who take Unicode very seriously, as compared to programmers who aren't quite so concerned but just want their applications to not be broken; and the current status quo is very deeply broken.

Created 02 Aug '05

Comments:

In python 2.1, setdefaultencoding doesn't work any later than that - because some aspect of encoding is already nailed into sys.stdin/sys.stdout etc. Later versions of python incrementally improved this....

# Mark Eichin

I don't understand this at all. There is a function sys.setdefaultencoding, but site.py deletes the function ?

# Ciezarowe

"Are people claiming that there should be no default encoding?"

No. We're claiming that there should be one fixed default encoding that's used when mixing 8-bit and Unicode strings. And that's how things are, really.

When the Unicode type was added, people disagreed on what the encoding should be (ASCII, ISO-8859-1, or UTF-8), so the setdefaultencoding hook was added so we could play with it. Unfortunately, nobody got around to remove it before the release.

(to me, arguing that it's a good thing that you can use a global setting to control what a+b does when a is an 8-bit string and b is a unicode string is about as silly as arguing that it would be a good thing to have a global setting for controlling what a+b does if a is an integer and b is a string. if you want to convert between different logical types (encoded data and text are different things), use an explicit conversion.)

# Fredrik

Elusive indeed. I just spent the better part of a day trying to figure out why using zipfile.writestr(string) on UTF-8 encoded strings was giving me a UnicodeDecodeError (I'm still relatively new at python). It was actually binascii.crc32(bytes) that was complaining. Since I don't have root access, I can't edit lib/site-packages/sitecustomize.py. I tried putting sys.setdefaultencoding('utf-8') in a file in my working directory. At first, it wouldn't let me access sys.setdefaultencoding, but then I added '.' to my PYTHONPATH and that finally did it. But what happens when I'm zipping up Latin-1 encoded files? I would like to be able to set the default encoding from within my program. I wonder what would be the danger in allowing that? Right now, the only way to do that are the three methods mentioned above. None of these sound satisfactory to me.

# Justin

Elusive indeed. I just spent the better part of a day trying to figure out why using zipfile.writestr(string) on UTF-8 encoded strings was giving me a UnicodeDecodeError (I'm still relatively new at python). It was actually binascii.crc32(bytes) that was complaining. Since I don't have root access, I can't edit lib/site-packages/sitecustomize.py. I tried putting sys.setdefaultencoding('utf-8') in a file in my working directory. At first, it wouldn't let me access sys.setdefaultencoding, but then I added '.' to my PYTHONPATH and that finally did it. But what happens when I'm zipping up Latin-1 encoded files? I would like to be able to set the default encoding from within my program. I wonder what would be the danger in allowing that? Right now, the only way to do that are the three methods mentioned above. None of these sound satisfactory to me.

# Justin

I doubt that the quotes in the OP indicate a literal string.

# Öffentliche Aufträge aus Polen

Thank you for posting this. Very helpful. I wound up modifying site.py in python2.4 by changing encoding from "ascii" to "utf-8" in the setencoding() function. Voila! utf-8 from python command line.

tn$ python

System Message: WARNING/2 (<string>, line 4)

Block quote ends without a blank line; unexpected unindent.

Python 2.4.4 (#1, Oct 18 2006, 10:34:39) [GCC 4.0.1 (Apple Computer, Inc. build 5341)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> sys.getdefaultencoding() 'utf-8' >>>

Then I needed to change the Pydev editor encodings to UTF-8 (Window->Preferences->General->Workspace->Text file encoding in Eclipse 3.2.1). Then I needed to change the run settings in Pydev (Window->Run->Common (tab)->Console Encoding) to UTF-8. Works perfectly now.

Thanks again. Not sure why that was so difficult though...

# Todd