String hash vs. Unicode hash

This is odd...

>>> d = {'test': None}
>>> d[u'test'] = 1
>>> d
{'test': 1}
>>> d = {u'test': None}
>>> d['test'] = 1
>>> d
{u'test': 1}

I guess it makes sense, but it's tricky. 'test' == u'test'; but if you feel Unicode strings are different from byte strings (str), then this is no help. But here's a problem with setdefaultencoding:

>>> import sys
>>> reload(sys)
>>> sys.setdefaultencoding('utf-8')
>>> s = u'\u0100'
>>> str(s)
'\xc4\x80'
>>> str(s) == s
True
>>> hash(str(s)), hash(s)
(1207774670, -1591639807)
>>> d = {s: None}
>>> d[str(s)] = 1
>>> d
{u'\u0100': None, '\xc4\x80': 1}

The strings are equal, but they don't hash equally, so the dictionary (being a hash table) puts both in and doesn't notice their equality. Not surprising; equality is default encoding aware (the byte string is decoded before comparing it with the unicode string). In fact you get UnicodeDecodeError if you compare a byte string that can't be decoded in the default encoding to a unicode string. (I know exactly why there's an exception there, and I understand why, and maybe I even see how it's a good idea, but how can you not find it disturbing that these two objects can't be safely compared when almost all other objects, no matter how different in type, can be compared?)

Oh, but I was talking about hashes. Well, the hash algorithm for strings apparently isn't aware of default encodings. (Just in case this was specific to the reload(sys) hack, I also tested it with a change to site.py). Note that hash does work for ASCII-encodable Unicode string (i.e., hash('foo') == hash(u'foo')).

Your first example is not odd and the behavior has hardly anything to do with str and unicode.

It is a property of Python's dictionaries that if you set a new key-value pair and the key is equal to an existing key, then the value in the dictionary is changed but the old key is not replaced with the new (equal) one. The last time I asked about this I believe Tim Peters said it wasn't intended to be part of the interface for dict, but that it was a very reasonable implementation accident and that most things implementing the mappable interface did the same.

The only thing that has anything to do with strings and unicode is that "test" == u"test" -- I find this a highly desirable property. With the default encoding hash("test") == hash(u"test") also as it ought to.

---

The bit about getting a UnicodeDecodeError when comparing byte strings to unicode strings doesn't bother me much. There are other incomparable things (complex numbers for example, also classes that override __equals__) and these rarely cause problems in practice. In the real world, no one other than the garbage collecter goes around comparing random things to each other, and the GC only cares about object identity. This is a little worse because strings and unicode strings often get mixed up together randomly, but when that happens it is usually because we have a collection of "string things" (not binary data) some of which managed to fit into str and some of which had characters outside the current encoding and thus wound up as unicode. In such a situation the str objects would all be convertable to unicode and the exception wouldn't occur.

If someone really needs to start doing random compares between some uncategorized objects some of which are unicode strings and others of which are str objects containing BINARY data, then this person can just catch the UnicodeDecodeError... it won't be the worst of their problems.

---

Finally, I think you're onto something when you point out that after changing the default encoding "xxx" == u"xxx" for all xxx but hash("xxx") != hash(u"xxx") for some xxx. I would say that's a bug... equality should imply equal hashes, that's part of the contract for hashing.

So should we fix it? I tend to think we should. The only worry I have is that this means that the hash of a string or unicode might need to depend on the string's content AND on the current default encoding. That's OK if default encoding can't change during the lifetime of a Python interpreter, but a definite no no if default encoding can change. And as you pointed out, it's HARD to change at runtime, but not impossible.

What to do?

-- Michael Chermside

Ian, all this talk about sys.setdefaultencoding() is really not very productive: sys.setdefaultencoding() was added to Python as part of a compromise in choosing a default encoding and originates from a time when we considered using the locale to determine the default encoding. It is not supported and was only left in the language to allow experiments (see the site.py module) - if you change the default encoding in Python, you're on your own. Expect problems with all kinds of things.

For the historical details, read up in the python-dev archives of the year 2000 when Unicode support was added to the language. A quick overview is included in a talk I gave at the EuroPython conference in 2002: http://www.egenix.com/files/python/EuroPython2002-Python-and-Unicode.pdf

On the subject: we took great care to make sure that ASCII Unicode gives the same hash value as an ASCII string. This does not extend to non-ASCII characters, regardless of the default encoding.

Ian Bicking: the old part of his blog

String hash vs. Unicode hash

Comments: