Ian Bicking: the old part of his blog

Re: String hash vs. Unicode hash

Your first example is not odd and the behavior has hardly anything to do with str and unicode.

It is a property of Python's dictionaries that if you set a new key-value pair and the key is equal to an existing key, then the value in the dictionary is changed but the old key is not replaced with the new (equal) one. The last time I asked about this I believe Tim Peters said it wasn't intended to be part of the interface for dict, but that it was a very reasonable implementation accident and that most things implementing the mappable interface did the same.

The only thing that has anything to do with strings and unicode is that "test" == u"test" -- I find this a highly desirable property. With the default encoding hash("test") == hash(u"test") also as it ought to.

---

The bit about getting a UnicodeDecodeError when comparing byte strings to unicode strings doesn't bother me much. There are other incomparable things (complex numbers for example, also classes that override __equals__) and these rarely cause problems in practice. In the real world, no one other than the garbage collecter goes around comparing random things to each other, and the GC only cares about object identity. This is a little worse because strings and unicode strings often get mixed up together randomly, but when that happens it is usually because we have a collection of "string things" (not binary data) some of which managed to fit into str and some of which had characters outside the current encoding and thus wound up as unicode. In such a situation the str objects would all be convertable to unicode and the exception wouldn't occur.

If someone really needs to start doing random compares between some uncategorized objects some of which are unicode strings and others of which are str objects containing BINARY data, then this person can just catch the UnicodeDecodeError... it won't be the worst of their problems.

---

Finally, I think you're onto something when you point out that after changing the default encoding "xxx" == u"xxx" for all xxx but hash("xxx") != hash(u"xxx") for some xxx. I would say that's a bug... equality should imply equal hashes, that's part of the contract for hashing.

So should we fix it? I tend to think we should. The only worry I have is that this means that the hash of a string or unicode might need to depend on the string's content AND on the current default encoding. That's OK if default encoding can't change during the lifetime of a Python interpreter, but a definite no no if default encoding can change. And as you pointed out, it's HARD to change at runtime, but not impossible.

What to do?

-- Michael Chermside

Comment on String hash vs. Unicode hash
by Michael Chermside