So... thinking some more about my Unicode woes, I think UTF-8 is the Right Default Encoding For Me. I think it will solve a large number of my problems.
If you set the default encoding to UTF-8, things like str(u'\u0100') actually works (and gives you the encoded version). If you concatenate the result ('\xc4\x80') to a Unicode string, the string is automatically decoded and it works perfectly. This is what I want! UTF-8, being a superset of ASCII, happens to be the encoding I'm already using in my sourcecode. I'm perfectly happy moving as many of my external data sources to UTF-8 as possible. I'll set DefaultEncoding in Apache, I'll fiddle with my database, whatever. In those cases where I can't, I'll just have to carefully decode the data, but I have to do that anyway. To the degree I can make my systems and communications consistently UTF-8, things will just get better. I really don't see a downside.
But why does Python make it SO DAMN HARD to change my encoding? I don't understand this at all. There is a function sys.setdefaultencoding, but site.py (which is loaded on Python startup) deletes the function. I feel like someone decided they were smarter than me, but I'm not sure I believe them.
From what I can tell, there's three ways to fix the default encoding:
There's some discussion in the comments here. This post suggests running reload(sys) to restore setdefaultencoding, which is very clean to enable (none of this site crap) but reloading sys scares me a bit.
And searching about I didn't see one justification for why doing any of this is bad, just references to it being a hack, which is not very convincing. Are people claiming that there should be no default encoding? As long as we have non-Unicode strings, I find the argument less than convincing, and I think it reflects the perspective of people who take Unicode very seriously, as compared to programmers who aren't quite so concerned but just want their applications to not be broken; and the current status quo is very deeply broken.
In python 2.1, setdefaultencoding doesn't work any later than that - because some aspect of encoding is already nailed into sys.stdin/sys.stdout etc. Later versions of python incrementally improved this....
I don't understand this at all. There is a function sys.setdefaultencoding, but site.py deletes the function ?
"Are people claiming that there should be no default encoding?"
No. We're claiming that there should be one fixed default encoding that's used when mixing 8-bit and Unicode strings. And that's how things are, really.
When the Unicode type was added, people disagreed on what the encoding should be (ASCII, ISO-8859-1, or UTF-8), so the setdefaultencoding hook was added so we could play with it. Unfortunately, nobody got around to remove it before the release.
(to me, arguing that it's a good thing that you can use a global setting to control what a+b does when a is an 8-bit string and b is a unicode string is about as silly as arguing that it would be a good thing to have a global setting for controlling what a+b does if a is an integer and b is a string. if you want to convert between different logical types (encoded data and text are different things), use an explicit conversion.)
Elusive indeed. I just spent the better part of a day trying to figure out why using zipfile.writestr(string) on UTF-8 encoded strings was giving me a UnicodeDecodeError (I'm still relatively new at python). It was actually binascii.crc32(bytes) that was complaining. Since I don't have root access, I can't edit lib/site-packages/sitecustomize.py. I tried putting sys.setdefaultencoding('utf-8') in a file in my working directory. At first, it wouldn't let me access sys.setdefaultencoding, but then I added '.' to my PYTHONPATH and that finally did it. But what happens when I'm zipping up Latin-1 encoded files? I would like to be able to set the default encoding from within my program. I wonder what would be the danger in allowing that? Right now, the only way to do that are the three methods mentioned above. None of these sound satisfactory to me.
Elusive indeed. I just spent the better part of a day trying to figure out why using zipfile.writestr(string) on UTF-8 encoded strings was giving me a UnicodeDecodeError (I'm still relatively new at python). It was actually binascii.crc32(bytes) that was complaining. Since I don't have root access, I can't edit lib/site-packages/sitecustomize.py. I tried putting sys.setdefaultencoding('utf-8') in a file in my working directory. At first, it wouldn't let me access sys.setdefaultencoding, but then I added '.' to my PYTHONPATH and that finally did it. But what happens when I'm zipping up Latin-1 encoded files? I would like to be able to set the default encoding from within my program. I wonder what would be the danger in allowing that? Right now, the only way to do that are the three methods mentioned above. None of these sound satisfactory to me.
I doubt that the quotes in the OP indicate a literal string.
Thank you for posting this. Very helpful. I wound up modifying site.py in python2.4 by changing encoding from "ascii" to "utf-8" in the setencoding() function. Voila! utf-8 from python command line.
tn$ pythonPython 2.4.4 (#1, Oct 18 2006, 10:34:39) [GCC 4.0.1 (Apple Computer, Inc. build 5341)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> sys.getdefaultencoding() 'utf-8' >>>
Then I needed to change the Pydev editor encodings to UTF-8 (Window->Preferences->General->Workspace->Text file encoding in Eclipse 3.2.1). Then I needed to change the run settings in Pydev (Window->Run->Common (tab)->Console Encoding) to UTF-8. Works perfectly now.
Thanks again. Not sure why that was so difficult though...