Do I hate Unicode, or Do I Hate ASCII?

I was glad to hear I am not alone in feeling that (to quote) "Unicode stinks". UnicodeDecodeError is a constant pain in the ass for me.

I appreciate this advise on Unicode, but I'm not entirely sure what to do with it:

strings are fine for text data that is encoded using the default encoding

Unicode should be used for all text data that is not or cannot be encoded in the default encoding

I still have str() calls and __str__ methods all over the place, and Unicode sneaks into the most unexpected places.

Sometimes I think my life would be much, much easier if my default encoding was UTF-8 instead of ASCII. Isn't that a nice, happy encoding? Sure, a UTF-8 string isn't equivalent to a Unicode string. The lengths don't match up, some Unicode-aware operations (e.g., operations that deal with letters different from punctuation) won't work. Most of my strings are sufficiently opaque that I don't care, though. And, doing server programming, UTF-8 is a good encoding; there's no such thing as locale for me. But even setting the default encoding has been made deliberately very difficult.

I just really don't know what I should do. Should I replace all my __str__ methods with __unicode__? Should I set up a boundary where I carefully decode all strings, making sure I'm using Unicode everywhere in my app? These are rather hard things to do, because "inside" is a rather leaky place. There's all these libraries other people wrote, external inputs I am hidden from, etc.

For instance, imagine some library that writes data to a file occasionally. Maybe it's a cache; the data is opaque. It expects strings. What does it do when it gets a Unicode object? Very possibly it writes it, if it is encodable with the default encoding (typically ASCII). In fact, this works great for me because my name and everything I write is ASCII; I'm not even sure how to input anything but ASCII. How do I, the ignorant English-speaking-and-typing American, even make a test case? Well, sometimes I write u"\u0100" or something; I don't even know what that character is, but at least I know it's Real Unicode. Sucks that it takes 9 characters to give me that one Unicode character I want. And in practice I usually leave this out of my tests. Then some European comes along with an umlaut in their name, and BOOM! UnicodeDecodeError -- and I didn't even know strings were involved. It's not even my library. Nothing is safe from these blasted characters. And the problem isn't localized -- Unicode works implicitly often enough that the Unicode can leak in long before it causes a problem, and a subtle difference like between "%s" % obj and str(obj) can cascade throughout the system.

(And just try commenting on this post with anything but ASCII, I dare you!)

Update: I've written some followups here and here.

Created 01 Aug '05
Modified 02 Aug '05

Comments:

Excelente article. As a mexican developer I wrestle with Unicode daily and it's the only thing that can take away from Python, if I ever found a language with smooth handling of Unicode

# anonymous

I suspect it would be better if we had only Unicode text and non-text bytestrings, with a sufficiently different-enough interface that you could tell which one you expected and would work with. Like Java I guess. You'd still have all the problems, but you'd have to deal with them up front, and everyone would deal with them consistenly, and errors wouldn't propogate.

# Ian Bicking

There is no such thing like working Unicode support in any language. :-/ The main problem is that there are interfaces needed to the real world - and the real world doesn't like Unicode, it only accepts utf-8. But several libraries try to be smart. For example django uses the email module to parse multipart POST data - that's a logical choice, as HTTP multipart POST data is actually just mime attachements. Of course, email produces unicode strings if you pass in something that's defined as utf-8. And that produces problems in the django code. Or take sqlite3 for example: the pysqlite2 binding allways returns unicode strings. Except if you store data in the database directly with the sqlite command - then it stores what you pass in and what your local encoding is. It's iso-8859-1 in my case - so I can add stuff to the sqlite database that can't be read by the pysqlite2 library. With django this required me to write a row factory that removes the unicode string and replaces it by a bytestring in utf-8 encoding, just to make the django code happy.

Oh, and your blog doesn't accept Umlauts in the comments - it produces a traceback with UnicodeDecodeError :-)

# hugo

Oh, and just before anybody mentions it: both the email module and the pysqlite2 library will happily deliver bytestrings in other situations, so you can't just say to hell with it and all unicode. The email module will return non-utf-8 stuff as bytestrings and the pysqlite2 library will give a registered converter not the unicode string but the raw bytestring. For example if you register a (lambda s: s) as the converter, you will get bytestrings.

# hugo

Oh, and your blog doesn't accept Umlauts in the comments - it produces a traceback with UnicodeDecodeError :-)

Haha, you took my dare and YOU LOST! I actually knew you would, it's one of those Unicode errors that I'm too cowardly to try to fix. In fact, I'm pretty sure I introduced it when I was trying to fix a related encoding problem. There's a very high likelihood of regressions when you try to fix unicode-related errors.

# Ian Bicking

well, i usually code like this:

all inputs are converted to unicode
the internals of my programs only deal with unicode
all the outputs are converted (explicitly) to byte-strings

at least this is what i'm trying ;)

there are generally 2 problems with this approach in python (it's much better in java for example):

probably because of historical reasons (first there were byte-strings?) python kind-of recommends byte-strings...well, not exactly recommends..but..for example if you want to write an unicode string you have to prefix it with [u]. so it's usually extra work to enter unicode strings. in java this issue does not exists, because there the strings are unicode. there are no byte-strings (only byte-arrays)
probably a consequence of #1: many library functions only deal with byte-strings. and what's worse, sometimes (to be sure) they start with somethings like "input = str(input)". and this of course completely fails when the input contains non-ascii... so you have to be careful...

# Gabor Farkas

There isn't really a lot of point in hating ASCII. ASCII is fine if you are absolutely sure your code will only ever be used by monolingual English speakers who don't give a sh*t about whether their text looks typographically half-decent (proper quotes, en- and em-dashes etc.)

If you have any interest in your code being used by the other 95% of the world, then ASCII simply isn't an option and you have to find a way to deal with unicode whether you like it or not.

Django incidentally has the same problem: http://code.djangoproject.com/ticket/170 ... which does rather beg the question, do American newspapers not even try to spell foreigners' names correctly?

# Alan Little

They obviously never bother to use pound signs or euros.

# Moof

I got your pound sign right here: # -- :-)

# Ray

Kind of off-topic, but I wonder: How did the typographical mark '#' come to be called "pound" in the US? Was it through keymap differences? Or is it just a co-incidence that Shift-3 happens to produce the "right pound for the job"?

(On a British keyboard, Shift-3 produces the GBP currency symbol. On a US keyboard, it produces the number sign -- what I would call a "hash" sign)

# Robert Hunter

The "octothorp" is often refered to as the "pound sign" because it is used to designate pounds when it follows a number in the retail trade (er, in the old days, when _I_ was a kid). E.g. "5# of flour", "3# of 10d common nails" (btw, that 'd' stands for 'penny'). See http://www.octothorp.us/octothorp.html for more...

# anonymous

I think most Americans don't think of that as a spelling error. After all, the letters are really the same, so you can just leave out the funny decorations that furriners like to write, right? :)

# Randall Randall

It seems to me like non-ASCII character sets are just not ready for prime time yet. I treat stuff like Unicode the way I treat C++--I'm still waiting for things to shake out and the bugs to be worked out. Of course, I've been waiting 15 years and C++ is still a mess.

(Yes, I'm kidding. Sort of.)

# Mike Coleman

That why you should set the default encoding to something inexistent. automatic conversion from str to unicode or unicode to str is a source of endless bugs. Better to have everything explicit ;)

# anonymous

But sometimes automatic conversion really is the Right Thing. For instance, let's say I have some code that does this:
def popup(href, title):
    return '<a href="%s" onclick="window.open(%r,'_blank')">%s</a>' % (
        href, str(href), title)
Without a u' this will raise an exception if title is Unicode, if no default encoding is given. Is that the right behavior? No! This function works fine. This is not a boundary where encoding needs to be defined; in fact, it would suck if you had to encode the title before passing it in, because then you'd have to decode the result of the function as well, so you could re-encode it at the real boundary (when you serve the page). All this because of a missing u -- and that u is missing far more often than it is included.

If you can "fix" that function, you're okay. But there's too much code out there that does this now. I simply can't update all the code out there that uses bytestrings instead of Unicode.
# Ian Bicking

But that is the problem. Unicode difficulties always come when you try to mix 8 bit strings with unicode strings. A solution ( and one of the best in my opinion ) is to do all string handling in unicode.

You read a string from a file, first thing you do is convert it to unicode. You write a string to a file ? Encode it. Inbetween, always use unicode. If there is some unruly code somewhere, then it's better to ask the author of the code to correct it then to add some workaround somewhere. It is the safest solution that way because there are less hacks involved.

If really there is some external code you can't change, then consider encapsuling the piece of code in an automatic convert/unconvert routine with utf8 as the encoding.

# anonymous

I have a hard time asking a library author to "fix" their library in this way. Because when they've fixed it for Unicode-using me, they've simultaneously broken it for everyone else.

# Ian Bicking

I always set my default string encoding to utf8 if I'm working in a python environment I have control over, and having done that I don't recalll having had any difficulties with library code that uses strings. Of course that's no use if you're in a shared hosting environment, or distributing code for others to use in their environments. The utterly stupid and bizarre only-at-startup way to set this really doesn't help.

It occurs to me that Dutch is one of only three ASCII-only languages I can think of off the top of m head (the other one is Italian). This might explain a lot.

# Alan Little

Not even Italian and Dutch are absolutely ASCII. Italian makes use of accented vowels quite a lot (e.g. á) and Dutch technically needs the trema (e.g. tetraëder); the IJ (http://en.wikipedia.org/wiki/%C4%B2) is usually written as two letters when using computers, but typewriters still have this as its own ligature.

So, unless I'm missing something, English is the only pure ASCII language there is (except Latin, of course).

# Philipp von Weitershausen

You think English is a "pure ASCII" language?? Nonsense!

What about 'façade' or 'rôle', or 'résumé', all perfectly good English words!

~fran

# Francis Tyers

As ukrainian developer, I have to deal with Unicode issues regularly.

The only viable strategy to avoid unicode errors for me has been to have clear borders of "Unicode world" within a program. As you noted, this may be tricky but it is doable. Usually I don't strive to have Unicode throughout the program because Unicode is used inconsistently in Python (and, yes, you're going to figure it out the hard way) but it is relatively simple to define a Unicode area within your own code. To document boundaries I usually use asserts.

Regards, Max.

# Max Ischenko

I used to have the same annoyances, but then I learned to stop worrying and love the exceptions :) Or, more prosaically, faced with a large and complex project that required use of Unicode, I sat down and worked out how Python's support for it operates. The conclusion I came to is that Python is actually one of the better (if not best) languages at handling Unicode, as long as you work with it. That is; it presupposes a particular way of handling strings. The faults are more in the explanations and documentation.

The project of which I speak is driven by a vast MySQL database, in which [essentially] all strings are Unicode. Since Python tends to promote non-Unicode strings to Unicode as required (in string operations and the like), that means that any textual object must be considered as being Unicode. They come to Python code (via MySQLdb) as Unicode objects. To the console and to files, they are sent as UTF-8-encoded strings.

The general rules I've followed are: (a) Use a console environment that supports UTF8 characters (in my case, PuTTY ssh sessions to Linux boxen). Thus one can print any string that's encoded in UTF8 to stdout in Python code. (b) Assume that all strings are Unicode; when using "print", convert to UTF8. Since non-Uncode strings also have an "encode" method, this makes life easier. Converting an ASCII string to UTF8 is essentially a no-op. Only encode when writing out from code to files or the console. Only decode when reading in. (c) When creating or reading files, know what encoding format you're handling. That should be as much an attribute of any defined file format as the line-endings or use of Ctrl-Z/EOF. I use BOM marks (as does Windows) via the constants in the "codecs" module to ensure that I tag files appropriately. Writing a generic file reader that spots BOM marks and decodes appropriately is an easy task. (d) Keep in mind that "the console and the world are in ASCII" is a falsehood that will bite you as much as "everyone in the world speaks English" does :) (e) UTF8 is your friend - it'll handle encoding of any Unicode character and if you can't meet rule (a) is still more-or-less printable, though you don't see the strings as intended. (f) the "default encoding" is your enemy. You can't rely on it, it only takes effect in some circumstances and it may bear no relation at all to what the console can or cannot handle.

Given the above, I have never had a problems with rogue Unicode[De|En]codeErrors, and we handle all Western languages plus (recently) Russian and Japanese. Generally, the only times they occur is where I've found an old "print" line that spits some object out without encoding it first.

# Ben Last

No I don't but I do hate all things microsoft (micro as in small soft as per brain type)

All I want to do when I call word and when it tells me it needs to convert a file I select OEM United states. I now want this selection to be the default for all the following files till I end the word session.

How can I do this ?

Hope one of U Gurus can help.

I have 'Googled@ for the answer no help

Paul

# Paul Suret

Ian Bicking: the old part of his blog

Do I hate Unicode, or Do I Hate ASCII?

Comments: