Ian Bicking: the old part of his blog

August ChiPy (and the stdlib)

Had the ChiPy meeting on Thursday. Aaron Lav started out with a talk on Unicode and Chinese. One thing that I hadn't realized about Unicode is that it needs to be normalized. You can represent characters either as composed or decomposed. E.g., e' could be a single character or two characters (the e and the '). Of course this has a dramatic effect on searching, string length, etc. From what Aaron said the display support for the decomposed form isn't very good, but he made use of it for constructing pronunciation guides and then turned it into the composed form. The unicodedata module has a normalize function to handle this.

I'd also had the vague impression that Chinese was one character per word. But of course there's too many words for that. But there are no spaces, so it's not immediately clear where the word boundaries are (at least to a computer, and certain it is unclear to me eye). So the middleproxy application actually scans every three-character combination for possible "words". That reminded me a great deal of this presentation.

I gave a presentation on setuptools, mostly hoping to introduce all the things you can do, and how to distribute packages.

At the end Chris McAvoy brought up the issue of the standard library. I had kind of forgotten about it, but that's what got me to thinking about versioned imports some time ago, and I think Setuptools has an important place there.

So, the story goes like this: the standard library isn't advancing very fast. Little of the neat new stuff in Python is in the standard library, with a few small exceptions, and when neat new stuff is in the standard library it's often not really helpful for a few years anyway, since it's not in the standard library in old versions of Python. And the standard library is stuck in a release cycle that is really slow -- slow releases are okay for a core language (good even!) but not for the software built on that language.

Setuptools doesn't improve the standard library. The standard library has some advantages over other libraries, but I think we need to figure out how to develop outside of the standard library with those same advantages, and that's what Setuptools (really the whole family of setuptools, easy_install.py, Python Eggs, and pkg_resources) give us. Or at least move us in the right direction. The Cheese Shop is also important, and the PEP process can still be applicable to libraries not in the standard library. For instance, if Web-SIG creates libraries on top of WSGI I think some PEP-ish process is appropriate (giving some consensus and authority to the library), even though the library can't reasonably be distributed with the Python standard library.

Created 13 Aug '05

Comments:

There are all sorts of messy hacks in Unicode.

Take Devanagari for example. Devanagari is the Indian script used to write classical sanskrit, hindi, nepali, marathi and some other languages adding up to the mother tongues of several hundred million people. So quite important to get right really. Devanagari is a kind-of-syllabic alphabet in which consonants are normally are read as including an implicit "a" sound - so "t" is read as "ta". (Unless it's at the end of a word in hindi, I believe). There are supplementary characters to replace the "a" with other vowels or suppress it completely at the end of a word (except in hindi, where it's suppressed automatically anyway at the end a word. I think).

There are also compound consonants, e.g. the "tr" in the word "sutra". These have their own written characters, which are (mostly) recognisable as combined versions of the two root characters. These ligature characters are not conventionally regarded as letters in their own right even though they are written/printed as single characters, and they don't have their own unicode code points. Instead the "tra" in "sutra" is written as three code points: Ux0924 TA, Ux094D VIRAMA to suppress the implicit A in TA, Ux0930 RA.

So how many characters is Ux0924Ux094DUx0930?

It's one on the printed (rendered) page. Linguistically it's normally regarded as two, TA & RA combined. It certainly isn't three, using the VIRAMA to signal a ligature in that way is a Unicode hack not a part of the real script. (Nor is it six, as some idiot who didn't know they were dealing with a utf-8 encoded version might conclude from counting bytes).

(Sorry for the lots of words and no visible examples. It's late at night, I don't know if your comments system would handle unicode examples correctly, and even if it does several browsers - basically all Mozilla variants - don't display devanagari ligatures correctly anyway. I raised this as a bug nearly a year ago, no sign of progress. Presumably all Indian hackers are southerners and don't care if hindi-speaking northerners get to read stuff on the web or not)

# Alan Little

Ligatures are always a little confusing. But in all honesty, I think it's not unreasonable that the native speakers adapt just as computers adapt, and we meet somewhere in the middle. In some ways it seems gross that we change an entire language and tradition to comply with our technical limitations. But people have been doing that for thousands of years, and they'll do it today regardless of whether it is expected or approved. Spanish officially dropped two letters a few years ago (ch and ll) in recognition of the predominant understanding of what a "letter" is. If I remember correctly, Chinese is traditionally written top-to-bottom, but electronically it seems like left-to-right is the norm. I appreciate the adaptation. Not because everything should match the Western norms, but because the Western norms are notable for how much they themselves have adapted over time, and I believe there's virtue in that.

If Devanagari adapts its ideas of the linguistic meaning of character, or readers recognize the adapted typography of that character, I think that's reasonable. But then I'm not a traditionalist, and I like the idea of a polyglot.

# Ian Bicking

I agree it's an inherently difficult problem. If you're not going to give ligatures their own code points - I assume native speakers were probably consulted and said "no" to that idea - then you have to come up with something. And the something thery came up with is perfectly reasonable.

It still leave us, though, in the situation where there are plausible arguments for the "length" of a single sequence of three code points being one, two -or- three, with my personal preference being two.

# Alan Little