Ian Bicking: the old part of his blog

Re: August ChiPy (and the stdlib)

There are all sorts of messy hacks in Unicode.

Take Devanagari for example. Devanagari is the Indian script used to write classical sanskrit, hindi, nepali, marathi and some other languages adding up to the mother tongues of several hundred million people. So quite important to get right really. Devanagari is a kind-of-syllabic alphabet in which consonants are normally are read as including an implicit "a" sound - so "t" is read as "ta". (Unless it's at the end of a word in hindi, I believe). There are supplementary characters to replace the "a" with other vowels or suppress it completely at the end of a word (except in hindi, where it's suppressed automatically anyway at the end a word. I think).

There are also compound consonants, e.g. the "tr" in the word "sutra". These have their own written characters, which are (mostly) recognisable as combined versions of the two root characters. These ligature characters are not conventionally regarded as letters in their own right even though they are written/printed as single characters, and they don't have their own unicode code points. Instead the "tra" in "sutra" is written as three code points: Ux0924 TA, Ux094D VIRAMA to suppress the implicit A in TA, Ux0930 RA.

So how many characters is Ux0924Ux094DUx0930?

It's one on the printed (rendered) page. Linguistically it's normally regarded as two, TA & RA combined. It certainly isn't three, using the VIRAMA to signal a ligature in that way is a Unicode hack not a part of the real script. (Nor is it six, as some idiot who didn't know they were dealing with a utf-8 encoded version might conclude from counting bytes).

(Sorry for the lots of words and no visible examples. It's late at night, I don't know if your comments system would handle unicode examples correctly, and even if it does several browsers - basically all Mozilla variants - don't display devanagari ligatures correctly anyway. I raised this as a bug nearly a year ago, no sign of progress. Presumably all Indian hackers are southerners and don't care if hindi-speaking northerners get to read stuff on the web or not)

Comment on August ChiPy (and the stdlib)
by Alan Little

Comments:

Ligatures are always a little confusing. But in all honesty, I think it's not unreasonable that the native speakers adapt just as computers adapt, and we meet somewhere in the middle. In some ways it seems gross that we change an entire language and tradition to comply with our technical limitations. But people have been doing that for thousands of years, and they'll do it today regardless of whether it is expected or approved. Spanish officially dropped two letters a few years ago (ch and ll) in recognition of the predominant understanding of what a "letter" is. If I remember correctly, Chinese is traditionally written top-to-bottom, but electronically it seems like left-to-right is the norm. I appreciate the adaptation. Not because everything should match the Western norms, but because the Western norms are notable for how much they themselves have adapted over time, and I believe there's virtue in that.

If Devanagari adapts its ideas of the linguistic meaning of character, or readers recognize the adapted typography of that character, I think that's reasonable. But then I'm not a traditionalist, and I like the idea of a polyglot.

# Ian Bicking

I agree it's an inherently difficult problem. If you're not going to give ligatures their own code points - I assume native speakers were probably consulted and said "no" to that idea - then you have to come up with something. And the something thery came up with is perfectly reasonable.

It still leave us, though, in the situation where there are plausible arguments for the "length" of a single sequence of three code points being one, two -or- three, with my personal preference being two.

# Alan Little