Monday, 8 January 2007

Vanished in the Twinkling of an Eye

Here's an entry for a Chinese character meaning "to blink" or "to twinkle" in a standard Chinese-English dictionary :


Han-Ying Cidian 漢英詞典 (Shangwu Yinshuguan, 1980) p.594.


Notice anything funny about it ? Not really, then try typing it up using Unicode ... oops, now where is that character ? ... you know, the one with a 目 radical and shǎn 㚒 phonetic.

Look as hard as you like, but not one of the 70,229 unified ideographs that have already been encoded matches the head character in this entry. A mistake or an obscure character perhaps (really obscure not to be in CJK-B) ? But no, the same character is also found in the standard pocket dictionary of Chinese characters :


Xinhua Zidian 新華字典 (Shangwu Yinshuguan, 5th ed., 1979) p.398.


And the best one-volume dictionary of modern Chinese :


Xiandai Hanyu Cidian 現代漢語詞典 (Shangwu Yinshuguan, 2nd ed., 1983) p.998.


So if it is such a common character, how come it's not in Unicode ? Well, the answer is that officially it is in Unicode, just that it's been unified with the very similar character U+4039 䀹. Notice how the character we are interested in has a shǎn 㚒 phonetic, but U+4039 has a jiā 夾 phonetic. U+4039 is an uncommon character which does not occur in any of the three dictionaries cited above, but if you look in the big dictionaries you will find both characters (one with a shǎn 㚒 phonetic, and one with a jiā 夾 phonetic), treated as separate characters in their own entries.

Here's the entry in the great Kangxi Dictionary :


Kangxi Zidian 康熙字典 (Zhonghua Shuju, 1958) p.809.


Let's see how these two entries are handled in an electronic Kangxi dictionary (click on the 康熙字典 tab) -- hmm, the two entries are conflated, but the on-line editor has had to add the apologetic note "䀹原字从㚒,不从夾。" [this character is originally written with the shan radical not the jia radical] to the first entry. Not very satisfactory !

And here's the entry in the Chinese answer to the OED :


Hanyu Dacidian 漢語大詞典 (Hanyu Dacidian Chubanshe, 1991) vol.7 pp.1221-1222.


Hmm, clearly two distinct characters. So why are they unified in Unicode ? Well, I don't believe they should be unified as the rules for CJK unification are that two characters should not be unified if a source dictionary treats them as distinct lexical items.

By now the most tenacious of my readers will have discovered that although there isn't a unified ideograph corresponding to the character in hand, there is a compatibility ideograph that looks just like it, viz. U+FAD4. So why not use U+FAD4 for this character to distinguish it from U+4039 ? Well, compatibility ideographs are not real Chinese ideographs at all, they only sometimes look as if they are, but with a wave of the magic wand of normalization they vanish away. Or in more prosaic words, U+FAD4 is canonically equivalent to U+4039, which means that any conformant Unicode process can convert U+FAD4 to U+4039 in the twinkle of an eye, without so much as a by-your-leave. So, it is not useful to try to represent our character in permanent electronic text form using its lookalike compatibility ideograph.

For this reason John Jenkins and myself have submitted a proposal to disunify U+4039, which will hopefully see this woefully overlooked character encoded as a unified ideograph in its own right before very long.


Addendum [2007-02-05]

Kenneth Whistler has pointed out that there is one more compatibility ideograph, U+2F949, which is canonically equivalent to U+4039. Whereas the reference glyph for U+FAD4 is the same as the missing character, the reference glyph for U+2F949 is the same as U+4039. So if a new character is created for the shan-radical character and the jia-radical character remains assigned to U+4039, then the compatibility ideograph U+2F949 will have the correct canonical equivalence to U+4039, but the compatibility ideograph U+FAD4 will be left with an incorrect canonical equivalence. On the other hand, if a new character is created for the jia-radical character, and the shan-radical character is assigned to U+4039, then the compatibility ideograph U+FAD4 will have the correct canonical equivalence to U+4039, but the compatibility ideograph U+2F949 will be left with an incorrect canonical equivalence. So whichever of these two solutions is chosen (if either), one of the compatability ideographs will be left with an incorrect canonical equivalence. We will just have to wait and see what the relevent committees decide to do about it.



Addendum [2007-05-13]

The proposal to disunify U+4039 was subject to much discussion at the recent WG2 meeting at Frankfurt, and resulted in a decision to encode the shǎn character at the earliest opportunity. To this end the new character has been included in the additions to ISO/IEC 10646 under ballot as Amendment 4 in the basic CJK block at U+9FC3, and if all goes to plan it will making its debut in Unicode 5.1 this time next year.

I have also revised and expanded the disunification proposal (N3196) with further examples of usage and evidence for disunification.


8 comments:

DopefishJustin said...

Kind of a nitpick, but the phonetic isn't shăn, it's shǎn. (Pinyin uses háčeks, not breves.)

John Cowan said...

The objection is not that they are distinct in source dictionaries, for that has never been a rule. (There was in the first stage of unification, the one that made the main unified ideograph sequence, a rule that ideographs that are distinct in the coded character sets of a given country cannot be unified; that rule is not being applied to later additions.)

No, the objection is that they have different abstract shapes: they are y-variants, and therefore should always have been disunified.

Tom Gewecke said...

One font I have, PMingLiU Regular, seems to have the shǎn form at 4039.

Andrew said...

You're quite right to pull me up on the breve/hacek thing ... I should have been more careful.

Andrew West said...

Tom,

As the two characters are unified I guess that it is OK for a font to use either glyph, and as the shan-phonetic character is more common than the jia-phonetic character then it could make sense to use the former glyph even though the code charts use the latter glyph.

Andrew West said...

John,

According to John Jenkins, who knows far more about these things than I do (as one of Unicode's representativs to IRG), being distinct lexical items in a source dictionary is the important point, at least from an IRG perspective.

I'm not sure what you mean by "y-variant" -- as far as I am concerned neither character is a variant (a through z) of the other; they are two distinct characters which just so happen to look very similar and to have some semantic overlap.

If the shan-character were to have a simplification to U+25174 𥅴 (mu plus simplified jia-phonetic) in the same way that the shan 陝 in Shaanxi 陝西 simplifies to 陕 (shan-phonetic simplifying to simplified jia-phonetic) then the situation would be much more complicated, but it does not, and the unnatural simplification of 陝 to 陕 is a one-off aberration.

John Cowan said...

The terms x-variant, y-variant, and z-variant were (and are) used in discussing CJK unification. Two characters are x-variants if they are different historically and semantically. If not, they may still be y-variants if they differ in abstract shape. If not, they are z-variants, differing only the details of how the abstract shape is concretely recognized. X-variants and y-variants are not (or at least should not be) unified; z-variants should be and (mostly) are.

See the detailed discussion with examples on pp. 417-421 of TUS 5.0.

Andrew West said...

Oddly enough I have never been attracted to the x-variant, y-variant and z-variant terminology, which anyway seems to have very little currency outside of the Unicode Standard (they are certainly not terms that are used within IRG when discussing CJK unification).