Sunday, 2 December 2007

CJK-B Case Study #1 : U+272F0

The CJK Unified Ideographs Extension B [13MB] block that was added to Unicode/10646 in 2001 comprises 42,711 characters, and it is no secret that there are many problems with this huge collection of mostly quite rare characters, including hundreds of cases of unifiable characters that have been erroneously encoded separately and even a handful of completely duplicate characters. There is enough material to keep a dedicated CJK-B blogger busy for years to come, but I certainly don't want to go down that particular path. However, a recent psot by Michael Kaplan, How bad does it need to be in order to be not good enough, anyway?, about discrepancies in Han character stroke counts provided by China on the one hand and Taiwan on the other set me investigating one particular character, and its history is convoluted enough to be worth writing up as a case history.

Michael notes that the character with the greatest stroke count differential is U+272F0 𧋰, which is given 13 strokes in the PRC stroke count data but 19 in the Taiwan stroke count data. This is what the character looks like in both the Unicode Standard and in ISO/IEC 10646 :




Clearly, as all commentators to Michael's post agree, 13 must be the correct count, and there is absolutely no way to get 19 strokes out of it. In short, "19" must be a mistake. However, I did suggest in my first comment, as a rather wild guess (not actually being familiar with the character in question), that perhaps the stroke count of "19" represented a variant form of the character with two "insect" 虫 radicals at the bottom rather than the single radical shown in the Unicode and 10646 code charts. Then yesterday (which will be several days ago by the time I hit the "Publish" button), I obtained a copy of IRG N1381 (no ambiguity about the document number I trust ;-) which is a draft of the "Extension B multi-column code charts". It may come as somewhat of a surprise to my readers to know that although multi-column source glyph charts for the other CJK Unified Ideograph blocks are published in ISO/IEC 10646, there is no published multi-column chart for CJK-B, and that nearly seven years after CJK-B went live a draft multi-column chart for CJK-B (2,670 pages weighing 40MB) has only just been produced (2007-11-13). But anyhow, if you do look at this document you will find that U+272F0 has a single source glyph from Taiwan (T7-496B), and it does indeed have two insect radicals at the bottom instead of one, making it exactly 19 strokes :



So that would seem to explain where the Taiwan count of "19" came from, but the real question is what this glyph form is doing here, when forms of the same abstract Han character with double radicals are not in Unicode terms unifiable with forms of the same character with a single radical ?

As with all investigations of this sort, the first place to look for some help is the Kangxi Dictionary, where we find that this character exists in quite a few variant forms, although the character we are interested in (U+272F0 with a single radical) does not seem be mentioned :



Note that although there are only six entries for variant forms of this character, the dictionary does actually use seven variant forms, as the fourth entry above is defined as a corruption of a variant form that does not itself have an entry (⿱𠂤⿰虫虫)—despite the fact that that form is also referred to in the sixth entry.

The character itself, in all its guises, is pronounced , and is the first character in the compound word fùzhōng ~螽, which is either a general name for grasshoppers, locusts, etc. or a specific kind of one such insect (also known as 蠜 fán). In early sources the character is written simply as 阜 without an insect radical, as in the 14th song of the Book of Songs (Shi Jing 詩經) :


喓喓草蟲,趯趯阜螽。
未見君子,憂心忡忡。
亦既見止,亦既覯止,我心則降。

召南・草蟲

Yao-yao went the grass-insects,
And the hoppers sprang about.
While I do not see my lord,
My sorrowful heart is agitated.
Let me have seen him,
Let me have met him,
And my heart will then be stilled.

The odes of Shao and the South (No.14)

Incidentally, the earliest Chinese dictionary (written circa 100 AD), Shuowen Jiezi 說文解字, does not have fùzhōng 阜螽, but it does define 蠜 fán as meaning fùfán 𨸏蠜, which is most probably a later transcription error for fùzhōng 𨸏螽 (𨸏 is the Shuowen way of writing 阜). This is supported by the the early 11th century edition of the Yu Pian 玉篇 dictionary (originally compiled during the 6th century), where 蠜 fán is indeed defined as fùzhōng 阜螽 :



So now that we know what the variant ways of writing this character are, we can check and see which of them have been encoded in Unicode, and by my reckoning there are eight of them :


Unicode Encodings of U+86D7 and its Variants
Codepoint Character Ideographic Description
Sequence (IDS)
Radical/Stroke Index
(kRSUnicode)
Source References Kangxi Index
(kIRGKangXi)
U+4600⿱阜⿰虫虫142.14G_KX
T4-6441
JA-265F
K3-3154
KP1-72DF
1100.480
U+86D7⿱𠂤虫142.6GE-3C2C
T2-4023
J1-5A66
K2-5B71
1081.220
U+272F0𧋰⿳𠂤一虫142.7T7-496B1085.261
U+27313𧌓⿰虫阜142.8G_KX
T4-4722
KP1-71CE
1087.140
U+2731B𧌛⿱阜虫142.8G_KX
T6-617B
1088.150
U+27449𧑉⿱自⿰虫虫142.12G_KX
T7-4268
J4-7773
1097.080
U+27482𧒂⿱𠂤⿰虫虫142.12G_HZ
T4-5D3C
KP1-72A1
1098.441
U+27499𧒙⿳𠂤一⿰虫虫142.13G_KX1100.130

Of these eight encoded characters, U+86D7 蛗, U+27313 𧌓, U+2731B 𧌛, U+27449 𧑉, U+27499 𧒙 and U+4600 䘀 correspond to the Kangxi Dictionary entries shown above, and U+27482 𧒂 is the character quoted in the Kangxi entry for U+27449. Only U+272F0 𧋰 (as shown in the Unicode code charts) is not in the Kangxi Dictionary. From this it looks as if the Taiwan source glyph for U+272F0 should be the glyph for U+27499 𧒙. So at this point I think we should double-check the source glyphs for all eight characters :





Hmm, everything seems in order, except for the Taiwan source glyph for U+272F0 which does indeed look as if it should be the source glyph for U+27499, but if that is the case what is the correct source glyph for U+272F0 ?

Going back a step to look at the situation immediately prior to encoding in Unicode/10646, we find that U+272F0 (i.e. ⿳𠂤一虫) does not appear in either the Committee Draft (CD) or the Final Committee Draft (FCD) of ISO/IEC 10646-2:2001 :


ISO/IEC 10646-2:2001 CD and FCD
Final Codepoint Character CD Codepoint
[SC2 N3393]
FCD Codepoint
[SC2 N3442]
CD/FCD Source References
U+272F0𧋰
U+27313𧌓U+27456U+27342G_KX
T4-4722
U+2731B𧌛U+2745EU+2734AG_KX
T6-617B
U+27449𧑉U+2758EU+27479G_KX
T7-4268
JPN237/J4-7773
U+27482𧒂U+275C7U+274B2G_HZ
T4-5D3C
U+27499𧒙U+275DEU+274C9G_KX

And nor does the Taiwan source reference T7-496B occur in either the CD or FCD, so it looks as if U+272F0 was only added after the FCD had been voted on (the FCD is the last chance to make any technical changes before the standard is published). According to the Disposition of Comments for the FCD, the IRG extensively reviewed and revised the repertoire of CJK-B, and it was this revised repertoire that was accepted for inclusion in the FDIS ballot (the final ballot, where technical changes can no longer be made). So turning to the IRG document registry, we find that the Review Summary On CJK_Extension B FCD (N744) notes that four ideographs were found to be missing from the FCD, including T7-496B, as per an Errata Report for SuperCJK 10.0 (N738) by Taiwan :



What this is saying is that T7-496B had been proposed for unification with the character that would eventually become U+27499 𧒙, but as the characters are not unifiable T7-496B should be added as a separate character (and in case anyone is wondering what IRG N699 has to say about the character, so do I, but unfortunately this document does not seem to be listed in the IRG document registry). So on the basis of this document (IRG N738) T7-496B was added to the FDIS repertoire at U+272F0, and that's how it managed to sneak into CJK-B at the very last minute. Unfortunately, presumably because T7-496B was originally intended to be unified with U+27499, Taiwan somehow got the glyph for T7-496B wrong in their reference font, with the result that the Taiwan stroke count for U+272F0 is out by six, and the Taiwan source glyph for U+272F0 in the new draft multi-column charts (IRG N1381) is the same as the PRC source glyph for U+27499 (incidentally the correct source glyph for U+272F0 is shown is Super CJK Version 10.2 [32MB] page 675 and Super CJK Version 14.0 [64MB] page 1473).

I guess that once the Taiwan source glyph is corrected and the Taiwan stroke count data is amended it should be the end of the story, but the one thing that nags at me (as is the case with so many characters which only have a single Taiwan source reference) is what is the ultimate source of this character and which texts is it used in ? So if anyone can show me an example of U+272F0 in running text (before it was actually encoded) please let me know.



Addendum 1 [2007-12-10]

In the comments to this post Eric Rasmussen has pointed out that the code charts for the CNS 11643-1992 standard which was the source for U+272F0 are available to download as Appendix G to Ken Lunde's book CJKV Information Processing. The character in question is at Plane 7 Row 41 Column 75 (subtract hex 2020 from the source reference, T7-496B, and convert to decimal to get the row/col position) :


CNS 11643 Plane 7 Row 41 Columns 71-79 (from Ken Lunde's CJKV Information Processing)


What Eric noticed was that characters to the left of it and the characters to the right of it all have 13 residual strokes, which strongly suggests that T7-496B should also have 13 residual strokes, and not 7 seven residual strokes as shown here. That is to say, it is looking more and more likely that T7-496B should in fact be the Taiwan source reference for U+27499 𧒙, and that U+272F0 𧋰 is a phantom character.

But it is probably too late to do anything about it now—if Taiwan were to change the source glyph and request a source reference correction for T7-496B (remapping to U+27499) that would leave U+272F0 an orphan, which is not a happy ending.



Addendum 2 [2007-12-16]

In the comments Matthew Fischer linked to a scanned copy of the official CNS 11643-1992 standard which does indeed show that the glyph for the character at Plane 7 Row 41 Col 75 (i.e. T7-496B ) is identical to the glyph for U+27499:


CNS 11643 Plane 7 Row3 32-41 Columns 72-76 (the official version)


This pretty much clinches the argument I think.



Addendum 3 [2011-05-01]

In the Unicode 6.0 code charts, the glyph for T7-496B at U+272F0 has been changed to 19 strokes with a double 虫 radical, which is essentially the same as the glyph for GKX-1100.13 at U+27499, which in my opinion is not an ideal change :


 

However, in the code charts for the Final Committee Draft of ISO/IEC 10646:2012, which show what the Unicode 6.1 code charts will look like, a new Unicode source reference has been added to both U+272F0 and U+27499, one with a single 虫 radical and one with a double 虫 radical :


 

I'm not sure that this is a better solution, and I still think that the T7-496B source reference should be moved to U+27499.