Thursday, 21 June 2007

A Brief History of CJK-C

In Memoriam Paul Thompson (2007-06-12)

騰蛇游霧,飛龍乘雲,雲罷霧霽,與蚯蚓同,則失其所乘也。



My friend Asmus Freytag (who has just retired from active participation in Unicode after many years of dedication to Unicode and WG2) recently bemoaned the total lack of interest in CJK-C on the public Unicode mailing list. Whilst it is true that there has been little overt interest in the latest addition to the already huge collection of CJKV ideographs in Unicode , behind the scenes a lot of people have been working very hard on reviewing the CJK-C repertoire and resolving issues, and it has generated (and is continuing to generate) a huge volume of email traffic. This post is rather long and, in places, somewhat detailed, reflecting the long hours that I have been occupied by CJK-C over the past few months, so unless you are really interested in CJK unification issues and obscure Han characters I suggest that you read no further, and content yourself with the knowledge that there are problems with the 4,000+ characters of CJK-C, but these will be resolved, and CJK-C will be encoded in Unicode 5.2 (released 2009-10-01).



The Ideographic Rapporteur Group (IRG)

One aspect of the encoding process that I deliberately avoided in my post on Unicode and ISO/IEC 10646 is how new CJK ideographs get added to the standards. The answer is that under WG2 there is an Ideographic Rapporteur Group (IRG) that is responsible for coordinating the encoding of Han ideographs. IRG comprises representatives from those countries and territories that use or historically have used Han ideographs (China, Hong Kong, Japan, Macau, North Korea, South Korea, Taiwan, Vietnam), as well as Unicode.

IRG is responsible for collating submissions from its various members, and producing a unified set of characters to be submitted to WG2 for inclusion in ISO/IEC 10646 (and hence Unicode). Before a set of characters can be submitted to WG2, not only does IRG needs to ensure that no duplicate characters are inadvertently encoded, but also that unifiable glyph variants of the same abstract character are not encoded separately.

Although the Unicode code charts only show a single glyph form for each character, 10646 uses multi-column charts for the CJK and CJK-A blocks (but not for CJK-B) that give the source glyph provided by each IRG member for a particular character (in the chart below, under "C" for Chinese, "G" represents China and "T" represents Taiwan). This format enables font developers to design fonts that have the correct glyph form for a particular locale.


Detail of Multi-column code chart in ISO/IEC 10646


A similar multi-column layout is used for CJK-C, but with added columns for M (Macau) and U (Unicode) source glyphs.



Han Unification

Unicode and 10646 have a policy of unifying non-significant glyph variants of the same abstract character (see The Unicode Standard pp.417-421 and ISO/IEC 10646:2003 Annex S). This policy was not applied to the initial set of nearly 21,000 characters included in Unicode 1.0 (those characters in the CJK Unified Ideographs block from U+4E00 to U+9FA5 inclusively), for which the "source separation rule" applied. This rule meant that any characters separately encoded in any of the legacy standards used as the basis for the Unicode collection of unified ideographs would not be unified. Thus, the CJK Unified Ideographs block contains many examples of characters that are normally considered to be interchangeable glyph variants, such as 為 and 爲. Some 250 examples of pairs or triplets of unifiable ideographs encoded separately in Unicode 1.0 due to the source separation rule are included in ISO/IEC 10646:2003 Annex S :


Some Examples of Unifiable Characters in Annex S


The source separation rule does not apply to any of the additions after Unicode 1.0, and so in principle CJK-A and CJK-B should not include any unifiable characters. Unfortunately the quality control for the huge 40,000+ characters in CJK-B was not up to standard, with the result that well over a hundred unifiable glyph variants were encoded, as well as five exact duplicates :

  • U+34A8 㒨 = U+20457 𠑗
  • U+3DB7 㶷 = U+2420E 𤈎
  • U+8641 虁 = U+27144 𧅄
  • U+204F2 𠓲 = U+23515 𣔕
  • U+249BC 𤦼 = U+249E9 𤧩

Since then great efforts have been made to improve IRG's quality control process, and Ideographic Description Sequences (IDS) are now used to try to identify and eliminate duplicates and unifiables.



The CJK-C Repertoire

Work on the CJK-C collection started in 2002, and over 20,000 characters were submitted for inclusion by China, Hong Kong, Japan, North Korea, South Korea, Macau, Taiwan, Vietnam and Unicode. Because of the very long time it was taking to complete the work on such a large number of characters, in 2005 it was decided to reduce the size of the initial "C1" set to about 5,000 characters for encoding as CJK-C as soon as possible, with the remaining characters scheduled for encoding as CJK-D after CJK-C has been processed.

Finally, last autumn the "C1" set of 4,219 characters (representing a unification of 4,600 source characters) was submitted to WG2 for encoding as CJK-C (at code points 2A700..2B77A). This set of CJK-C characters was then added to ISO/IEC 10646:2003 Amd.4, and PDAM4 was submitted for the first round of balloting by P-members of SC2 (see Unicode and ISO/IEC 10646 if this makes no sense to you).

The CJK-C repertoire can be analysed as follows :

  • China [IRG N1227] : 1,127 characters, from the following sources :
    • Ci Hai 辭海 [Sea of Words] : 265 characters
    • Gudai Hanyu Cidian 古代漢語詞典 [Dictionary of Ancient Chinese] : 50 characters
    • Hanyu Dacidian 漢語大詞典 [Great Dictionary of Chinese Words] :16 characters
    • Hanyu Dazidian 漢語大字典 [Great Dictionary of Chinese Characters] : 1 character
    • Hanyu Fangyan Dacidian 漢語方言大辭典 [Great Dictionary of Chinese Dialects] : 203 characters
    • Kangxi Zidian 康熙字典 [Kangxi Dictionary] : 7 characters
    • Xiandai Hanyu Cidian 現代漢語詞典 [Dictionary of Modern Chinese] : 26 characters
    • Yinzhou Jinwen Jicheng Yinde 殷周金文集成引得 [Concordance of Shang and Zhou Dynasty Bronze Inscriptions] : 367 characters
    • Zhongguo Dabaike Quanshu 中國大百科全書 [Chinese Encyclopedia] : 75 characters
    • Ideographs used by the Chinese Academy of Surveying and Mapping [中國測繪科學院用字] : 55 characters
    • Ideographs used by the Commercial Press [商務印書館用字] : 61 characters
    • Ideographs used in the Founder Press System [方正排版系統] : 1 character
  • Japan [IRG N1225 part 1, IRG N1225 part 2 and IRG N1225 part 3] : 369 characters, representing the following kinds of usage :
    • characters found in the 9th century Shinsen Jikyō 新撰字鏡 dictionary
    • characters found in various modern dictionaries
    • characters used in various literary works
    • characters used in Buddhist sutras
    • characters found in miscellaneous documents
    • characters used for animal names
    • characters used for place names
    • characters used in personal names
  • North Korea : 9 characters from KPS 10721:2000 and KPS 10721:2003
  • South Korea [IRG N1234] : 405 characters, mostly from historical sources such as 朝鮮王朝實錄
  • Macao [IRG N1228] : 16 characters from the Macao Information System Character Set (澳門資訊系統字集), comprising 15 characters used in personal names and one character used for the name of an unspecified chemical
  • Taiwan [IRG N1232 part 1, IRG N1232 part 2 and IRG N1232 part 3] : 1,812 characters, all used in personal names
  • Vietnam [IRGN 1231] : 785 characters from various dictionaries, including :
    • Từ Điển Chữ Nôm 字典<⿰字宁>喃 (2006)
    • Từ Điển Chữ Nôm Tày (2003)
    • Bảng Tra Chữ Nôm Miển Nam <⿰字文>喃沔南 (1994)
  • Unicode [IRGN 1235] : 77 characters from various sources, including :
    • ABC Chinese-English Comprehensive Dictionary (2000)
    • A complete checklist of species and subspecies of the Chinese birds 中国鸟类种和亚种分类名录大全 (2000)
    • A Field Guide to the Birds of China 中国鸟类野外手册 (2000)
    • Mathews Chinese-English Dictionary (1932)
    • A Pocket Dictionary of Cantonese (1975)
    • Songben Guangyun 宋本廣韻
    • Ideographs used by The Church of Jesus Christ of Latterday Saints

I guess that there are three points that I would make about the repertoire.

Firstly, the quality of sources for these characters varies considerably, with some submissions (e.g. those of China and Vietnam) based on well-known dictionaries and other respectable sources, whereas other submissions are little more than lists of characters to be taken on faith. In particular, the thousands of personal name characters submitted by Taiwan are something that I really do not like at all. The Unicode Standard clearly states that it "does not encode idiosyncratic, personal, novel, or private-use characters" (TUS section 1.1), but this is precisely what they are. Now, I have no problems with encoding ideographs used in personal names that are attested in historical sources or have widespread currency because of the fame of the person bearing the name, but the thousands of characters proposed by Taiwan are one-off usages by ordinary people that will, in the vast majority of cases, never be used outside of Taiwan's ID Card system. Some doting parent no doubt thought it cute to name their baby with a character written as ⿰香寶 "fragrant precious" (U+2B648 = TE-4B54), but once the bearer of this name passes into oblivion, the character will no longer be required or used, although it will remain in Unicode for ever (what a pleasant way to achieve immortality). These are ephemeral usages required solely for Taiwan's ID Card system, and in my opinion they should be represented using the PUA. The complete unnecessity of encoding such characters was driven home just three weeks ago when Taiwan announced that following a program to issue new ID cards to everyone it was discovered that 6,545 proposed characters were no longer in use (both because the bearers of these characters had died or moved abroad, and also because Taiwan was now encouraging people to use standard characters on their ID cards) and should be withdrawn from CJK-D. No doubt if we put off the encoding of CJK-C and CJK-D a few more years we will be able to weed out a few thousand more dead personal name characters. For any other script than CJK, the encoding of personal use characters for a national ID system would not be countenanced, but I suppose that because there are already over 70,000 ideographs encoded the feeling is that adding a few thousand ephemeral characters won't make much difference.

Even within a single submission the quality of evidence adduced varies. For example the Japanese submission provides individual evidence of usage for about two-thirds of the submitted characters, but for many characters there is no indication of where they are used. So, for instance, U+2ABCF [JK-66953] ⿰扌⿱合幸 is given as a "character appearing in other documents", with the unusual range of readings kan, ken, sa, san, ha and uhakkyū, but no indication at all of what document refers to this character, what contexts it is used in or what it means. If it were not that I coincidentally stumbled upon this character recently I would have no idea why it is being proposed for encoding ... as it is I still have no idea what it means, so if anyone does know please tell me.

The second point to make is that the "evidence" provided by the various IRG members varies in quality, with only some members providing examples of usage for each individual proposed character. Vietnam's evidence for its proposed 785 characters comprises nothing more than images of the front covers of the dictionaries from which the characters are taken and a few sample photos of pages from some of these dictionaries (and at a resolution that makes them practically illegible). Again, it has to be admitted that characters from no other script than CJK would be admitted to Unicode on the basis of the evidence supplied by Vietnam.

The third point is that information about the proposed characters varies considerably. Japan and Taiwan provide readings for the proposed characters (although the Taiwan readings are toneless), but other IRG members (e.g. South Korea) do not provide either readings or definitions. I am glad to say that starting from CJK-D every single proposed character will need to be supplied with a reading (if known), definition (if known) and source reference. This will be very useful for populating the Unihan database.

Whilst I have not been greatly impressed by the quality of submissions for CJK-C, things do seem to be changing for the better now, as demonstrated by Taiwan's recent submission of 24 characters required for Taiwanese and Hakka (IRG N1305 and appendix) which provides an excellent model for such documents. Hopefully future submissions from all IRG members will be as good as this one.



The Problems with CJK-C

When CJK-C was presented to WG2 last August it was proudly stated that the repertoire had been through more than fifteen rounds of review by IRG members. However, it was only at this stage (as part of the PDAM4 ballot process) that a few dedicated people outside of IRG started to take a very close look at the CJK-C repertoire, resulting in a WG2 document that presented evidence that six of the submitted CJK-C characters were unifiable variants of existing characters. This document was discussed at the recent WG2 meeting in Frankfurt by WG2/IRG members, and it was agreed that two of the characters were definitely unifiable variants and should not be encoded, and that the other four were potential unifiables, which should be removed from CJK-C pending further investigation. The discovery of issues of this magnitude at this late stage of the encoding process sent shock waves through the IRG membership, and the resultant loss of confidence in the quality of CJK-C meant that there was unanimous agreement to move CJK-C out of Amd.4, and put back to Amd.5 (which is now currently under PDAM ballot).

In light of these developments other IRG member bodies started their own review of the CJK-C repertoire, and it soon became apparent that the six characters were only the tip of the iceberg, and that there were many other potentially unifiable characters in CJK-C, the vast majority of which were personal name usage characters submitted by Taiwan. The IRG met at Xi'an in China a couple of weeks ago, and the result of their deliberations was to recommend the removal of 71 characters from CJK-C, eleven removed entirely and sixty moving to CJK-D for further investigation. The final resolution of CJK-C will be made at the next WG2 meeting, to be held at Hangzhou in China in September, and a lot will depend upon the ballot comments of the various interested national bodies.

One of the major problems that has been highlighted by this exercise is the difficulty of identifying unifiable characters, even using IDS matching algorithms, especially as there is no officially published list of unifiable components. Decisions on whether two characters are unifiable or not have up until now been largely based on ISO/IEC 10646:2003 Annex S, which provides over 250 examples of pairs or triplets of unifiable characters encoded separately in Unicode 1.0 due to the source separation rule. However, these are merely examples that through historical accident came to be encoded in Unicode 1.0, and there are many examples of unifiable components that are not included within the Annex S examples, and so often there is no clear precedent for unification or not of two similar ideographs. In order to help overcome this problem the IRG intends to throroughly revise Annex S, and to provide a more comprehensive list of unifiable and non-unifiable ideographic components. This should not only help proposers and reviewers determine the unifiability of pairs of characters, but when fed into the IDS matching algorithm help identify problematic characters at an early stage in the encoding process.



Some Examples of Problematic CJK-C Characters

To finish things off, here are some examples of characters in CJK-C that I personally find problematic, some of which have already been addressed by IRG, and some of which are still sub judice, so to speak.


U+2A988 [TC-553A]

U+2A988 :

U+2177B :

U+2A988 <⿰女⿱𡗜亐> is quite obviously a simple glyph variant of U+2177B 𡝻 <⿰女⿱𡗜亏>. U+4E90 亐 and U+4E8F 亏 are unifiable components, as indicated by Annex S where U+6C5A 汚 (U+4E90 component) and U+6C61 污 (U+4E8F component) are given as an example of two characters which would have been unified according to the unification rules but for the fact that they come under the source separation rule.

That U+2A988 should be unified with U+2177B is further evidenced by U+28706 𨜆, which has both <⿰⿱𡗜亐阝> and <⿰⿱𡗜亏阝> source glyphs (see Super CJK Version 14.0 page 1729) :

And in CJK-C the Taiwan source glyph for U+2A746 is <⿰亻⿱𡗜亐>, whereas the Vietnam source glyph for the same character is <⿰亻⿱𡗜亏> :

The fact that the unification of <⿰亻⿱𡗜亐>, and <⿰亻⿱𡗜亏> as U+2A746 had been recognised, but the corresponding unification of U+2A988 <⿰女⿱𡗜亐> with U+2177B 𡝻 <⿰女⿱𡗜亏> had not been noticed is worrying, and indicative of a failure in the original IDS checking algorithm. However, we are all learning from mistakes such as this one, and it is to be expected that the IDS checking algorithm used for CJK-D will be much improved.


U+2ACF5 [TD-4D43]

U+2ACF5 :

U+069D4 :

This is another example of a straightforward unification that should have been picked up long before CJK-C went to ballot. U+2ACF5 <⿰木⿱白本> differs from U+69D4 槔 <⿰木⿱白夲> only by the way in which the bottom right component is written, U+5932 夲 being a common handwritten variant of U+672C 本. Annex S gives U+5932 夲 and U+672C 本 as examples of unifiable characters, and so the IDS checking algorithm should have picked up the unification with U+69D4. But what really amazes me is that this character somehow managed to get into the Taiwan ID Card system as a separate character from U+69D4 槔 in the first place.


U+2AE77 [TD-3F3B]

U+2AE77 :

U+07296 :

This is an example of one of many Taiwan personal name characters in CJK-C that vary from an already encoded character that they share the same pronunciation with by a single stroke. In the case of U+2AE77 <⿱𤇾𠀆> (reading given as "luo" in the Taiwan evidence), the glyph differs from U+7296 犖 <⿱𤇾牛> luò by the omission of one stroke. It may be that the bearer of this character deliberately omits the stroke for some reason best known to himself (perhaps taboo avoidance if the character was also used in the name of a dead relative or perhaps just to be different) or it may simply be that the ID card on which the name was written was damaged or defaced, leading to some Taiwan bureaucrat to mistakenly read 犖 as <⿱𤇾𠀆>. Whatever the reason, I personally believe that characters like U+2AE77 should not be encoded, but treated as unifiable glyph variants of the character that they are mutilations of.

In response to the unification issues relating to characters used for personal names (especially the thousands submitted by Taiwan), it has now been suggested that a separate block be allocated for personal use ideographs, and that ideographs encoded in this block should have less strict unification rules applied to them. This is something that I, and I suspect a lot of other people, would be strongly opposed to. My suggestion would be that the PUA would be the ideal place to put ephemeral personal name characters where a unifiable glyph distinction needs to be preserved.


U+2AEDF [HC100308]

U+2AEDF :

U+072AE :

At first sight U+2AEDF (犬 "dog" with an extra stroke on its right leg) does not look too much like U+72AE 犮 bá, but it does if you look at the source glyphs for U+72AE (ISO/IEC 10646:2003 p.677) :

From this it would seem that U+2AEDF has always been one of the ways of writing U+72AE, so how come it is suddenly up for encoding (an implicit disunification of the two glyph forms of U+72AE). The answer is that Hanyu Dacidian 漢語大詞典 [Great Dictionary of Chinese Words] has two separate entries for each of the glyphs. The entry for U+72AE 犮 says it is the same as U+2AEDF, but refers the reader to the entry là bá 剌犮 "walking in the manner of a limping dog" :

Then under the entry for U+2AEDF, we read that U+2AEDF either means the same as the character U+62D4 拔 bá "to root out" or is used in the compound word báyǐ <U+2AEDF>乙 "to write in a careless and unrestrained manner" :

From these entries in Hanyu Dacidian it would seem that there is a semantic distinction between U+2AEDF and U+72AE, the former used in the word báyǐ and the latter in the word là bá 剌犮, and thus the disunification of U+72AE into U+72AE and U+2AEDF is justified. However, when we look at the entry for U+2AEDF in the Kangxi Dictionary (there is no entry for U+72AE) we find that the same glyph (U+2AEDF) is used in the senses covered by both U+72AE and U+2AEDF in Hanyu Dacidian :

The Kangxi Dictionary entry confirms that there is no semantic distinction between U+2AEDF and U+72AE, and that the distinction shown in Hanyu Dacidian may be categorised as an editorial mistake. Thus the disunification of the two gltph forms of U+72AE, and the consequent encoding of U+2AEDF is not justified.


U+2AEEF [G_HC100898]

U+2AEEF :

U+24814 :

At first sight U+2AEEF <⿰犭貟> and U+24814 <⿰犭員;> are unifiable glyph variants, as Annex S gives U+8C9F 貟 and U+54E1 員 as unifiable components (see sample image from Annex S given above). But when we look at the Kangxi Dictionary we find that they have different definitions, U+2AEEF being defined as a variant form of U+7328 猨, and U+24814 being defined as a variant form of U+733F 猿 (the "above" character) :

This would seem to suggest that the two characters are in fact non-unifiable on the principle that non-cognate characters are not unified. However, U+7328 猨 and U+733F 猿 are themselves different glyphs for the same character, meaning "ape" (in the Kangxi Dictionary U+733F 猿 is treated as a vulgar variant of U+7328 猨, but in modern Chinese U+733F 猿 is the standard character for "ape"). So if U+7328 猨 and U+733F 猿 refer to the same beast, is there any semantic difference between U+2AEEF and U+24814 (i.e. can we say, U+2AEEF == U+7328, and U+24814 == U+733F, and U+7328 == U+733F, but U+2AEEF != U+24814) ? Probably not, in which case U+2AEEF should not be encoded separately, but unified with U+24814.

The issue in this case is further complicated by the fact that there is already a compatibility ideograph, U+2F927 𤠔 (that is canonically equivalent to U+24814) that has the same glyph shape as U+2AEEF. So in effect, encoding U+2AEEF would be disunifying the two glyph forms of U+2AEEF, but the unfortunate and inevitable result of such a disunification would be to leave U+2F927 with a canonical decomposition mapping to U+24814 when it should be mapped to the new U+2AEEF character (but Unicode stability rules mean that decomposition mappings can never be changed). If you are interested in disunification issues such as this, read N3196 which proposes the disunification of U+4039.


U+2AFA7 [JK-65424]

U+2AFA7 :

The problem with this character is that the proposed glyph for U+2AFA7 𪾧 <⿸疒⿱非気> does not match the glyph used in the evidence adduced for it, where the character is actually written as <⿸疒⿱非氣> :

U+6C17 気 is the standard Japanese simplification of U+6C23 氣, but I do not think that it is allowed to show as evidence a character with the 氣 component and then ask for the corresponding simplified form with the 気 component to be encoded -- certainly for Chinese simplified characters this is not allowed (the simplified form has to be attested to be encoded). Annex S does not indicate that U+6C17 気 and U+6C23 氣 are unifiable components, which implies that they are not unifiable, and therefore that <⿸疒⿱非気> and <⿸疒⿱非氣> are not equivalent.

If we look at another of the Japanese CJK-C submissions (p.55), U+2B27A <⿱艹氣> we see that both the source reference and the proposed glyph are written using the 氣 component, so why does the proposed glyph for U+2AFA7 show the simplified 気 component when its source reference shows the traditional 氣 component ?

Other examples of J-source characters that show a discrepancy between the CJK-C glyph and the glyph shown in the supporting evidence include :

  • U+2A761 [JK-65028] : ⿰亻弱 vs. ⿰亻⿰苟苟 (IRG N1225 part 2 page 41)
  • U+2ACCC [JK-65156] : ⿰市来 vs. ⿰市耒 (IRG N1225 part 3 page 97)
  • U+2B057 [JK-65465] : ⿱禾工 vs. ⿱禾土 (IRG N1225 part 3 page 101)
  • U+2B318 [JK-65704] : ⿰虫集 vs. ⿱⿰虫隹木 (IRG N1225 part 2 page 50)
  • U+2B340 [JK-65723] : ⿰衤昜 vs. ⿰衤⿱𠂉昜 (IRG N1225 part 2 page 51)

We are just left to wonder whether perhaps any of the J-source characters in CJK-C that have no supporting evidence provided for them have the wrong glyph shape as well. This highlights a wider problem, that is that the correctness of the glyph shape of proposed characters can only be verified if sample images showing the characters in text use are supplied for all proposed characters. However, currently this is not being done by all IRG member bodies, and some (such as Vietnam) did not provide any textual evidence at the individual character level for their CJK-C submissions.


U+2B29E [HC101428]

U+2B29E :

U+0452D :

The first thing to note about U+2B29E <⿱艹𡩋> is that although its source reference is Hanyu Dacidian 漢語大詞典 [Great Dictionary of Chinese Words] <HC>, there is no entry for this character in this dictionary. There is only an entry for the very similar U+452D 䔭 <⿱艹甯>, which says "See under dǐng nìng 葶䔭" :

But when we look at the entry for U+8476 葶 we find that the word dǐng nìng 葶䔭 is here written with U+2B29E as its second character :

Clearly, U+2B29E and U+452D are interchangeable glyph variants, and the fact that both variants are used in Hanyu Dacidian rather than either U+2B29E or U+452D consistently would seem to be an editorial oversight.

Looking now at the already encoded pair U+27476 𧑶 <⿰虫𡩋> and U+27457 𧑗 <⿰虫甯>, which have the same relationship as U+2B29E and U+452D, we find that Hanyu Dacidian has an entry for U+27476 𧑶 (vol.8 p.974) but not for U+27457 𧑗, whereas the Kangxi Dictionary has an entry for U+27457 𧑗 (p.1098) but not for U+27476 𧑶. And significantly, U+27476 𧑶 in Hanyu Dacidian corresponds in meaning to U+27457 𧑗 in the Kangxi Dictionary, where they are both defined as a kind of cicada 蟬.

From these two examples, it is clear to me that the phonetic elements U+752F 甯 and U+21A4B 𡩋 can be used interchangeably. However, are they unifiable variants ? I believe that as the difference between U+752F 甯 and U+21A4B 𡩋 is just one of stroke overshoot (see Annex S section S.1.5 b) they are indeed unifiable variants. Note that U+5BD7 寗 is also a specialised variant of U+752F 甯, but in this case the extra stroke probably means that it is not unifiable.


U+2B497 [G_XC2019, TC-2D59]

U+2B497 :

U+090A6 :

The source reference for U+2B497 is Xiandai Hanyu Cidian 現代漢語詞典 [Dictionary of Modern Chinese] (in my opinion the best concise dictionary of Chinese around), where it is given as a variant form of U+90A6 邦 bāng :

The difference between U+2B497 and U+90A6 is one of glyph overshoot (see Annex S section S.1.5 b) and stroke rotation (see Annex S section S.1.5 a), and so according to Annex S these two characters are unifiable glyph variants. Other already encoded characters with U+2B497 as a component are U+22E0C 𢸌, U+26C25 𦰥 and U+22D69 𢵩, in all of which cases the U+2B497 component is surely interchangeable with U+90A6.


U+2B6B8 [TE-435A]

U+2B6B8 :

U+09C49 :

The bottom component of U+2B6B8 (encoded as U+29D4B 𩵋) is a common glyph variant of U+9B5A 魚 "fish" (I remember frequently seeing this variant form of the fish radical in restaurants in Japan, and it is the form of the fish radical used in the source references for U+2B6B1 [JK-66001] and U+2B6C8 [JK-65938]), as seen in these examples from a Japanese dictionary of calligraphy (書道字典) :

This example shows up the weakness of Annex S, as there is nothing in it to suggest that U+29D4B 𩵋 and U+9B5A 魚 are unifiable components, yet anyone who reads Chinese will immediately recognise that U+2B6B8 (a Taiwan personal name character) is a simple glyph variant of U+9C49 鱉. At present the only encoded character with this form of the fish radical is U+29E3A 𩸺 <⿰𩵋隶>, for which luckily there is no corresponding character <⿰魚隶>. To deny that U+29D4B 𩵋 and U+9B5A 魚 are unifiable components would open up the possibility of encoding U+29D4B variants of any or all of the 957 currently encoded characters with the 魚 "fish" radical. In my opinion, it was a mistake to encode U+29E3A, but to encode U+2B6B8 would be a crime.



What's the Solution ?

One common theme that can be seen in these examples is the desire to be able to represent unifiable glyph variants at the encoding level. I can certainly understand that if a dictionary references a glyph variant for a particular character in addition to the standard glyph form of the character, it is not very helpful to tell the dictionary editors and/or users that we won't encode the variant form they need to distinguish from the standard glyph form because it is "unifiable" with the standard form of the character.

As an example, if I wanted to make an on-line version of Xiandai Hanyu Cidian 現代漢語詞典 [Dictionary of Modern Chinese], how would I be expected to deal with the entry for U+90A6 邦 (image shown above), which shows the variant form U+2B497 in parentheses after the main character. In plain text my entry would look something like :

邦(邦) bāng 国:友~|邻~。

This, of course, makes no sense, as the character in parentheses (U+2B497) is the same as the character it refers to (U+90A6). I can think of several ways of dealing with this problem :

  • Encoding U+2B497 as a separate character
  • Representing U+2B497 with an image
  • Representing U+2B497 with an IDS sequence
  • Specifying a special font for the character in parentheses that has the U+2B497 glyph for U+90A6
  • Representing the U+2B497 glyph form of U+90A6 with a variation sequence

The first of these solutions is obviously something that I have been arguing against, and the middle three solutions are clunky and unacceptable to my mind, so that only leaves us with the final solution, or "pseudo-coding" as my friend Michael Everson would call it. I don't much like the idea of defining variation sequences in order to represent simple glyph variants, but in the case of CJK I think that this is the best solution we have, and I would recommend this approach where there is a demonstrable need to represent distinctions between glyph variants in a dictionary (e.g. for U+2B497 vs. U+90A6, U+2AEEF vs. U+24814 and U+2AEDF vs U+72AE), but not for cases where it is just a matter of wanting to use a particular glyph variant for a particular character (e.g. U+2B29E, which is not used distinctively from U+452D in Hanyu Dacidian).

For my penultimate post in the current series I am going to be continuing with these Han thingies, but will be looking even further into the future, to CJK-D and beyond. But in the meantime, having touched upon variation selectors in this post I think I shall make a quick detour to examine in greater detail the issues of variation sequences for Maths, Mongolian, Phags-pa and CJK.


18 comments:

Wayne Steele said...

I think the root problem is apparent: The unicode system of encoding individual ideographic characters is problematic, at best.
From what I've heard, the native-language users of these scripts are not happy with it either.
Hasn't there been work on "stroke-level" encoding for these things? How is it going?

Sean Burke said...

The use of variation sequences seems like it would also be a perfectly acceptable solution to the personal names problem, rather than using the PUA. It seems to me that the system of variation sequences is designed to solve problems exactly like that. It would probably even be good to register them in the IVD. On which topic, I'm very curious to see why you think that the Adobe collection misuses variation selectors. It seems to me like it does exactly what it should.

In response to Wayne's comment: stroke-level encoding would probably be problematic in many more ways. Every character to be represented would need a list of strokes (or radicals) and good positioning information. It would add a great deal of complexity to the situation and, as far as I can see, wouldn't add much in the way of benefit.

DopefishJustin said...

"So in effect, encoding U+2AEEF would be disunifying the two glyph forms of U+2AEEF, with the inevitable result that U+2F927 retains a canonical decomposition mapping to U+2AEEF."

That should be "retains a canonical decomposition mapping to U+24814", surely. (I had to read that sentence about five times trying to make sense out of it....)

Andrew West said...

Wayne,

I think the root problem is apparent: The unicode system of encoding individual ideographic characters is problematic, at best.

If at the beginning anyone could have anticipated that CJK would eventually swell to over a 100,000 characters then perhaps a compositional model of CJK encoding would have been considered, but even so, because of the technology limitations at the time, a compositional model would not have replaced the encoding of unitary ideographic characters but, initially at least, co-existed with the encoding of unitary ideographs.

Perhaps, in a different universe, Unicode 1.0 would have defined a set of CJK components (radicals and phonetics) as well as 20,000 unitary ideographs with canonical decompositions to series of CJK components (much in the same way that accented and diacritical letters were dealt with). Then, no new unitary ideograph would be encoded if it could be represented using the compositional model. Such an encoding model would have curbed the unstoppable growth of CJK that we now have.

However, there are several problems with a compositional model, the most important of which is the fact that a compositional model would bloat CJK text by at least a power of three. If we imagine that a compositional model would use an approach similar to the Ideographic Description Sequences (IDS) that are currently used to *describe* ideographic characters (as I have used in this post), then in addition to individual components there would need to be layout controls (like the current IDC characters), so that most characters would be encoded as three characters (e.g. ⿱宀子 for U+5B57 字), and more complicated characters as longer strings of layout controls and components. Back in the 90s bloating of CJK text to this extent would not have been acceptable to CJK users.

Another issue with such a model is that for complex characters there may be several different decompositional sequences, and even if the shortest sequence should be used, people may well end up using different sequences for the same character. This model also makes things more difficult for fonts and software such as editors (which would have to ensure that users had the illusion of working with unitary ideographs).

So, all in all, I think that even if people had realised how far CJK would grow it is unlikely that a decompositional model of CJK ideographs would have been chosen. At this stage in the game it is unrealistic to expect a revolutionary change to the encoding model, so we have to live with what we've got.

From what I've heard, the native-language users of these scripts are not happy with it either.

I don't think that this the case, and indeed I suspect that most native CJK users would be unhappy with a decompositional model.

Hasn't there been work on "stroke-level" encoding for these things? How is it going?

Not that I know of. A set of CJK stroke characters has been encoded, but these are not intended for a stroke-level encoding.

Andrew West said...

Sean,

The use of variation sequences seems like it would also be a perfectly acceptable solution to the personal names problem, rather than using the PUA.

Yes, I agree, but it has been difficult to get the IRG to buy into ideographic variation sequences.

It seems to me that the system of variation sequences is designed to solve problems exactly like that. It would probably even be good to register them in the IVD.

I agree, personal name usage is one of the stated reasons for the IVD.

On which topic, I'm very curious to see why you think that the Adobe collection misuses variation selectors. It seems to me like it does exactly what it should.

I'll try to answer that in a future post.

Andrew West said...

That should be "retains a canonical decomposition mapping to U+24814", surely.

Yes, indeed. Thanks for pointing that out. I'll try to make the sentence a little clearer as well.

Sean Burke said...

A question about CJK-B: is it possible to get a hold of the source glyphs and associate them with characters in any reasonable manner? The absence of the charts would make comprehensive font creation a lot more difficult and it seems like it would be useful to create even a hacked-together version of this. As far as I can tell, though, there's no posting of the source glyphs anywhere accessible. That's the biggest hurdle, though if it was possible to cross-reference them in some sort of automated fashion it would be even better. Anyone have any insight on this?

minkymomo said...

Just so you know...
If you mean, for example, U+20457, you can just type:
&#x20457;

You're currently trying to use surrogates in your XHTML file,
which is illegal in XML and doesn't work for a standard-compliant browser.

(1) You MUST NOT use a surrogate as a separate character ("Char") in XML.
http://www.w3.org/TR/REC-xml/#charsets

(2)Characters referred to by character references (&#...) MUST be Char,
hence &# + a number for a surrogate is a no-no.
http://www.w3.org/TR/REC-xml/#sec-references

(3) More generally, your file is in UTF-8 and surrogates are illegal in UTF-8.
I'm pretty sure that you know that very well, and I guess maybe you're doing that on purpose,
like CESU-8, to make old MSIE happy, but that incorrect approach doesn't work for Firefox :(

How about respecting the standards at least when you're talking about the standards ? :P

Thanks always for interesting reading anyway :)

Andrew West said...

Thanks for the explanation. Actually you should be directing the comment to the people at Blogspot or Blogger (or whatever they call themselves today) rather than me. I write the post in legal UTF-8, paste it into the provided edit box and hit "Publish" -- and when it comes out at the other end it has been mangled as you have observed, with characters needlessly replaced by character references, and supra-BMP characters illegally converted into pairs of surrogate character references.

Having just confirmed that Firefox does not render the pairs of surrogate character references as a single character, I will try to see what I can do to fix the problem -- even though I must stress that I am not responsible for it.

Andrew West said...

Well, I tried replacing all the SIP characters in the source file to character references (single character references of course), but when I republished the post all these single character references had been converted to pairs of surrogate character references. I'm really not sure what I can do about this other than publish a copy of the post on my website in legal XHTML.

minkymomo said...

I see. Then... I guess it can't be helped :sad:

Andrew West said...

I have morrored this post at http://www.babelstone.co.uk/Blog/brief-history-of-cjk-c.html. Let me know if that has the same display problems or not.

minkymomo said...

I'm not sure if I could post a comment. I said the mirrored one is working fine, much better than here. And I also said it's very difficult and time-wasting to post a comment here for a firefox user (at least for me).

minkymomo said...

There, unlike here, Firefox on my Windows XP correctly shows glyphs for: The first 5 non-BMP characters including U+20457; U+215DC used to explain U+2A988; U+2177B, and U+215DC used to explain it; many similar non-BMP characters in that section and following sections.

Honestly, I love your articles and programs but I really hate to post a comment here. It's least usable, actually literally unusable (because the CAPTCHA image is not shown), for Firefox users, at least for me. I have to use IE just to post a comment. I don't hate MSIE itself, but I've never experienced something as annoying as this when posting a comment, like this tiny, inconvenient, unresizable pop-up. I know It's not your fault at all, but still...

Also, validator.w3.org reports more than 1000 errors for this page, while only 76 Errors, 1 warning for the "mirrored" page. It's still invalid as XHTML, but at least it doesn't have "reference to non-SGML character" errors, so it doesn't make you look bad :) Here, people not knowing of you at all might misunderstand you as someone discussing about subtle problems in Unicode when he/she can't even make a valid UTF-8 document themself. Well, maybe not. That's an exaggeration. :)

Andrew West said...

Although I use (and prefer) IE8 as my browser, I try to be accessible to all all browsers. In fact, checking my blog stats I see that 40% of recent visitors use Firefox compared with 39% for IE 6/7/8 combined. I've had complaints about the comment window before, so as an experiment I have changed the settings to use a full-size window (although the comment box is still just as small) and removed the catcha (but if I get an increase in comment spam I will put it back). Hope this helps a little.

Andrew West said...

Well, I can certainly make a valid UTF-8 document, but it is damned hard to make a valid XHTML document. I originally wrote my posts in html, not realising that they were wrapped up in an xhtml wrapper, so most of the early posts are very invalid xhtml. I now try to write in valid xhtml syntax for new posts and convert old posts when I update them. But I'm afraid I must still use some old html habits that are not valid in xhtml. I will try to reform my ways, but struggling over the details of what is or is not allowed or required in xhtml is not something that I relish.

Andrew West said...

I've fixed the two mirrored blog posts to be fully valid xhtml, and will try to eventually get round to converting all my blog posts and web pages to valid xhtml, but it may take some time.

I have also added a comment link to the bottom of the mirrored posts.

Thanks for all your help.

JAEMIN said...

So do you know what happened to encoding personal name characters separately? Is it accepted?