Thursday, 28 June 2007

The Secret Life of Variation Selectors

One of the most controversial encoding mechanisms provided by Unicode is that of variation selectors. Some people revile them as "pseudo-coding" whilst others are eager to embrace them as a solution for almost every new encoding issue that arises. Personally I think that they provide an essential mechanism for selecting contextual glyph forms in isolation or overriding the default contextual glyph selection in some complex scripts such as Mongolian and Phags-pa, but I am not keen on their use to select simple glyph variants for aesthetic or epigraphic purposes, and I definitely oppose their use as private glyph identifiers.

Recently, with more and more historic scripts being encoded in Unicode, there have been frequent suggestions that variation selectors should be used to standardize the multitude of stylistic letterforms that are often recognised by scholars of ancient scripts, usually with the rationale that epigraphers and palaeographers need to be able to distinguish variations in glyph forms at the encoding level in order to accurately represent ancient texts. As a textual scholar by training I appreciate how important distinctions at the glyph level can be to the dating and analysis of a text, but I really doubt the need to represent stylistic glyph variants at the encoding level. This is usually more usefully achieved with higher level markup or at the font level. Time and again when discussing the encoding of some ancient script with Dr. X or Professor Y I hear the assertion that the encoded text must be an exact facsimile of the written or inscribed original, to which my response is that encoded text is not intended as a replacement for facsimile drawings and photographs of manuscripts and inscriptions, and that scholars of ancient texts need to work with both photographic images and electronic text, which serve very different purposes. Thusfar we have managed to stave off the demand for glyph level encoding of historic scripts using variation selectors, but I predict that before long there will be a proliferation of variation sequences for newly encoded historic scripts.

Fundamental Principles

Variation Selectors are a set of 256 characters, FE00..FE0F (VS1..VS16) and E0100..E01EF (VS17..VS256), that can be used to define specific variant glyph forms of Unicode characters. There are also three Mongolian Free Variation Selectors, 180B..180D (FVS1..FVS3), that behave the same as the generic variation selectors but are specific to the Mongolian script. See The Unicode Standard Section 16.4 for more details.

A variation selector may be used to define a variation sequence, which comprises a single base character followed by a single variation selector. The base character must not be either a decomposable character or a combining character otherwise normalization could change the character to which the variation selector is appended (as we shall see below this rule was not followed when mathematical variation sequences were first defined).

The most important thing to realise about variation selectors is that they are not intended to provide a generic method for defining glyph variants by all and sundry, but that only those variation sequences specifically defined by Unicode (aka standardized variants) are valid. To put it another way, no conformant Unicode process is allowed to recognise any variation sequence not defined by Unicode (i.e. a conformant Unicode process may not render the base character to which a variation selector is appended any differently to the base character by itself, if the variation sequence is not defined by Unicode).

Of course there is nothing to stop me from defining my own variation sequence, say <0041 FE0F> (A + VS16) to indicate the Barred A that I use to write the "A" of "A️ndew", but I should not expect Microsoft or anyone else to support my variation sequence. Although, having said that, Microsoft Vista does support some variation sequences that are undefined by Unicode (as we shall see below), and so I hope no-one is advertising Vista as being Unicode-conformant.

At present (Unicode 5.0) Unicode defines variation sequences for various mathematical characters, as well as for the Mongolian and Phags-pa scripts. These are specified in the file StandardizedVariants.txt (also as HTML with glyph images). It is to be expected that the first Han ideographic variants will be defined in Unicode 5.1.

Mathematical Variation Sequences

Unicode defines variation sequences for 15 characters in the Mathematical Operators block [2200..22FF] and 8 characters in the Supplemental Mathematical Operators block [2A00..2AFF]. In all of these cases the variation selector used is U+FE00 (VS1).

Mathematical Variation Sequences
Variation Sequence Appearance *
No VS With VS
U+2229VS1<2229 FE00> INTERSECTION with serifs∩︀
U+222AVS1<222A FE00> UNION with serifs∪︀
U+2268VS1<2268 FE00> LESS-THAN BUT NOT EQUAL TO with vertical stroke≨︀
U+2269VS1<2269 FE00> GREATER-THAN AND NOT DOUBLE EQUAL with vertical stroke≩︀
U+2272VS1<2272 FE00> LESS-THAN OR EQUIVALENT TO following the slant of the lower leg≲︀
U+2273VS1<2273 FE00> GREATER-THAN OR EQUIVALENT TO following the slant of the lower leg≳︀
U+228AVS1<228A FE00> SUBSET OF WITH NOT EQUAL TO with stroke through bottom members⊊︀
U+228BVS1<228B FE00> SUPERSET OF WITH NOT EQUAL TO with stroke through bottom members⊋︀
U+2293VS1<2293 FE00> SQUARE CAP with serifs⊓︀
U+2294VS1<2294 FE00> SQUARE CUP with serifs⊔︀
U+2295VS1<2295 FE00> CIRCLED PLUS with white rim⊕︀
U+2297VS1<2297 FE00> CIRCLED TIMES with white rim⊗︀
U+229CVS1<229C FE00> CIRCLED EQUALS equal sign touching the circle⊜︀
U+22DAVS1<22DA FE00> LESS-THAN EQUAL TO OR GREATER-THAN with slanted equal⋚︀
U+22DBVS1<22DB FE00> GREATER-THAN EQUAL TO OR LESS-THAN with slanted equal⋛︀
U+2A3CVS1<2A3C FE00> INTERIOR PRODUCT tall variant with narrow foot⨼︀
U+2A3DVS1<2A3D FE00> RIGHTHAND INTERIOR PRODUCT tall variant with narrow foot⨽︀
U+2A9DVS1<2A9D FE00> SIMILAR OR LESS-THAN with similar following the slant of the upper leg⪝︀
U+2A9EVS1<2A9E FE00> SIMILAR OR GREATER-THAN with similar following the slant of the upper leg⪞︀
U+2AACVS1<2AAC FE00> SMALLER THAN OR EQUAL TO with slanted equal⪬︀
U+2AADVS1<2AAD FE00> LARGER THAN OR EQUAL TO with slanted equal⪭︀
U+2ACBVS1<2ACB FE00> SUBSET OF ABOVE NOT EQUAL TO with stroke through bottom members⫋︀
U+2ACCVS1<2ACC FE00> SUPERSET OF ABOVE NOT EQUAL TO with stroke through bottom members⫌︀

* If you have a recent version of James Kass's Code2000 installed on your system you should see the difference in appearance between the base character with and without VS1 applied to it (at least it works for me with IE6 or IE7).

Originally when the set of mathematical variation selectors were encoded in Unicode 3.2 there were two additional variation sequences :

  • <2278 FE00> NEITHER LESS-THAN NOR GREATER-THAN with vertical stroke
  • <2279 FE00> NEITHER GREATER-THAN NOR LESS-THAN with vertical stroke

However, as U+2278 and U+2279 are both decomposable characters, if the variation sequences <2278 FE00> and <2279 FE00> are subjected to decomposition (NFD or NFKD) they will change to <2276 0338 FE00> and <2277 0338 FE00> respectively. When this happens VS1 is now appended to U+0338 COMBINING LONG SOLIDUS OVERLAY, and <0338 FE00> is not a defined variation sequence. Therefore these two variation sequences were undefined in Unicode 4.0 (which I guess answers the question of whether once defined a variation sequence can be undefined or not). However, due to an unfortunate oversight, the last paragraph of Section 15.4 of The Unicode Standard still suggests that VS1 can be applied to U+2278 and U+2279 (although an erratum for this has now been issued).

Turning to the general reason for defining these variation sequences in the first place, we find almost no explanation for them in The Unicode Standard (section 15.4). We are asked to "see Section 16.4, Variation Selectors, for more information on some particular variants", but turning to Section 16.4 we find no mention of mathematical variation sequences, much less any information on particular variation sequences. It has been explained to me that mathematical variation sequences have been defined because nobody is quite sure whether there is any semantic difference between the variant glyphs or not; if it was certain that there is a semantic difference between the variant gyphs then the variant forms would have been encoded as separate characters, and conversely, if it was certain that there was no semantic difference then variation sequences would not have been defined for them.

A final important point to note is that whilst the glyph form of a variation sequence is fixed, that of the base character when not part of a variation sequence is not fixed, so that the range of acceptable glyph forms for a particular base character may encompass the glyph form of its standardized variant. For example, although the glyph for <2229 FE00> "INTERSECTION with serifs" must have serifs, this does not mean that the character U+2229 must not have serifs, and depending on the font it may or may not have serifs. In fact, there is no way of selecting "INTERSECTION without serifs" at the encoding level.

Mongolian Variation Sequences

Mongolian variation sequences are formed using the special Mongolian Free Variation Selectors 180B..180D (FVS1..FVS3) rather than the generic variation selectors. Unlike mathematical variation selectors, which seem like a kludge, variation selectors are an essential aspect of the Mongolian encoding model. To understand why they are required you need to understand a little bit about the nature of the Mongolian script, in which most letters have a variety of positional, contextual and semantic glyph forms (see The Unicode Standard Section 13.2 for further details). The glyph form that a particular letter assumes depends upon various factors such as :

  • its position in a word (initial, medial, final or isolate)
  • the gender of the word that it occurs in (masculine or feminine depending upon the vowels in the word, so that, for example, completely different glyph forms of U+182D GA are found in the masculine word jarlig "order" and the feminine word chirig "soldier")
  • what letters it is adjoining to (e.g. U+1822 I is written with a single tooth after a consonant but with a double tooth after a vowel; U+1828 NA in medial position has a dot before a vowel but no dot before a consonant; U+1832 TA and U+1833 DA both take the reclining form before a vowel and the upright form before a consonant)
  • whether the word is a native word or a foreign borrowing (e.g. the glyph form of U+1832 TA and U+1833 DA in medial position in a native word depends upon whether the letter is followed by a vowel or a consonant, but in foreign words U+1832 TA is always written with the upright glyph form, whereas U+1833 DA is always written with the reclining glyph form)
  • whether traditional or modern orthographic rules are being followed (e.g. U+182D GA in the word gal "fire" is written with two dots in modern orthography but with no dots in traditional orthography)

The rendering system should select the correct positional or contextual form of a letter without any need for user intervention (i.e. variation selectors are not normally needed in running text to select glyph forms that the rendering system can predict from context), but for foreign words and words written in traditional orthography the user needs to apply the appropriate variation selector to select the correct glyph form where appropriate.

Variation selectors may also be used to select a particular contextual glyph form of a letter out of context, for example in discussions of the script, where there is a need to display a particular glyph form in isolation.

Not all Mongolian, Todo, Manchu and Sibe letters have glyph forms that need distinguishing by means of variation sequences, but variation sequences are still defined for as many as thirty-eight of the 128 letters in the Mongolian block. In addition to these variation sequences which define contextual glyph forms of letters, there are two variation sequences defined by Unicode where variation selectors are used to select stylistic variants :


With regard to the first of these, I would suggest that U+1880 by itself corresponds to a "candrabindu" (e.g. Devanagari U+0901 and Tibetan U+0F83), whereas the variation sequence <1880 180B> ᢀ᠋ corresponds to an "anusvara" (e.g. Devanagari U+0902 and Tibetan U+0F7E); thus I believe that they are semantically distinct and should have been encoded as separate characters rather than as one character plus a standardized variant. I am not sure about the two forms of the visarga (U+1881 and <1881 180B> ᢁ᠋).

As an aside, one very curious feature about the two characters U+1880 and U+1881 is their names, which both include the unexpected and (in this context) meaningless word "one". My only explanation for this is that at some early stage of the Mongolian character repertoire four characters had been proposed :


But then the "two" characters were redefined as variation sequences of the corresponding "one" character. However, the original names must have been inadvertently left unchanged, with "one" left in the name as a fossil reminder to the time when there were two such characters. But this is pure conjecture; I have not been able to find any support for this theory yet.

The problem with the system of Mongolian variation sequences is that nearly eight years after Mongolian was added to Unicode (3.0 in September 1999) the exact shaping behaviour of Mongolian remains undefined. Although Unicode defines a number of standardized variants for Mongolian, a simple list such as this is not sufficient to implement Mongolian correctly. So when Microsoft decided to support Mongolian in its Vista operating system it had to rely on information on shaping behaviour outside of the Unicode Standard, specifically unpublished draft specifications for Mongolian shaping behaviour from China which in places contradicts both itself and the Unicode Standard with regard to the use of variation selectors.

I have to sympathise with Microsoft, which is in a very difficult position in trying to support a script for which the necessary shaping behaviour specification has long been promised but never delivered, but nevertheless it is very unfortunate that Microsoft did not work with Unicode to write the promised Unicode Technical Report on Mongolian at the same time as it developed its Mongolian implementation. As it stands the Vista implementation of Mongolian is essentially an undocumented and private interpretation of Mongolian shaping behaviour. In particular the Vista implementation (Uniscribe and the Mongolian Baiti font) support a number of variation sequences that are not defined by Unicode.

The table below lists those variation sequences supported in the Mongolian Baiti font that are undefined by Unicode but which have the same glyph appearance as another defined variation sequence. The seven undefined isolate variants are identical to another positional form of the letter, and can be selected using the appropriate combination of ZWJ and FVS; I do not believe any of them are true isolate forms which require special variation sequences other than the already defined sequences for when they occur in a non-isolate position. The two undefined initial variants are identical to the medial forms of the same letter that are selected after NNBSP, and the undefined final variant is identical to the medial form of the same letter that is selected before MVS. I do not think that these are true initial or final forms, and any usage in initial or final position (e.g. when discussing a stem or suffix in isolation) can be dealt with using the existing, defined variation sequences and ZWJ where appropriate (e.g. the suffix ACA that occurs after NNBSP can be represented in isolation as <200D 1820 180C 1834 1820>, without requiring a special initial variant). In summary, not only are none of variation sequences in the table below sanctioned by Unicode, but in my opinion none of them are required anyway.

Undefined Variation Sequences in Mongolian Baiti
Base Character Variation Selector Position Variation Sequence Appearance* Notes
U+1820 FVS2 Isolate <1820 180C> ᠠ᠌ This undefined isolate variant is the same as the defined second final form
U+1821 FVS1 Isolate <1821 180B> ᠡ᠋ This undefined isolate variant is the same as the defined second final form
U+1822 FVS1 Isolate <1822 180B> ᠢ᠋ This undefined isolate variant is the same as the defined final form
U+1824 FVS1 Isolate <1824 180B> ᠤ᠋ This undefined isolate variant is the same as the defined final form
U+1826 FVS2 Isolate <1826 180C> ᠦ᠌ This undefined isolate variant is the same as the defined first final form
U+182D FVS2 Isolate <182D 180B> ᠭ᠋ This undefined isolate variant is the same as the defined feminine medial form
U+1835 FVS1 Isolate <1835 180B> ᠵ᠋ This undefined isolate variant is the same as the defined second medial form
U+1820 FVS1 Initial <1820 180B> ᠠ᠋‍ This undefined initial variant is the same as the defined second medial form (used after NNBSP)
U+1826 FVS1 Initial <1826 180B> ᠦ᠋‍ This undefined initial variant is the same as the defined first medial form (used after NNBSP)
U+1828 FVS2 Final <200D 1828> ‍ᠨ᠌ This undefined final variant is the same as the defined third medial form (used before MVS)

* You will need to be running under Vista to see what I intend to be seen.

In addition to the undefined variation sequences in the above table, Mongolian Baiti supports several other undefined variation sequences which are even more problematic.

Firstly, the undefined variation sequence <1840 180B> ᡀ᠋ (Mongolian LHA plus FVS1) produces a glyph which is the same as the letter LA with a circle diacritic. This in not a variant glyph form of Mongolian LHA (in origins a ligature of the letters LA and HA) at all, but is a completely separate letter used in Manchu to transliterate Tibetan LHA (discussed in more detail here). Although this letter was inadvertently omitted from the original set of Mongolian/Todo/Manchu/Sibe letters, it is to be be encoded as U+18AA MONGOLIAN LETTER MANCHU ALI GALI LHA in Unicode 5.1. All I can say is that trying to represent an unencoded letter by means of an undefined and unsanctioned variation sequence is a shameful hack that should never have been countenanced by a major vendor and founder member of the Unicode Consortium.

Then there are these four variant forms of U+1800 MONGOLIAN BIRGA :

  • <1800 180B> (FVS1) ᠀᠋ "1st variant"
  • <1800 180C> (FVS2) ᠀᠌ "2nd variant"
  • <1800 180D> (FVS3) ᠀᠍ "3rd variant"
  • <1800 200D> (ZWJ) ᠀‍ "4th variant"

And for those without Vista, these are what I am talking about (1st to 4th variants from left to right) :

Although none of these four birga variants are defined in Unicode, they are defined in both Traditional Mongolian Script in the ISO/IEC 10646 and Unicode Standards (UNU/IIST Report No. 170, August 1999) and a book on Mongolian character encoding Mengguwen Bianma 蒙古文编码 (2000) by Professor Quejingzhabu which closely follows the UNU/IIST report.

I suspect that the main reason why Unicode did not accept these four variation sequences when it accepted all the other variation sequences defined in UNU/IIST Report No. 170 is that the fourth variation sequence uses U+200D ZERO WIDTH JOINER as a pseudo-variation selector because there are not enough Mongolian Free Variation Selectors for more than three variants of the same positional form of a letter. This abuse of ZWJ was no doubt unacceptable to Unicode, and I imagine that as they couldn't accept three of the variants and reject one of them, they rejected them all until a better solution could be found. Unfortunately, instead of working with Unicode to define an acceptable solution Microsoft uncritically implemented something Unicode had already rejected.

Let us just consider for a moment the wisdom of using ZWJ as a pseudo-variation selector in a script that already uses ZWJ to select positional forms of letters (X-ZWJ, ZWJ-X-ZWJ and ZWJ-X select the initial, medial and final forms of the letter X respectively). As the Mongolian birga is a head mark that occurs at the start of text, it is quite likely to be followed by a Mongolian letter (maybe with whitespace between them, maybe not). Is it not just possible that if a letter with positional forms occurs immediately after the fourth birga variant <1800 200D> the ZWJ will have an adverse effect on the following letter ?

Well yes, it is just possible, under Vista at least. In IE7 the ZWJ acts upon both the preceding birga (U+1800) and following letter A (U+1820), producing the 4th birga variant followed by the final form of the letter A; whereas in simpler applications such as Notepad the ZWJ only acts upon the following letter, producing the standard birga glyph followed by the final form of the letter A (Birga 4th variant plus letter A separated by space is on the left and Birga 4th variant plus letter A not separated by space is on the right) :

And in Word 2007 you get weird behaviour, as seen below where exactly the same three sequences <1800 200D 1820> may end up being rendered differently from each other :

This sort of unpredictable rendering behaviour is no doubt why Unicode rejected <1800 200D> as a variation sequence in the first place, and why Microsoft should never have implemented it. Unfortunately there is a lot more that I could say about the rendering behaviour of Mongolian Baiti, but that would be beyond the scope of this post.

Phags-pa Variation Sequences

As with the Mongolian model, variation selectors (always VS1) are used in the Phags-pa script in order to select a particular contextual glyph form. This mechanism is only actually required in order to represent the Sanskrit Buddhist texts that are engraved in Phags-pa script on the walls of the "Cloud Platform" 雲台 at Juyong Guan 居庸關 Pass at the Great Wall north-west of Beijing, in commemoration of the construction of a Buddhist edifice in 1345. On these very important inscriptions (and nowhere else in the extant Phags-pa corpus) the Sanskrit retroflex letters ṭa, ṭha, ḍa and ṇa are represented by reversed forms of the Phags-pa letters TA, THA, DA and NA (following the example of Tibetan), and as such these four reversed letters are encoded separately from their unreversed counterparts (A869..A86C : TTA, TTHA, DDA and NNA). However, as the stem on these four reversed letters is on the opposite side compared with normal, when other letters follow them they also normally take a reversed glyph form to facilitate joining along the stem. These reversed glyph forms are not phonetically or semantically any different from the corresponding unreversed glyph forms, and so are not encoded separately, but are treated as contextual glyph variants. This contextual reversing affects the following six letters :


These letters exhibit the following reversing behaviour :

  • The letter HA reverses after the letter DDA
  • The letter Subjoined YA reverses after the letter NNA
  • The letters I, U and E reverse after the letters TTA, TTHA, DDA or NNA (or after a reversed Subjoined YA or HA), although the letter I does not always reverse after the letter TTHA
  • The letter Small A normally does not reverse after the letters TTA or TTHA, presumably because a reversed Small A is identical to the letter SHA, but may sometimes be reversed after the letter TTHA

The rendering system should automatically reverse the glyph form of the letters Small A, HA, I, U, E and Subjoined YA when they occur immediately after one of the letters TTA, TTHA, DDA or NNA (or a reversed Small A, HA, I, U, E or Subjoined YA), but variation selectors are needed to display the reversed glyph forms of the letters Small A, HA, I, U, E and Subjoined YA in isolation (for example when discussing the letters of the script) and when the default reversing behaviour needs to be overridden, for example in order to represent those occurences where the letters Small A and I do not reverse after the letters TTA or TTHA in the Juying Guan inscriptions.

The six variation sequences defined for these purposes are different from any other variation sequence defined thusfar, in that they do not define an absolute glyph form but a relative glyph form :

  • <A856 FE00> phags-pa letter reversed shaping small a
  • <A85C FE00> phags-pa letter reversed shaping ha
  • <A85E FE00> phags-pa letter reversed shaping i
  • <A85F FE00> phags-pa letter reversed shaping u
  • <A860 FE00> phags-pa letter reversed shaping e
  • <A868 FE00> phags-pa letter reversed shaping subjoined ya

By "reversed shaping" it means that where the rendering system would normally display an unreversed form of the letter, applying VS1 will cause the glyph to be reversed; an conversely, where the rendering system would normally display a reversed form of the letter (e.g. after the letters TTA, TTHA, DDA and NNA), applying VS1 will cause the glyph to be unreversed. By this means the same variation sequence can be used to display a reversed glyph form of a letter in isolation and to inhibit glyph reversal in running text.

As an example, the Sanskrit word dhiṣṭhite is transliterated as DHISH TTHI TE in the Phags-pa inscriptions at Juyong Guan, but in some cases the letter I of TTHI is reversed and in some cases it is not. These two versions of the word may be represented as :

  • <A84A A85C A85E A85A 0020 A86A A85E 0020 A848 A860> ꡊꡜꡞꡚ ꡪꡞ ꡈꡠ (letter I contextually reversed by the rendering system)
  • <A84A A85C A85E A85A 0020 A86A A85E FE00 0020 A848 A860> ꡊꡜꡞꡚ ꡪꡞ︀ ꡈꡠ (VS1 inhibits contextual reversing of letter I)

Whereas in this context VS1 inhibits contextual reversing of letter I, we can using the same variation sequence <A85E FE00> in isolation to produce the reversed glyph form of the letter I : ‍ꡞ︀ (preceded by ZWJ to get the final reversed glyph form).

[See Phags-pa Shaping Behaviour for more examples]

Han Ideographic Variation Sequences

For a long time there has been a demand from some quarters for a mechanism to allow vendors and CJK users to register glyph variants of Han ideographs, and in order to accomodate this demand Unicode has recently established an Ideographic Variation Database (IVD). Unlike variation sequences for other scripts, which are individually defined by Unicode, the IVD provides a registration mechanism so that sets of Ideographic Variation Sequences (IVS) can be registered by the "user community" on demand. As long as certain rules are followed and a fee is paid (which Unicode may waive if it so desires) then Unicode (as the registration authority) will accept any set of glyph variants that anybody wants to register, without any scrutiny of the appropriateness of the proposed glyph variants -- there is a 90 day public review period, but in my opinion that's just an excuse to move responsibility away from the UTC.

The Variation Selectors Supplement, comprising 240 variation selectors (VS17-VS256), was specially encoded in anticipation of a large number of Han ideographic variants being defined, and ideographic variation sequences are intended to only use these 240 supplementary variation selectors. The door has been left open to define even more variation selectors if 240 variation sequences for a single CJK unified ideograph proves too few.

The first, and so far only, IVD registration application has come from Adobe, who have requested the registration of the entire set of kanji glyphs in their Adobe-Japan1 collection. This is a set of glyphs used by Adobe for fonts for the Japanese market, and includes 14,664 kanji glyphs. Adobe wants to be able to uniquely refer to each of these glyphs at the encoding level (don't ask me why), but as many of the glyphs are from a Unicode perspective unifiable variants it can only do so by means of variation sequences.

Seven of the glyphs in the Adobe-Japan1 collection do not correspond to encoded ideographs, and so have been fast-tracked (by-passing IRG) for encoding in Unicode 5.1 at 9FBC..9FC2. The remaining 14,657 glyphs have been analysed as mapping to a total of 13,262 encoded ideographs (one glyph, CID+19071, maps to both U+29FCE and U+29FD7 !) :

  • 12,040 characters mapped to 1 glyph
  • 1,084 characters mapped to 2 glyphs
  • 120 characters mapped to 3 glyphs
  • 14 characters mapped to 4 glyphs
  • 1 character mapped to 5 glyphs (U+97FF 響)
  • 1 character mapped to 6 glyphs (U+6168 慨)
  • 1 character mapped to 8 glyphs (U+908A 邊)
  • 1 character mapped to 15 glyphs (U+9089 邉)

From this one would have thought that variation sequences would only be required for those 1,222 ideographs that map to one or more glyphs in the Adobe set, and even then perhaps only for those glyphs that differ from the standard form of the ideograph, yielding at most 2,618 variation sequences. However, for purposes of forward compatibility (if additional Adobe glyphs are mapped to characters that currently only map to a single Adobe glyph), and in order to be able to reference all glyphs in the set as a variation sequence (don't ask me why), a total of 14,658 variation sequences are being put forward for registration (i.e. a unique variation sequence for every glyph in the Adobe-Japan1 collection, other than the seven unencoded characters, although I presume redundant variation sequences for those seven characters will be added once they are encoded). For the vast majority of the 12,040 ideographs for which only a single ideographic variation sequence is specified, the glyph for the IVS has the same appearance as the standard glyph form of the character, i.e. they are variation sequences that define a glyph that is not a variant of the base character and for which their is no need to distinguish it from any other variant glyph forms.

At this point I start seriously worrying about the implications of the Adobe approach to ideographic variation sequences. What if there is an "Adobe-Japan2" collection or an "Adobe-China" collection or an "Adobe-Korean" collection ? Would these collections also require the definition of many thousands of ideographic variation sequences that are not distinguishable from the standard glyph form of the base ideograph ? What if other vendors such as Microsoft or Apple decide to follow Adobe's lead, and define unique ideographic variation sequences for tens of thousands of font glyphs ? As the whole point of the IVD is to ensure that a given variation sequence is used in at most one collection, the same variant (or not-so-variant) glyph in multiple collections will inevitably be defined with different variation sequences, once for every collection it occurs in. It seems to me that the end result of all this will be that many thousands of ideographs will have multiple variation sequences associated with them (one per collection) and that the glyphs for each variation sequence will be practically indistinguishable from each other and from the standard glyph form of the base ideograph.

Looking at the glyphs of the Adobe-Japan1 collection it is evident that in very many cases where a single base ideograph has multiple variation sequences defined, the difference between glyphs is very slight (often just minor differences in stroke formation), and it is hard to see how there could be any practical need to distinguish them at the text level. In some cases the differences between "variant" glyphs is microscopic; for instance, can you differentiate the VS17 and VS18 forms of U+55A9 ?

On the other hand, sometimes the glyph variation is too extreme. One major problem with the collection that was identified during the review period is that the variation sequences for a single ideograph sometimes represent glyph forms that are not unifiable according to the Annex S rules, in particular there are quite a few cases where a Japanese simplified form which has not been encoded is defined by means of a variation sequence as a variant of the encoded non-simplified form. A single example from page 4 should suffice :

  • <56C0 E0100> (VS17) 囀�� [4454] = ⿰口轉
  • <56C0 E0101> (VS18) 囀�� [14116] = ⿰口轉
  • <56C0 E0102> (VS19) 囀�� [20096] = ⿰口転

It has now been clarified that the glyph for any ideographic variation sequence must be within the range of unifiable glyph variation for the base ideograph, and glyphs that would not be unified according to the unification rules may not be treated as variants of the same base ideograph. The text of UTS 37 will be amended accordingly, and a revised list of variation sequences for the Adobe-Japan1 collection will be issued. This means that there will probably be about fifty more characters from the Adobe-Japan1 set that will need encoding, and I have no doubt that, as with the previous seven, they will be fast-tracked (bypassing IRG), and tagged on to the the end of the CJK and CJK-A blocks (probably just enough room for them in the BMP).

When I first read UTS 37 I thought that the purpose of the IVD was to provide CJKV users with a mechanism to define glyph variants that, although unifiable from a character-encoding perspective, were required to be distinguished at the text level in certain circumstances, most obviously when used as personal or place names. But having reviewed the Adobe-Japan1 submission it seems that I must have been mistaken. It is evident that this collection of 14,658 ideographic variation sequences has no practical benefit for anyone other than Adobe, will never be supported by anyone other than Adobe, and will never be used in text by the general CJKV user community. In my opinion the collection is required purely to enable Adobe to uniquely identify their fonts glyphs internally, and not for information interchange, which I personally think is an abuse of ideographic variation sequences. But more than that, this very first IVD registration is going to be seen as a model for what the IVD is intended for, and I am afraid that it will only serve to put off people from registering sensible and useful ideographic variation sequences, for example for the many thousands of Taiwan personal name usage characters, as well as dictionary usage variants. We shall just have to wait and see ...

Thursday, 21 June 2007

A Brief History of CJK-C

In Memoriam Paul Thompson (2007-06-12)


My friend Asmus Freytag (who has just retired from active participation in Unicode after many years of dedication to Unicode and WG2) recently bemoaned the total lack of interest in CJK-C on the public Unicode mailing list. Whilst it is true that there has been little overt interest in the latest addition to the already huge collection of CJKV ideographs in Unicode , behind the scenes a lot of people have been working very hard on reviewing the CJK-C repertoire and resolving issues, and it has generated (and is continuing to generate) a huge volume of email traffic. This post is rather long and, in places, somewhat detailed, reflecting the long hours that I have been occupied by CJK-C over the past few months, so unless you are really interested in CJK unification issues and obscure Han characters I suggest that you read no further, and content yourself with the knowledge that there are problems with the 4,000+ characters of CJK-C, but these will be resolved, and CJK-C will be encoded in Unicode 5.2 (released 2009-10-01).

The Ideographic Rapporteur Group (IRG)

One aspect of the encoding process that I deliberately avoided in my post on Unicode and ISO/IEC 10646 is how new CJK ideographs get added to the standards. The answer is that under WG2 there is an Ideographic Rapporteur Group (IRG) that is responsible for coordinating the encoding of Han ideographs. IRG comprises representatives from those countries and territories that use or historically have used Han ideographs (China, Hong Kong, Japan, Macau, North Korea, South Korea, Taiwan, Vietnam), as well as Unicode.

IRG is responsible for collating submissions from its various members, and producing a unified set of characters to be submitted to WG2 for inclusion in ISO/IEC 10646 (and hence Unicode). Before a set of characters can be submitted to WG2, not only does IRG needs to ensure that no duplicate characters are inadvertently encoded, but also that unifiable glyph variants of the same abstract character are not encoded separately.

Although the Unicode code charts only show a single glyph form for each character, 10646 uses multi-column charts for the CJK and CJK-A blocks (but not for CJK-B) that give the source glyph provided by each IRG member for a particular character (in the chart below, under "C" for Chinese, "G" represents China and "T" represents Taiwan). This format enables font developers to design fonts that have the correct glyph form for a particular locale.

Detail of Multi-column code chart in ISO/IEC 10646

A similar multi-column layout is used for CJK-C, but with added columns for M (Macau) and U (Unicode) source glyphs.

Han Unification

Unicode and 10646 have a policy of unifying non-significant glyph variants of the same abstract character (see The Unicode Standard pp.417-421 and ISO/IEC 10646:2003 Annex S). This policy was not applied to the initial set of nearly 21,000 characters included in Unicode 1.0 (those characters in the CJK Unified Ideographs block from U+4E00 to U+9FA5 inclusively), for which the "source separation rule" applied. This rule meant that any characters separately encoded in any of the legacy standards used as the basis for the Unicode collection of unified ideographs would not be unified. Thus, the CJK Unified Ideographs block contains many examples of characters that are normally considered to be interchangeable glyph variants, such as 為 and 爲. Some 250 examples of pairs or triplets of unifiable ideographs encoded separately in Unicode 1.0 due to the source separation rule are included in ISO/IEC 10646:2003 Annex S :

Some Examples of Unifiable Characters in Annex S

The source separation rule does not apply to any of the additions after Unicode 1.0, and so in principle CJK-A and CJK-B should not include any unifiable characters. Unfortunately the quality control for the huge 40,000+ characters in CJK-B was not up to standard, with the result that well over a hundred unifiable glyph variants were encoded, as well as five exact duplicates :

  • U+34A8 㒨 = U+20457 𠑗
  • U+3DB7 㶷 = U+2420E 𤈎
  • U+8641 虁 = U+27144 𧅄
  • U+204F2 𠓲 = U+23515 𣔕
  • U+249BC 𤦼 = U+249E9 𤧩

Since then great efforts have been made to improve IRG's quality control process, and Ideographic Description Sequences (IDS) are now used to try to identify and eliminate duplicates and unifiables.

The CJK-C Repertoire

Work on the CJK-C collection started in 2002, and over 20,000 characters were submitted for inclusion by China, Hong Kong, Japan, North Korea, South Korea, Macau, Taiwan, Vietnam and Unicode. Because of the very long time it was taking to complete the work on such a large number of characters, in 2005 it was decided to reduce the size of the initial "C1" set to about 5,000 characters for encoding as CJK-C as soon as possible, with the remaining characters scheduled for encoding as CJK-D after CJK-C has been processed.

Finally, last autumn the "C1" set of 4,219 characters (representing a unification of 4,600 source characters) was submitted to WG2 for encoding as CJK-C (at code points 2A700..2B77A). This set of CJK-C characters was then added to ISO/IEC 10646:2003 Amd.4, and PDAM4 was submitted for the first round of balloting by P-members of SC2 (see Unicode and ISO/IEC 10646 if this makes no sense to you).

The CJK-C repertoire can be analysed as follows :

  • China [IRG N1227] : 1,127 characters, from the following sources :
    • Ci Hai 辭海 [Sea of Words] : 265 characters
    • Gudai Hanyu Cidian 古代漢語詞典 [Dictionary of Ancient Chinese] : 50 characters
    • Hanyu Dacidian 漢語大詞典 [Great Dictionary of Chinese Words] :16 characters
    • Hanyu Dazidian 漢語大字典 [Great Dictionary of Chinese Characters] : 1 character
    • Hanyu Fangyan Dacidian 漢語方言大辭典 [Great Dictionary of Chinese Dialects] : 203 characters
    • Kangxi Zidian 康熙字典 [Kangxi Dictionary] : 7 characters
    • Xiandai Hanyu Cidian 現代漢語詞典 [Dictionary of Modern Chinese] : 26 characters
    • Yinzhou Jinwen Jicheng Yinde 殷周金文集成引得 [Concordance of Shang and Zhou Dynasty Bronze Inscriptions] : 367 characters
    • Zhongguo Dabaike Quanshu 中國大百科全書 [Chinese Encyclopedia] : 75 characters
    • Ideographs used by the Chinese Academy of Surveying and Mapping [中國測繪科學院用字] : 55 characters
    • Ideographs used by the Commercial Press [商務印書館用字] : 61 characters
    • Ideographs used in the Founder Press System [方正排版系統] : 1 character
  • Japan [IRG N1225 part 1, IRG N1225 part 2 and IRG N1225 part 3] : 369 characters, representing the following kinds of usage :
    • characters found in the 9th century Shinsen Jikyō 新撰字鏡 dictionary
    • characters found in various modern dictionaries
    • characters used in various literary works
    • characters used in Buddhist sutras
    • characters found in miscellaneous documents
    • characters used for animal names
    • characters used for place names
    • characters used in personal names
  • North Korea : 9 characters from KPS 10721:2000 and KPS 10721:2003
  • South Korea [IRG N1234] : 405 characters, mostly from historical sources such as 朝鮮王朝實錄
  • Macao [IRG N1228] : 16 characters from the Macao Information System Character Set (澳門資訊系統字集), comprising 15 characters used in personal names and one character used for the name of an unspecified chemical
  • Taiwan [IRG N1232 part 1, IRG N1232 part 2 and IRG N1232 part 3] : 1,812 characters, all used in personal names
  • Vietnam [IRGN 1231] : 785 characters from various dictionaries, including :
    • Từ Điển Chữ Nôm 字典<⿰字宁>喃 (2006)
    • Từ Điển Chữ Nôm Tày (2003)
    • Bảng Tra Chữ Nôm Miển Nam <⿰字文>喃沔南 (1994)
  • Unicode [IRGN 1235] : 77 characters from various sources, including :
    • ABC Chinese-English Comprehensive Dictionary (2000)
    • A complete checklist of species and subspecies of the Chinese birds 中国鸟类种和亚种分类名录大全 (2000)
    • A Field Guide to the Birds of China 中国鸟类野外手册 (2000)
    • Mathews Chinese-English Dictionary (1932)
    • A Pocket Dictionary of Cantonese (1975)
    • Songben Guangyun 宋本廣韻
    • Ideographs used by The Church of Jesus Christ of Latterday Saints

I guess that there are three points that I would make about the repertoire.

Firstly, the quality of sources for these characters varies considerably, with some submissions (e.g. those of China and Vietnam) based on well-known dictionaries and other respectable sources, whereas other submissions are little more than lists of characters to be taken on faith. In particular, the thousands of personal name characters submitted by Taiwan are something that I really do not like at all. The Unicode Standard clearly states that it "does not encode idiosyncratic, personal, novel, or private-use characters" (TUS section 1.1), but this is precisely what they are. Now, I have no problems with encoding ideographs used in personal names that are attested in historical sources or have widespread currency because of the fame of the person bearing the name, but the thousands of characters proposed by Taiwan are one-off usages by ordinary people that will, in the vast majority of cases, never be used outside of Taiwan's ID Card system. Some doting parent no doubt thought it cute to name their baby with a character written as ⿰香寶 "fragrant precious" (U+2B648 = TE-4B54), but once the bearer of this name passes into oblivion, the character will no longer be required or used, although it will remain in Unicode for ever (what a pleasant way to achieve immortality). These are ephemeral usages required solely for Taiwan's ID Card system, and in my opinion they should be represented using the PUA. The complete unnecessity of encoding such characters was driven home just three weeks ago when Taiwan announced that following a program to issue new ID cards to everyone it was discovered that 6,545 proposed characters were no longer in use (both because the bearers of these characters had died or moved abroad, and also because Taiwan was now encouraging people to use standard characters on their ID cards) and should be withdrawn from CJK-D. No doubt if we put off the encoding of CJK-C and CJK-D a few more years we will be able to weed out a few thousand more dead personal name characters. For any other script than CJK, the encoding of personal use characters for a national ID system would not be countenanced, but I suppose that because there are already over 70,000 ideographs encoded the feeling is that adding a few thousand ephemeral characters won't make much difference.

Even within a single submission the quality of evidence adduced varies. For example the Japanese submission provides individual evidence of usage for about two-thirds of the submitted characters, but for many characters there is no indication of where they are used. So, for instance, U+2ABCF [JK-66953] ⿰扌⿱合幸 is given as a "character appearing in other documents", with the unusual range of readings kan, ken, sa, san, ha and uhakkyū, but no indication at all of what document refers to this character, what contexts it is used in or what it means. If it were not that I coincidentally stumbled upon this character recently I would have no idea why it is being proposed for encoding ... as it is I still have no idea what it means, so if anyone does know please tell me.

The second point to make is that the "evidence" provided by the various IRG members varies in quality, with only some members providing examples of usage for each individual proposed character. Vietnam's evidence for its proposed 785 characters comprises nothing more than images of the front covers of the dictionaries from which the characters are taken and a few sample photos of pages from some of these dictionaries (and at a resolution that makes them practically illegible). Again, it has to be admitted that characters from no other script than CJK would be admitted to Unicode on the basis of the evidence supplied by Vietnam.

The third point is that information about the proposed characters varies considerably. Japan and Taiwan provide readings for the proposed characters (although the Taiwan readings are toneless), but other IRG members (e.g. South Korea) do not provide either readings or definitions. I am glad to say that starting from CJK-D every single proposed character will need to be supplied with a reading (if known), definition (if known) and source reference. This will be very useful for populating the Unihan database.

Whilst I have not been greatly impressed by the quality of submissions for CJK-C, things do seem to be changing for the better now, as demonstrated by Taiwan's recent submission of 24 characters required for Taiwanese and Hakka (IRG N1305 and appendix) which provides an excellent model for such documents. Hopefully future submissions from all IRG members will be as good as this one.

The Problems with CJK-C

When CJK-C was presented to WG2 last August it was proudly stated that the repertoire had been through more than fifteen rounds of review by IRG members. However, it was only at this stage (as part of the PDAM4 ballot process) that a few dedicated people outside of IRG started to take a very close look at the CJK-C repertoire, resulting in a WG2 document that presented evidence that six of the submitted CJK-C characters were unifiable variants of existing characters. This document was discussed at the recent WG2 meeting in Frankfurt by WG2/IRG members, and it was agreed that two of the characters were definitely unifiable variants and should not be encoded, and that the other four were potential unifiables, which should be removed from CJK-C pending further investigation. The discovery of issues of this magnitude at this late stage of the encoding process sent shock waves through the IRG membership, and the resultant loss of confidence in the quality of CJK-C meant that there was unanimous agreement to move CJK-C out of Amd.4, and put back to Amd.5 (which is now currently under PDAM ballot).

In light of these developments other IRG member bodies started their own review of the CJK-C repertoire, and it soon became apparent that the six characters were only the tip of the iceberg, and that there were many other potentially unifiable characters in CJK-C, the vast majority of which were personal name usage characters submitted by Taiwan. The IRG met at Xi'an in China a couple of weeks ago, and the result of their deliberations was to recommend the removal of 71 characters from CJK-C, eleven removed entirely and sixty moving to CJK-D for further investigation. The final resolution of CJK-C will be made at the next WG2 meeting, to be held at Hangzhou in China in September, and a lot will depend upon the ballot comments of the various interested national bodies.

One of the major problems that has been highlighted by this exercise is the difficulty of identifying unifiable characters, even using IDS matching algorithms, especially as there is no officially published list of unifiable components. Decisions on whether two characters are unifiable or not have up until now been largely based on ISO/IEC 10646:2003 Annex S, which provides over 250 examples of pairs or triplets of unifiable characters encoded separately in Unicode 1.0 due to the source separation rule. However, these are merely examples that through historical accident came to be encoded in Unicode 1.0, and there are many examples of unifiable components that are not included within the Annex S examples, and so often there is no clear precedent for unification or not of two similar ideographs. In order to help overcome this problem the IRG intends to throroughly revise Annex S, and to provide a more comprehensive list of unifiable and non-unifiable ideographic components. This should not only help proposers and reviewers determine the unifiability of pairs of characters, but when fed into the IDS matching algorithm help identify problematic characters at an early stage in the encoding process.

Some Examples of Problematic CJK-C Characters

To finish things off, here are some examples of characters in CJK-C that I personally find problematic, some of which have already been addressed by IRG, and some of which are still sub judice, so to speak.

U+2A988 [TC-553A]

U+2A988 :

U+2177B :

U+2A988 <⿰女⿱𡗜亐> is quite obviously a simple glyph variant of U+2177B 𡝻 <⿰女⿱𡗜亏>. U+4E90 亐 and U+4E8F 亏 are unifiable components, as indicated by Annex S where U+6C5A 汚 (U+4E90 component) and U+6C61 污 (U+4E8F component) are given as an example of two characters which would have been unified according to the unification rules but for the fact that they come under the source separation rule.

That U+2A988 should be unified with U+2177B is further evidenced by U+28706 𨜆, which has both <⿰⿱𡗜亐阝> and <⿰⿱𡗜亏阝> source glyphs (see Super CJK Version 14.0 page 1729) :

And in CJK-C the Taiwan source glyph for U+2A746 is <⿰亻⿱𡗜亐>, whereas the Vietnam source glyph for the same character is <⿰亻⿱𡗜亏> :

The fact that the unification of <⿰亻⿱𡗜亐>, and <⿰亻⿱𡗜亏> as U+2A746 had been recognised, but the corresponding unification of U+2A988 <⿰女⿱𡗜亐> with U+2177B 𡝻 <⿰女⿱𡗜亏> had not been noticed is worrying, and indicative of a failure in the original IDS checking algorithm. However, we are all learning from mistakes such as this one, and it is to be expected that the IDS checking algorithm used for CJK-D will be much improved.

U+2ACF5 [TD-4D43]

U+2ACF5 :

U+069D4 :

This is another example of a straightforward unification that should have been picked up long before CJK-C went to ballot. U+2ACF5 <⿰木⿱白本> differs from U+69D4 槔 <⿰木⿱白夲> only by the way in which the bottom right component is written, U+5932 夲 being a common handwritten variant of U+672C 本. Annex S gives U+5932 夲 and U+672C 本 as examples of unifiable characters, and so the IDS checking algorithm should have picked up the unification with U+69D4. But what really amazes me is that this character somehow managed to get into the Taiwan ID Card system as a separate character from U+69D4 槔 in the first place.

U+2AE77 [TD-3F3B]

U+2AE77 :

U+07296 :

This is an example of one of many Taiwan personal name characters in CJK-C that vary from an already encoded character that they share the same pronunciation with by a single stroke. In the case of U+2AE77 <⿱𤇾𠀆> (reading given as "luo" in the Taiwan evidence), the glyph differs from U+7296 犖 <⿱𤇾牛> luò by the omission of one stroke. It may be that the bearer of this character deliberately omits the stroke for some reason best known to himself (perhaps taboo avoidance if the character was also used in the name of a dead relative or perhaps just to be different) or it may simply be that the ID card on which the name was written was damaged or defaced, leading to some Taiwan bureaucrat to mistakenly read 犖 as <⿱𤇾𠀆>. Whatever the reason, I personally believe that characters like U+2AE77 should not be encoded, but treated as unifiable glyph variants of the character that they are mutilations of.

In response to the unification issues relating to characters used for personal names (especially the thousands submitted by Taiwan), it has now been suggested that a separate block be allocated for personal use ideographs, and that ideographs encoded in this block should have less strict unification rules applied to them. This is something that I, and I suspect a lot of other people, would be strongly opposed to. My suggestion would be that the PUA would be the ideal place to put ephemeral personal name characters where a unifiable glyph distinction needs to be preserved.

U+2AEDF [HC100308]


U+072AE :

At first sight U+2AEDF (犬 "dog" with an extra stroke on its right leg) does not look too much like U+72AE 犮 bá, but it does if you look at the source glyphs for U+72AE (ISO/IEC 10646:2003 p.677) :

From this it would seem that U+2AEDF has always been one of the ways of writing U+72AE, so how come it is suddenly up for encoding (an implicit disunification of the two glyph forms of U+72AE). The answer is that Hanyu Dacidian 漢語大詞典 [Great Dictionary of Chinese Words] has two separate entries for each of the glyphs. The entry for U+72AE 犮 says it is the same as U+2AEDF, but refers the reader to the entry là bá 剌犮 "walking in the manner of a limping dog" :

Then under the entry for U+2AEDF, we read that U+2AEDF either means the same as the character U+62D4 拔 bá "to root out" or is used in the compound word báyǐ <U+2AEDF>乙 "to write in a careless and unrestrained manner" :

From these entries in Hanyu Dacidian it would seem that there is a semantic distinction between U+2AEDF and U+72AE, the former used in the word báyǐ and the latter in the word là bá 剌犮, and thus the disunification of U+72AE into U+72AE and U+2AEDF is justified. However, when we look at the entry for U+2AEDF in the Kangxi Dictionary (there is no entry for U+72AE) we find that the same glyph (U+2AEDF) is used in the senses covered by both U+72AE and U+2AEDF in Hanyu Dacidian :

The Kangxi Dictionary entry confirms that there is no semantic distinction between U+2AEDF and U+72AE, and that the distinction shown in Hanyu Dacidian may be categorised as an editorial mistake. Thus the disunification of the two gltph forms of U+72AE, and the consequent encoding of U+2AEDF is not justified.

U+2AEEF [G_HC100898]


U+24814 :

At first sight U+2AEEF <⿰犭貟> and U+24814 <⿰犭員;> are unifiable glyph variants, as Annex S gives U+8C9F 貟 and U+54E1 員 as unifiable components (see sample image from Annex S given above). But when we look at the Kangxi Dictionary we find that they have different definitions, U+2AEEF being defined as a variant form of U+7328 猨, and U+24814 being defined as a variant form of U+733F 猿 (the "above" character) :

This would seem to suggest that the two characters are in fact non-unifiable on the principle that non-cognate characters are not unified. However, U+7328 猨 and U+733F 猿 are themselves different glyphs for the same character, meaning "ape" (in the Kangxi Dictionary U+733F 猿 is treated as a vulgar variant of U+7328 猨, but in modern Chinese U+733F 猿 is the standard character for "ape"). So if U+7328 猨 and U+733F 猿 refer to the same beast, is there any semantic difference between U+2AEEF and U+24814 (i.e. can we say, U+2AEEF == U+7328, and U+24814 == U+733F, and U+7328 == U+733F, but U+2AEEF != U+24814) ? Probably not, in which case U+2AEEF should not be encoded separately, but unified with U+24814.

The issue in this case is further complicated by the fact that there is already a compatibility ideograph, U+2F927 𤠔 (that is canonically equivalent to U+24814) that has the same glyph shape as U+2AEEF. So in effect, encoding U+2AEEF would be disunifying the two glyph forms of U+2AEEF, but the unfortunate and inevitable result of such a disunification would be to leave U+2F927 with a canonical decomposition mapping to U+24814 when it should be mapped to the new U+2AEEF character (but Unicode stability rules mean that decomposition mappings can never be changed). If you are interested in disunification issues such as this, read N3196 which proposes the disunification of U+4039.

U+2AFA7 [JK-65424]

U+2AFA7 :

The problem with this character is that the proposed glyph for U+2AFA7 𪾧 <⿸疒⿱非気> does not match the glyph used in the evidence adduced for it, where the character is actually written as <⿸疒⿱非氣> :

U+6C17 気 is the standard Japanese simplification of U+6C23 氣, but I do not think that it is allowed to show as evidence a character with the 氣 component and then ask for the corresponding simplified form with the 気 component to be encoded -- certainly for Chinese simplified characters this is not allowed (the simplified form has to be attested to be encoded). Annex S does not indicate that U+6C17 気 and U+6C23 氣 are unifiable components, which implies that they are not unifiable, and therefore that <⿸疒⿱非気> and <⿸疒⿱非氣> are not equivalent.

If we look at another of the Japanese CJK-C submissions (p.55), U+2B27A <⿱艹氣> we see that both the source reference and the proposed glyph are written using the 氣 component, so why does the proposed glyph for U+2AFA7 show the simplified 気 component when its source reference shows the traditional 氣 component ?

Other examples of J-source characters that show a discrepancy between the CJK-C glyph and the glyph shown in the supporting evidence include :

  • U+2A761 [JK-65028] : ⿰亻弱 vs. ⿰亻⿰苟苟 (IRG N1225 part 2 page 41)
  • U+2ACCC [JK-65156] : ⿰市来 vs. ⿰市耒 (IRG N1225 part 3 page 97)
  • U+2B057 [JK-65465] : ⿱禾工 vs. ⿱禾土 (IRG N1225 part 3 page 101)
  • U+2B318 [JK-65704] : ⿰虫集 vs. ⿱⿰虫隹木 (IRG N1225 part 2 page 50)
  • U+2B340 [JK-65723] : ⿰衤昜 vs. ⿰衤⿱𠂉昜 (IRG N1225 part 2 page 51)

We are just left to wonder whether perhaps any of the J-source characters in CJK-C that have no supporting evidence provided for them have the wrong glyph shape as well. This highlights a wider problem, that is that the correctness of the glyph shape of proposed characters can only be verified if sample images showing the characters in text use are supplied for all proposed characters. However, currently this is not being done by all IRG member bodies, and some (such as Vietnam) did not provide any textual evidence at the individual character level for their CJK-C submissions.

U+2B29E [HC101428]

U+2B29E :

U+0452D :

The first thing to note about U+2B29E <⿱艹𡩋> is that although its source reference is Hanyu Dacidian 漢語大詞典 [Great Dictionary of Chinese Words] <HC>, there is no entry for this character in this dictionary. There is only an entry for the very similar U+452D 䔭 <⿱艹甯>, which says "See under dǐng nìng 葶䔭" :

But when we look at the entry for U+8476 葶 we find that the word dǐng nìng 葶䔭 is here written with U+2B29E as its second character :

Clearly, U+2B29E and U+452D are interchangeable glyph variants, and the fact that both variants are used in Hanyu Dacidian rather than either U+2B29E or U+452D consistently would seem to be an editorial oversight.

Looking now at the already encoded pair U+27476 𧑶 <⿰虫𡩋> and U+27457 𧑗 <⿰虫甯>, which have the same relationship as U+2B29E and U+452D, we find that Hanyu Dacidian has an entry for U+27476 𧑶 (vol.8 p.974) but not for U+27457 𧑗, whereas the Kangxi Dictionary has an entry for U+27457 𧑗 (p.1098) but not for U+27476 𧑶. And significantly, U+27476 𧑶 in Hanyu Dacidian corresponds in meaning to U+27457 𧑗 in the Kangxi Dictionary, where they are both defined as a kind of cicada 蟬.

From these two examples, it is clear to me that the phonetic elements U+752F 甯 and U+21A4B 𡩋 can be used interchangeably. However, are they unifiable variants ? I believe that as the difference between U+752F 甯 and U+21A4B 𡩋 is just one of stroke overshoot (see Annex S section S.1.5 b) they are indeed unifiable variants. Note that U+5BD7 寗 is also a specialised variant of U+752F 甯, but in this case the extra stroke probably means that it is not unifiable.

U+2B497 [G_XC2019, TC-2D59]

U+2B497 :

U+090A6 :

The source reference for U+2B497 is Xiandai Hanyu Cidian 現代漢語詞典 [Dictionary of Modern Chinese] (in my opinion the best concise dictionary of Chinese around), where it is given as a variant form of U+90A6 邦 bāng :

The difference between U+2B497 and U+90A6 is one of glyph overshoot (see Annex S section S.1.5 b) and stroke rotation (see Annex S section S.1.5 a), and so according to Annex S these two characters are unifiable glyph variants. Other already encoded characters with U+2B497 as a component are U+22E0C 𢸌, U+26C25 𦰥 and U+22D69 𢵩, in all of which cases the U+2B497 component is surely interchangeable with U+90A6.

U+2B6B8 [TE-435A]

U+2B6B8 :

U+09C49 :

The bottom component of U+2B6B8 (encoded as U+29D4B 𩵋) is a common glyph variant of U+9B5A 魚 "fish" (I remember frequently seeing this variant form of the fish radical in restaurants in Japan, and it is the form of the fish radical used in the source references for U+2B6B1 [JK-66001] and U+2B6C8 [JK-65938]), as seen in these examples from a Japanese dictionary of calligraphy (書道字典) :

This example shows up the weakness of Annex S, as there is nothing in it to suggest that U+29D4B 𩵋 and U+9B5A 魚 are unifiable components, yet anyone who reads Chinese will immediately recognise that U+2B6B8 (a Taiwan personal name character) is a simple glyph variant of U+9C49 鱉. At present the only encoded character with this form of the fish radical is U+29E3A 𩸺 <⿰𩵋隶>, for which luckily there is no corresponding character <⿰魚隶>. To deny that U+29D4B 𩵋 and U+9B5A 魚 are unifiable components would open up the possibility of encoding U+29D4B variants of any or all of the 957 currently encoded characters with the 魚 "fish" radical. In my opinion, it was a mistake to encode U+29E3A, but to encode U+2B6B8 would be a crime.

What's the Solution ?

One common theme that can be seen in these examples is the desire to be able to represent unifiable glyph variants at the encoding level. I can certainly understand that if a dictionary references a glyph variant for a particular character in addition to the standard glyph form of the character, it is not very helpful to tell the dictionary editors and/or users that we won't encode the variant form they need to distinguish from the standard glyph form because it is "unifiable" with the standard form of the character.

As an example, if I wanted to make an on-line version of Xiandai Hanyu Cidian 現代漢語詞典 [Dictionary of Modern Chinese], how would I be expected to deal with the entry for U+90A6 邦 (image shown above), which shows the variant form U+2B497 in parentheses after the main character. In plain text my entry would look something like :

邦(邦) bāng 国:友~|邻~。

This, of course, makes no sense, as the character in parentheses (U+2B497) is the same as the character it refers to (U+90A6). I can think of several ways of dealing with this problem :

  • Encoding U+2B497 as a separate character
  • Representing U+2B497 with an image
  • Representing U+2B497 with an IDS sequence
  • Specifying a special font for the character in parentheses that has the U+2B497 glyph for U+90A6
  • Representing the U+2B497 glyph form of U+90A6 with a variation sequence

The first of these solutions is obviously something that I have been arguing against, and the middle three solutions are clunky and unacceptable to my mind, so that only leaves us with the final solution, or "pseudo-coding" as my friend Michael Everson would call it. I don't much like the idea of defining variation sequences in order to represent simple glyph variants, but in the case of CJK I think that this is the best solution we have, and I would recommend this approach where there is a demonstrable need to represent distinctions between glyph variants in a dictionary (e.g. for U+2B497 vs. U+90A6, U+2AEEF vs. U+24814 and U+2AEDF vs U+72AE), but not for cases where it is just a matter of wanting to use a particular glyph variant for a particular character (e.g. U+2B29E, which is not used distinctively from U+452D in Hanyu Dacidian).

For my penultimate post in the current series I am going to be continuing with these Han thingies, but will be looking even further into the future, to CJK-D and beyond. But in the meantime, having touched upon variation selectors in this post I think I shall make a quick detour to examine in greater detail the issues of variation sequences for Maths, Mongolian, Phags-pa and CJK.

Friday, 8 June 2007

What's new in Unicode 5.1 ?

Back in November 2005 I asked What's new in Unicode 5.0 ? in anticipation of its release in July of the following year. Now that Unicode 5.0 has been out for nearly a year I thought it would be good time to look ahead to what is in store for Unicode 5.1. Just to be clear, Unicode 5.1 won't be released until the spring or summer of 2008, but the character repertoire is already basically fixed, and there are unlikely to be any major changes (but if there are I will update this post). Well in the end there was one major change -- see addendum at bottom of the page [2007-10-19]. See bottom of post for a list of fonts with Unicode 5.1 coverage.

The additions to Unicode 5.1 will correspond to Amendments 3 and 4 of ISO/IEC 10646:2003. A total of 1,102 new characters are added in Amd.3, although four (U+097B, U+097C, U+097E and U+097F) are already in Unicode 5.0, and a total of 636 526 new characters are expected to be added to Amd.4, so that Unicode 5.1 will have 1,734 1,624 additional characters compared with Unicode 5.0, making a grand total of 100,823 100,713 encoded characters (graphic, format and control characters) in Unicode, breaking the 100K mark for the first time (and for all those who are worried that 17 planes are just not enough, that still leaves room for another 873,707 873,817 characters).

The additions for 5.1 are not as controversial as those for 5.0, and maybe not be as exciting as 5.2 promises to be, but it will include twelve eleven new scripts [Lanna now postponed to Amd.5], which equals nearly equals 3.0 as being the largest number of scripts added in a single version of Unicode. From 5.1 Unicode will cover 76 75 scripts (including Braille which is classified as a script in Unicode), as shown in the table below. Regular readers of my blog will realise that there are still many more historic and less comon scripts waiting to be encoded.

Scripts Encoded up to Unicode 5.1
Script Name ISO 15924 Characters* Version Introduced into Unicode
5.0 5.1
Canadian AboriginalCans6306303.0
CopticCopt1281281.0 (disunified from Greek in 4.1)
HangulHang11,62011,6201.0 (relocated in 2.0)
Kayah LiKali0485.1
Lanna Tai ThamLana0128 05.1 (postponed to Unicode 5.2)
Linear BLinb2112114.0
MyanmarMymr78139 1563.0
New Tai LueTalu80804.1
Ol ChikiOlck0485.1
Old ItalicItal35353.1
Old PersianXpeo50504.1
Syloti NagriSylo44444.1
Tai LeTale35354.0
TibetanTibt1952011.0 (removed in 1.1 and reintroduced in 2.0)

* Numbers of characters do not necessarily represent the total number of encoded characters used for the script (and are not necessarily the same as the number of characters in the same-named block), but are the number of characters that are uniquely assigned to that script by Unicode (i.e. excluding characters that have the Unicode script property of "common" or "inherited"). Some differences in the figures for particular scripts (e.g. Katakana and Latin) reflect changes in script assignment in Unicode 5.1.

For me, the highlights of Unicode 5.1 are the encoding of the symbols on the enigmatic Phaistos Disc (first proposed for encoding ten years ago, but delayed because of some opposition to encoding undeciphered symbols found on a unique artefact), and the encoding of a wide range of letters used in medieval manuscripts and early printed books, so that finally texts such as The Calixtus Bull can be represented exactly as they are written. The script that has had the biggest makeover for 5.1 is Myanmar, with changes to the encoding model to finally make it useable, as well as additions to support minority languages such as Mon, S'gaw Karen, Western Pwo Karen, Eastern Pwo Karen, Geba Karen, Kayah, Shan and Rumai Palaung (see Andrew Cunningham's The Myanmar script and Unicode for a useful overview of support for the Myanmar script) And then there are a handful of Tibetan (U+0FCE, U+0FD2..U+0FD4), Mongolian (U+18AA) and CJK (U+9FC3) characters that I am responsible for, which I am of course pleased to see make it into the standard.

Amendment 3

Amendment 3 is now at the FDAM stage of the ISO ballot process, and its repertoire is fixed, so the code points given below can be relied on. The ISO 15924 code for new scripts is given in square brackets, and the number of new characters is given in curly braces.

New Scripts

  • Sundanese [Sund] {55} at 1B80..1BBF
  • Lepcha [Lepc] {74} at 1C00..1C4F
  • Ol Chiki [Olck] {48} at 1C50..1C7F
  • Vai [Vaii] {300} at A500..A63F
  • Saurashtra [Saur] {81} at A880..A8DF
  • Kayah Li [Kali] {48} at A900..A92F
  • Rejang [Rjng] {37} at A930..A95F
  • Lycian [Lyci] {29} at 10280..1029F
  • Carian [Cari] {49} at 102A0..102DF
  • Lydian [Lydi] {27} at 10920..1093F

Other New Blocks

Additions to Existing Blocks

Amendment 4

Amendment 4 is now at the FPDAM stage of the ISO ballot process, and its repertoire is unlikely to change significantly, but there may be changes, and the code point allocations could possibly change. The ISO 15924 code for new scripts is given in square brackets, and the number of new characters is given in curly braces.

New Scripts

  • Lanna [Lana] {127} at 1A20..1AAF (now moved to Amd.5)
  • Cham [Cham] {83} at AA00..AA5F

Other New Blocks

Additions to Existing Blocks

What's Not in Unicode 5.1

Egyptian Hieroglyphs (an initial set of 1,063 characters corresponding to Gardiner's Sign List) are not in 5.1, but are in Amd.5 which is currently undergoing its first ballot, and should correspond to Unicode 5.2 (there will probably be several minor versions before Unicode 6.0 is published). Other scripts that are in Amd.5 are Meitei Mayek, Bamum (removed for further study), Tai Viet and Avestan. Amd.5 also includes two new blocks for a set of controversial Old Hangul Jamo.

Not yet ready for inclusion in Unicode 5.2 is Tangut. A first proposal has now been submitted to the UTC, but has not yet reached WG2. Because of the complexity of the Tangut repertoire and probable issues about "ownership" of the script, it may take some time to reach an agreement on encoding Tangut, and so may not be in Unicode for a few more versions yet. [Well, I was wrong about that—it has made it into Amd.6 which means that it is scheduled for inclusion in Unicode 5.2]

However, the big and unexpected hole in 5.1 (Amd.4) is CJK-C, which is the first installment of the tens of thousands of additional Han characters submitted for encoding by members of the Ideographic characters Rapporteur Group (IRG). This set of 4,219 CJKV ideographs was included in PDAM4, but was moved from Amd.4 to Amd.5 at the last WG2 meeting (in Frankfurt at the end of April). I will look at CJK-C in more detail in my next post.

Addendum [2007-10-19]

At the WG2 meeting in Hangzhou last month (which I had hoped to attend if it was in Ürümqi as originally planned) two important changes to the Amd.4 repertoire were made.

Firstly, 17 additional Myanmar characters (including 10 Shan digits) were added in order to complete the extensions to the Myanmar script required to support the Shan language.

Secondly, the agreement on encoding the Lanna script achieved at the Frankfurt WG2 meeting in the Spring fell apart, with China demanding significant changes to the proposal. The end result was that Lanna was removed from Amd.4, and put back to Amd.5 (this will mean that it will miss the train for Unicode 5.1 next year). In addition, the script name is to be changed to TAI THAM due to objections to the name "Lanna" by China. (There have been a lot of disputes over script names recently, with user communities objecting to traditional English script names such as Pollard and Fraser.)

So now the repertoire of Amds. 3 and 4 have been finalised, and consequently the contents of Unicode 5.1 are now fixed, and will be going beta in the Spring. However, I think that Amd.5 is going to be the interesting one, as it includes both CJK-C and Egyptian hieroglyphs (but with Bamum removed by request of the user community, and Meitei Mayek removed due to fierce differences of opinion on danda disunification within WG2).

Unicode 5.1 Fonts [2008-04-28]

Now that Unicode 5.1 has been released (April 2008) a lot of people want to be able to make use of all the new scripts and characters, but obviously can't if they don't have any fonts that support the new Unicode 5.1 characters. So here is a list of some freeware and shareware fonts that do have Unicode 5.1 coverage (Unicode 5.1 coverage in brackets):

  • Aegean (Ancient Symbols, Carian, Lycian, Phaistos Disc)
  • Aegyptus (Lydian)
  • Code2000 (Cham, Cyrillic, Cyrillic Extended-B, Greek, Kayah Li, Latin Extended Additional, Latin Extended-C, Latin Extended-D, Myanmar, Ol Chiki, Rejang, Saurashtra, Supplemental Punctuation, Vai)
  • Code2001 (Domino Tiles, Phaistos Disc)
  • Everson Mono (Ancient Symbols, Combining Diacritical Marks Supplement, Cyrillic, Cyrillic Extended-A, Cyrillic Extended-B, Greek, Latin Extended Additional, Latin Extended-C, Latin Extended-D, Phaistos Disc, Supplemental Punctuation)
  • Padauk (Myanmar)
  • RomanCyrillic Std and CampusRoman Std (Ancient Symbols, Cyrillic Extended-A, Cyrillic Extended-B)
  • Sundanese Unicode (Sundanese)
  • Unicode Symbols (Domino Tiles, Mahjong Tiles)

On Beyond Unicode 5.1 ...

And finally, if you are interested in what will be in the next version of Unicode after 5.1, take a look at What's new in Unicode 5.2 ?.