Thursday, 28 June 2007

The Secret Life of Variation Selectors

One of the most controversial encoding mechanisms provided by Unicode is that of variation selectors. Some people revile them as "pseudo-coding" whilst others are eager to embrace them as a solution for almost every new encoding issue that arises. Personally I think that they provide an essential mechanism for selecting contextual glyph forms in isolation or overriding the default contextual glyph selection in some complex scripts such as Mongolian and Phags-pa, but I am not keen on their use to select simple glyph variants for aesthetic or epigraphic purposes, and I definitely oppose their use as private glyph identifiers.

Recently, with more and more historic scripts being encoded in Unicode, there have been frequent suggestions that variation selectors should be used to standardize the multitude of stylistic letterforms that are often recognised by scholars of ancient scripts, usually with the rationale that epigraphers and palaeographers need to be able to distinguish variations in glyph forms at the encoding level in order to accurately represent ancient texts. As a textual scholar by training I appreciate how important distinctions at the glyph level can be to the dating and analysis of a text, but I really doubt the need to represent stylistic glyph variants at the encoding level. This is usually more usefully achieved with higher level markup or at the font level. Time and again when discussing the encoding of some ancient script with Dr. X or Professor Y I hear the assertion that the encoded text must be an exact facsimile of the written or inscribed original, to which my response is that encoded text is not intended as a replacement for facsimile drawings and photographs of manuscripts and inscriptions, and that scholars of ancient texts need to work with both photographic images and electronic text, which serve very different purposes. Thusfar we have managed to stave off the demand for glyph level encoding of historic scripts using variation selectors, but I predict that before long there will be a proliferation of variation sequences for newly encoded historic scripts.



Fundamental Principles

Variation Selectors are a set of 256 characters, FE00..FE0F (VS1..VS16) and E0100..E01EF (VS17..VS256), that can be used to define specific variant glyph forms of Unicode characters. There are also three Mongolian Free Variation Selectors, 180B..180D (FVS1..FVS3), that behave the same as the generic variation selectors but are specific to the Mongolian script. See The Unicode Standard Section 16.4 for more details.

A variation selector may be used to define a variation sequence, which comprises a single base character followed by a single variation selector. The base character must not be either a decomposable character or a combining character otherwise normalization could change the character to which the variation selector is appended (as we shall see below this rule was not followed when mathematical variation sequences were first defined).

The most important thing to realise about variation selectors is that they are not intended to provide a generic method for defining glyph variants by all and sundry, but that only those variation sequences specifically defined by Unicode (aka standardized variants) are valid. To put it another way, no conformant Unicode process is allowed to recognise any variation sequence not defined by Unicode (i.e. a conformant Unicode process may not render the base character to which a variation selector is appended any differently to the base character by itself, if the variation sequence is not defined by Unicode).

Of course there is nothing to stop me from defining my own variation sequence, say <0041 FE0F> (A + VS16) to indicate the Barred A that I use to write the "A" of "A️ndew", but I should not expect Microsoft or anyone else to support my variation sequence. Although, having said that, Microsoft Vista does support some variation sequences that are undefined by Unicode (as we shall see below), and so I hope no-one is advertising Vista as being Unicode-conformant.

At present (Unicode 5.0) Unicode defines variation sequences for various mathematical characters, as well as for the Mongolian and Phags-pa scripts. These are specified in the file StandardizedVariants.txt (also as HTML with glyph images). It is to be expected that the first Han ideographic variants will be defined in Unicode 5.1.



Mathematical Variation Sequences

Unicode defines variation sequences for 15 characters in the Mathematical Operators block [2200..22FF] and 8 characters in the Supplemental Mathematical Operators block [2A00..2AFF]. In all of these cases the variation selector used is U+FE00 (VS1).

Mathematical Variation Sequences
Base
Character
Variation
Selector
Variation Sequence Appearance *
No VS With VS
U+2229VS1<2229 FE00> INTERSECTION with serifs∩︀
U+222AVS1<222A FE00> UNION with serifs∪︀
U+2268VS1<2268 FE00> LESS-THAN BUT NOT EQUAL TO with vertical stroke≨︀
U+2269VS1<2269 FE00> GREATER-THAN AND NOT DOUBLE EQUAL with vertical stroke≩︀
U+2272VS1<2272 FE00> LESS-THAN OR EQUIVALENT TO following the slant of the lower leg≲︀
U+2273VS1<2273 FE00> GREATER-THAN OR EQUIVALENT TO following the slant of the lower leg≳︀
U+228AVS1<228A FE00> SUBSET OF WITH NOT EQUAL TO with stroke through bottom members⊊︀
U+228BVS1<228B FE00> SUPERSET OF WITH NOT EQUAL TO with stroke through bottom members⊋︀
U+2293VS1<2293 FE00> SQUARE CAP with serifs⊓︀
U+2294VS1<2294 FE00> SQUARE CUP with serifs⊔︀
U+2295VS1<2295 FE00> CIRCLED PLUS with white rim⊕︀
U+2297VS1<2297 FE00> CIRCLED TIMES with white rim⊗︀
U+229CVS1<229C FE00> CIRCLED EQUALS equal sign touching the circle⊜︀
U+22DAVS1<22DA FE00> LESS-THAN EQUAL TO OR GREATER-THAN with slanted equal⋚︀
U+22DBVS1<22DB FE00> GREATER-THAN EQUAL TO OR LESS-THAN with slanted equal⋛︀
U+2A3CVS1<2A3C FE00> INTERIOR PRODUCT tall variant with narrow foot⨼︀
U+2A3DVS1<2A3D FE00> RIGHTHAND INTERIOR PRODUCT tall variant with narrow foot⨽︀
U+2A9DVS1<2A9D FE00> SIMILAR OR LESS-THAN with similar following the slant of the upper leg⪝︀
U+2A9EVS1<2A9E FE00> SIMILAR OR GREATER-THAN with similar following the slant of the upper leg⪞︀
U+2AACVS1<2AAC FE00> SMALLER THAN OR EQUAL TO with slanted equal⪬︀
U+2AADVS1<2AAD FE00> LARGER THAN OR EQUAL TO with slanted equal⪭︀
U+2ACBVS1<2ACB FE00> SUBSET OF ABOVE NOT EQUAL TO with stroke through bottom members⫋︀
U+2ACCVS1<2ACC FE00> SUPERSET OF ABOVE NOT EQUAL TO with stroke through bottom members⫌︀

* If you have a recent version of James Kass's Code2000 installed on your system you should see the difference in appearance between the base character with and without VS1 applied to it (at least it works for me with IE6 or IE7).


Originally when the set of mathematical variation selectors were encoded in Unicode 3.2 there were two additional variation sequences :

  • <2278 FE00> NEITHER LESS-THAN NOR GREATER-THAN with vertical stroke
  • <2279 FE00> NEITHER GREATER-THAN NOR LESS-THAN with vertical stroke

However, as U+2278 and U+2279 are both decomposable characters, if the variation sequences <2278 FE00> and <2279 FE00> are subjected to decomposition (NFD or NFKD) they will change to <2276 0338 FE00> and <2277 0338 FE00> respectively. When this happens VS1 is now appended to U+0338 COMBINING LONG SOLIDUS OVERLAY, and <0338 FE00> is not a defined variation sequence. Therefore these two variation sequences were undefined in Unicode 4.0 (which I guess answers the question of whether once defined a variation sequence can be undefined or not). However, due to an unfortunate oversight, the last paragraph of Section 15.4 of The Unicode Standard still suggests that VS1 can be applied to U+2278 and U+2279 (although an erratum for this has now been issued).

Turning to the general reason for defining these variation sequences in the first place, we find almost no explanation for them in The Unicode Standard (section 15.4). We are asked to "see Section 16.4, Variation Selectors, for more information on some particular variants", but turning to Section 16.4 we find no mention of mathematical variation sequences, much less any information on particular variation sequences. It has been explained to me that mathematical variation sequences have been defined because nobody is quite sure whether there is any semantic difference between the variant glyphs or not; if it was certain that there is a semantic difference between the variant gyphs then the variant forms would have been encoded as separate characters, and conversely, if it was certain that there was no semantic difference then variation sequences would not have been defined for them.

A final important point to note is that whilst the glyph form of a variation sequence is fixed, that of the base character when not part of a variation sequence is not fixed, so that the range of acceptable glyph forms for a particular base character may encompass the glyph form of its standardized variant. For example, although the glyph for <2229 FE00> "INTERSECTION with serifs" must have serifs, this does not mean that the character U+2229 must not have serifs, and depending on the font it may or may not have serifs. In fact, there is no way of selecting "INTERSECTION without serifs" at the encoding level.



Mongolian Variation Sequences

Mongolian variation sequences are formed using the special Mongolian Free Variation Selectors 180B..180D (FVS1..FVS3) rather than the generic variation selectors. Unlike mathematical variation selectors, which seem like a kludge, variation selectors are an essential aspect of the Mongolian encoding model. To understand why they are required you need to understand a little bit about the nature of the Mongolian script, in which most letters have a variety of positional, contextual and semantic glyph forms (see The Unicode Standard Section 13.2 for further details). The glyph form that a particular letter assumes depends upon various factors such as :

  • its position in a word (initial, medial, final or isolate)
  • the gender of the word that it occurs in (masculine or feminine depending upon the vowels in the word, so that, for example, completely different glyph forms of U+182D GA are found in the masculine word jarlig "order" and the feminine word chirig "soldier")
  • what letters it is adjoining to (e.g. U+1822 I is written with a single tooth after a consonant but with a double tooth after a vowel; U+1828 NA in medial position has a dot before a vowel but no dot before a consonant; U+1832 TA and U+1833 DA both take the reclining form before a vowel and the upright form before a consonant)
  • whether the word is a native word or a foreign borrowing (e.g. the glyph form of U+1832 TA and U+1833 DA in medial position in a native word depends upon whether the letter is followed by a vowel or a consonant, but in foreign words U+1832 TA is always written with the upright glyph form, whereas U+1833 DA is always written with the reclining glyph form)
  • whether traditional or modern orthographic rules are being followed (e.g. U+182D GA in the word gal "fire" is written with two dots in modern orthography but with no dots in traditional orthography)

The rendering system should select the correct positional or contextual form of a letter without any need for user intervention (i.e. variation selectors are not normally needed in running text to select glyph forms that the rendering system can predict from context), but for foreign words and words written in traditional orthography the user needs to apply the appropriate variation selector to select the correct glyph form where appropriate.

Variation selectors may also be used to select a particular contextual glyph form of a letter out of context, for example in discussions of the script, where there is a need to display a particular glyph form in isolation.

Not all Mongolian, Todo, Manchu and Sibe letters have glyph forms that need distinguishing by means of variation sequences, but variation sequences are still defined for as many as thirty-eight of the 128 letters in the Mongolian block. In addition to these variation sequences which define contextual glyph forms of letters, there are two variation sequences defined by Unicode where variation selectors are used to select stylistic variants :

  • <1880 180B> MONGOLIAN LETTER ALI GALI ANUSVARA ONE second form
  • <1881 180B> MONGOLIAN LETTER ALI GALI VISARGA ONE second form

With regard to the first of these, I would suggest that U+1880 by itself corresponds to a "candrabindu" (e.g. Devanagari U+0901 and Tibetan U+0F83), whereas the variation sequence <1880 180B> ᢀ᠋ corresponds to an "anusvara" (e.g. Devanagari U+0902 and Tibetan U+0F7E); thus I believe that they are semantically distinct and should have been encoded as separate characters rather than as one character plus a standardized variant. I am not sure about the two forms of the visarga (U+1881 and <1881 180B> ᢁ᠋).

As an aside, one very curious feature about the two characters U+1880 and U+1881 is their names, which both include the unexpected and (in this context) meaningless word "one". My only explanation for this is that at some early stage of the Mongolian character repertoire four characters had been proposed :

  • MONGOLIAN LETTER ALI GALI ANUSVARA ONE
  • MONGOLIAN LETTER ALI GALI ANUSVARA TWO
  • MONGOLIAN LETTER ALI GALI VISARGA ONE
  • MONGOLIAN LETTER ALI GALI VISARGA TWO

But then the "two" characters were redefined as variation sequences of the corresponding "one" character. However, the original names must have been inadvertently left unchanged, with "one" left in the name as a fossil reminder to the time when there were two such characters. But this is pure conjecture; I have not been able to find any support for this theory yet.

The problem with the system of Mongolian variation sequences is that nearly eight years after Mongolian was added to Unicode (3.0 in September 1999) the exact shaping behaviour of Mongolian remains undefined. Although Unicode defines a number of standardized variants for Mongolian, a simple list such as this is not sufficient to implement Mongolian correctly. So when Microsoft decided to support Mongolian in its Vista operating system it had to rely on information on shaping behaviour outside of the Unicode Standard, specifically unpublished draft specifications for Mongolian shaping behaviour from China which in places contradicts both itself and the Unicode Standard with regard to the use of variation selectors.

I have to sympathise with Microsoft, which is in a very difficult position in trying to support a script for which the necessary shaping behaviour specification has long been promised but never delivered, but nevertheless it is very unfortunate that Microsoft did not work with Unicode to write the promised Unicode Technical Report on Mongolian at the same time as it developed its Mongolian implementation. As it stands the Vista implementation of Mongolian is essentially an undocumented and private interpretation of Mongolian shaping behaviour. In particular the Vista implementation (Uniscribe and the Mongolian Baiti font) support a number of variation sequences that are not defined by Unicode.

The table below lists those variation sequences supported in the Mongolian Baiti font that are undefined by Unicode but which have the same glyph appearance as another defined variation sequence. The seven undefined isolate variants are identical to another positional form of the letter, and can be selected using the appropriate combination of ZWJ and FVS; I do not believe any of them are true isolate forms which require special variation sequences other than the already defined sequences for when they occur in a non-isolate position. The two undefined initial variants are identical to the medial forms of the same letter that are selected after NNBSP, and the undefined final variant is identical to the medial form of the same letter that is selected before MVS. I do not think that these are true initial or final forms, and any usage in initial or final position (e.g. when discussing a stem or suffix in isolation) can be dealt with using the existing, defined variation sequences and ZWJ where appropriate (e.g. the suffix ACA that occurs after NNBSP can be represented in isolation as <200D 1820 180C 1834 1820>, without requiring a special initial variant). In summary, not only are none of variation sequences in the table below sanctioned by Unicode, but in my opinion none of them are required anyway.

Undefined Variation Sequences in Mongolian Baiti
Base Character Variation Selector Position Variation Sequence Appearance* Notes
U+1820 FVS2 Isolate <1820 180C> ᠠ᠌ This undefined isolate variant is the same as the defined second final form
U+1821 FVS1 Isolate <1821 180B> ᠡ᠋ This undefined isolate variant is the same as the defined second final form
U+1822 FVS1 Isolate <1822 180B> ᠢ᠋ This undefined isolate variant is the same as the defined final form
U+1824 FVS1 Isolate <1824 180B> ᠤ᠋ This undefined isolate variant is the same as the defined final form
U+1826 FVS2 Isolate <1826 180C> ᠦ᠌ This undefined isolate variant is the same as the defined first final form
U+182D FVS2 Isolate <182D 180B> ᠭ᠋ This undefined isolate variant is the same as the defined feminine medial form
U+1835 FVS1 Isolate <1835 180B> ᠵ᠋ This undefined isolate variant is the same as the defined second medial form
U+1820 FVS1 Initial <1820 180B> ᠠ᠋‍ This undefined initial variant is the same as the defined second medial form (used after NNBSP)
U+1826 FVS1 Initial <1826 180B> ᠦ᠋‍ This undefined initial variant is the same as the defined first medial form (used after NNBSP)
U+1828 FVS2 Final <200D 1828> ‍ᠨ᠌ This undefined final variant is the same as the defined third medial form (used before MVS)

* You will need to be running under Vista to see what I intend to be seen.


In addition to the undefined variation sequences in the above table, Mongolian Baiti supports several other undefined variation sequences which are even more problematic.

Firstly, the undefined variation sequence <1840 180B> ᡀ᠋ (Mongolian LHA plus FVS1) produces a glyph which is the same as the letter LA with a circle diacritic. This in not a variant glyph form of Mongolian LHA (in origins a ligature of the letters LA and HA) at all, but is a completely separate letter used in Manchu to transliterate Tibetan LHA (discussed in more detail here). Although this letter was inadvertently omitted from the original set of Mongolian/Todo/Manchu/Sibe letters, it is to be be encoded as U+18AA MONGOLIAN LETTER MANCHU ALI GALI LHA in Unicode 5.1. All I can say is that trying to represent an unencoded letter by means of an undefined and unsanctioned variation sequence is a shameful hack that should never have been countenanced by a major vendor and founder member of the Unicode Consortium.

Then there are these four variant forms of U+1800 MONGOLIAN BIRGA :

  • <1800 180B> (FVS1) ᠀᠋ "1st variant"
  • <1800 180C> (FVS2) ᠀᠌ "2nd variant"
  • <1800 180D> (FVS3) ᠀᠍ "3rd variant"
  • <1800 200D> (ZWJ) ᠀‍ "4th variant"

And for those without Vista, these are what I am talking about (1st to 4th variants from left to right) :

Although none of these four birga variants are defined in Unicode, they are defined in both Traditional Mongolian Script in the ISO/IEC 10646 and Unicode Standards (UNU/IIST Report No. 170, August 1999) and a book on Mongolian character encoding Mengguwen Bianma 蒙古文编码 (2000) by Professor Quejingzhabu which closely follows the UNU/IIST report.

I suspect that the main reason why Unicode did not accept these four variation sequences when it accepted all the other variation sequences defined in UNU/IIST Report No. 170 is that the fourth variation sequence uses U+200D ZERO WIDTH JOINER as a pseudo-variation selector because there are not enough Mongolian Free Variation Selectors for more than three variants of the same positional form of a letter. This abuse of ZWJ was no doubt unacceptable to Unicode, and I imagine that as they couldn't accept three of the variants and reject one of them, they rejected them all until a better solution could be found. Unfortunately, instead of working with Unicode to define an acceptable solution Microsoft uncritically implemented something Unicode had already rejected.

Let us just consider for a moment the wisdom of using ZWJ as a pseudo-variation selector in a script that already uses ZWJ to select positional forms of letters (X-ZWJ, ZWJ-X-ZWJ and ZWJ-X select the initial, medial and final forms of the letter X respectively). As the Mongolian birga is a head mark that occurs at the start of text, it is quite likely to be followed by a Mongolian letter (maybe with whitespace between them, maybe not). Is it not just possible that if a letter with positional forms occurs immediately after the fourth birga variant <1800 200D> the ZWJ will have an adverse effect on the following letter ?

Well yes, it is just possible, under Vista at least. In IE7 the ZWJ acts upon both the preceding birga (U+1800) and following letter A (U+1820), producing the 4th birga variant followed by the final form of the letter A; whereas in simpler applications such as Notepad the ZWJ only acts upon the following letter, producing the standard birga glyph followed by the final form of the letter A (Birga 4th variant plus letter A separated by space is on the left and Birga 4th variant plus letter A not separated by space is on the right) :

And in Word 2007 you get weird behaviour, as seen below where exactly the same three sequences <1800 200D 1820> may end up being rendered differently from each other :

This sort of unpredictable rendering behaviour is no doubt why Unicode rejected <1800 200D> as a variation sequence in the first place, and why Microsoft should never have implemented it. Unfortunately there is a lot more that I could say about the rendering behaviour of Mongolian Baiti, but that would be beyond the scope of this post.



Phags-pa Variation Sequences

As with the Mongolian model, variation selectors (always VS1) are used in the Phags-pa script in order to select a particular contextual glyph form. This mechanism is only actually required in order to represent the Sanskrit Buddhist texts that are engraved in Phags-pa script on the walls of the "Cloud Platform" 雲台 at Juyong Guan 居庸關 Pass at the Great Wall north-west of Beijing, in commemoration of the construction of a Buddhist edifice in 1345. On these very important inscriptions (and nowhere else in the extant Phags-pa corpus) the Sanskrit retroflex letters ṭa, ṭha, ḍa and ṇa are represented by reversed forms of the Phags-pa letters TA, THA, DA and NA (following the example of Tibetan), and as such these four reversed letters are encoded separately from their unreversed counterparts (A869..A86C : TTA, TTHA, DDA and NNA). However, as the stem on these four reversed letters is on the opposite side compared with normal, when other letters follow them they also normally take a reversed glyph form to facilitate joining along the stem. These reversed glyph forms are not phonetically or semantically any different from the corresponding unreversed glyph forms, and so are not encoded separately, but are treated as contextual glyph variants. This contextual reversing affects the following six letters :

  • U+A856 PHAGS-PA LETTER SMALL A
  • U+A85C PHAGS-PA LETTER HA
  • U+A85E PHAGS-PA LETTER I
  • U+A85F PHAGS-PA LETTER U
  • U+A860 PHAGS-PA LETTER E
  • U+A868 PHAGS-PA SUBJOINED LETTER YA

These letters exhibit the following reversing behaviour :

  • The letter HA reverses after the letter DDA
  • The letter Subjoined YA reverses after the letter NNA
  • The letters I, U and E reverse after the letters TTA, TTHA, DDA or NNA (or after a reversed Subjoined YA or HA), although the letter I does not always reverse after the letter TTHA
  • The letter Small A normally does not reverse after the letters TTA or TTHA, presumably because a reversed Small A is identical to the letter SHA, but may sometimes be reversed after the letter TTHA

The rendering system should automatically reverse the glyph form of the letters Small A, HA, I, U, E and Subjoined YA when they occur immediately after one of the letters TTA, TTHA, DDA or NNA (or a reversed Small A, HA, I, U, E or Subjoined YA), but variation selectors are needed to display the reversed glyph forms of the letters Small A, HA, I, U, E and Subjoined YA in isolation (for example when discussing the letters of the script) and when the default reversing behaviour needs to be overridden, for example in order to represent those occurences where the letters Small A and I do not reverse after the letters TTA or TTHA in the Juying Guan inscriptions.

The six variation sequences defined for these purposes are different from any other variation sequence defined thusfar, in that they do not define an absolute glyph form but a relative glyph form :

  • <A856 FE00> phags-pa letter reversed shaping small a
  • <A85C FE00> phags-pa letter reversed shaping ha
  • <A85E FE00> phags-pa letter reversed shaping i
  • <A85F FE00> phags-pa letter reversed shaping u
  • <A860 FE00> phags-pa letter reversed shaping e
  • <A868 FE00> phags-pa letter reversed shaping subjoined ya

By "reversed shaping" it means that where the rendering system would normally display an unreversed form of the letter, applying VS1 will cause the glyph to be reversed; an conversely, where the rendering system would normally display a reversed form of the letter (e.g. after the letters TTA, TTHA, DDA and NNA), applying VS1 will cause the glyph to be unreversed. By this means the same variation sequence can be used to display a reversed glyph form of a letter in isolation and to inhibit glyph reversal in running text.

As an example, the Sanskrit word dhiṣṭhite is transliterated as DHISH TTHI TE in the Phags-pa inscriptions at Juyong Guan, but in some cases the letter I of TTHI is reversed and in some cases it is not. These two versions of the word may be represented as :

  • <A84A A85C A85E A85A 0020 A86A A85E 0020 A848 A860> ꡊꡜꡞꡚ ꡪꡞ ꡈꡠ (letter I contextually reversed by the rendering system)
  • <A84A A85C A85E A85A 0020 A86A A85E FE00 0020 A848 A860> ꡊꡜꡞꡚ ꡪꡞ︀ ꡈꡠ (VS1 inhibits contextual reversing of letter I)

Whereas in this context VS1 inhibits contextual reversing of letter I, we can using the same variation sequence <A85E FE00> in isolation to produce the reversed glyph form of the letter I : ‍ꡞ︀ (preceded by ZWJ to get the final reversed glyph form).

[See Phags-pa Shaping Behaviour for more examples]



Han Ideographic Variation Sequences

For a long time there has been a demand from some quarters for a mechanism to allow vendors and CJK users to register glyph variants of Han ideographs, and in order to accomodate this demand Unicode has recently established an Ideographic Variation Database (IVD). Unlike variation sequences for other scripts, which are individually defined by Unicode, the IVD provides a registration mechanism so that sets of Ideographic Variation Sequences (IVS) can be registered by the "user community" on demand. As long as certain rules are followed and a fee is paid (which Unicode may waive if it so desires) then Unicode (as the registration authority) will accept any set of glyph variants that anybody wants to register, without any scrutiny of the appropriateness of the proposed glyph variants -- there is a 90 day public review period, but in my opinion that's just an excuse to move responsibility away from the UTC.

The Variation Selectors Supplement, comprising 240 variation selectors (VS17-VS256), was specially encoded in anticipation of a large number of Han ideographic variants being defined, and ideographic variation sequences are intended to only use these 240 supplementary variation selectors. The door has been left open to define even more variation selectors if 240 variation sequences for a single CJK unified ideograph proves too few.

The first, and so far only, IVD registration application has come from Adobe, who have requested the registration of the entire set of kanji glyphs in their Adobe-Japan1 collection. This is a set of glyphs used by Adobe for fonts for the Japanese market, and includes 14,664 kanji glyphs. Adobe wants to be able to uniquely refer to each of these glyphs at the encoding level (don't ask me why), but as many of the glyphs are from a Unicode perspective unifiable variants it can only do so by means of variation sequences.

Seven of the glyphs in the Adobe-Japan1 collection do not correspond to encoded ideographs, and so have been fast-tracked (by-passing IRG) for encoding in Unicode 5.1 at 9FBC..9FC2. The remaining 14,657 glyphs have been analysed as mapping to a total of 13,262 encoded ideographs (one glyph, CID+19071, maps to both U+29FCE and U+29FD7 !) :

  • 12,040 characters mapped to 1 glyph
  • 1,084 characters mapped to 2 glyphs
  • 120 characters mapped to 3 glyphs
  • 14 characters mapped to 4 glyphs
  • 1 character mapped to 5 glyphs (U+97FF 響)
  • 1 character mapped to 6 glyphs (U+6168 慨)
  • 1 character mapped to 8 glyphs (U+908A 邊)
  • 1 character mapped to 15 glyphs (U+9089 邉)

From this one would have thought that variation sequences would only be required for those 1,222 ideographs that map to one or more glyphs in the Adobe set, and even then perhaps only for those glyphs that differ from the standard form of the ideograph, yielding at most 2,618 variation sequences. However, for purposes of forward compatibility (if additional Adobe glyphs are mapped to characters that currently only map to a single Adobe glyph), and in order to be able to reference all glyphs in the set as a variation sequence (don't ask me why), a total of 14,658 variation sequences are being put forward for registration (i.e. a unique variation sequence for every glyph in the Adobe-Japan1 collection, other than the seven unencoded characters, although I presume redundant variation sequences for those seven characters will be added once they are encoded). For the vast majority of the 12,040 ideographs for which only a single ideographic variation sequence is specified, the glyph for the IVS has the same appearance as the standard glyph form of the character, i.e. they are variation sequences that define a glyph that is not a variant of the base character and for which their is no need to distinguish it from any other variant glyph forms.

At this point I start seriously worrying about the implications of the Adobe approach to ideographic variation sequences. What if there is an "Adobe-Japan2" collection or an "Adobe-China" collection or an "Adobe-Korean" collection ? Would these collections also require the definition of many thousands of ideographic variation sequences that are not distinguishable from the standard glyph form of the base ideograph ? What if other vendors such as Microsoft or Apple decide to follow Adobe's lead, and define unique ideographic variation sequences for tens of thousands of font glyphs ? As the whole point of the IVD is to ensure that a given variation sequence is used in at most one collection, the same variant (or not-so-variant) glyph in multiple collections will inevitably be defined with different variation sequences, once for every collection it occurs in. It seems to me that the end result of all this will be that many thousands of ideographs will have multiple variation sequences associated with them (one per collection) and that the glyphs for each variation sequence will be practically indistinguishable from each other and from the standard glyph form of the base ideograph.

Looking at the glyphs of the Adobe-Japan1 collection it is evident that in very many cases where a single base ideograph has multiple variation sequences defined, the difference between glyphs is very slight (often just minor differences in stroke formation), and it is hard to see how there could be any practical need to distinguish them at the text level. In some cases the differences between "variant" glyphs is microscopic; for instance, can you differentiate the VS17 and VS18 forms of U+55A9 ?

On the other hand, sometimes the glyph variation is too extreme. One major problem with the collection that was identified during the review period is that the variation sequences for a single ideograph sometimes represent glyph forms that are not unifiable according to the Annex S rules, in particular there are quite a few cases where a Japanese simplified form which has not been encoded is defined by means of a variation sequence as a variant of the encoded non-simplified form. A single example from page 4 should suffice :

  • <56C0 E0100> (VS17) 囀�� [4454] = ⿰口轉
  • <56C0 E0101> (VS18) 囀�� [14116] = ⿰口轉
  • <56C0 E0102> (VS19) 囀�� [20096] = ⿰口転

It has now been clarified that the glyph for any ideographic variation sequence must be within the range of unifiable glyph variation for the base ideograph, and glyphs that would not be unified according to the unification rules may not be treated as variants of the same base ideograph. The text of UTS 37 will be amended accordingly, and a revised list of variation sequences for the Adobe-Japan1 collection will be issued. This means that there will probably be about fifty more characters from the Adobe-Japan1 set that will need encoding, and I have no doubt that, as with the previous seven, they will be fast-tracked (bypassing IRG), and tagged on to the the end of the CJK and CJK-A blocks (probably just enough room for them in the BMP).

When I first read UTS 37 I thought that the purpose of the IVD was to provide CJKV users with a mechanism to define glyph variants that, although unifiable from a character-encoding perspective, were required to be distinguished at the text level in certain circumstances, most obviously when used as personal or place names. But having reviewed the Adobe-Japan1 submission it seems that I must have been mistaken. It is evident that this collection of 14,658 ideographic variation sequences has no practical benefit for anyone other than Adobe, will never be supported by anyone other than Adobe, and will never be used in text by the general CJKV user community. In my opinion the collection is required purely to enable Adobe to uniquely identify their fonts glyphs internally, and not for information interchange, which I personally think is an abuse of ideographic variation sequences. But more than that, this very first IVD registration is going to be seen as a model for what the IVD is intended for, and I am afraid that it will only serve to put off people from registering sensible and useful ideographic variation sequences, for example for the many thousands of Taiwan personal name usage characters, as well as dictionary usage variants. We shall just have to wait and see ...


7 comments:

Sean Burke said...

Thanks for your perspective on the Adobe submission. I hadn't considered the consequences of their use of variants for single glyphs. Is there any reasonable hope that this particular flaw could be worked out of the submission? If the unifiability issue could be dealt with, it doesn't seem to be outside the realm of possibility that this too could be solved. The unifiability issue shows that the submission isn't yet set in stone so it would seem that now is the time to take care of it.

Andrew West said...

I don't think that anyone on the Unicode Technical Committee (UTC) sees this as a problem, so I am afraid that this is one feature of the submission that won't be changing.

Chris Fynn said...

Of course if anyone wants to display these particular variants they will probably also need to purchase a font from Adobe.

Since the differences are so subtle anyone trying to duplicate them would probably be infringing on the IP of Adobe's design.

Michael Everson said...

I refer to them as pseudo-coding. Bah.

28481k said...

Well, we can implement variant selection, in fact I will suggest the following use of variant selectors from (VS1-VS16) for EVERY existing Unified Han chracter in Unicode:

VS1: forcing zh_CN locale form
VS2: forcing zh_TW locale form
VS3: forcing ja locale form
VS4: forcing kr locale form
VS5: forcing zh_HK locale form
VS6: forcing zh_MO locale form
VS7: forcing zh_SG locale form
VS8: forcing zh_MA locale form
VS9: forcing zh_VT locale form
VS10: forcing Kangxi reference glyph
VS11-16: any other variant that are used often used but not encoded or encoded elsewhere but unifiable under current unifying rules.

This way, we can minimise the possible reduplicated encodings and yet show all existent forms with little qualms.

Robert Siemer said...

So, does the base character define a variant itself or not?

For math, you say it doesn’t, for CJK you seem to indicate it does.

E.g. a math base character may mean with serifs or not, but it’s only variant means with serifs. – In that case another variant meaning “without serifs” is missing!

And for CJK? Is it that the base character means “this or any variation” and the Adobe registration means “this and only this”?

Andrew West said...

So, does the base character define a variant itself or not?

No, any variant is acceptable for the base character, so if the code chart form of a character needs to be defined, a separate variation sequence for it needs to be defined.

For math, you say it doesn’t, for CJK you seem to indicate it does.

No, a base CJK character with no variation selector can be any acceptable glyph shape. That is why many of the Adobe variation sequences define a single variant for a CJK character that looks the same as the glyph in the code chart.

E.g. a math base character may mean with serifs or not, but it’s only variant means with serifs. – In that case another variant meaning “without serifs” is missing!

Yes.

And for CJK? Is it that the base character means “this or any variation” and the Adobe registration means “this and only this”?

Yes.