Wednesday, 20 September 2006

Tibetan Shorthand Contractions

When discussing Balti extensions for Tibetan recently I talked a little about the use of U+0F39 TIBETAN MARK TSA -PHRU for writing shorthand contractions, and whilst I'm still in a Tibetanish mood I thought I might discuss Tibetan shorthand contractions in some more detail, especially as I have just made available a List of Tibetan Shorthand Contractions that I have garnered from various sources.

Shorthand contractions (bskungs yig བསྐུངས་ཡིག་ "concealed writing" or bsdu yig བསྡུ་ཡིག་ "amalgamated writing") are informal contractions of words, created by conjoining two or more "syllable units" into a single unit. For example. the word bkra shis བཀྲ་ཤིས་ "auspicious" (also the common personal name Tashi) may be contracted to bkris བཀྲིས་, which looks like it ought to be an authentic Tibetan word but isn't. Although in this case the resultant shorthand contraction conforms to Tibetan spelling rules, this need not be so, and in very many cases the shorthand contractions break the normal rules of Tibetan spelling, as can be seen in the example below :



This is a Tibetan 1½ srang coin of 1937 in which the word bcu gcig བཅུ་གཅིག་ "eleven" is contracted to bcuig བཅིུག་, with the letter ca taking a 'u' vowel sign below and an 'i' vowel sign above. This is contrary to the rules of Tibetan spelling, under which a consonant can only take a single vowel sign (diphthongs are represented by putting the second vowel sign on a following letter 'a, e.g. spre'u སྤྲེའུ་ "monkey").

In the above example the vowel signs on the two syllables being combined together are above and below, so there is no typographical interaction between the vowels, and the contraction should be rendered correctly by most Tibetan fonts. However, some Tibetan fonts do not cope well with the cases where multiple vowel signs occur above the same consonant, as in this example :



This is a detail from a prayer flag in which the common formula ཀི་ཀི་སྭོ་སྭོ་ལྷ་རྒྱལ་ལོ། "All Hail, Glory be to the Gods !" has been contracted to kii soo ཀིི་སོོ་ (kii swoo ཀིི་སྭོོ་ would be the expected form), with two 'i' vowel signs over the letter ka and two 'o' vowel signs over the letter sa. None of the Tibetan fonts on my system render the vowel signs correctly, either overlaying the double vowel signs on each other or rendering the second one on a dotted circle (Jomolhari renders the double 'o' correctly, but not the double 'i'). Note that the double 'o' vowel could be represented by U+0F7D TIBETAN VOWEL SIGN OO, but I prefer to restrict U+0F7D and U+0F7B TIBETAN VOWEL SIGN EE to transliterating Sanskrit au and ai respectively.

This example also illustrates one of the techniques of shorthand contractions, that is representing syllable reduplication by doubling the vowel sign. Another example of syllable repetition is frequently seen on prayer flags, which often end with the word bskyed བསྐྱེད་ "increase, prosper" written once (e.g. here), twice (e.g. here) or even three times (e.g. here) for added emphasis. In shorthand contractions, the number of vowel signs is used to indicate the number of syllable repetitions, so in the examples below the contraction bskyeed བསྐྱེེད་ (with two 'e' vowels) represents bskyed bskyed བསྐྱེད་བསྐྱེད་ and the contraction bskyeeed བསྐྱེེེད་ (with three 'e' vowels) represents bskyed bskyed bskyed བསྐྱེད་བསྐྱེད་བསྐྱེད་ :





Rules of Contractions

I guess that it is obvious by now to any readers who may have read my posts on Long S and R Rotunda that I am obsessed with orthographic rules, so it should be no surprise that I have attempted to look for order amongst the rule-breaking of Tibetan contractions. However, my first observation is that contractions are often idiosyncratic, and the same word may be contracted differently in different sources. For example, rgya mtsho རྒྱ་མཚོ་ "ocean" is variously contracted as རྒྱོ༹་, རྒྱམོ་ or རྪོ་. In most cases it is not possible to systematically reverse engineer the uncontracted form from a contraction, and contractions may perhaps be best considered as mnemonic abbreviations. What any individual contraction should be expanded to is usually evident from its context. Nevertheless, there are a few general principles that I have gleaned from the examples in my List of Tibetan Shorthand Contractions :

1. If the final letter of the first syllable unit is the same as the first letter of the following syllable unit, then the two letters are combined :

  • mkha' 'gro མཁའ་འགྲོ་ = mkha'gro མཁའགྲོ་
  • lcags sgrog ལྕགས་སྒྲོག་ = lcag.sgrog ལྕགསྒྲོག་
  • gtum mo གཏུམ་མོ་ = gtu.mo གཏུམོ་
  • dpal ldan དཔལ་ལྡན་ = dpa.ldan དཔལྡན་
  • 'od dkar འོད་དཀར་ = 'odkar འོདཀར་
  • gnon nu གནོན་ནུ་ = gno.nu གནོནུ་
  • gzug gin 'dug གཟུག་གིན་འདུག་ = gzu.gin 'dug གཟུགིན་འདུག་
  • khyab bdag ཁྱབ་བདག་ = khyabdag ཁྱབདག་

2. An anusvara sign is used to represent a final letter ma somewhere in the uncontracted word :

  • khams gsum ཁམས་གསུམ་ = ཁམསུཾ་
  • khrums smad ཁྲུམས་སྨད་ = ཁྲུཾད་
  • mnyam nyid མཉམ་ཉིད་ = མཉིཾད་
  • mnyam bzhag མཉམ་བཞག་ = མཉཾག་
  • thams cad ཐམས་ཅད་ = ཐཾད་
  • rnam grangs རྣམ་གྲངས་ = རྣངཾས་
  • lha mtshams ལྷ་མཚམས་ = ལྷ༹ཾས་

3. A tsa 'phru sign is used to indicate a letter tsa , tsha , dza or za somewhere in the uncontracted word :

  • kun bzang ཀུན་བཟང་ = ཀུན༹ང་
  • kun rdzob ཀུན་རྫོབ་ = ཀོུབ༹་
  • skal bzang སྐལ་བཟང་ = སྐལ༹ང་
  • rgya mtsho རྒྱ་མཚོ་ = རྒྱོ༹་
  • khur tshos ཁུར་ཚོས་ = ཁོུས༹་
  • rgyal mtshan རྒྱལ་མཚན་ = རྒྱལ༹ན་
  • chu tshod ཆུ་ཚོད་ = ཆོུ༹ད་
  • rje btsun རྗེ་བཙུན་ = རྗེུན༹་
  • ting 'dzin ཏིང་འཛིན་ = ཏིངི་ན༹་
  • thugs brtse ཐུགས་བརྩེ་ = ཐེུག༹ས་
  • bdud rtsi བདུད་རྩི་ = བདིུད༹་
  • sno tshogs སྣོ་ཚོགས་ = སྣོག༹ས་
  • phyag 'tshal lo ཕྱག་འཚལ་ལོ་ = ཕྱ༹ལོ་
  • phun tshogs ཕུན་ཚོགས་ = ཕོུག༹ས་
  • bya tshogs བྱ་ཚོགས་ = བྱོ༹གས་
  • sgrang rtsi སྦྲང་རྩི་ = སྦྲིང༹་
  • lha mtshams ལྷ་མཚམས་ = ལྷ༹ཾས་

Note that when tsa 'phru occurs on a stack with a head letter (e.g. rgy^o རྒྱོ༹་), it attaches to the head letter not the base consonant. I have not seen a single example of tsa 'phru attaching to a consonant that is not the top of the stack (i.e. in Unicode terms, U+0F39 never seems to attach to a subjoined letter <0F90..0FBC>).

4. Final -gs གས is represented by a reversed letter ta  :

  • lcags ལྕགས་ = ལྕཊ་
  • chags thogs ཆགས་ཐོགས་ = ཆཊ་ཐོཊ་
  • thugs rje ཐུགས་རྗེ་ = ཐུཊེ་ (also contracted as ཐེུགས་)
  • thugs brtse ཐུགས་བརྩེ་ = ཐེུ༹ཊ་ (also contracted as ཐེུག༹ས་)
  • de bzhin gshegs pa དེ་བཞིན་གཤེགས་པ་ = དེནིཊེ་པ་
  • lhan tshogs ལྷན་ཚོགས་ = ལྷནོ༹ཊ་

5. Syllable repetition is represented by multiple vowel signs (as discussed above) :

  • bskyed bskyed བསྐྱེད་ = bskyeed བསྐྱེེད་
  • bskyed bskyed bskyed བསྐྱེད་ = bskyeeed བསྐྱེེེད་


Unicode Issues

Here's what the Unicode Standard has to say about Tibetan Shorthand Abbreviations :

Tibetan Shorthand Abbreviations (bskungs-yig) and Limitations of the Encoding.

A consonant functioning as a word-base (ming-gzhi) is allowed to take only one vowel sign according to Tibetan grammar. The Tibetan shorthand writing technique called bskungs-yig does allow one or more words to be constructed into a single, very unusual combination of consonants and vowels. This construction frequently entails the application of more than one vowel sign to a single consonant or stack, and the composition of the stacks themselves can break the rules of normal Tibetan grammar. For this reason, vowel signs do sometimes interact typographically, which accounts for their particular combining classes.

The Unicode Standard accounts for plain text compounds of Tibetan that contain at most one base consonant, any number of of subjoined consonants, followed by any number of vowel signs. This coverage constitutes the vast majority of Tibetan text. Rarely, stacks are seen that contain more than one such consonant-vowel combination in vertical arrangement. These stacks are highly unusual and are considered beyond the scope of plain text rendering. They may be handled by higher-level mechanisms.


What the standard does not say is that the "particular combining classes" cause all sorts of problems for dealing with shorthand contractions.

Firstly, all the vowel signs that are positioned above the stack ('i', 'e', 'double e', 'o' and 'double o') have a CCC (Canonical Combining Class) of 130, whereas the only vowel sign that is positioned below the stack ('u') has a CCC of 132, which means that when normalized a 'u' vowel sign will be reordered after any other vowel sign. However, the logical Tibetan order is to write the 'u' vowel sign first before any vowel signs above, and this is the order expected by most Tibetan fonts, with the result that a word such as bcuig may not be rendered correctly in normalized form (on my computer at least, the normalized version renders incorrectly with whatever font I use, but note that paradoxically on pre-Vista systems without Uniscribe Tibetan support both sequences may render correctly) :

  • bcuig བཅིུག་ <0F56 0F45 0F74 0F72 0F42>
  • bciug བཅིུག་ <0F56 0F45 0F72 0F74 0F42> (NFC/NFD)


As can be seen from the above screenshot, the normalized version renders incorrectly. With all of the Tibetan fonts on my system a dotted circle is inserted into the glyph sequence; I believe that this is done by Uniscribe (version 1.0606.5112.0 on my computer), presumably because it has been taught that a consonant only takes one vowel sign, and so two vowel signs must be invalid. With Jomolhari (my favourite Tibetan font), the problem is doubly bad, as the font also adds a dotted circle of its own into the glyph sequence, with the result that 'u' is assisted by two dotted circles. Personally, I dislike the Uniscribe philosophy of trying to restrict script-specific rendering logic to the rendering engine, so that OpenType logic in the font is often ignored or circumvented. I would much prefer it if rendering engines such as Uniscribe did not try to impose their interpretation of correct rendering behaviour on the font, but just let the font do what is specified in its OpenType tables (although I understand that Microsoft prefers to keep the logic in the rendering engine so that rendering is uniform across fonts). I suspect that if Uniscribe didn't insert the spurious dotted circle into the glyph sequence in the first place, then perhaps at least some of my Tibetan fonts would deal correctly with the normalized sequence of multiple vowels.

The second problem is that U+0F39 TIBETAN MARK TSA -PHRU has a CCC of 216, which means that when normalized it will be reordered after all vowel signs. However, Tibetan fonts expect the tsa 'phru to occur immediately after the consonant stack that it modifies and before any vowel signs, and thus sequences with one or more vowel signs between a consonant and a tsa 'phru will not render correctly. This is not specifically a problem with shorthand contractions, but is a problem that is most frequently encountered with shorthand contractions due to the common use of tsa 'phru in constructing contractions. For example, the contraction for nyin mtshan ཉིན་མཚན་ "day and night" is ny^in ཉི༹ན་ (^ represents tsa 'phru in EWTS transliteration). The contraction written in logical order and normalized order is shown below, and again, on my system, the normalized form does not render correctly (same caveat as above for pre-Vista systems) :

  • ny^in ཉི༹ན་ <0F49 0F39 0F72 0F53>
  • nyi^n ཉི༹ན་ <0F49 0F72 0F39 0F53> (NFC/NFD)


Sunday, 17 September 2006

Precomposed Tibetan Part 2 : Stuck in the PUA

As discussed in Part 1, in 2002-2003 China tried and failed to get nearly a thousand precomposed Tibetan characters encoded in ISO/IEC 10646 (which is the international standard corresponding to Unicode).

Following on from this humiliating defeat, in April of 2004 Joe Zhang (Zhang Zhoucai 张轴材), formerly a contributing editor of ISO/IEC 10646, presented to a conference in China a paper that outlined a new Chinese encoding standard for Tibetan, codenamed the "Everest Scheme". This scheme utilizes the Private Use Areas (PUA) of the UCS to encode several thousand precomposed Tibetan characters, and was characterised as a "national standard within the framework of an international standard". Under this scheme Tibetan characters would be distributed as follows :

  • 0F00..0FFF : Basic Tibetan (the existing Tibetan block)
  • F500..F8FF : Tibetan Extension-A 藏文编码字符集(扩充集A)
  • 000F1000..000F3000 : Tibetan Extension-B 藏文编码字符集(扩充集B)

The paper also stated that there should be two implementation levels for Tibetan :

  1. Level 1 : Only works with non-combining and precomposed Tibetan characters
  2. Level 2 : Works with combining and precomposed characters

Level 1 would not be required to process any of the following characters :

  • 0F18 TIBETAN ASTROLOGICAL SIGN -KHYUD PA
  • 0F19 TIBETAN ASTROLOGICAL SIGN SDONG TSHUGS
  • 0F35 TIBETAN MARK NGAS BZUNG NYI ZLA
  • 0F37 TIBETAN MARK NGAS BZUNG SGOR RTAGS
  • 0F39 TIBETAN MARK TSA -PHRU
  • 0F3E TIBETAN SIGN YAR TSHES
  • 0F3F TIBETAN SIGN MAR TSHES
  • 0F71 TIBETAN VOWEL SIGN AA
  • 0F72 TIBETAN VOWEL SIGN I
  • 0F73 TIBETAN VOWEL SIGN II
  • 0F74 TIBETAN VOWEL SIGN U
  • 0F75 TIBETAN VOWEL SIGN UU
  • 0F76 TIBETAN VOWEL SIGN VOCALIC R
  • 0F77 TIBETAN VOWEL SIGN VOCALIC RR
  • 0F78 TIBETAN VOWEL SIGN VOCALIC L
  • 0F79 TIBETAN VOWEL SIGN VOCALIC LL
  • 0F7A TIBETAN VOWEL SIGN E
  • 0F7B TIBETAN VOWEL SIGN EE
  • 0F7C TIBETAN VOWEL SIGN O
  • 0F7D TIBETAN VOWEL SIGN OO
  • 0F7E TIBETAN SIGN RJES SU NGA RO
  • 0F7F TIBETAN SIGN RNAM BCAD
  • 0F80 TIBETAN VOWEL SIGN REVERSED I
  • 0F81 TIBETAN VOWEL SIGN REVERSED II
  • 0F82 TIBETAN SIGN NYI ZLA NAA DA
  • 0F83 TIBETAN SIGN SNA LDAN
  • 0F84 TIBETAN MARK HALANTA
  • 0F86 TIBETAN MARK LCI RTAGS
  • 0F87 TIBETAN MARK YANG RTAGS
  • 0F90 TIBETAN SUBJOINED LETTER KA
  • 0F91 TIBETAN SUBJOINED LETTER KHA
  • 0F92 TIBETAN SUBJOINED LETTER GA
  • 0F93 TIBETAN SUBJOINED LETTER GHA
  • 0F94 TIBETAN SUBJOINED LETTER NGA
  • 0F95 TIBETAN SUBJOINED LETTER CA
  • 0F96 TIBETAN SUBJOINED LETTER CHA
  • 0F97 TIBETAN SUBJOINED LETTER JA
  • 0F99 TIBETAN SUBJOINED LETTER NYA
  • 0F9A TIBETAN SUBJOINED LETTER TTA
  • 0F9B TIBETAN SUBJOINED LETTER TTHA
  • 0F9C TIBETAN SUBJOINED LETTER DDA
  • 0F9D TIBETAN SUBJOINED LETTER DDHA
  • 0F9E TIBETAN SUBJOINED LETTER NNA
  • 0F9F TIBETAN SUBJOINED LETTER TA
  • 0FA0 TIBETAN SUBJOINED LETTER THA
  • 0FA1 TIBETAN SUBJOINED LETTER DA
  • 0FA2 TIBETAN SUBJOINED LETTER DHA
  • 0FA3 TIBETAN SUBJOINED LETTER NA
  • 0FA4 TIBETAN SUBJOINED LETTER PA
  • 0FA5 TIBETAN SUBJOINED LETTER PHA
  • 0FA6 TIBETAN SUBJOINED LETTER BA
  • 0FA7 TIBETAN SUBJOINED LETTER BHA
  • 0FA8 TIBETAN SUBJOINED LETTER MA
  • 0FA9 TIBETAN SUBJOINED LETTER TSA
  • 0FAA TIBETAN SUBJOINED LETTER TSHA
  • 0FAB TIBETAN SUBJOINED LETTER DZA
  • 0FAC TIBETAN SUBJOINED LETTER DZHA
  • 0FAD TIBETAN SUBJOINED LETTER WA
  • 0FAE TIBETAN SUBJOINED LETTER ZHA
  • 0FAF TIBETAN SUBJOINED LETTER ZA
  • 0FB0 TIBETAN SUBJOINED LETTER -A
  • 0FB1 TIBETAN SUBJOINED LETTER YA
  • 0FB2 TIBETAN SUBJOINED LETTER RA
  • 0FB3 TIBETAN SUBJOINED LETTER LA
  • 0FB4 TIBETAN SUBJOINED LETTER SHA
  • 0FB5 TIBETAN SUBJOINED LETTER SSA
  • 0FB6 TIBETAN SUBJOINED LETTER SA
  • 0FB7 TIBETAN SUBJOINED LETTER HA
  • 0FB8 TIBETAN SUBJOINED LETTER A
  • 0FB9 TIBETAN SUBJOINED LETTER KSSA
  • 0FBA TIBETAN SUBJOINED LETTER FIXED-FORM WA
  • 0FBB TIBETAN SUBJOINED LETTER FIXED-FORM YA
  • 0FBC TIBETAN SUBJOINED LETTER FIXED-FORM RA
  • 0FC6 TIBETAN SYMBOL PADMA GDAN

Level 2 would work with both standard Unicode Tibetan and the precomposed Tibetan extensions in the PUA blocks.

Tibetan Extension-A (often referred to as "Set A"), covering the most common stacks, was published at the end of 2004, and comprises 1,536 precomposed characters in the PUA of the BMP at <F300..F8FF>. For the full repertoire see my mapping table between the Set A precomposed characters and standard Unicode Tibetan character sequences.

Tibetan Extension-B (often referred to as "Set B"), covering rarely occuring stacks, is slated for the Supplementary Private Use Area-A in Plane 15. I'm not sure how many characters it is supposed to cover, but 5,664 is figure I have heard mentioned. It has not yet been published (as far as I know) and perhaps it never will be, as the success of OpenType Tibetan fonts is rapidly making the precomposed model redundant.

One might have expected that Tibetan Extension-A would be based on the set of BrdaRten characters proposed and rejected the previous year, but that does not seem to have been the case, as :

  • Tibetan Extension-A and Tibetan Extension-B cover many thousands more characters than the proposed BrdaRten characters (Tibetan Extension-A alone has over 50% more characters);
  • There is no obvious correlation between Tibetan Extension-A and the proposed BrdaRten characters in terms of code point sequence (see my mapping table between the proposed BrdaRten characters and Tibetan Extension-A);
  • 11 of the proposed BrdaRten characters aren't even included in Tibetan Extension-A (including the seven PH + H characters added in N2621 that I suspect are mistakes for the already included H + PH characters).

These points make me wonder just how mature the BrdaRten proposal was and whether the 962 proposed characters were perhaps intended as a foot in the door for thousands more. The fact that the proposed BrdaRten characters were replaced by a quite different set of precomposed characters also makes a mockery of the Chinese claim that the BrdaRten characters were required to be encoded for backwards compatibility with legacy data.

One interesting issue with Tibetan Extension-A is that it does not include a precomposed character for the character sequence ཨོཾ <0F68 0F7C 0F7E> (the "om" of the mantra Om Mani Padme Hūm ཨོཾ་མ་ཎི་པདྨེ་ཧཱུཾ།). This must be because the Tibetan block already includes the character TIBETAN SYLLABLE OM at U+0F00, and the Chinese took this to be equivalent to the character sequence <0F68 0F7C 0F7E>. However, this character has no Unicode decomposition, and under Unicode it is not equivalent to <0F68 0F7C 0F7E>, so it would have been better to encode a separate precomposed character corresponding to <0F68 0F7C 0F7E> in the PUA rather than use U+0F00 as if it were a precomposed character.



Implementation of Precomposed Tibetan

If you do want to or need to work with Tibetan text encoded according to the PRC's standard for extended Tibetan, then it is possible to do so now using freely available software. My BabelPad text editor supports the conversion (both ways) between standard Unicode character sequences and Extended Tibetan-A, and Chris Fynn's Jomolhari font supports both standard combining Tibetan and precomposed Tibetan. Let's give it go.

1. We start up BabelPad, select the Jomolhari font, and open a Tibetan document encoded as standard combining Tibetan (Universal Declaration of Human Rights). The document renders perfectly (although it may not do so unless you are running Vista) :



2. Then we select "Unicode to Extended Tibetan-A" from the "Tibetan" submenu of the "Convert" menu of BabelPad. Hmm, no discernable change, document renders identically ... has it actually done anything ? Well yes it has. Take a look at the Status Bar; the character at the caret position was U+0F66 TIBETAN LETTER SA, but now it is U+F3B5 PRIVATE USE CHARACTER-F3B5, which according to the Set A Mapping Table corresponds to the decomposed sequence <0F66 0F94 0F7C> sngo (the first syllable of sngon brjod སྔོན་བརྗོད། "preamble").



3. Now hit the u" button on the BabelPad toolbar. This causes the text to be rendered in "Glyph Mode" (i.e. with all characters rendered as individual spacing glyphs). Note that the only difference is a slight change in the inter-glyph spacing and loss of smart line breaking. This shows that each stack is indeed a single character.



4. Finally, select "Extended Tibetan-A to Unicode" from the "Tibetan" submenu of the "Convert" menu of BabelPad, and it suddenly looks like we've accidentally switched to "Arial Unicode MS". Of course we haven't; we're still using Jomolhari, but now we're rendering each character as an individual spacing glyph so that the underlying difference between combining Tibetan and precomposed Tibetan is clear.



So there you are, standard combining Tibetan and precomposed Tibetan both work equally well (at least on Vista; I'm forced to admit that precomposed Tibetan will work fine on everything from Windows 95 onwards, which is not quite true for combining Tibetan). People in the PRC can used the precomposed model and everyone else can use the combining model. Everyone should be happy now, right ? Well, we'll just have to wait and see.

Meanwhile, here are two more things to consider :

1. How on earth are people supposed to enter Tibetan text consisting of thousands of precomposed characters ? You can't use a simple keyboard layout (as you can for Unicode Tibetan); a CJK style phonetic or transliteration IME (e.g. based on EWTS) would be useless for ordinary (or even most educated) Tibetans; and a "character picker" solution is totally impractical.

2. What will happen if China mandates support for its Extended Tibetan scheme as a requirement for GB18030 certification ? As I understand it, there is no such requirement at present and I have been told that there is no intention to make support for Extended Tibetan a GB1830 requirement, but things change.


Thursday, 14 September 2006

Precomposed Tibetan Part 1 : BrdaRten

This post really ought to have been Part 3 of a History of Tibetan Encoding in Unicode, but Michael Kaplan's recent posts on the proposed alternative syllabic encoding of Tamil here and here have encouraged me to take a look at the latest twist in the saga of Tibetan encoding before I visit its early history of false starts and lost opportunities.

Tibetan is not a difficult script to read or write, but it is a very complex script to deal with in terms of computer processing (as far as complexity goes I would rate it second only to the Mongolian script). The problem is that written Tibetan comprises complex syllable units (known in Tibetan as a tsheg bar ཚེག་བར) which although written horizontally may include vertical clusters of consonants and vowel signs agglutinating around a base consonant (a vertical cluster is known as a "stack"). Thus most words have a horizontal and a vertical dimension, with the result that text is not laid out in a straight line as in most scripts. For example, the word bsGrogs བསྒྲོགས་ (pronounced drok ... obviously!) may be analysed as follows :



  • b (blue) = prefix (silent)
  • s (green) = superfix (silent)
  • g (red) = base consonant
  • r (purple) = subfix
  • o (yellow) = vowel sign
  • g (turquoise) = terminal
  • s (pink) = postfix (silent)

In the Unicode Tibetan encoding model a vertical stack (sgro སྒྲོ in the above example) is treated as a composite unit comprising (in the simple case, ignoring the complexities of Sanskrit transliteration and shorthand contractions) a single consonant from the range <0F40..0F6A>, zero or many subjoined consonants from the range <0F90..0FBC> and zero or one vowel sign. Thus the word bsGrogs is represented as <0F56 0F66 0F92 0FB2 0F7C 0F42 0F66 0F0B>.

The encoded representation only specifies what the elements of a word are, not the precise relationship between the elements at the glyph level. It is up to the rendering system to put all the pieces together correctly, so that within a vertical stack all the component letters take the expected glyph shape (some superfixed and subfixed letters have special forms), are positioned correctly in relationship to each other and are joined together seamlessly. For several years after Tibetan was encoded in Unicode 2.0 (July 1996) no rendering system existed that was capable of doing all this, and using Unicode to write Tibetan remained a theoretical exercise. It was not until the early years of this decade that OpenType fonts supporting complex Tibetan stacks started to appear and Microsoft started to support Tibetan in its Uniscribe rendering engine. However, out-of-the-box support for Tibetan (including font and keyboard layouts) did not become available until the arrival of Vista, more than ten years after Tibetan was encoded. But if you are running Vista then Tibetan works pretty much perfectly, and, if you want, there are half a dozen freely available Unicode Tibetan fonts that you can use instead of the Tibetan font that ships with Vista ("Himalaya"). Try out my Tibetan Test Page to see whether Tibetan works for you or not.

The problem is that the Chinese government had never really bought into the decomposed Tibetan model. As far back as January 1994, when the encoding model for Tibetan was still under discussion, China submitted a proposal (N964) to encode Tibetan stacks as individual precomposed characters rather than as a sequence of combining characters, but this model was rejected in favour of the combining model.

Then six years after Tibetan encoding had been finalised, in December 2002, the Chinese national body submitted a proposal to encode nearly a thousand so-called "BrdaRten" (བརྡ་རྟེན, pronounced daden) precomposed stacks in the BMP at <A500..A8FF> (see N2558, revised the following year as N2621, and further elaborated in N2661). These precomposed stacks were intended to be used in conjunction with those existing Tibetan characters that were non-combining (e.g. the consonants at <0F40..0F6A> but not the subjoined consonants at <0F90..0FBC> or any of the vowel signs), so that a word such as bsGrogs བསྒྲོགས་ would be encoded as <0F56 A5BA 0F42 0F66 0F0B> instead of <0F56 0F66 0F92 0FB2 0F7C 0F42 0F66 0F0B> under the existing encoding model (five code point units instead of eight). See my BrdaRten Mapping Table for a list of the 962 proposed BrdaRten characters (originally 956 in N2558), with their mappings to standard Unicode character sequences.

The arguments put forward by China in support of the proposed BrdaRten encoding are very poorly articulated, but I think they boil down to four basic points (with my observations in brackets) :

  1. The technical difficulties of implementing a system that can dynamically compose Tibetan stacks from a sequence of multiple characters [the technical difficulties had already been overcome at this time, as can be seen from N2624 which shows that all the proposed BrdaRten stacks could already be rendered correctly under the existing encoding model using OpenType font technology];
  2. The existence of gigabytes of legacy Tibetan data encoded using the BrdaRten model [but N2661 admits that the legacy data uses different repertoires and different code points, so mapping tables are required anyway];
  3. Precomposed stacks have been treated as single units since the advent of lead typesetting [nice picture of Tibetan lead type in Fig.1 of N2661, but not really relevant];
  4. On average BrdaRten stacks occupy 23% of Tibetan text and so BrdaRten cannot be ignored [hmm, this is where they forgot to mention the substantial reductions in storage costs that encoding precomposed characters would bring].

The counter-arguments boil down to :

  1. Precomposed characters are unnecessary as complex vertical stacks can already be dealt with satisfactorily in the existing encoding model using "smart font" technology such as OpenType;
  2. Encoding precomposed characters would introduce multiple non-equivalent spellings for Tibetan words (because of the Unicode Stability Policy if precomposed BrdaRten characters were encoded they would not be canonically equivalent to the corresponding decomposed character sequence), which would have severe implications on processes such as collation and searching;
  3. Encoding precomposed characters would create two competing models for Tibetan, with the result that people inside and outside of the PRC may end up creating mutually incompatible documents, thus restricting information exchange;
  4. The 962 proposed characters do not cater for all the thousands of less common stacks used for Sanskrit, and so the combining model is still required for representing many religious texts (and China is the major source for modern editions of Tibetan religious texts);
  5. As the new BrdaRten encoding model would not displace the existing encoding model, applications would still have to support standard combining Tibetan, so the scheme only adds an extra layer of complexity for systems that need to fully support Tibetan.

Not unexpectedly there was very strong opposition to the Chinese proposal (see N2624, N2625, N2635, N2637, N2638 and especially Peter Constable's systematic refutation in N2668). When China's proposal was discussed by WG2 in October 2003 it was firmly rejected :

With reference to the revised proposal in document N2621 on Tibetan BrdaRten from China, WG 2 resolves not to encode the suggested list of characters in the standard based on the following:

  1. All of the proposed characters can already be represented as sequences of existing encoded UCS characters, as shown explicitly in document N2624.
  2. The addition of the proposed characters would thereby lead to normalization issues.
  3. The addition of the proposed characters would also amount to a change in the overall encoding model for the Tibetan script, thereby destabilizing and introducing more complexity for existing implementations conformant to the standard.

Further.

  1. WG 2 notes that the various implementation issues for BrdaRten Tibetan raised in documents N2621 and N2661 can be addressed in a variety of ways, involving dynamic conversion interfaces to existing legacy systems and other techniques suggested in document N2668.
  2. WG 2 suggests that the list of BrdaRten Tibetan stacks enumerated in document N2621 might be appropriate for processing by WG 2 as additions to a potential future annex of named entities represented by USIs, rather than be encoded as individual characters.
  3. WG 2 notes the issues on Tibetan script encoding and its implementation in document N2661, and invites WG 2 experts to work with Chinese experts to arrive at a satisfactory solution.

RESOLUTION M44.20 (Tibetan BrdaRten)

China was furious at this outcome (see N2674), and vowed to oppose the encoding of any scripts "less alive than BrdaRten" in the BMP. They promptly opposed the encoding of Syloti Nagri and Phags-pa (see Resolutions M44.3 and M44.4), which was bad news for me as I was responsible for the Phags-pa proposal, and agreement from China was essential for its success. At the time I thought that their opposition to Phags-pa must be revenge for my opposition to BrdaRten (N2624), but I have been assured that the Chinese just wanted to keep the proposed window at <A500..A8FF> open (and Syolti Nagri was put at <A800..A82F> and Phags-pa put at <A840..A87F>).

China may have lost the battle at WG2, but as we will see in Part 2, this was far from the end of the story.