As discussed in Part 1, in 2002-2003 China tried and failed to get nearly a thousand precomposed Tibetan characters encoded in ISO/IEC 10646 (which is the international standard corresponding to Unicode).
Following on from this humiliating defeat, in April of 2004 Joe Zhang (Zhang Zhoucai 张轴材), formerly a contributing editor of ISO/IEC 10646, presented to a conference in China a paper that outlined a new Chinese encoding standard for Tibetan, codenamed the "Everest Scheme". This scheme utilizes the Private Use Areas (PUA) of the UCS to encode several thousand precomposed Tibetan characters, and was characterised as a "national standard within the framework of an international standard". Under this scheme Tibetan characters would be distributed as follows :
- 0F00..0FFF : Basic Tibetan (the existing Tibetan block)
- F500..F8FF : Tibetan Extension-A 藏文编码字符集(扩充集A)
- 000F1000..000F3000 : Tibetan Extension-B 藏文编码字符集(扩充集B)
The paper also stated that there should be two implementation levels for Tibetan :
- Level 1 : Only works with non-combining and precomposed Tibetan characters
- Level 2 : Works with combining and precomposed characters
Level 1 would not be required to process any of the following characters :
- 0F18 TIBETAN ASTROLOGICAL SIGN -KHYUD PA
- 0F19 TIBETAN ASTROLOGICAL SIGN SDONG TSHUGS
- 0F35 TIBETAN MARK NGAS BZUNG NYI ZLA
- 0F37 TIBETAN MARK NGAS BZUNG SGOR RTAGS
- 0F39 TIBETAN MARK TSA -PHRU
- 0F3E TIBETAN SIGN YAR TSHES
- 0F3F TIBETAN SIGN MAR TSHES
- 0F71 TIBETAN VOWEL SIGN AA
- 0F72 TIBETAN VOWEL SIGN I
- 0F73 TIBETAN VOWEL SIGN II
- 0F74 TIBETAN VOWEL SIGN U
- 0F75 TIBETAN VOWEL SIGN UU
- 0F76 TIBETAN VOWEL SIGN VOCALIC R
- 0F77 TIBETAN VOWEL SIGN VOCALIC RR
- 0F78 TIBETAN VOWEL SIGN VOCALIC L
- 0F79 TIBETAN VOWEL SIGN VOCALIC LL
- 0F7A TIBETAN VOWEL SIGN E
- 0F7B TIBETAN VOWEL SIGN EE
- 0F7C TIBETAN VOWEL SIGN O
- 0F7D TIBETAN VOWEL SIGN OO
- 0F7E TIBETAN SIGN RJES SU NGA RO
- 0F7F TIBETAN SIGN RNAM BCAD
- 0F80 TIBETAN VOWEL SIGN REVERSED I
- 0F81 TIBETAN VOWEL SIGN REVERSED II
- 0F82 TIBETAN SIGN NYI ZLA NAA DA
- 0F83 TIBETAN SIGN SNA LDAN
- 0F84 TIBETAN MARK HALANTA
- 0F86 TIBETAN MARK LCI RTAGS
- 0F87 TIBETAN MARK YANG RTAGS
- 0F90 TIBETAN SUBJOINED LETTER KA
- 0F91 TIBETAN SUBJOINED LETTER KHA
- 0F92 TIBETAN SUBJOINED LETTER GA
- 0F93 TIBETAN SUBJOINED LETTER GHA
- 0F94 TIBETAN SUBJOINED LETTER NGA
- 0F95 TIBETAN SUBJOINED LETTER CA
- 0F96 TIBETAN SUBJOINED LETTER CHA
- 0F97 TIBETAN SUBJOINED LETTER JA
- 0F99 TIBETAN SUBJOINED LETTER NYA
- 0F9A TIBETAN SUBJOINED LETTER TTA
- 0F9B TIBETAN SUBJOINED LETTER TTHA
- 0F9C TIBETAN SUBJOINED LETTER DDA
- 0F9D TIBETAN SUBJOINED LETTER DDHA
- 0F9E TIBETAN SUBJOINED LETTER NNA
- 0F9F TIBETAN SUBJOINED LETTER TA
- 0FA0 TIBETAN SUBJOINED LETTER THA
- 0FA1 TIBETAN SUBJOINED LETTER DA
- 0FA2 TIBETAN SUBJOINED LETTER DHA
- 0FA3 TIBETAN SUBJOINED LETTER NA
- 0FA4 TIBETAN SUBJOINED LETTER PA
- 0FA5 TIBETAN SUBJOINED LETTER PHA
- 0FA6 TIBETAN SUBJOINED LETTER BA
- 0FA7 TIBETAN SUBJOINED LETTER BHA
- 0FA8 TIBETAN SUBJOINED LETTER MA
- 0FA9 TIBETAN SUBJOINED LETTER TSA
- 0FAA TIBETAN SUBJOINED LETTER TSHA
- 0FAB TIBETAN SUBJOINED LETTER DZA
- 0FAC TIBETAN SUBJOINED LETTER DZHA
- 0FAD TIBETAN SUBJOINED LETTER WA
- 0FAE TIBETAN SUBJOINED LETTER ZHA
- 0FAF TIBETAN SUBJOINED LETTER ZA
- 0FB0 TIBETAN SUBJOINED LETTER -A
- 0FB1 TIBETAN SUBJOINED LETTER YA
- 0FB2 TIBETAN SUBJOINED LETTER RA
- 0FB3 TIBETAN SUBJOINED LETTER LA
- 0FB4 TIBETAN SUBJOINED LETTER SHA
- 0FB5 TIBETAN SUBJOINED LETTER SSA
- 0FB6 TIBETAN SUBJOINED LETTER SA
- 0FB7 TIBETAN SUBJOINED LETTER HA
- 0FB8 TIBETAN SUBJOINED LETTER A
- 0FB9 TIBETAN SUBJOINED LETTER KSSA
- 0FBA TIBETAN SUBJOINED LETTER FIXED-FORM WA
- 0FBB TIBETAN SUBJOINED LETTER FIXED-FORM YA
- 0FBC TIBETAN SUBJOINED LETTER FIXED-FORM RA
- 0FC6 TIBETAN SYMBOL PADMA GDAN
Level 2 would work with both standard Unicode Tibetan and the precomposed Tibetan extensions in the PUA blocks.
Tibetan Extension-A (often referred to as "Set A"), covering the most common stacks, was published at the end of 2004, and comprises 1,536 precomposed characters in the PUA of the BMP at <F300..F8FF>. For the full repertoire see my mapping table between the Set A precomposed characters and standard Unicode Tibetan character sequences.
Tibetan Extension-B (often referred to as "Set B"), covering rarely occuring stacks, is slated for the Supplementary Private Use Area-A in Plane 15. I'm not sure how many characters it is supposed to cover, but 5,664 is figure I have heard mentioned. It has not yet been published (as far as I know) and perhaps it never will be, as the success of OpenType Tibetan fonts is rapidly making the precomposed model redundant.
One might have expected that Tibetan Extension-A would be based on the set of BrdaRten characters proposed and rejected the previous year, but that does not seem to have been the case, as :
- Tibetan Extension-A and Tibetan Extension-B cover many thousands more characters than the proposed BrdaRten characters (Tibetan Extension-A alone has over 50% more characters);
- There is no obvious correlation between Tibetan Extension-A and the proposed BrdaRten characters in terms of code point sequence (see my mapping table between the proposed BrdaRten characters and Tibetan Extension-A);
- 11 of the proposed BrdaRten characters aren't even included in Tibetan Extension-A (including the seven PH + H characters added in N2621 that I suspect are mistakes for the already included H + PH characters).
These points make me wonder just how mature the BrdaRten proposal was and whether the 962 proposed characters were perhaps intended as a foot in the door for thousands more. The fact that the proposed BrdaRten characters were replaced by a quite different set of precomposed characters also makes a mockery of the Chinese claim that the BrdaRten characters were required to be encoded for backwards compatibility with legacy data.
One interesting issue with Tibetan Extension-A is that it does not include a precomposed character for the character sequence ཨོཾ <0F68 0F7C 0F7E> (the "om" of the mantra Om Mani Padme Hūm ཨོཾ་མ་ཎི་པདྨེ་ཧཱུཾ།). This must be because the Tibetan block already includes the character TIBETAN SYLLABLE OM ༀ at U+0F00, and the Chinese took this to be equivalent to the character sequence <0F68 0F7C 0F7E>. However, this character has no Unicode decomposition, and under Unicode it is not equivalent to <0F68 0F7C 0F7E>, so it would have been better to encode a separate precomposed character corresponding to <0F68 0F7C 0F7E> in the PUA rather than use U+0F00 as if it were a precomposed character.
Implementation of Precomposed Tibetan
If you do want to or need to work with Tibetan text encoded according to the PRC's standard for extended Tibetan, then it is possible to do so now using freely available software. My BabelPad text editor supports the conversion (both ways) between standard Unicode character sequences and Extended Tibetan-A, and Chris Fynn's Jomolhari font supports both standard combining Tibetan and precomposed Tibetan. Let's give it go.
1. We start up BabelPad, select the Jomolhari font, and open a Tibetan document encoded as standard combining Tibetan (Universal Declaration of Human Rights). The document renders perfectly (although it may not do so unless you are running Vista) :
2. Then we select "Unicode to Extended Tibetan-A" from the "Tibetan" submenu of the "Convert" menu of BabelPad. Hmm, no discernable change, document renders identically ... has it actually done anything ? Well yes it has. Take a look at the Status Bar; the character at the caret position was U+0F66 TIBETAN LETTER SA, but now it is U+F3B5 PRIVATE USE CHARACTER-F3B5, which according to the Set A Mapping Table corresponds to the decomposed sequence <0F66 0F94 0F7C> sngo (the first syllable of sngon brjod སྔོན་བརྗོད། "preamble").
3. Now hit the u" button on the BabelPad toolbar. This causes the text to be rendered in "Glyph Mode" (i.e. with all characters rendered as individual spacing glyphs). Note that the only difference is a slight change in the inter-glyph spacing and loss of smart line breaking. This shows that each stack is indeed a single character.
4. Finally, select "Extended Tibetan-A to Unicode" from the "Tibetan" submenu of the "Convert" menu of BabelPad, and it suddenly looks like we've accidentally switched to "Arial Unicode MS". Of course we haven't; we're still using Jomolhari, but now we're rendering each character as an individual spacing glyph so that the underlying difference between combining Tibetan and precomposed Tibetan is clear.
So there you are, standard combining Tibetan and precomposed Tibetan both work equally well (at least on Vista; I'm forced to admit that precomposed Tibetan will work fine on everything from Windows 95 onwards, which is not quite true for combining Tibetan). People in the PRC can used the precomposed model and everyone else can use the combining model. Everyone should be happy now, right ? Well, we'll just have to wait and see.
Meanwhile, here are two more things to consider :
1. How on earth are people supposed to enter Tibetan text consisting of thousands of precomposed characters ? You can't use a simple keyboard layout (as you can for Unicode Tibetan); a CJK style phonetic or transliteration IME (e.g. based on EWTS) would be useless for ordinary (or even most educated) Tibetans; and a "character picker" solution is totally impractical.
2. What will happen if China mandates support for its Extended Tibetan scheme as a requirement for GB18030 certification ? As I understand it, there is no such requirement at present and I have been told that there is no intention to make support for Extended Tibetan a GB1830 requirement, but things change.