Sunday, September 17, 2006

Precomposed Tibetan Part 2 : Stuck in the PUA

As discussed in Part 1, in 2002-2003 China tried and failed to get nearly a thousand precomposed Tibetan characters encoded in ISO/IEC 10646 (which is the international standard corresponding to Unicode).

Following on from this humiliating defeat, in April of 2004 Joe Zhang (Zhang Zhoucai 张轴材), formerly a contributing editor of ISO/IEC 10646, presented to a conference in China a paper that outlined a new Chinese encoding standard for Tibetan, codenamed the "Everest Scheme". This scheme utilizes the Private Use Areas (PUA) of the UCS to encode several thousand precomposed Tibetan characters, and was characterised as a "national standard within the framework of an international standard". Under this scheme Tibetan characters would be distributed as follows :

  • 0F00..0FFF : Basic Tibetan (the existing Tibetan block)
  • F500..F8FF : Tibetan Extension-A 藏文编码字符集(扩充集A)
  • 000F1000..000F3000 : Tibetan Extension-B 藏文编码字符集(扩充集B)

The paper also stated that there should be two implementation levels for Tibetan :

  1. Level 1 : Only works with non-combining and precomposed Tibetan characters
  2. Level 2 : Works with combining and precomposed characters

Level 1 would not be required to process any of the following characters :

  • 0F18 TIBETAN ASTROLOGICAL SIGN -KHYUD PA
  • 0F19 TIBETAN ASTROLOGICAL SIGN SDONG TSHUGS
  • 0F35 TIBETAN MARK NGAS BZUNG NYI ZLA
  • 0F37 TIBETAN MARK NGAS BZUNG SGOR RTAGS
  • 0F39 TIBETAN MARK TSA -PHRU
  • 0F3E TIBETAN SIGN YAR TSHES
  • 0F3F TIBETAN SIGN MAR TSHES
  • 0F71 TIBETAN VOWEL SIGN AA
  • 0F72 TIBETAN VOWEL SIGN I
  • 0F73 TIBETAN VOWEL SIGN II
  • 0F74 TIBETAN VOWEL SIGN U
  • 0F75 TIBETAN VOWEL SIGN UU
  • 0F76 TIBETAN VOWEL SIGN VOCALIC R
  • 0F77 TIBETAN VOWEL SIGN VOCALIC RR
  • 0F78 TIBETAN VOWEL SIGN VOCALIC L
  • 0F79 TIBETAN VOWEL SIGN VOCALIC LL
  • 0F7A TIBETAN VOWEL SIGN E
  • 0F7B TIBETAN VOWEL SIGN EE
  • 0F7C TIBETAN VOWEL SIGN O
  • 0F7D TIBETAN VOWEL SIGN OO
  • 0F7E TIBETAN SIGN RJES SU NGA RO
  • 0F7F TIBETAN SIGN RNAM BCAD
  • 0F80 TIBETAN VOWEL SIGN REVERSED I
  • 0F81 TIBETAN VOWEL SIGN REVERSED II
  • 0F82 TIBETAN SIGN NYI ZLA NAA DA
  • 0F83 TIBETAN SIGN SNA LDAN
  • 0F84 TIBETAN MARK HALANTA
  • 0F86 TIBETAN MARK LCI RTAGS
  • 0F87 TIBETAN MARK YANG RTAGS
  • 0F90 TIBETAN SUBJOINED LETTER KA
  • 0F91 TIBETAN SUBJOINED LETTER KHA
  • 0F92 TIBETAN SUBJOINED LETTER GA
  • 0F93 TIBETAN SUBJOINED LETTER GHA
  • 0F94 TIBETAN SUBJOINED LETTER NGA
  • 0F95 TIBETAN SUBJOINED LETTER CA
  • 0F96 TIBETAN SUBJOINED LETTER CHA
  • 0F97 TIBETAN SUBJOINED LETTER JA
  • 0F99 TIBETAN SUBJOINED LETTER NYA
  • 0F9A TIBETAN SUBJOINED LETTER TTA
  • 0F9B TIBETAN SUBJOINED LETTER TTHA
  • 0F9C TIBETAN SUBJOINED LETTER DDA
  • 0F9D TIBETAN SUBJOINED LETTER DDHA
  • 0F9E TIBETAN SUBJOINED LETTER NNA
  • 0F9F TIBETAN SUBJOINED LETTER TA
  • 0FA0 TIBETAN SUBJOINED LETTER THA
  • 0FA1 TIBETAN SUBJOINED LETTER DA
  • 0FA2 TIBETAN SUBJOINED LETTER DHA
  • 0FA3 TIBETAN SUBJOINED LETTER NA
  • 0FA4 TIBETAN SUBJOINED LETTER PA
  • 0FA5 TIBETAN SUBJOINED LETTER PHA
  • 0FA6 TIBETAN SUBJOINED LETTER BA
  • 0FA7 TIBETAN SUBJOINED LETTER BHA
  • 0FA8 TIBETAN SUBJOINED LETTER MA
  • 0FA9 TIBETAN SUBJOINED LETTER TSA
  • 0FAA TIBETAN SUBJOINED LETTER TSHA
  • 0FAB TIBETAN SUBJOINED LETTER DZA
  • 0FAC TIBETAN SUBJOINED LETTER DZHA
  • 0FAD TIBETAN SUBJOINED LETTER WA
  • 0FAE TIBETAN SUBJOINED LETTER ZHA
  • 0FAF TIBETAN SUBJOINED LETTER ZA
  • 0FB0 TIBETAN SUBJOINED LETTER -A
  • 0FB1 TIBETAN SUBJOINED LETTER YA
  • 0FB2 TIBETAN SUBJOINED LETTER RA
  • 0FB3 TIBETAN SUBJOINED LETTER LA
  • 0FB4 TIBETAN SUBJOINED LETTER SHA
  • 0FB5 TIBETAN SUBJOINED LETTER SSA
  • 0FB6 TIBETAN SUBJOINED LETTER SA
  • 0FB7 TIBETAN SUBJOINED LETTER HA
  • 0FB8 TIBETAN SUBJOINED LETTER A
  • 0FB9 TIBETAN SUBJOINED LETTER KSSA
  • 0FBA TIBETAN SUBJOINED LETTER FIXED-FORM WA
  • 0FBB TIBETAN SUBJOINED LETTER FIXED-FORM YA
  • 0FBC TIBETAN SUBJOINED LETTER FIXED-FORM RA
  • 0FC6 TIBETAN SYMBOL PADMA GDAN

Level 2 would work with both standard Unicode Tibetan and the precomposed Tibetan extensions in the PUA blocks.

Tibetan Extension-A (often referred to as "Set A"), covering the most common stacks, was published at the end of 2004, and comprises 1,536 precomposed characters in the PUA of the BMP at <F300..F8FF>. For the full repertoire see my mapping table between the Set A precomposed characters and standard Unicode Tibetan character sequences.

Tibetan Extension-B (often referred to as "Set B"), covering rarely occuring stacks, is slated for the Supplementary Private Use Area-A in Plane 15. I'm not sure how many characters it is supposed to cover, but 5,664 is figure I have heard mentioned. It has not yet been published (as far as I know) and perhaps it never will be, as the success of OpenType Tibetan fonts is rapidly making the precomposed model redundant.

One might have expected that Tibetan Extension-A would be based on the set of BrdaRten characters proposed and rejected the previous year, but that does not seem to have been the case, as :

  • Tibetan Extension-A and Tibetan Extension-B cover many thousands more characters than the proposed BrdaRten characters (Tibetan Extension-A alone has over 50% more characters);
  • There is no obvious correlation between Tibetan Extension-A and the proposed BrdaRten characters in terms of code point sequence (see my mapping table between the proposed BrdaRten characters and Tibetan Extension-A);
  • 11 of the proposed BrdaRten characters aren't even included in Tibetan Extension-A (including the seven PH + H characters added in N2621 that I suspect are mistakes for the already included H + PH characters).

These points make me wonder just how mature the BrdaRten proposal was and whether the 962 proposed characters were perhaps intended as a foot in the door for thousands more. The fact that the proposed BrdaRten characters were replaced by a quite different set of precomposed characters also makes a mockery of the Chinese claim that the BrdaRten characters were required to be encoded for backwards compatibility with legacy data.

One interesting issue with Tibetan Extension-A is that it does not include a precomposed character for the character sequence ཨོཾ <0F68 0F7C 0F7E> (the "om" of the mantra Om Mani Padme Hūm ཨོཾ་མ་ཎི་པདྨེ་ཧཱུཾ།). This must be because the Tibetan block already includes the character TIBETAN SYLLABLE OM at U+0F00, and the Chinese took this to be equivalent to the character sequence <0F68 0F7C 0F7E>. However, this character has no Unicode decomposition, and under Unicode it is not equivalent to <0F68 0F7C 0F7E>, so it would have been better to encode a separate precomposed character corresponding to <0F68 0F7C 0F7E> in the PUA rather than use U+0F00 as if it were a precomposed character.



Implementation of Precomposed Tibetan

If you do want to or need to work with Tibetan text encoded according to the PRC's standard for extended Tibetan, then it is possible to do so now using freely available software. My BabelPad text editor supports the conversion (both ways) between standard Unicode character sequences and Extended Tibetan-A, and Chris Fynn's Jomolhari font supports both standard combining Tibetan and precomposed Tibetan. Let's give it go.

1. We start up BabelPad, select the Jomolhari font, and open a Tibetan document encoded as standard combining Tibetan (Universal Declaration of Human Rights). The document renders perfectly (although it may not do so unless you are running Vista) :

Standard Combining Tibetan


2. Then we select "Unicode to Extended Tibetan-A" from the "Tibetan" submenu of the "Convert" menu of BabelPad. Hmm, no discernable change, document renders identically ... has it actually done anything ? Well yes it has. Take a look at the Status Bar; the character at the caret position was U+0F66 TIBETAN LETTER SA, but now it is U+F3B5 PRIVATE USE CHARACTER-F3B5, which according to the Set A Mapping Table corresponds to the decomposed sequence <0F66 0F94 0F7C> sngo (the first syllable of sngon brjod སྔོན་བརྗོད། "preamble").

Precomposed Tibetan


3. Now hit the u" button on the BabelPad toolbar. This causes the text to be rendered in "Glyph Mode" (i.e. with all characters rendered as individual spacing glyphs). Note that the only difference is a slight change in the inter-glyph spacing and loss of smart line breaking. This shows that each stack is indeed a single character.

Precomposed Tibetan (Glyph Mode)


4. Finally, select "Extended Tibetan-A to Unicode" from the "Tibetan" submenu of the "Convert" menu of BabelPad, and it suddenly looks like we've accidentally switched to "Arial Unicode MS". Of course we haven't; we're still using Jomolhari, but now we're rendering each character as an individual spacing glyph so that the underlying difference between combining Tibetan and precomposed Tibetan is clear.

Standard Combining Tibetan (Glyph Mode)


So there you are, standard combining Tibetan and precomposed Tibetan both work equally well (at least on Vista; I'm forced to admit that precomposed Tibetan will work fine on everything from Windows 95 onwards, which is not quite true for combining Tibetan). People in the PRC can used the precomposed model and everyone else can use the combining model. Everyone should be happy now, right ? Well, we'll just have to wait and see.

Meanwhile, here are two more things to consider :

1. How on earth are people supposed to enter Tibetan text consisting of thousands of precomposed characters ? You can't use a simple keyboard layout (as you can for Unicode Tibetan); a CJK style phonetic or transliteration IME (e.g. based on EWTS) would be useless for ordinary (or even most educated) Tibetans; and a "character picker" solution is totally impractical.

2. What will happen if China mandates support for its Extended Tibetan scheme as a requirement for GB18030 certification ? As I understand it, there is no such requirement at present and I have been told that there is no intention to make support for Extended Tibetan a GB1830 requirement, but things change.

5 comments:

28481k said...

It is possible to do what I called the Korean solution – Using alphabetic input keyboard settings and form pre-composed characters at the fly. Then we could by the keyboard setting, choose between pre-composed characters or individually coded characters. Of course, this would require a lot of computing power, but that has been resolved by immensely powerful machines we are having now. Korean encoding could have been individually encoded instead of pre-composed if the government and the industry had the guts in the 1980s, so instead inputting Korean require a look-up mechanism for pre-composed characters. I fear if China really pushed this, then this potential technological drawback has to be address, and a really go IME has to be invented.

Andrew West said...

Yes, that would be a possibility. The thing is that the Chinese government has pushed for a precomposed encoding model because it wants a low tech solution to Tibetan computer processing. The downside of "simplifying" the encoding is that it makes other aspects of Tibetan computer processing, suh as input methods, more complicated.

Chris Fynn said...

Although I too consider the PRC BrdaRten encoding for Tibetan a retrograde step - I finally decided to support at least part-A of this Chinese national "standard" in my Jomolhari font primarily because some Tibetans in China may not have access to systems supporting Unicode and OpenType fonts. For instance, the version of Red Flag Linux localised for Tibetan apparently uses the BrdaRten encoding. Secondly people outside of China may need a font to display Tibetan email or web-pages created by Tibetan freinds in China using this encoding.

Finally China have apparently told some people that support for their encoding will in future be a requirement for software sold or distributed in China. Does this include fonts?


Although this BrdaRten encoding contains thousands of combinations it is nowhere near exhaustive. I've already encountered hundreds of additional combinations of characters in traditional Tibetan texts unsupported by the BrdaRten encoding but which can easily be supported with plain Unicode/iso10646 character encoding.

Having a smart IME to type pre-composed Tibetan just moves the required intelligence from the font / rendering engine to the input method. I believe it belongs in the font - particularly as different forms of Tibetan script have slightly different shaping rules.

One reason you get different spacing with pre-composed Tibetan and atomic Unicode Tibetan using the same font is that the OpenType glyph positioning instructions used for kerning do not get applied to PUA characters. A number of other features in the font also do not work with

Something which gives me sleepless nights is the possibility of mixed-encoding documents. Suppose someone in China creates a Tibetan document using the BrdaRten encoding and it gets edited by someone else using a Unicode based system....

Consequently, before the release of the first official version of my Jomolhari font - which is still very much in development - I may remove support for the Chinese BrdaRten encoding.

Andrew West said...

Finally China have apparently told some people that support for their encoding will in future be a requirement for software sold or distributed in China. Does this include fonts?

I've been told (second hand) that it won't be, but I agree that it may well become a requirement.

Something which gives me sleepless nights is the possibility of mixed-encoding documents. Suppose someone in China creates a Tibetan document using the BrdaRten encoding and it gets edited by someone else using a Unicode based system....

I'm afraid that "mixed encoding" documents are going to be almost inevitable, as the precomposed model has a fixed number of characters,and the possible number of Tibetan stacks is almost limitless, so there will always be situations where you need to use the Unicode combining character mechanism to do deal with obscure stacks. Indeed, Zhang Zhoucai's 2004 document explicitly allows for applications to process both precomposed PUA characters and standard combining Tibetan characters.

Consequently, before the release of the first official version of my Jomolhari font - which is still very much in development - I may remove support for the Chinese BrdaRten encoding.

I hope you will keep the Chinese PUA mappings, as I believe that your font will help people migrate from precomposed Tibetan to standard Unicode Tibetan. However, the one thing that I would encourage you to remove is the mapping of the JHA glyph to U+0F48 (a reserved codepoint).

As I think I've already mentioned in my blog, Jomolhari is my favourite Tibetan font at present (I really love the glyph for U+0F17), and I look forward to its official release. Keep up the good work !

Chris Fynn said...

"The thing is that the Chinese government has pushed for a precomposed encoding model because it wants a low tech solution to Tibetan computer processing."

There is no real excuse for a low-tech solution. Bhutan, with miniscule human and financial resources compared to it's northern neighbour China has managed to create Dzongkha Linux along with a whole host of fully localized applications. In fact because Dzongkha and Tibetan share the same script, essentially the same collation rules, same line breaking rules and so on all the technically difficult part to do with OpenType rendering and so on has already been done by the Butanese and the FOSS community - and it's all open source.

All that China or the Tibetans really need to do is translate the strings in the GNOME desktop, and the applications they wish to use. In MS Windows they can take advanage of Uniscribe.

Archive

Followers

About Me