Thursday, 14 September 2006

Precomposed Tibetan Part 1 : BrdaRten

This post really ought to have been Part 3 of a History of Tibetan Encoding in Unicode, but Michael Kaplan's recent posts on the proposed alternative syllabic encoding of Tamil here and here have encouraged me to take a look at the latest twist in the saga of Tibetan encoding before I visit its early history of false starts and lost opportunities.

Tibetan is not a difficult script to read or write, but it is a very complex script to deal with in terms of computer processing (as far as complexity goes I would rate it second only to the Mongolian script). The problem is that written Tibetan comprises complex syllable units (known in Tibetan as a tsheg bar ཚེག་བར) which although written horizontally may include vertical clusters of consonants and vowel signs agglutinating around a base consonant (a vertical cluster is known as a "stack"). Thus most words have a horizontal and a vertical dimension, with the result that text is not laid out in a straight line as in most scripts. For example, the word bsGrogs བསྒྲོགས་ (pronounced drok ... obviously!) may be analysed as follows :



  • b (blue) = prefix (silent)
  • s (green) = superfix (silent)
  • g (red) = base consonant
  • r (purple) = subfix
  • o (yellow) = vowel sign
  • g (turquoise) = terminal
  • s (pink) = postfix (silent)

In the Unicode Tibetan encoding model a vertical stack (sgro སྒྲོ in the above example) is treated as a composite unit comprising (in the simple case, ignoring the complexities of Sanskrit transliteration and shorthand contractions) a single consonant from the range <0F40..0F6A>, zero or many subjoined consonants from the range <0F90..0FBC> and zero or one vowel sign. Thus the word bsGrogs is represented as <0F56 0F66 0F92 0FB2 0F7C 0F42 0F66 0F0B>.

The encoded representation only specifies what the elements of a word are, not the precise relationship between the elements at the glyph level. It is up to the rendering system to put all the pieces together correctly, so that within a vertical stack all the component letters take the expected glyph shape (some superfixed and subfixed letters have special forms), are positioned correctly in relationship to each other and are joined together seamlessly. For several years after Tibetan was encoded in Unicode 2.0 (July 1996) no rendering system existed that was capable of doing all this, and using Unicode to write Tibetan remained a theoretical exercise. It was not until the early years of this decade that OpenType fonts supporting complex Tibetan stacks started to appear and Microsoft started to support Tibetan in its Uniscribe rendering engine. However, out-of-the-box support for Tibetan (including font and keyboard layouts) did not become available until the arrival of Vista, more than ten years after Tibetan was encoded. But if you are running Vista then Tibetan works pretty much perfectly, and, if you want, there are half a dozen freely available Unicode Tibetan fonts that you can use instead of the Tibetan font that ships with Vista ("Himalaya"). Try out my Tibetan Test Page to see whether Tibetan works for you or not.

The problem is that the Chinese government had never really bought into the decomposed Tibetan model. As far back as January 1994, when the encoding model for Tibetan was still under discussion, China submitted a proposal (N964) to encode Tibetan stacks as individual precomposed characters rather than as a sequence of combining characters, but this model was rejected in favour of the combining model.

Then six years after Tibetan encoding had been finalised, in December 2002, the Chinese national body submitted a proposal to encode nearly a thousand so-called "BrdaRten" (བརྡ་རྟེན, pronounced daden) precomposed stacks in the BMP at <A500..A8FF> (see N2558, revised the following year as N2621, and further elaborated in N2661). These precomposed stacks were intended to be used in conjunction with those existing Tibetan characters that were non-combining (e.g. the consonants at <0F40..0F6A> but not the subjoined consonants at <0F90..0FBC> or any of the vowel signs), so that a word such as bsGrogs བསྒྲོགས་ would be encoded as <0F56 A5BA 0F42 0F66 0F0B> instead of <0F56 0F66 0F92 0FB2 0F7C 0F42 0F66 0F0B> under the existing encoding model (five code point units instead of eight). See my BrdaRten Mapping Table for a list of the 962 proposed BrdaRten characters (originally 956 in N2558), with their mappings to standard Unicode character sequences.

The arguments put forward by China in support of the proposed BrdaRten encoding are very poorly articulated, but I think they boil down to four basic points (with my observations in brackets) :

  1. The technical difficulties of implementing a system that can dynamically compose Tibetan stacks from a sequence of multiple characters [the technical difficulties had already been overcome at this time, as can be seen from N2624 which shows that all the proposed BrdaRten stacks could already be rendered correctly under the existing encoding model using OpenType font technology];
  2. The existence of gigabytes of legacy Tibetan data encoded using the BrdaRten model [but N2661 admits that the legacy data uses different repertoires and different code points, so mapping tables are required anyway];
  3. Precomposed stacks have been treated as single units since the advent of lead typesetting [nice picture of Tibetan lead type in Fig.1 of N2661, but not really relevant];
  4. On average BrdaRten stacks occupy 23% of Tibetan text and so BrdaRten cannot be ignored [hmm, this is where they forgot to mention the substantial reductions in storage costs that encoding precomposed characters would bring].

The counter-arguments boil down to :

  1. Precomposed characters are unnecessary as complex vertical stacks can already be dealt with satisfactorily in the existing encoding model using "smart font" technology such as OpenType;
  2. Encoding precomposed characters would introduce multiple non-equivalent spellings for Tibetan words (because of the Unicode Stability Policy if precomposed BrdaRten characters were encoded they would not be canonically equivalent to the corresponding decomposed character sequence), which would have severe implications on processes such as collation and searching;
  3. Encoding precomposed characters would create two competing models for Tibetan, with the result that people inside and outside of the PRC may end up creating mutually incompatible documents, thus restricting information exchange;
  4. The 962 proposed characters do not cater for all the thousands of less common stacks used for Sanskrit, and so the combining model is still required for representing many religious texts (and China is the major source for modern editions of Tibetan religious texts);
  5. As the new BrdaRten encoding model would not displace the existing encoding model, applications would still have to support standard combining Tibetan, so the scheme only adds an extra layer of complexity for systems that need to fully support Tibetan.

Not unexpectedly there was very strong opposition to the Chinese proposal (see N2624, N2625, N2635, N2637, N2638 and especially Peter Constable's systematic refutation in N2668). When China's proposal was discussed by WG2 in October 2003 it was firmly rejected :

With reference to the revised proposal in document N2621 on Tibetan BrdaRten from China, WG 2 resolves not to encode the suggested list of characters in the standard based on the following:

  1. All of the proposed characters can already be represented as sequences of existing encoded UCS characters, as shown explicitly in document N2624.
  2. The addition of the proposed characters would thereby lead to normalization issues.
  3. The addition of the proposed characters would also amount to a change in the overall encoding model for the Tibetan script, thereby destabilizing and introducing more complexity for existing implementations conformant to the standard.

Further.

  1. WG 2 notes that the various implementation issues for BrdaRten Tibetan raised in documents N2621 and N2661 can be addressed in a variety of ways, involving dynamic conversion interfaces to existing legacy systems and other techniques suggested in document N2668.
  2. WG 2 suggests that the list of BrdaRten Tibetan stacks enumerated in document N2621 might be appropriate for processing by WG 2 as additions to a potential future annex of named entities represented by USIs, rather than be encoded as individual characters.
  3. WG 2 notes the issues on Tibetan script encoding and its implementation in document N2661, and invites WG 2 experts to work with Chinese experts to arrive at a satisfactory solution.

RESOLUTION M44.20 (Tibetan BrdaRten)

China was furious at this outcome (see N2674), and vowed to oppose the encoding of any scripts "less alive than BrdaRten" in the BMP. They promptly opposed the encoding of Syloti Nagri and Phags-pa (see Resolutions M44.3 and M44.4), which was bad news for me as I was responsible for the Phags-pa proposal, and agreement from China was essential for its success. At the time I thought that their opposition to Phags-pa must be revenge for my opposition to BrdaRten (N2624), but I have been assured that the Chinese just wanted to keep the proposed window at <A500..A8FF> open (and Syolti Nagri was put at <A800..A82F> and Phags-pa put at <A840..A87F>).

China may have lost the battle at WG2, but as we will see in Part 2, this was far from the end of the story.


5 comments:

UL-Tomten said...

Where does one find the Tibetan fonts mentioned in your CSS? (Are the first two, Himalaya and Ximalaya, Microsoft fonts still in beta, as a quick googling suggests?)

Andrew West said...

Himalaya ships with Vista (Ximalaya is the beta version).

You can find links to most of the others at my Tibetan Test Page.

Andrew West said...

... or check out the "Some of My Favourite Fonts" section at the bottom of the main blog page.

Rich Felker said...

The Chinese claim that Tibetan text is "too complex" to process in the standard Unicode form is rather absurd, considering that I just got through writing a text rendering implementation that supports Tibetan (along with most other scripts) in under 20kb of machine code, static linked (and only about 5k is actually text-related code). The problem is just that existing software and libraries are bad, not that there's any inherent difficulty to Tibetan. Unlike Indic scripts and bidi which are actually difficult, Tibetan is barely more complex to handle than Thai or European scripts with non-precomposed combining marks.

Anyone interested in seeing what I mean can google for "uuterm". The code should be in svn soon.

Chris Fynn said...

Rich wrote...

Unlike Indic scripts and bidi which are actually difficult, Tibetan is barely more complex to handle than Thai or European scripts with non-precomposed combining marks.


Tibetan is of course an Indic script - and inherently just as complex as other Indic scripts. The reason the rendering is easier to handle is that a different encoding model - with a separate set of explicitly combining consonants [U+0F90 to U+0FBC]- was adopted for Tibetan in the UCS.

This encoding model is much more straightforward to work with than the imo overly complex model used for encoding many other Indic scripts.

- Chris