Sunday, 17 September 2006

Precomposed Tibetan Part 2 : Stuck in the PUA

As discussed in Part 1, in 2002-2003 China tried and failed to get nearly a thousand precomposed Tibetan characters encoded in ISO/IEC 10646 (which is the international standard corresponding to Unicode).

Following on from this humiliating defeat, in April of 2004 Joe Zhang (Zhang Zhoucai 张轴材), formerly a contributing editor of ISO/IEC 10646, presented to a conference in China a paper that outlined a new Chinese encoding standard for Tibetan, codenamed the "Everest Scheme". This scheme utilizes the Private Use Areas (PUA) of the UCS to encode several thousand precomposed Tibetan characters, and was characterised as a "national standard within the framework of an international standard". Under this scheme Tibetan characters would be distributed as follows :

  • 0F00..0FFF : Basic Tibetan (the existing Tibetan block)
  • F500..F8FF : Tibetan Extension-A 藏文编码字符集(扩充集A)
  • 000F1000..000F3000 : Tibetan Extension-B 藏文编码字符集(扩充集B)

The paper also stated that there should be two implementation levels for Tibetan :

  1. Level 1 : Only works with non-combining and precomposed Tibetan characters
  2. Level 2 : Works with combining and precomposed characters

Level 1 would not be required to process any of the following characters :

  • 0F18 TIBETAN ASTROLOGICAL SIGN -KHYUD PA
  • 0F19 TIBETAN ASTROLOGICAL SIGN SDONG TSHUGS
  • 0F35 TIBETAN MARK NGAS BZUNG NYI ZLA
  • 0F37 TIBETAN MARK NGAS BZUNG SGOR RTAGS
  • 0F39 TIBETAN MARK TSA -PHRU
  • 0F3E TIBETAN SIGN YAR TSHES
  • 0F3F TIBETAN SIGN MAR TSHES
  • 0F71 TIBETAN VOWEL SIGN AA
  • 0F72 TIBETAN VOWEL SIGN I
  • 0F73 TIBETAN VOWEL SIGN II
  • 0F74 TIBETAN VOWEL SIGN U
  • 0F75 TIBETAN VOWEL SIGN UU
  • 0F76 TIBETAN VOWEL SIGN VOCALIC R
  • 0F77 TIBETAN VOWEL SIGN VOCALIC RR
  • 0F78 TIBETAN VOWEL SIGN VOCALIC L
  • 0F79 TIBETAN VOWEL SIGN VOCALIC LL
  • 0F7A TIBETAN VOWEL SIGN E
  • 0F7B TIBETAN VOWEL SIGN EE
  • 0F7C TIBETAN VOWEL SIGN O
  • 0F7D TIBETAN VOWEL SIGN OO
  • 0F7E TIBETAN SIGN RJES SU NGA RO
  • 0F7F TIBETAN SIGN RNAM BCAD
  • 0F80 TIBETAN VOWEL SIGN REVERSED I
  • 0F81 TIBETAN VOWEL SIGN REVERSED II
  • 0F82 TIBETAN SIGN NYI ZLA NAA DA
  • 0F83 TIBETAN SIGN SNA LDAN
  • 0F84 TIBETAN MARK HALANTA
  • 0F86 TIBETAN MARK LCI RTAGS
  • 0F87 TIBETAN MARK YANG RTAGS
  • 0F90 TIBETAN SUBJOINED LETTER KA
  • 0F91 TIBETAN SUBJOINED LETTER KHA
  • 0F92 TIBETAN SUBJOINED LETTER GA
  • 0F93 TIBETAN SUBJOINED LETTER GHA
  • 0F94 TIBETAN SUBJOINED LETTER NGA
  • 0F95 TIBETAN SUBJOINED LETTER CA
  • 0F96 TIBETAN SUBJOINED LETTER CHA
  • 0F97 TIBETAN SUBJOINED LETTER JA
  • 0F99 TIBETAN SUBJOINED LETTER NYA
  • 0F9A TIBETAN SUBJOINED LETTER TTA
  • 0F9B TIBETAN SUBJOINED LETTER TTHA
  • 0F9C TIBETAN SUBJOINED LETTER DDA
  • 0F9D TIBETAN SUBJOINED LETTER DDHA
  • 0F9E TIBETAN SUBJOINED LETTER NNA
  • 0F9F TIBETAN SUBJOINED LETTER TA
  • 0FA0 TIBETAN SUBJOINED LETTER THA
  • 0FA1 TIBETAN SUBJOINED LETTER DA
  • 0FA2 TIBETAN SUBJOINED LETTER DHA
  • 0FA3 TIBETAN SUBJOINED LETTER NA
  • 0FA4 TIBETAN SUBJOINED LETTER PA
  • 0FA5 TIBETAN SUBJOINED LETTER PHA
  • 0FA6 TIBETAN SUBJOINED LETTER BA
  • 0FA7 TIBETAN SUBJOINED LETTER BHA
  • 0FA8 TIBETAN SUBJOINED LETTER MA
  • 0FA9 TIBETAN SUBJOINED LETTER TSA
  • 0FAA TIBETAN SUBJOINED LETTER TSHA
  • 0FAB TIBETAN SUBJOINED LETTER DZA
  • 0FAC TIBETAN SUBJOINED LETTER DZHA
  • 0FAD TIBETAN SUBJOINED LETTER WA
  • 0FAE TIBETAN SUBJOINED LETTER ZHA
  • 0FAF TIBETAN SUBJOINED LETTER ZA
  • 0FB0 TIBETAN SUBJOINED LETTER -A
  • 0FB1 TIBETAN SUBJOINED LETTER YA
  • 0FB2 TIBETAN SUBJOINED LETTER RA
  • 0FB3 TIBETAN SUBJOINED LETTER LA
  • 0FB4 TIBETAN SUBJOINED LETTER SHA
  • 0FB5 TIBETAN SUBJOINED LETTER SSA
  • 0FB6 TIBETAN SUBJOINED LETTER SA
  • 0FB7 TIBETAN SUBJOINED LETTER HA
  • 0FB8 TIBETAN SUBJOINED LETTER A
  • 0FB9 TIBETAN SUBJOINED LETTER KSSA
  • 0FBA TIBETAN SUBJOINED LETTER FIXED-FORM WA
  • 0FBB TIBETAN SUBJOINED LETTER FIXED-FORM YA
  • 0FBC TIBETAN SUBJOINED LETTER FIXED-FORM RA
  • 0FC6 TIBETAN SYMBOL PADMA GDAN

Level 2 would work with both standard Unicode Tibetan and the precomposed Tibetan extensions in the PUA blocks.

Tibetan Extension-A (often referred to as "Set A"), covering the most common stacks, was published at the end of 2004, and comprises 1,536 precomposed characters in the PUA of the BMP at <F300..F8FF>. For the full repertoire see my mapping table between the Set A precomposed characters and standard Unicode Tibetan character sequences.

Tibetan Extension-B (often referred to as "Set B"), covering rarely occuring stacks, is slated for the Supplementary Private Use Area-A in Plane 15. I'm not sure how many characters it is supposed to cover, but 5,664 is figure I have heard mentioned. It has not yet been published (as far as I know) and perhaps it never will be, as the success of OpenType Tibetan fonts is rapidly making the precomposed model redundant.

One might have expected that Tibetan Extension-A would be based on the set of BrdaRten characters proposed and rejected the previous year, but that does not seem to have been the case, as :

  • Tibetan Extension-A and Tibetan Extension-B cover many thousands more characters than the proposed BrdaRten characters (Tibetan Extension-A alone has over 50% more characters);
  • There is no obvious correlation between Tibetan Extension-A and the proposed BrdaRten characters in terms of code point sequence (see my mapping table between the proposed BrdaRten characters and Tibetan Extension-A);
  • 11 of the proposed BrdaRten characters aren't even included in Tibetan Extension-A (including the seven PH + H characters added in N2621 that I suspect are mistakes for the already included H + PH characters).

These points make me wonder just how mature the BrdaRten proposal was and whether the 962 proposed characters were perhaps intended as a foot in the door for thousands more. The fact that the proposed BrdaRten characters were replaced by a quite different set of precomposed characters also makes a mockery of the Chinese claim that the BrdaRten characters were required to be encoded for backwards compatibility with legacy data.

One interesting issue with Tibetan Extension-A is that it does not include a precomposed character for the character sequence ཨོཾ <0F68 0F7C 0F7E> (the "om" of the mantra Om Mani Padme Hūm ཨོཾ་མ་ཎི་པདྨེ་ཧཱུཾ།). This must be because the Tibetan block already includes the character TIBETAN SYLLABLE OM at U+0F00, and the Chinese took this to be equivalent to the character sequence <0F68 0F7C 0F7E>. However, this character has no Unicode decomposition, and under Unicode it is not equivalent to <0F68 0F7C 0F7E>, so it would have been better to encode a separate precomposed character corresponding to <0F68 0F7C 0F7E> in the PUA rather than use U+0F00 as if it were a precomposed character.



Implementation of Precomposed Tibetan

If you do want to or need to work with Tibetan text encoded according to the PRC's standard for extended Tibetan, then it is possible to do so now using freely available software. My BabelPad text editor supports the conversion (both ways) between standard Unicode character sequences and Extended Tibetan-A, and Chris Fynn's Jomolhari font supports both standard combining Tibetan and precomposed Tibetan. Let's give it go.

1. We start up BabelPad, select the Jomolhari font, and open a Tibetan document encoded as standard combining Tibetan (Universal Declaration of Human Rights). The document renders perfectly (although it may not do so unless you are running Vista) :



2. Then we select "Unicode to Extended Tibetan-A" from the "Tibetan" submenu of the "Convert" menu of BabelPad. Hmm, no discernable change, document renders identically ... has it actually done anything ? Well yes it has. Take a look at the Status Bar; the character at the caret position was U+0F66 TIBETAN LETTER SA, but now it is U+F3B5 PRIVATE USE CHARACTER-F3B5, which according to the Set A Mapping Table corresponds to the decomposed sequence <0F66 0F94 0F7C> sngo (the first syllable of sngon brjod སྔོན་བརྗོད། "preamble").



3. Now hit the u" button on the BabelPad toolbar. This causes the text to be rendered in "Glyph Mode" (i.e. with all characters rendered as individual spacing glyphs). Note that the only difference is a slight change in the inter-glyph spacing and loss of smart line breaking. This shows that each stack is indeed a single character.



4. Finally, select "Extended Tibetan-A to Unicode" from the "Tibetan" submenu of the "Convert" menu of BabelPad, and it suddenly looks like we've accidentally switched to "Arial Unicode MS". Of course we haven't; we're still using Jomolhari, but now we're rendering each character as an individual spacing glyph so that the underlying difference between combining Tibetan and precomposed Tibetan is clear.



So there you are, standard combining Tibetan and precomposed Tibetan both work equally well (at least on Vista; I'm forced to admit that precomposed Tibetan will work fine on everything from Windows 95 onwards, which is not quite true for combining Tibetan). People in the PRC can used the precomposed model and everyone else can use the combining model. Everyone should be happy now, right ? Well, we'll just have to wait and see.

Meanwhile, here are two more things to consider :

1. How on earth are people supposed to enter Tibetan text consisting of thousands of precomposed characters ? You can't use a simple keyboard layout (as you can for Unicode Tibetan); a CJK style phonetic or transliteration IME (e.g. based on EWTS) would be useless for ordinary (or even most educated) Tibetans; and a "character picker" solution is totally impractical.

2. What will happen if China mandates support for its Extended Tibetan scheme as a requirement for GB18030 certification ? As I understand it, there is no such requirement at present and I have been told that there is no intention to make support for Extended Tibetan a GB1830 requirement, but things change.


16 comments:

28481k said...

It is possible to do what I called the Korean solution – Using alphabetic input keyboard settings and form pre-composed characters at the fly. Then we could by the keyboard setting, choose between pre-composed characters or individually coded characters. Of course, this would require a lot of computing power, but that has been resolved by immensely powerful machines we are having now. Korean encoding could have been individually encoded instead of pre-composed if the government and the industry had the guts in the 1980s, so instead inputting Korean require a look-up mechanism for pre-composed characters. I fear if China really pushed this, then this potential technological drawback has to be address, and a really go IME has to be invented.

Andrew West said...

Yes, that would be a possibility. The thing is that the Chinese government has pushed for a precomposed encoding model because it wants a low tech solution to Tibetan computer processing. The downside of "simplifying" the encoding is that it makes other aspects of Tibetan computer processing, suh as input methods, more complicated.

Chris Fynn said...

Although I too consider the PRC BrdaRten encoding for Tibetan a retrograde step - I finally decided to support at least part-A of this Chinese national "standard" in my Jomolhari font primarily because some Tibetans in China may not have access to systems supporting Unicode and OpenType fonts. For instance, the version of Red Flag Linux localised for Tibetan apparently uses the BrdaRten encoding. Secondly people outside of China may need a font to display Tibetan email or web-pages created by Tibetan freinds in China using this encoding.

Finally China have apparently told some people that support for their encoding will in future be a requirement for software sold or distributed in China. Does this include fonts?


Although this BrdaRten encoding contains thousands of combinations it is nowhere near exhaustive. I've already encountered hundreds of additional combinations of characters in traditional Tibetan texts unsupported by the BrdaRten encoding but which can easily be supported with plain Unicode/iso10646 character encoding.

Having a smart IME to type pre-composed Tibetan just moves the required intelligence from the font / rendering engine to the input method. I believe it belongs in the font - particularly as different forms of Tibetan script have slightly different shaping rules.

One reason you get different spacing with pre-composed Tibetan and atomic Unicode Tibetan using the same font is that the OpenType glyph positioning instructions used for kerning do not get applied to PUA characters. A number of other features in the font also do not work with

Something which gives me sleepless nights is the possibility of mixed-encoding documents. Suppose someone in China creates a Tibetan document using the BrdaRten encoding and it gets edited by someone else using a Unicode based system....

Consequently, before the release of the first official version of my Jomolhari font - which is still very much in development - I may remove support for the Chinese BrdaRten encoding.

Andrew West said...

Finally China have apparently told some people that support for their encoding will in future be a requirement for software sold or distributed in China. Does this include fonts?

I've been told (second hand) that it won't be, but I agree that it may well become a requirement.

Something which gives me sleepless nights is the possibility of mixed-encoding documents. Suppose someone in China creates a Tibetan document using the BrdaRten encoding and it gets edited by someone else using a Unicode based system....

I'm afraid that "mixed encoding" documents are going to be almost inevitable, as the precomposed model has a fixed number of characters,and the possible number of Tibetan stacks is almost limitless, so there will always be situations where you need to use the Unicode combining character mechanism to do deal with obscure stacks. Indeed, Zhang Zhoucai's 2004 document explicitly allows for applications to process both precomposed PUA characters and standard combining Tibetan characters.

Consequently, before the release of the first official version of my Jomolhari font - which is still very much in development - I may remove support for the Chinese BrdaRten encoding.

I hope you will keep the Chinese PUA mappings, as I believe that your font will help people migrate from precomposed Tibetan to standard Unicode Tibetan. However, the one thing that I would encourage you to remove is the mapping of the JHA glyph to U+0F48 (a reserved codepoint).

As I think I've already mentioned in my blog, Jomolhari is my favourite Tibetan font at present (I really love the glyph for U+0F17), and I look forward to its official release. Keep up the good work !

Chris Fynn said...

"The thing is that the Chinese government has pushed for a precomposed encoding model because it wants a low tech solution to Tibetan computer processing."

There is no real excuse for a low-tech solution. Bhutan, with miniscule human and financial resources compared to it's northern neighbour China has managed to create Dzongkha Linux along with a whole host of fully localized applications. In fact because Dzongkha and Tibetan share the same script, essentially the same collation rules, same line breaking rules and so on all the technically difficult part to do with OpenType rendering and so on has already been done by the Butanese and the FOSS community - and it's all open source.

All that China or the Tibetans really need to do is translate the strings in the GNOME desktop, and the applications they wish to use. In MS Windows they can take advanage of Uniscribe.

YH said...

Hello, Andrew:

After reading your article Precomposed Tibetan Part 1&2(http://babelstone.blogspot.com/2006/09/precomposed-tibetan-part-1-brdarten.html;http://babelstone.blogspot.com/2006/09/precomposed-tibetan-part-2-stuck-in.html) and using you powerful BabelPad, I was very happy for gaining knowledges and seeing experts outside China pay so much attention on Chinese works, but still, I found almost all you said is based on Chinese proposal written in English, so I'd like to point out (I hope that helps):

It seems to be a puzzle for you why Chinese government always tried to make precomposed Tibetan characters' schemes though failed and failed. But it's quite clear for us Chinese people, if knowing a few about the background of Tibetan characters' encoding since 1980s. And "On average BrdaRten stacks occupy 23% of Tibetan text and so BrdaRten cannot be ignored", "and a 'character picker' solution is totally impractical" and "as a requirement for GB18030 certification" showed you are totally ignorant about Chinese "country-situation" (hope not derogatorily), though it seems you're accomplished in classical and modern Chinese language, etc.

With knowing Chinese "country-situation", you'll easily see why BrdaRten changed so frequently, why Chinese government again made the Ext-A and Ext-B scheme in 2006 and 2008, how can Tibetans in China type only "p,;" to get 6 syllables in Tibetan (18 precomposed stacks, 23 Unicode) and why they think those "keyboard layout" is not convenience enough, and what's the real meaning of "All of official publications are utilizing Tibetan basic letters plus BrdaRten coding model although the code pages ... are different" - before BrdaRten, they've already "utilize"d it (it's better to say BrdaRten utilized them, instead).

However now, I need to ask you if you can read Chinese, for "I write Chinese and read English, you write English and read Chinese" should be efficiently way to communicate.

This time, what I want to tell you is:
1. Jomolhari is the first font which implement the "GB/T 20542-2006: Information technology - Tibetan coded character set - Extension A" standard;
2. Currently, both Extension i (i=1,2) and BrdaRten are never used inside China, perhaps late they'll be.

P.S. The BlogSpot seems to be blocked and seems your text're only available to Chinese people via Google Cache.

YH

Andrew West said...

However now, I need to ask you if you can read Chinese, for "I write Chinese and read English, you write English and read Chinese" should be efficiently way to communicate.

你想写中文也可以。

Currently, both Extension i (i=1,2) and BrdaRten are never used inside China, perhaps late they'll be.

It seems unlikely that they will ever be widely used in the future if they are not now, as more and more people are getting used to using Unicode Tibetan with Windows Vista and 7 operating systems.

P.S. The BlogSpot seems to be blocked and seems your text're only available to Chinese people via Google Cache.

I plan to mirror all my blog posts on my website, which I don't think is blocked (yet).

乡下女人 said...

I am relatively new to Tibetan encoding, Andrew. However, I carefully read all the documents mentioned in your blogs. I have a few questions which I hope you can answer:
1. Who first proposed the current UCS Tibetan script? I cannot find the relevant documents, e.g. the documents N306, N808, N826, N835, N837, N845, N851, N986, N1095, etc. What was China's position on the first version the Tibetan script? Did China support it (or even did China propose it?)? can you send these early documents to me?
2. If China supported the dynamic stack initially, why on earth did they change their mind many years later? It seems to me totally meaningless and absurd. If Founder cannot render the UCS script, that's Founder's problem. China is not Founder's. Tibetan is used throughout the world. I am trying to understand the motivation behind China's proposals. To me, for China to make any sense, it must have objected to the USC encoding scheme from the very beginning. However, I know there are lots of people in China who strongly prefers the UCS standard, including me. If you can read chinese, you can read my recent blog: http://superfection.com/blog/
Someone must be lying.

regards
mandel

Andrew West said...

I have a few questions which I hope you can answer:

I only became involved in character encoding relatively recently, and the encoding of Tibetan occured ten years before I became involved, and so all I know is what I can glean from the available documents. Therefore I am not a reliable source of information with regard to this issue, and am probably the wrong person to ask. Nevertheless, I will do my best to answer your questions.

Who first proposed the current UCS Tibetan script?

China and UK both made proposals in early 1994:

N 964 Proposal for encoding Tibetan script; China; 1994-01
N 986 Proposal for Tibetan Script in the BMP; Bruce Paterson, Peter Lofting, U.K.; 1994.03.24

The Chinese proposal was discussed at the April 1994 meeting of WG2 in Turkey, with the Chinese proposal presented by Prof. Nyima Trashi of Tibet University (see N1033 8.4.3).

It seems that China initially proposed a set of 705 precomposed characters for standard Tibetan, and combining characters only for use in writing Sanskrit in Tibetan.

China worked with the Unicode Consortium over the following year, by email and in person at Unicode Technical Committee meetings in the US, to agree a revised encoding for Tibetan that used combining characters rather than precomposed characters. At the Helsinki WG2 meeting in June 1995 the joint China-Unicode code chart for Tibetan was presented, and further technical discussion took place (see N1253 6.4.5).

The result of this meeting was to task China and Unicode to work together to produce the final draft of the amendment (see N1254 M28.5).

A group of experts from China, Ireland and the UK (Nyima Trashi, DaWar Tsering, Tsering Choergyal, Michael Everson, Hugh McGregor Ross, and Mao Yonggang) worked together, and subsequently produced a joint proposed disposition of comments on PDAM6 (N1378), which was presented by China at the Copenhagen WG2 meeting in June 1996 (see N1353 6.2). The China-Ireland-UK document was accepted, and Tibetan was included in Unicode 2.0 released in July 1996.

I cannot find the relevant documents, e.g. the documents N306, N808, N826, N835, N837, N845, N851, N986, N1095, etc.

Most of the early documents are not available.

What was China's position on the first version the Tibetan script? Did China support it (or even did China propose it?)?

See above. It seems clear from the documents available that China worked with other national bodies (Ireland, UK and USA) to produce an encoding that was acceptable to all. China was closely involved in the process at all stages.0

can you send these early documents to me?

No, I do not have them.

If China supported the dynamic stack initially, why on earth did they change their mind many years later?

I don't know.

It seems to me totally meaningless and absurd ... for China to make any sense, it must have objected to the USC encoding scheme from the very beginning.

It did not. The UCS encoding model was developed by experts from China, Ireland, UK and USA, and the final code charts for Tibetan were produced by the China national body.

Someone must be lying.

I think it is not right to say anyone has been lying.

乡下女人 said...

Dear West
Thank you for your clarifying answer regarding the initial Tibetan encoding scheme in Unicode. I summarize your points as follows:

1. China initially proposed both the precomposed and combining approaches.

2. China worked with the other bodies actively to finally decided on the combining approach. China is the main driving force in the standardization of the Tibetan in Unicode.

3. You don't know any reason why China proposed to add precomposed Tibetan characters to the BMP many years later. This is by far the most dazzling thing that baffled me. Question: It is possible that the publishing houses in China actually were behind the scenes and they are so powerful in the government and so mean in technology that they lobbied the government for such a mad move?

Since I am going to see the initial person responsible for the Tibetan encoding (Professor Nyima Trashi of the Tibetan University) in a few days, I think I can get his version of the initial history fairly authoritatively.

Thank your for your help in kindly answering all my questions.

Andrew West said...

It is possible that the publishing houses in China actually were behind the scenes and they are so powerful in the government and so mean in technology that they lobbied the government for such a mad move?

Yes, I think so.

Please let me know how your meeting with Prof. Nyima Trashi goes, and what his version of the story is.

jt said...

Dear Mr. West,

I'm coming into this conversation quite late and--I'm afraid--from a complete outsider's view. I am interested in the early history of the Unicode and CJK from a historian's perspective. One of the questions I am eager to find out is how China's language policies towards its ethnic groups fits in the picture. For example, someone who's interested in China's control over its ethnic minority issues might assume an oppressive plot at work here. I suspect that the real story is more complicated. Can you direct me to print or human sources--if not yourself--that can shed light on this question? Much appreciate your time and consideration. jt

Andrew West said...

Hi jt, thanks for your question. I'm afraid that I'm not really the right person to ask this question of. All I can say is that from my experience collaborating with Chinese experts on a number of encoding proposals over the last ten years there is no sinister or oppressive plot, and I believe that the Chinese government is genuinely keen to have minority and historical scripts encoded for the benefit of the user communities.

jt said...

Thanks! I figured as much, and will look more along those lines.

乡下女人 said...

Dear Mr. West,

After two years I realized that I have to answer your last question for me. Prof. Nyima Trashi confirmed your story and I have worked with him since two years ago. I have developed a new Tibetan Input method (with other professors) that's far easier to input Tibetan Unicode than any other method and will release it soon. And as I looked at the Mongolian Unicode encoding, I realized it is a total disaster. (for anyone interested, see my blog in Chinese http://bbs.superfection.com/doku.php?id=mongolian_encoding)

Andrew West said...

Dear 乡下女人, thank you for your reply. I agree that Mongolian encoding model that was chosen is not ideal, but we have to live with what it is, and Unicode Mongolian is gradually becoming better supported. I look forward to seeing your Tibetan Unicode input method, please post a link to it here when you release it.