Monday, 24 May 2010

Prototyping Tangut IMEs, or Why Windows 7 Sucks

Why Windows 7 No Longer Sucks [2011-03-01]

On 22nd February 2011 Windows 7 Service Pack 1 (SP1) was released, and I am very pleased to say that all the rendering issues discussed below are now solved.


Typing PUA Tangut under Windows 7 plus SP1


Internet Explorer 8 under Windows 7 plus SP1



Original Post [2010-05-24]

In many ways Windows 7 is a great improvement on Vista, but this is the sad story of why my children have the use my shiny new Windows 7 laptop, and I am sticking to the old, not very user-friendly and not very reliable Vista laptop. I hope that one day I will be able to write a blog extolling the virtues of Windows 7, but given the contents of the forthcoming Service Pack 1 it seems very unlikely to happen any time soon, and at the current rate of (lack of) progress, I am afraid that Microsoft will lose more and more of the few remaining loyal customers like myself who find it impossible to do cutting edge Unicode stuff with an operating system that values gimmicks over functionality, and for every step forward takes two steps backwards.


Prototyping Tangut IMEs

In anticipation of the eventual encoding of the Tangut script in Unicode, I have been prototyping a couple of Input Methods for Tangut that use the table driven text service that is available in Windows Vista and Windows 7 (see Michael Kaplan's twelve-part series Behold the Table Driven Text Service for a tutorial).

I have created two mapping tables for Tangut :

Installing on my Windows Vista laptop I get the following results (the icons, StrokeCode.ico and Alphacode.ico, are a little degraded in the jpgs) :


Tangut Stroke Code IME under Windows Vista


Tangut Alphacode IME under Windows Vista


Hmm, the IMEs both work just fine, but the Tangut characters in the candidate list show up as little squares, which means that if two or more characters share the same alphabetic code sequence you have to guess which character to choose, and even if it is a unique alphabetic code sequence it would be nice to see what the character looks like. Unfortunately, for Vista there is no way to specify what font to use for the candidate window, but as explained by Michael Kaplan in Can't I pick the candidate list font if I don't speak fluent square box?, Windows 7 introduces new FontFaceName and FontSize parameters for the TableTextService file format. So let's install these two IMEs (with Unicode Tangut specified at 16 points) and the Unicode Tangut font on my Windows 7 laptop and see what happens.


Why Windows 7 Sucks

Tangut Stroke Code IME under Windows 7 (using BabelPad)


Tangut Alphacode IME under Windows 7 (using BabelPad)


D'oh, that's one step forward and two steps backwards. The candidate window is now using the Unicode Tangut font as specified, but both in the candidate window and in BabelPad the Tangut characters (currently reserved code points) are displayed as little square boxes, in fact two square boxes per character, which suggests that surrogate code points are being rendered separately rather than combined as a single character. But perhaps this a problem with BabelPad; let's see what it looks like with Notepad :


Tangut Stroke Code IME under Windows 7 (using Notepad)


Hmm, that's no better. Just to show that there is nothing intrinsically wrong with the Unicode Tangut font or the table driven text service, here's a screenshot from Windows 7 of a Tangut Components IME that maps the Tangut components listed here to PUA codepoints (TableTextServiceTangutRadicalsPUA.txt and TangutRadicals.ico) :


Tangut Component PUA IME under Windows 7


Now, at this point there will be some people who will be saying, "of course your so-called Tangut text doesn't display properly, because you are using unassigned Unicode codepoints". Ignoring the fact that it does display OK in Windows Vista, Windows XP and even Windows 2000 if the Unicode Tangut font is installed (as Tangut is not a complex script from a rendering perspective, it does not need support from Uniscribe to render correctly), let's take a look and see how Windows 7 copes with a recently-encoded script like Egyptian Hieroglyphs which does have officially assigned Unicode characters (NB Egyptian Hieroglyphs render fine under Windows Vista with a font like Aegyptus) :


Egyptian Hieroglyphs rendered in Notepad under Windows 7


Egyptian Hieroglyphs rendered in BabelPad (in Complex Rendering mode) under Windows 7


Well, that's not any good. How does BabelMap cope?


Egyptian Hieroglyphs rendered in BabelMap under Windows 7


The Egyptian hieroglyphs render OK in the character grid and in the popup window, because BabelMap does not use Uniscribe, but renders character directly using their glyph ID, as read from the font's CMAP table. But the edit buffer is a standard Windows edit control, which uses Uniscribe, and the Egyptian characters render as square boxes. Let's try again with BabelPad, this time with "Simple Rendering" mode selected, which uses the same method as BabelMap to render characters :


Egyptian Hieroglyphs rendered in BabelPad (in Simple Rendering mode) under Windows 7


That's better! Let's do the same thing for my unofficial Tangut text :


Tangut text rendered in BabelPad (in Simple Rendering mode) under Windows 7


Tangut text (one character per line) rendered in BabelPad (in Simple Rendering mode) under Windows 7


Hmm, that's weird, it only renders the first character in each line correctly. And exactly the same problem is seen in Windows Vista (screenshot omitted), so it is almost certainly a bug in BabelPad (fixed in version 5.2.0.8 released 2010-06-06). But what we have learnt is that if you use Uniscribe under Windows 7 (whether in Notepad or in an edit control or in BabelPad), then you won't see any Egyptian Hieroglyphs. The bottom line is that Windows 7 proudly supports Unicode 5.1, but is not forwardly compatible with later versions of Unicode, including Unicode 5.2 which was released in the same month that Windows 7 was released to the general public. Thus, for example, Phaistos Disc symbols (encoded in Unicode 5.1) render OK under Windows 7 (as evidenced by the fact that they display in the edit buffer of BabelMap) :


Phaistos Disc symbols (encoded in Unicode 5.1) rendered in BabelMap under Windows 7


All previous versions of Uniscribe have passively allowed text encoded in Unicode characters that it does not recognise to render OK as long as there is font support, but the version of Uniscribe that ships with Windows 7 appears to actively disallow Unicode text that it does not recognise ... or at least, characters in Unicode ranges that it does not recognise (post-Unicode 5.1 characters in existing Unicode blocks will be rendered OK under Windows 7 if there is font support). There is, however, one exception to this: CJK unified ideograph blocks added to the Supplementary Ideographic Plane (SIP) post Unicode 5.1 will render OK if there is font support (presumably Uniscribe treats the SIP as a single range) :


CJK Unified Ideographs Extension C (encoded in Unicode 5.2) rendered with the BabelStone Han font in Notepad under Windows 7


I wonder if Internet Explorer 8 does any better on Windows 7 than Notepad?


Internet Explorer 8 under Windows 7


Nope, just like in Notepad, Unicode 5.1 scripts and CJK Unified Ideographs Extension C render OK, but Egyptian Hieroglyphs and currently reserved character ranges come out as little square boxes. So there you have it, if you want to write in any of the fifteen new scripts added in Unicode 5.2 (Avestan, Bamum, Egyptian Hieroglyphs, Imperial Aramaic, Inscriptional Pahlavi, Inscriptional Parthian, Javanese, Kaithi, Lisu, Meetei Mayek, Old South Arabian, Old Turkic, Samaritan, Tai Tham, and Tai Viet) or any of the various new scripts and symbol blocks that will be added in the forthcoming Unicode 6.0 (Mandaic, Batak and Brahmi scripts, and Playing Cards, Miscellaneous Pictographic Symbols, Emoticons, Transport and Map symbols, and Alchemical Symbols), then my recommendation is to avoid Windows 7.


Phags-pa Rendering on Windows 7

Whilst we are on the subject of Windows 7, let's have a quick look at the rendering of the Phags-pa script in Windows Vista and Windows 7.

Phags-pa is a complex script in rendering terms, and Windows Vista does not actively support the script. Nevertheless, under Windows Vista, Unicode Phags-pa text renders correctly in all respects (joining, contextual shaping and variation sequences) in BabelPad and Notepad using my BabelStone Phags-pa Book font :


Phags-pa text rendered in BabelPad with the BabelStone Phags-pa Book font under Windows Vista


However, under Windows 7, the font is next to useless, as no joining or shaping behaviour is applied :


Phags-pa text rendered in BabelPad with the BabelStone Phags-pa Book font under Windows 7


On the other hand, the same Phags-pa text does render correctly using the Microsoft PhagsPa font that ships with Windows 7 :


Phags-pa text rendered in BabelPad with the Microsoft PhagsPa font under Windows 7


Now, the Microsoft PhagsPa font is in many respects (and not coincidentally) very similar to my BabelStone Phags-pa Book font, but the one crucial difference between the two fonts is the set of OpenType features that are used to control the joining and shaping behaviour of characters. The BabelStone Phags-pa Book font uses the Contextual Ligatures <clig> and Glyph Composition Decomposition <ccmp> features to enable it to do all the joining and shaping stuff, including variation sequences, internally without any need for assistance from Uniscribe. On the other hand, the Microsoft PhagsPa font uses the Initial Forms <init>, Medial Forms <medi> and Terminal Forms <fina> features to do the joining behaviour, and these features rely on Uniscribe. For this reason, the Microsoft PhagsPa font won't work correctly under Windows Vista (no Uniscribe support for Phags-pa), and conversely, the BabelStone Phags-pa Book font won't work correctly under Windows 7 (too much Uniscribe support for Phags-pa). I can't really complain about this, as Microsoft support for Phags-pa would almost inevitably mean making Uniscribe instrumental in the rendering process and using a different set of OpenType features than I used (of necessity) in my font. What I will do, when and if I ever get some free time from Tangut, is release new versions of my Phags-pa fonts that use the same OpenType features as the Microsoft PhagsPa font does.

But there is one added complication. Starting with Windows 7, Microsoft now use the newly defined Format 14 cmap subtable (Unicode Variation Sequences) to process variation sequences, thus by-passing OpenType entirely. In Windows Vista and earlier, variation sequences would work without any special support from Uniscribe by defining glyph substitutions in the font under the Glyph Composition Decomposition <ccmp> OpenType feature. Thus, under Windows Vista it is possible to correctly render Mathematical Variation Sequences by using James Kass' Code2000 font, or Phags-pa variation sequences using my Phags-pa fonts. But under Windows 7, variation sequences no longer render correctly using these fonts. Instead, under Windows 7, Microsoft's Cambria Math font supports Mathematical Variation Sequences, and Microsoft PhagsPa supports Phags-pa variation sequences, by including variation sequence mappings in an additional Format 14 cmap subtable which is accessed by Uniscribe. In my opinion, the use of a cmap subtable to apply variation sequences rather than use simple OpenType features is a very bad idea, as it overcomplicates what is essentially a very simple task, and makes variation sequence support not backwards compatible with versions of Windows prior to Windows 7. Moreover (and from my perspective, more importantly), there is not yet widespread support for the new Format 14 cmap subtable, and the font editor that I use have no short term plans to add support for this subtable, which makes it difficult for amateur font developers like myself to create fonts that use the Windows 7 model for variation sequences.

Finally, the screenshot above shows a variation sequence <U+A86A U+A85E U+FE00> (ꡪꡞ︀) rendered correctly with the Microsoft PhagsPa font on BabelPad (NB this only works on BabelPad version 5.2.0.0 or later, as applications need to set an undocumented flag in Uniscribe [SCRIPT_CONTROL.fMergeNeutralItems = TRUE] for the Format 14 cmap substitutions to work), but take a look what happens when we display the same text on Internet Explorer 8 under Windows 7 :

Phags-pa text rendered in Internet Explorer 8 with the Microsoft PhagsPa font under Windows 7


... the variation sequence (highlighted) is rendered incorrectly as two disconnected glyphs. Looks like Internet Explorer 8 does not yet support the new Format 14 cmap subtable for variation sequences; yet one more example of Microsoft's disconnected thinking across different development teams, and the appalling lack of testing that seems to be par for the course with Microsoft.