Saturday, 26 November 2005

What's new in Unicode 5.0 ?

Suzanne asked in a comment to How many Unicode characters are there? whether Phoenician is included in the repertoire for Unicode 5.0, which will be coming out next year. The answer is yes. Here are some pages on the Unicode site where you can find out more about what is new for Unicode 5.0 :

You can also see charts for all the characters that are currently under final ballot for inclusion in Amendment 2 of ISO/IEC 10646:2003 in WG2 document N2991. Normally Unicode and ISO/IEC 10646 are synchronised, so that the repertoire for any given version of Unicode corresponds exactly to the repertoire for a given version of the ISO standard (plus a given amendment if appropriate). However, the 1,369 new characters in Unicode 5.0 correspond to the 1,365 characters in the FDAM2 (Final Draft for Amendment 2) plus four characters from the PDAM3 (Provisional Draft for Amendment 3). The fast-tracked four characters (U+097B DEVANAGARI LETTER GGA, U+097C DEVANAGARI LETTER JJA, U+097E DEVANAGARI LETTER DDDA and U+097F DEVANAGARI LETTER BBA) are extended Devanagari letters used for writing the Sindhi language.

In summary, Unicode 5.0 will include 1,369 new characters and 9 new blocks :

  • NKo [07C0..07FF] : 59 characters
  • Balinese [1B00..1B7F] : 121 characters
  • Latin Extended-C [2C60..2C7F] : 17 characters
  • Latin Extended-D [A720..A7FF] : 2 characters
  • Phags-pa [A840..A87F] : 56 characters
  • Phoenician [10900..1091F] : 27 characters
  • Cuneiform [12000..123FF] : 879 characters
  • Cuneiform Numbers and Punctuation [12400..1247F] : 103 characters
  • Counting Rod Numerals [1D360..1D37F] : 18 characters

The new blocks cover five new scripts (ISO 15924 script codes in parentheses) :

  • NKo (Nkoo)
  • Balinese (Bali)
  • Phags-pa (Phag)
  • Phoenician (Phnx)
  • Cuneiform (Xsux)

Amendment 2 to ISO/IEC 10646:2003 has proved to be one of the most contentious epsiodes in the history of the standard, with controversy surrounding the encoding of NKo, Phags-pa and Phoenician.

There has been incessant bickering between the Irish and Canadian national bodies about many issues relating to the encoding of NKo, including whether certain old form characters should be encoded as separate characters or whether they should be considered to be simply glyph variants, the names of characters, and the use of NKo-specific diacritic marks against the use of script-neutral generic diacritic marks.

Whilst there have been no real disagreements about the repertoire for Phags-pa, two completely different encoding models were proposed for the script, one proposed by myself and an alternate model proposed by Professor Choijinzhab of the Inner Mongolia University and supported by the Chinese and Mongolian national bodies. It took a great deal of discussion between the various parties to finally come to a mutually acceptable agreement. In many ways the disagreement over Phags-pa, whilst invisible to casual observers of Unicode and ISO/IEC 10646 (i.e. most of the participants of the Unicode public mailing list), was far more important than the disagreements relating to NKo or even Phoenician, as the arguments involved went to the very heart of the Unicode encoding philosophy and the meaning of what a character is. But more of this another time.

The Phoenician "debate" on the Unicode public mailing list will be remembered with disgust for many years to come. For months and months the list was swamped with thousands of postings from people who held diametrically opposed (and immovable) positions, and who constantly bartered the same old arguments over and over again with extreme vitriol. Finally, and I think this was an unwarranted act of censorship, the subject was banned from the mailing list, and even now mentioning the "P" word on the list may lay you open to moderation. Personally I thought it was all a storm in a teacup ... and for the record (even though I did not take any part in the debate), I was in favour of encoding Phoenician.

In the first round of balloting of Amendment 2, Cananda, China, Germany, Ireland and Japan all voted against approving the amendment (see N2876), although only Canada, China and Germany were seriously opposed to anything in the amendment (NKo, Phags-pa and Phoenician respectively). In the end, China agreed to the encoding of Phags-pa, largely as originally proposed, and changed its vote to Yes in the second round of balloting. However, the concerns of Canada and Germany, relating to NKo and Phoenician respectively, were not met, and they both voted No in the second round of balloting (see N2959 and N2990). Whilst WG2 (the working group for ISO/IEC 10646) always tries to reach a consensus on the actions it takes, at the end of the day a ballot is a ballot, and sometimes a majority decision is taken; and in this case there was a 14-to-2 majority in favour of accepting the amendment (with 16 abstentions or non-votes).



Addendum

Ken Whistler has reminded me that the Unicode standard and ISO/IEC 10646 have been temporarily out of sync on one previous occasion. In 1998, when Unicode 2.1 was released it added two characters only from 10646-1:1993 Amendment 18 (U+20AC EURO SIGN and U+FFFC OBJECT REPLACEMENT CHARACTER). As is the case with the four Sindhi characters that have been fast-tracked for Unicode 5.0, there was an urgent implementation requirement by the software industry (aka Microsoft), and the Unicode Consortium needed to ensure that the standard did not lag behind implementations (i.e. that Microsoft did not support the Euro before it was officially encoded). Still, Unicode 5.0 will be the first time in eight years that a version of the Unicode standard does not correspond exactly to a particular publication/amendment of the ISO standard.


Thursday, 24 November 2005

How many Unicode characters are there ?

[Short answer: 113,021]


Otto Stolz asked on the Unicode List how many Unicode characters there were, classified as control characters, format characters, graphic characters, private use chacters, noncharacters, surrogate code points, etc. Now I love Unicode facts, figures and trivia, so I can't resist trying to answer this question.

The "Unicode Version History" utility of BabelMap provides precisely the information requested by Otto for all versions of Unicode from 1.0.0 up to the current version (4.1 when I first wrote this post, but now updated to 7.0). This information is tabulated below :


[Unicode Slide Show : 112,956 characters, one at a time]

[BabelMap Online : Javascript Unicode Character Map]


BabelMap : Unicode Version History
Version 1.0.0 1.0.1 1.1 2.0 2.1 3.0 3.1 3.2 4.0 4.1 5.0 5.1 5.2 6.0 6.1 6.2 6.3 7.0
Date October
1991
June
1992
June
1993
July
1996
May
1998
September
1999
March
2001
March
2002
April
2003
March
2005
July
2006
April
2008
October
2009
October
2010
January
2012
September
2012
September
2013
June
2014
Scripts 24 25 24 25 25 38 41 45 52 59 64 75 90 93 100 100 100 123
Blocks 57 59 63 67 67 86 95 107 122 142 151 168 194 206 217 217 217 249
Total
Code
Points
65,536 65,536 65,536 1,114,112 1,114,112 1,114,112 1,114,112 1,114,112 1,114,112 1,114,112 1,114,112 1,114,112 1,114,112 1,114,112 1,114,112 1,114,112 1,114,112 1,114,112
Assigned
Code
Points
12,795 34,505 40,635 178,500 178,502 188,809 233,787 234,803 236,029 237,302 238,671 240,295 246,943 249,031 249,763 249,764 249,769 252,603
Unassigned
Code
Points
52,741 31,031 24,901 935,612 935,610 925,303 880,325 879,309 878,083 876,810 875,441 873,817 867,169 865,081 864,349 864,348 864,343 861,509
Encoded
Characters
7,161 28,359 34,233 38,950 38,952 49,259 94,205 95,221 96,447 97,720 99,089 100,713 107,361 109,449 110,181 110,182 110,187 113,021
Private
Use
Characters
5,632 6,144 6,400 137,468 137,468 137,468 137,468 137,468 137,468 137,468 137,468 137,468 137,468 137,468 137,468 137,468 137,468 137,468
Non
characters
2 2 2 34 34 34 66 66 66 66 66 66 66 66 66 66 66 66
Surrogate
Code
Points
0 0 0 2,048 2,048 2,048 2,048 2,048 2,048 2,048 2,048 2,048 2,048 2,048 2,048 2,048 2,048 2,048
Graphic
Characters
7,085 28,283 34,151 38,867 38,869 49,168 94,009 95,023 96,243 97,515 98,884 100,507 107,154 109,242 109,975 109,976 109,975 112,804
Format
Characters
2 2 2 18 18 26 131 133 139 140 140 141 142 142 141 141 147 152
Control
Characters
74 74 80 65 65 65 65 65 65 65 65 65 65 65 65 65 65 65

For historic versions of Unicode the statistics are based on the General Category of the characters at the time of encoding, and do not take into account any subsequent changes in General Category. Thus the fact that 4.0 has 139 format characters and 4.1 has 140 format characters is not due to a new format character having been added in 4.1, but rather due to the General Category of U+200B ZERO WIDTH SPACE having been changed from Zs to Cf in Unicode 4.0.1. Note that the statistics for 1.0.0 and 1.0.1 are based upon Ken Whistler's reconstructed Unicode Character Data.

To help understand what we're talking about, here are some definitions of some of the terms used in the table (see Section 2.4 of the Unicode Standard for further information).

  • Control characters are those characters with a General Category of 'Cc'. These are invisible characters that perform a certain function that is defined by a protocol or standard other than Unicode (they are inherited from pre-existing 8-bit standards). They include familiar characters such as Tab, Carriage Return and Line Feed that are essential to writing Unicode (U+2028 LINE SEPARATOR and U+2029 PARAGRAPH SEPARATOR never took off as Unicode replacements for Carriage Return and/or Line Feed), as well as many characters that you should never see in plain text (the full range of control characters is 0000..001F and 007F..009F).
  • Format characters are those characters with a General Category of 'Cf', 'Zl' or 'Zp'. These are invisible characters defined by Unicode for a particular function. These include things like U+200D ZERO WIDTH JOINER, U+202D LEFT-TO-RIGHT OVERRIDE, interlinear annotation characters (FFF9..FFFB) and the infamous set of Tag characters (E0001 and E0020..E007F). They work behind the scenes to do useful things like bidirectional control and character shaping.
  • Graphic characters are those characters with a General Category other than Cc, Cn, Co, Cs, Cf, Zl and Zp, that is to say ordinary visible characters.
  • Noncharacters are code points that are permanently reserved, and are guaranteed never to be assigned as characters. They are the thirty-four code points ending in xxFE and xxFF, as well as the thirty-two code points in the range FDD0..FDEF.
  • Encoded characters = Graphic characters + Format characters + Control characters.
  • Assigned code points = Encoded characters + Private Use characters + Noncharacters + Surrogate code points.
  • Total code points = Assigned code points + Unassigned code points.


Charts





Wednesday, 16 November 2005

BabelMap Version 1.11.4

I released a new version of BabelMap yesterday, the first for several months. It has two important new features.

Firstly, I no longer use Uniscribe to render the characters in the main character grid. Most people will probably not notice the difference, but it has removed one constant source of annoyance to me : unwanted font substitution. Now the Uniscribe API allows you to turn off font substitution (which I do in both BabelPad and BabelMap), but in some cases it will still do it anyway. For example, in the previous versions of BabelMap if you selected the Tahoma or Microsoft Sans Serif font and displayed the CJK Unified Ideographs block, then the ideographs would be displayed even though these fonts do not have any glyphs for CJK ideographs (the MingLiU font appears to be substituted). Similarly, for Tahoma and Microsoft Sans Serif, the control characters U+0000 through U+001F are displayed using another font (Batang, Dolum or Gulim). As far as I can tell this unexpected font substitution only happens for Tahoma and Microsoft Sans Serif. Some versions of Uniscribe also do weird font substitution when running under anything other than Windows XP (i.e. 9X, ME, NT4, 2K), and for certain complex scripts if the selected font does not have the appropriate OpenType tables then Uniscribe will substitute the missing character glyph for the font's glyphs (this problem seems to have been fixed in the latest versions of Uniscribe). Anyhow, in the latest version of BabelMap none of this unexpected and unwanted font substitution occurs, and what you see is exactly what is in the font.

The second major change is the implementation of the number one feature request : to be able to open fonts from file. By default BabelMap shows all installed fonts on the PC, but now you can also open one or more uninstalled fonts from file by selecting "File : Add New Fonts..." from the menu. When you do this the fonts will be available for selection and display in exactly the same way that installed fonts are. However, the fonts will only be visible for the current instance of BabelMap, and will no longer be available once BabelMap is closed unless you actually install the fonts. This feature will be very useful for previewing fonts, in order to install only those fonts with the coverage you are interested in.

Finally, a sneak preview of what I am working on next, which for me is a very exciting feature : a utility that analyses the OpenType tables of the currently selected font. The utility displays in a tree control the scripts, languages, features and lookup tables defined for the GSUB and GPOS tables in the font :



When you click on one of the lookup tables, then full details of the lookup table will be displayed in another area, for example for a GSUB lookup you will be able to select an input glyph or glyph sequence from a dropdown list (and a context if appropriate) and see what output glyph or glyph sequence the lookup table maps it to. I am also planning to put in an OpenType test bed, similar to the proofing tool in Microsoft VOLT, which will allow the user to enter text and see the OpenType substitutions take place. I still have a lot of work to do on this, and I'm quite busy on other things right now (lots of fun stuff to do in preparation for Unicode 5.0, including writing up the block introduction for Phags-pa), so this feature will probably not be ready until the spring of next year.


BabelPad Version 1.9.3

The main new feature of the latest version of BabelPad that I released last month is a "glyph mode" rendering option that causes complex scripts to be rendered as individual spacing glyphs, with no shaping or ligature behaviour. BabelPad has always utilised Microsoft's Uniscribe rendering engine (usp10.dll) to render text, and on the whole, if you have a recent version of Uniscribe installed on your system, this works very well for most scripts. However, I have had a number of requests to be able to disable complex script rendering, and this has now been implemented as an option that can be turned on and off at the click of the mouse (by means of the u" button on the main toolbar). I think this is a really great feature, as it allows you to visually decompose text into its constituent elements without altering the actual textual content.

The screen shot below shows some words in various complex scripts that have been decomposed into Normalization Form NFD ("Convert : Normalization Form : To NFD" from the BabelPad menu). By default the text is rendered using Uniscribe (in this case version 1.473.4067.15; but note that earlier versions may not render Tibetan and/or Mongolian correctly), and so various forms of glyph substitution and positioning are applied, depending upon the requirements of the script and the capabilities of the font (in this case XTashi for Tibetan and Code2000 for everything else).



When "glyph mode" is enabled the display immediately changes to that shown below, with each character displayed as a single, spacing glyph. Note how RTL text such as Arabic is now displayed in logical order from left to right.



This mode entirely bypasses Uniscribe, and so no shaping, positioning or glyph substitution is carried out. However, there is an option ("Options : Rendering Options : Use Uniscribe" from the menu) to render the individual glyphs using Uniscribe. When this option is applied the result is as below.



There is still no shaping behaviour, but now Uniscribe does make some cosmetic changes to the output glyphs, principally the addition of a dotted circle to some characters that normally only occur in combining sequences. The other change, which is not illustrated here, is that Uniscribe does not render glyphs for control characters such as U+206A [INHIBIT SYMMETRIC SWAPPING]. Some fonts (e.g. Code2000) provide glyphs for such characters, and although they are usually filtered out by Uniscribe, they will be displayed by BabelPad when in "glyph mode" and Uniscribe is not enabled (they will also be rendered when in normal mode and "Options : Rendering Options : Display Glyphs for Control Characters" is selected).


Saturday, 12 November 2005

Tibetan Extensions 1 : Astrological Pebble Symbols

MichKa recently asked whether it was good luck or bad luck to encode the bad luck character? in relationship to proposed new Tibetan characters under consideration by the Unicode Technical Committee (UTC). As I am the proposer of the character in question (see N3011 for the formal proposal document), I guess it might be an auspicious start to my blogging career to talk a little about this ill-omened character and the other related Tibetan astrological characters in Unicode.

The Tibetan block of Unicode contains more than just the letters and punctuation marks that are needed to write the Tibetan script; it also includes quite a few religious, musical and astrological symbols. Among these are a group of characters representing various combinations of black and white pebbles that are used for divination, the various combinations representing differing levels of good or bad fortune : white pebbles (Tibetan rdel dkar རྡེལ་དཀར) represent good fortune and black pebbles (Tibetan rdel nag རྡེལ་ནག) represent bad fortune, with one, two or three pebbles indicating the degree of good or bad fortune, so that one white pebble is good, two white pebbles is very good, and three white pebbles is extremely good. Neutral fortune is represented by a white plus a black pebble. In Tibetan astrological texts black pebbles are represented by an X-shaped cross, whilst white pebbles are represented by a O-shaped circle, as can be seen in this page from an astrological almanac for the Wood Ox year (1985), published in Lhasa by the Tibetan People's Publishing House :



The upper rows of this chart show single, double and triple white pebble symbols, single and double black pebble symbols, and the white plus black pebble symbol. When Tibetan was re-encoded the second round time in 1996 for Unicode 2.0 (more about that another time), these six pebble symbols were included in the original repertoire :

  • U+0F1A TIBETAN SIGN RDEL DKAR GCIG རྡེལ་དཀར་གཅིག "one white pebble"
  • U+0F1B TIBETAN SIGN RDEL DKAR GNYIS རྡེལ་དཀར་གཉིས "two white pebbles"
  • U+0F1C TIBETAN SIGN RDEL DKAR GSUM རྡེལ་དཀར་གསུམ "three white pebble"
  • U+0F1D TIBETAN SIGN RDEL NAG GCIG རྡེལ་ནག་གཅིག "one black pebble"
  • U+0F1E TIBETAN SIGN RDEL NAG GNYIS རྡེལ་ནག་གཉིས "two black pebbles"
  • U+0F1F TIBETAN SIGN RDEL DKAR RDEL NAG རྡེལ་དཀར་རྡེལ་ནག "white pebble, black pebble"

This leaves an obvious gap in the pattern, with no symbol for the triple black pebble combination. The reason for this is that the triple black pebble represents extreme bad fortune, and is apparently avoided by astrologers, who always substitute the less ominous double black pebble symbol when writing out a fortune. Thus, whilst three black pebbles may be cast when divining someone's fortune, the three black pebble symbol is almost never written down. Nevertheless, the triple black pebble was added to the Unicode 3.0 repertoire in 1999 :

  • U+0FCF TIBETAN SIGN RDEL NAG GSUM རྡེལ་ནག་གསུམ "three black pebbles"

Now you may think that this would mark the end of the story, but far from it, it marked the start of much acrimonious debate on the Unicode TIBEX (Tibetan Extensions) mailing list about whether other combinations of pebble symbols needed to be encoded or not. The TIBEX mailing list was intended to be a forum for experts on Tibetan to agree upon the set of characters needed to complete the encoding of Tibetan, but due to irreconcilable differences between the participants, it has proved impossible to reach any sort of consensus on almost any issue, and the result is that rather than producing a single proposal covering a set of commonly agreed characters for encoding, new Tibetan characters are being proposed in dribs and drabs by individual sponsors. The subject of this posting is the symbol for the black plus white two pebble combination, for which I am the sponsor : TIBETAN SIGN RDEL NAG RDEL DKAR རྡེལ་ནག་རྡེལ་དཀར "white pebble, black pebble".

As I said earlier, neutral fortune is represented by the "OX" (white plus black pebble) combination, so what then does the proposed "XO" (black plus white pebble) combination represent? Well, the latter combination never occurs in the ordinary divinations, but only in a set of special divinations known as the "upper and lower nine burial pebbles" (steng 'og dgu dur rdel སྟེང་འོག་དགུ་དུར་རྡེལ). In these divinations the "OX" combination indicates good luck earlier and bad luck later, whereas the "XO" combination indicates bad luck earlier and good luck later. An example of the "upper and lower nine burial pebbles" can be seen in the bottom two rows of the 1985 almanac shown above. These divinations are divided into two rows, with a single black or white pebble in the top row, and one or two pebbles in the bottom row. The two pebbles are always either "OX" or "XO"; the same colour "XX" and "OO" combinations do not occur in this context.

This almanac provides clear and unambiguous evidence of the usage of the "XO" pebble symbol, and convinced most of the TIBEX members of the validity of encoding the "XO" character. However, the situation is complicated by the fact that in some almanacs the line between the upper and lower rows is dispensed with, so that superficially the divinations appear to comprise a single row of vertical two pebble combinations and mixed colour three pebble combinations, as can be seen in the bottom row of this almanac printed in India in 2000 :



Nevertheless, it is clear from the fact that some of the single pebbles have writing underneath them in place of a single or double pebble, that the single pebble on top and the single or double pebble at the bottom are seperate entities, as shown in the 1985 almanac. Notwithstanding, there was fierce argument on the TIBEX list about whether the "XO" symbol was a valid entity or not. It has been claimed by one particularly vociferous member of TIBEX that the 1985 almanac is an example of incorrect typographic and astrological practice, simply on the basis that it was published in Chinese-controlled Tibet. Whether that is the case or not (and I am convinced that there is absolutely no truth in this) is largely irrelevent, as Unicode encodes characters on the basis of attested usage, not on the basis of what is "correct" or not.

It has also been claimed that certain astrological calculations, known as the "upper and lower doors" (sgo steng 'og སྒོ་སྟེང་འོག), include all possible permutations of three black and white pebbles, and thus the mixed-colour three pebble combinations "OXO", "OOX", "OXX", "XXO", "XOX" and "XOO" also need to be encoded. However, despite repeated requests, the textual evidence for these pebble symbols has never been made available, and so no proposal for these characters has yet been made.

Anyway, back to the character that I have proposed for encoding, TIBETAN SIGN RDEL NAG RDEL DKAR. This character has now been accepted at the recent meeting of the UTC (see Proposed New Characters), and is provisionally allocated for encoding at U+0FCE, right next to its old friend the triple black pebble symbol [TIBETAN SIGN RDEL NAG GSUM]. The next stage on the road to characterhood will be in April 2006 when the proposal is considered by WG2, which is the working group of ISO/IEC 10646, the ISO standard corresponding to Unicode (see Appendix C of the Unicode Standard for details of the relationship between the two standards). As the character repertoires of Unicode and ISO/IEC 10646 are fully synchronised, any addition to Unicode also needs to be approved by WG2, and so character proposals need to undergo a convoluted and slow process of balloting by national bodies, often lasting several years, before they can be included in the relevant standards.