Sunday, 27 April 2008

What's new in Unicode 5.2 ?

Previously discussed :

[2009-10-01 : Unicode 5.2 has now been released (Unicode Code Charts, BabelMap)]

As most of us are still trying to get to grips with Unicode 5.1, which was only released three weeks ago, it may seem a little premature to start talking about Unicode 5.2, but I'm blogging about it early this time because 5.2 promises to a very important release of Unicode, with 12,799 6,648 new characters and a record 16 15 new scripts, including the long awaited CJK Extension-C (4,149 characters) and major historical scripts such as Egyptian Hieroglyphs (1,071 characters) and Tangut (5,910 characters), as well as the famous woman's writing of southern China (Tangut and Nüshu were originally in Amd.6, but have since been removed for further study, and will not now be encoded until Unicode 6.0 at the earliest).

[This blog post has been updated several times since first published on 2008-04-27. The most recent update on 2009-08-10 reflects the final repertoires of ISO/IEC 10646:2003 Amdendments 5 and 6, which will be identical to the contents of Unicode 5.2 (Unicode 5.2 Code Charts).]

Unicode 5.2 will correspond to Amendments 5 and 6 of ISO/IEC 10646: 2003 (see Unicode Liaison Report for WG 2 meeting 52). Both these amendments have now completed their two rounds of technical balloting, and so no more changes will be made to their character repertoire. It is anticipated that Unicode 5.2 will be released at the end of September 2009 (which incidentally will be the first autumnal release of a new Unicode version since 3.0 in September 1999).

Amendment 5 (5,611 characters)

Amendment 5 has now been published (December 2008), and can be downloaded for free from the ISO Publicly Available Standards site.

New Scripts

Other New Blocks

Additions to Existing Blocks

Glyph Changes

Amendment 5 will also introduce changes to the representative glyph shape used in the code charts for the following characters (the new glyphs are given in N3465) :


Amendment 6 (1,037 characters)

Amendment 6 has now completed its two rounds or technical balloting (PDAM and FPDAM ballots), and after it has completed its final FDAM ballot it will be published. No more technical changes can now be made to the character repertoire, and so the character names and code points in the Amd.6 Code Charts can be relied on.

New Scripts

  • Bamum @ A6A0..A6FF (88 characters) [originally in Amd.5, but removed for further study, and now added back to Amd.6]
  • Imperial Aramaic @ 10840..1085F (31 characters)
  • Inscriptional Pahlavi @ 10B60..10B7F (27 characters)
  • Inscriptional Parthian @ 10B40..10B5F (30 characters)
  • Javanese @ A980..A9DF (91 characters)
  • Kaithi @ 11080..110CF (66 characters, including two section marks)
  • Lisu [aka Fraser alphabet] @ A4D0..A4FF (48 characters)
  • Meetei Mayek @ ABC0..ABFF (56 characters) [23 historical characters have been removed for further study following objections from India to the encoding of historical characters for this script]
  • Nushu [nüshu 女書 "women's script"] (389 characters) [removed for further study in light of concerns expressed by the UK]
  • Old South Arabian [aka Sabaean] @ 10A60..10A7F (32 characters)
  • Old Turkic [aka Orkhon-Yenisey] @ 10C00..10C4F (73 characters)
  • Samaritan @ 0800..083F (61 characters)
  • Tangut (5,910 characters) [removed to Amd. 7 in light of concerns by UK, Ireland and Germany, as well as various Tangut experts; and now removed from Amd.7 for further study]

Other New Blocks

Additions to Existing Blocks

Unicode 5.2 Fonts

The following are some free or shareware fonts that include some of the characters added in Unicode 5.2:

  • Aboriginal Serif / Aboriginal Sans Serif (covers all the new Unified Canadian Aboriginal Syllabics characters)
  • Aegyptus (includes the 1,071 characters in the new Egyptian Hireroglyphs block [13000..1342F], as well as many as yet unencoded hieroglyphs and other characters in the Supplementary Private Use Area-A) [NB Under Windows 7 Egyptian hieroglyphs and all the other Unicode 5.2 characters in the Supplementary Multilingual Plane render as two .notdef glyphs in Notepad and most other Windows applications — this is due to a problem with the version of Uniscribe that ships with Windows 7, which supports Unicode 5.1 but is not forwardly compatible with Unicode 5.2]
  • HanaMin (includes the eight new characters in the main CJK Unified Ideographs block [9FC4..9FCB], all 4,149 characters in the CJK-C block, the three new characters in the CJK Compatibility Ideographs block [FA6B..FA6D], most of the characters in the Enclosed Ideographic Supplement block, and the four new characters in the Enclosed CJK Letters and Months block])
  • New Athena Unicode (includes the seven new Coptic characters in the range 2CEB..2CF1)
  • LisuTzimu (covers the new Lisu block)
  • Padauk (covers Myanmar Extended-A)
  • Quivira (includes various new Latin, Cyrillic and Coptic characters, as well as some of the new currency signs, fraction signs and symbols)
  • Tai Heritage Pro (covers Tai Viet)
  • Tibetan Machine Uni (includes the four svasti signs at 0FD5..0FD8)
  • UnBatang (includes the new characters in the Jamo block, and all the characters in the new Hangul Jamo Extended-A and Hangul Jamo Extended-B blocks)

On Beyond Unicode 5.2

Sunday, 6 April 2008

BabelMap Version

To coincide with Friday's release of Unicode 5.1.0 I am releasing an updated version of BabelMap which supports all 100,713 characters encoded in Unicode 5.1 (1,624 new characters and 11 new scripts).

In addition to the support for Unicode 5.1 this version also has the following improvements (most of which I only added in last week, which is why it was released two days late). However, I am still working on a major new version for release later in the year which will solve (what I consider to be) the main problem with BabelMap—the fact that the "edit buffer" only supports a single font, and so text in multiple scripts may display badly or as boxes.

1. A new "Font Info" dialog box has been added (available from the Tools menu or as a button in the Font Analysis utility). This gives detailed information about the currently selected font, currently all the information from the font's NAME table (for all platforms, encodings and languages supported by the font) and a list of all CMAP subtables in the font. This is my first experiment in providing information directly from the font tables, and in the future I might include more information from other tables if there is a demand. You can find out some very interesting things about your fonts from this dialog; for example I was very surprised to see just how many fonts there are that have a Unicode 1.0 or 1.1 semantics CMAP subtable, even though I very much doubt that the subtable mappings really do accord to Unicode 1.0 or 1.1 (i.e. Hangul symbols are mapped to where CJK-A now is).

2. The Composite Font configuration dialog ("Configure" button next to the "Composite Font" radio button) has been improved and simplified (largely in response to suggestions by John Cowan). There is now a simple correspondence between a single Unicode block selected and a list of fonts that are available for mapping to that block. This makes the configuration tool much easier to use, although it does mean that it is no longer possible to map a single font to multiple Unicode blocks in a single operation. The list of fonts covering a particular Unicode block are now also sortable by name or by number of characters that they cover, which should make it easier to find the font with the best coverage for any block.

I have also added an "Auto" button that will attempt to automatically configure the best composite font by mapping the font with the best coverage for each block, whilst at the same time using as few different fonts as possible. The results produced may not always be brilliant because the number of characters in a font's CMAP table is not necessarily the best indicator that the font has good coverage and support for a particular Unicode subset, especially for complex scripts. Another problem is that some fonts distort their actual coverage by including explicit blank or not defined glyphs for characters that they don't cover, which may make them seem as if they have good coverage, when in fact they don't (for example "Ming(for ISO10646)" has mapping for all 6,582 CJK-A characters but only a handful of them are non-blank). To avoid running the risk of getting every block mapped to a last resort font, I have explicitly excluded from the auto-configuration process any font which includes the string "LastResort" or "fallback" in its name.

And as a final touch I have added coverage statistics for the current configuration—a prize to the first person to achieve 100% coverage!

3. Related to the changes in the way the Font Configuration dialog works, I have also improved the the way that the default font mappings are assigned the first time that the application is run. This means that there may be a delay of several seconds the first time BabelMap is run (and also the first time it is run after upgrading from a version of BabelMap that supports a prior version of Unicode). This time is used to auto-configure the composite font and determine which font on your system has the greatest coverage, so that it can be set as the initial single font.

4. The Character Properties dialog (the "?" button or F9) has been extended to include the following additional information about characters :

  • XID_Start and XID_Continue have been added to the list of binary properties for each character.
  • Joining Type and Joining Group (for Arabic and related scripts) have been added.
  • All ideographic variation sequences (IVS) that are defined in the Ideographic Variation Database (currently only the Adobe-Japan1 collection) are listed under both the relevant base character [a CJK unified ideograph] and the relevant variation selector [VS17 through VS31] (e.g. <9089 E010E> [Adobe-Japan1 CID+20233] is listed under both U+9089 and U+E010E).
  • All currently defined named sequences and provisional named sequences are listed under the first character in the sequence (e.g. <0045 0329> LATIN CAPITAL LETTER E WITH VERTICAL LINE BELOW is listed under U+0045 LATIN CAPITAL LETTER E).

5. The character grid font size can now be adjusted from the new "Fonts" menu. Generally speaking, most glyphs for most fonts fit in their cell comfortably at the default font size, but some fonts have glyphs that are smaller or larger than typical at the default font size, and may not display well. This new feature allows you to adjust the font size used for the character grid display if you are having display problems.

Latest Version [2008-06-12]

BabelMap Version incorporates a number of minor bug fixes and improvements to the user interface, as well as workarounds for fonts with invalid data (e.g. Caslon Roman/Italic, Matisse ITC 1.00). The most important change is to fix a bug that causes BabelMap to become unresponsive if you have installed Apple's Last Resort Font. If you are considering installing the last Resort Font you should first upgrade to the latest versions of BabelMap and BabelPad.