Sunday, 27 April 2008

What's new in Unicode 5.2 ?

Previously discussed :

[2009-10-01 : Unicode 5.2 has now been released (Unicode Code Charts, BabelMap)]

As most of us are still trying to get to grips with Unicode 5.1, which was only released three weeks ago, it may seem a little premature to start talking about Unicode 5.2, but I'm blogging about it early this time because 5.2 promises to a very important release of Unicode, with 12,799 6,648 new characters and a record 16 15 new scripts, including the long awaited CJK Extension-C (4,149 characters) and major historical scripts such as Egyptian Hieroglyphs (1,071 characters) and Tangut (5,910 characters), as well as the famous woman's writing of southern China (Tangut and Nüshu were originally in Amd.6, but have since been removed for further study, and will not now be encoded until Unicode 6.0 at the earliest).

[This blog post has been updated several times since first published on 2008-04-27. The most recent update on 2009-08-10 reflects the final repertoires of ISO/IEC 10646:2003 Amdendments 5 and 6, which will be identical to the contents of Unicode 5.2 (Unicode 5.2 Code Charts).]

Unicode 5.2 will correspond to Amendments 5 and 6 of ISO/IEC 10646: 2003 (see Unicode Liaison Report for WG 2 meeting 52). Both these amendments have now completed their two rounds of technical balloting, and so no more changes will be made to their character repertoire. It is anticipated that Unicode 5.2 will be released at the end of September 2009 (which incidentally will be the first autumnal release of a new Unicode version since 3.0 in September 1999).



Amendment 5 (5,611 characters)

Amendment 5 has now been published (December 2008), and can be downloaded for free from the ISO Publicly Available Standards site.


New Scripts


Other New Blocks


Additions to Existing Blocks


Glyph Changes

Amendment 5 will also introduce changes to the representative glyph shape used in the code charts for the following characters (the new glyphs are given in N3465) :

  • 04A8 CYRILLIC CAPITAL LETTER ABKHASIAN HA
  • 04A9 CYRILLIC SMALL LETTER ABKHASIAN HA
  • 04BE CYRILLIC CAPITAL LETTER ABKHASIAN CHE WITH DESCENDER
  • 04BF CYRILLIC SMALL LETTER ABKHASIAN CHE WITH DESCENDER
  • 11EC HANGUL JONGSEONG IEUNG-KIYEOK
  • 11ED HANGUL JONGSEONG IEUNG-SSANGKIYEOK
  • 11EE HANGUL JONGSEONG SSANGIEUNG
  • 11EF HANGUL JONGSEONG IEUNG-KHIEUKH
  • 1680 OGHAM SPACE MARK
  • 19D1 NEW TAI LUE DIGIT ONE


Amendment 6 (1,037 characters)

Amendment 6 has now completed its two rounds or technical balloting (PDAM and FPDAM ballots), and after it has completed its final FDAM ballot it will be published. No more technical changes can now be made to the character repertoire, and so the character names and code points in the Amd.6 Code Charts can be relied on.


New Scripts

  • Bamum @ A6A0..A6FF (88 characters) [originally in Amd.5, but removed for further study, and now added back to Amd.6]
  • Imperial Aramaic @ 10840..1085F (31 characters)
  • Inscriptional Pahlavi @ 10B60..10B7F (27 characters)
  • Inscriptional Parthian @ 10B40..10B5F (30 characters)
  • Javanese @ A980..A9DF (91 characters)
  • Kaithi @ 11080..110CF (66 characters, including two section marks)
  • Lisu [aka Fraser alphabet] @ A4D0..A4FF (48 characters)
  • Meetei Mayek @ ABC0..ABFF (56 characters) [23 historical characters have been removed for further study following objections from India to the encoding of historical characters for this script]
  • Nushu [nüshu 女書 "women's script"] (389 characters) [removed for further study in light of concerns expressed by the UK]
  • Old South Arabian [aka Sabaean] @ 10A60..10A7F (32 characters)
  • Old Turkic [aka Orkhon-Yenisey] @ 10C00..10C4F (73 characters)
  • Samaritan @ 0800..083F (61 characters)
  • Tangut (5,910 characters) [removed to Amd. 7 in light of concerns by UK, Ireland and Germany, as well as various Tangut experts; and now removed from Amd.7 for further study]

Other New Blocks


Additions to Existing Blocks



Unicode 5.2 Fonts

The following are some free or shareware fonts that include some of the characters added in Unicode 5.2:

  • Aboriginal Serif / Aboriginal Sans Serif (covers all the new Unified Canadian Aboriginal Syllabics characters)
  • Aegyptus (includes the 1,071 characters in the new Egyptian Hireroglyphs block [13000..1342F], as well as many as yet unencoded hieroglyphs and other characters in the Supplementary Private Use Area-A) [NB Under Windows 7 Egyptian hieroglyphs and all the other Unicode 5.2 characters in the Supplementary Multilingual Plane render as two .notdef glyphs in Notepad and most other Windows applications — this is due to a problem with the version of Uniscribe that ships with Windows 7, which supports Unicode 5.1 but is not forwardly compatible with Unicode 5.2]
  • HanaMin (includes the eight new characters in the main CJK Unified Ideographs block [9FC4..9FCB], all 4,149 characters in the CJK-C block, the three new characters in the CJK Compatibility Ideographs block [FA6B..FA6D], most of the characters in the Enclosed Ideographic Supplement block, and the four new characters in the Enclosed CJK Letters and Months block])
  • New Athena Unicode (includes the seven new Coptic characters in the range 2CEB..2CF1)
  • LisuTzimu (covers the new Lisu block)
  • Padauk (covers Myanmar Extended-A)
  • Quivira (includes various new Latin, Cyrillic and Coptic characters, as well as some of the new currency signs, fraction signs and symbols)
  • Tai Heritage Pro (covers Tai Viet)
  • Tibetan Machine Uni (includes the four svasti signs at 0FD5..0FD8)
  • UnBatang (includes the new characters in the Jamo block, and all the characters in the new Hangul Jamo Extended-A and Hangul Jamo Extended-B blocks)


On Beyond Unicode 5.2


13 comments:

tty01 said...

Need Yezidi :(
(http://en.wikipedia.org/wiki/Yazidi)

Michael Everson said...

I'd love to do Yezidi. But we have no serious information about it—not even a reliable transliteration.

jedi787plus said...

Why doesn't N3465 show the three CJK Unified Ideographs' additions from ARIB-B24? It only shows the three CJK Compatibility Ideographs' ARIB additions. What were the codepoints assigned to such three new ideographs? Source document N3318 doesn't mention exact codepoints (only U+XXXX instead).

Andrew West said...

Someone on the Unicore mailing list recently asked the same question about the five new HKSCS characters recently added to Amd.6, but which are not shown in the PDAM6.2 repertoire (N3546). Ken Whistler's reply was:

Font limitations. Generally, Michael [Everson] doesn't do CJK additions in Unibook, particularly under meeting deadlines. That is a known issue and has been the case for several of these repertoire documents from WG2 meetings.

Michel [Suignard] knows about this and separately tracks any CJK additions. These onesey-twosey CJK additions for the URO do get correctly into the ballot documents.

The new CJK unified ideographs are shown on page 5 of fpdam5-all.pdf in the FPDAM5 ballot documents.

NB My blog post was out of date when it stated that there three CJK unified ideographs and three CJK compatibility ideographs are being added to Amd.5 -- in fact one of the proposed unified ideographs has been encoded as a compatibility ideograph, so there are actually two new unified ideographs (9FC4 and 9FC5) and four new compatibility ideographs (FA6B..FA6E). I have now corrected the post.

alifshinobi said...

Awesomeness! I still don't get it. Where can I find the actual Tai Tham unicode font?

Andrew West said...

Remember that Unicode 5.2 won't be released until about October of this year (and it won't even be going beta until later this month), so you can't officially use any of the new Unicode 5.2 characters yet.

Note also that Unicode does not provide a font for the characters it encodes, and it may take months or years for vendors and font developers to provide support for the new scripts. Fonts with extensive Unicode coverage, such as Code2000/Code2001 and Everson Mono, will probably be updated to include some of the new Unicode 5.2 characters soon after its release, but some new scripts may remain fontless for several years if no-one is interested in creating a working Unicode font for a particular sscript.

I will append a list of fonts with Unicode 5.2 coverage to this post (as I did with my Uniocde 5.1 post) when any such fonts come to my attention.

Edchick said...

When is Unicode 6.0 scheduled?
I am waiting for Tangut....

And when will Unicode end? Doesn't it pretty much cover 95% of all scripts on earth?

Andrew West said...

I will be discussing Unicode 6.0 at the end of October (after the end of the next WG2 meeting in Tokyo).

I am also keen to see Tangut encoded as soon as possible, but because of technical disagreements on how Tangut should be encoded, it will not be in Unicode until version 6.1 at the earliest.

There are many people who want Unicode to be frozen and stable, with no more additions, but I do not think that will happen for a few more years yet. In particular, China is currently undertaking to get all minority and historical scripts that are and have been used in China encoded (including major historical scripts such as Tangut, as already mentioned). Whilst China is involved in this work and also in the work to encode more Han characters then I believe that Unicode will continue to grow. But when China decides that it has everything it needs, then I think that the end will have come.

Moonchild said...

Now when 5.2 has been released...

The svastikas at U+0FD5..U+0FD8 (࿕ ࿖ ࿗ ࿘) that were added to the Tibetan range have all been available in the GPL font ‘Tibetan Machine Uni’ for quite some time now:

* Font page at THL

3155ffGd said...

Some other Unicode 5.2 fonts which come to my mind:

* UnFonts include Hangul Jamo Ext-A and Ext-B
* HanaZono font includes all of CJK-C, plus some of Enclosed Ideographic supplement
* New Athena Unicode includes the Unicode 5.2 additions to Coptic...
*Padauk from SIL includes Myanmar Ext-A
*Tai Heritage Pro from SIL supports Tai Viet script.

Andrew West said...

Thanks for your font suggestions, I have added them to the list.

Tavulteguy said...

Thanks for your hard work Andrew.

There's a slight error in your Hangul Jamo listing (under Amendment 5). You write:
    "Hangul Jamo : Old Hangul (16 characters : U+115A..U+11FF)."

This should read:
    (16 characters : 115A..115E, 11A3..11A7, 11FA..11FF)

Andrew West said...

Thanks, I have fixed that now.