Monday, 13 June 2011

What's new in Unicode 6.1 ?

Previously discussed :


[2012-02-01 Update: Unicode 6.1.0 was released on 31 January 2012.]


Unicode 6.1 is scheduled for release in Spring 2012, and will be synchronized to the 3rd edition of ISO/IEC 10646 (see Unicode Liaison Report to WG2). Confusingly, the 3rd edition is actually the 5th iteration of the ISO/IEC 10646 standard, but it is the 3rd edition of the combined one-part standard first published in 2003 that superceded the original two-part standard (Part 1: Architecture and Basic Multilingual Plane; Part 2: Supplementary Planes) first published in 1993 (see Unicode and ISO/IEC 10646 for more details on the relationship between the Unicode and ISO/IEC 10646 standards). The first combined edition published in 2003 (corresponding to Unicode 4.0) underwent eight amendments in as many years, adding 41 new scripts, 84 new blocks, and 13,002 new characters (see How many Unicode characters are there ?), before a second edition (corresponding to Unicode 6.0) was published earlier this year. Due to technical issues with the CJK-B fonts, the CJK-B code chart was printed in single-column format rather than the multi-column format used for the other CJK blocks, and in order to rectify this deficiency a third edition will be published straight away (instead of first publishing a series of amendments to the second edition).

The 3rd edition of ISO/IEC 10646 has already completed two rounds of balloting, and will undergo one final (FDIS) ballot later this year, before being published sometime next year. The character repertoire, code points and character names are now stable, and highly unlikely to change before publication. Unicode 6.1 will correspond to the repertoire of this 3rd edition of ISO/IEC 10646.

The 3rd edition of ISO/IEC 10646 has 733 new characters compared with the 2nd edition, but as one these characters was fast-tracked into Unicode 6.0 (U+20B9 ₹ Indian Rupee Sign), Unicode 6.1 will include a total of 732 new characters, including seven new scripts, as detailed below. This will mean that Unicode 6.1 comprises a total of 110,116 graphic and format characters.

The final 3rd edition code charts are not yet ready, but an earlier version of the code charts showing the new additions (with some characters that have since been removed) is available.



New Scripts

Unicode 6.1 includes the following seven scripts, which are all encoded in the Supplementary Multilingual Plane (SMP). The Basic Multilingual Plane (BMP) is now almost full, and it is unlikely that any new scripts will be encoded in the BMP.

  • Meroitic Hieroglyphs {10980..1099F} : 32 characters for the 'monumental' form of the Meroitic script that was derived from Egyptian hieroglyphs [N3665]
  • Meroitic Cursive {109A0..109FF} : 26 characters for the 'cursive' form of the Meroitic script that was derived from Egyptian Demotic (40 fraction and number characters have been removed from the proposed repertoire pending further study) [N3665]
  • Sora Sompeng {110D0..110FF} : 35 characters for the Sora Sompeng script used in India [N3647]
  • Chakma {11100..1114F} : 67 characters for the Chakma script used in Bangladesh and India [N3645]
  • Sharada {11180..111DF} : 83 characters for the Śāradā script which was the principal inscriptional and literary script of Kashmir from the 8th through 20th centuries, but which is now virtually obsolete [N3595]
  • Takri {11680..116CF} : 66 characters for the Takri script that was used for writing the Dogri language of Kashmir until the 1940s [N3758]
  • Miao {16F00..16F9F} : 133 characters for the Old Miao script that was devised by Samuel Pollard during the early 20th century [N3761, N3789, N3877]

Funerary stele with Meroitic Hieroglyphic inscription [CC-BY-SA-3.0 by Piero d'Houin dit Triboulet]


New Blocks

Unicode 6.1 also includes four new blocks for extensions to existing scripts and for symbols:

  • Arabic Extended-A {08A0..08FF} : 39 characters (9 letters for African languages, 15 characters for Rohingya, 4 Koranic annotation signs, 11 vowel signs for African and Philippine languages) [N3791, N3816, N3882]
  • Sundanese Supplement {1CC0..1CCF} : 8 punctuation marks used in old Sundanese manuscripts [N3666]
  • Meetei Mayek Extensions {AAE0..AAFF} : 23 characters used in historical orthographies of Meetei Mayek, and which are not defined for modern use by the Manupuri Government [N3206, N3470, N3478]
  • Arabic Mathematical Alphabetical Symbols {1EE00..1EEFF} : 143 characters used in Arabic mathematical expressions [N3799]

Additions to Existing Blocks

  • Armenian {0530..058F} : 1 character (U+058F Armenian Dram Sign) [N3771]
  • Arabic {0600..06FF} : 1 character (U+0604 Arabic Sign Samvat) [N3734]
  • Gujarati {0A80..0AFF} : 1 character (U+0AF0 Gujarati Abbreviation Sign) [N3764]
  • Lao {0E80..0EFF} : 2 letters for Khmu [N3893]
  • Georgian {10A0..10FF} : 5 letters for Ossetian and Abkhaz [N3775]
  • Sundanese {1B80..1BBF} : 9 characters for historic usage [N3666]
  • Vedic Extensions {1CD0..1CFF} : 4 characters [N3844, N3861, N3881]
  • Miscellaneous Mathematical Symbols-A {27C0..27EF} : 2 diagonal bar symbols [N3763]
  • Coptic {2C80..2CFF} : 2 letters for the Bohairic dialect [N3873]
  • Georgian Supplement {2D00..2D2F} : 2 letters for Ossetian [N3775]
  • Tifinagh {2D30..2D7F} : 2 letters for Tuareg [N3870]
  • Supplemental Punctuation {2E00..2E7F} : 10 characters (8 historic punctuation marks, and 2 em dashes) [N3664, N3740, N3770]
  • CJK Unified Ideographs {4E00..9FFF} : 1 character (U+9FCC = Adobe-Japan1-6 CID+20156, a variant of U+6DBC 涼) [N3885]
  • Cyrillic Extended-B {A640..A69F} : 9 characters for medieval Church Slavonic manuscripts [N3748]
  • Latin Extended-D {A720..A7FF} : 5 letters (including the Cambrian symbol (U+A792), but excluding middle dot letter, which was again removed at the request of the US) [N3840, N3846]
  • CJK Compatibility Ideographs {F900..FAFF} : 2 characters (U+FA2E and U+FA2F) [N3747]
  • Enclosed Alphanumeric Supplement {1F100..1F1FF} : 2 characters (marque de commerce and marque déposée signs used in Canada) [N3860]
  • Miscellaneous Symbols and Pictographs {1F300..1F5FF} : 4 Orthodox typikon symbols [N3772]
  • Emoticons {1F600..1F64F} : 13 more emoticons (Grinning Face, Expressionless Face, Confused Face, Kissing Face, Kissing Face with Smiling Eyes, Face with Stuck-Out Tongue, Worried Face, Frowning Face with Open Mouth, Anguished Face, Grimacing Face, Face with Open Mouth, Hushed Face, Sleeping Face) [N3790]

Other Changes

Formal aliases will be defined for the following two Magnetic Ink Character Recognition (MICR) symbols used on cheques by banks, that were inadvertently given each other's name when encoded twenty years ago:

  • U+2118 ℘ SCRIPT CAPITAL P will be given the formal alias WEIERSTRASS ELLIPTIC FUNCTION
  • U+2448 ⑈ OCR DASH will be given the formal alias MICR ON US SYMBOL
  • U+2449 ⑉ OCR CUSTOMER ACCOUNT NUMBER will be given the formal alias MICR DASH SYMBOL

Once assigned character names may not be changed, so formal aliases are a mechanism for ameliorating problems caused by woefully misnamed characters, and processes are encouraged to use formal aliases in place of the official character names in user interfaces. Only a handful of characters have been assigned formal aliases, and the above are the first new formal aliases to be defined since formal aliases were introduced in Unicode 5.0 (July 2006). Formal aliases are only assigned in rare cases where there is a typographical error in the name (e.g. "bracket" misspelled as "brakcet") or where the name is confusingly wrong ("Yi Syllable Wu" is a syllable iteration mark, not the syllable wu), and are not assigned in cases where a character name is merely suboptimal or where there is academic dispute about about the transliteration or naming conventions used. See Unicode Character Names Part 3 for more details about formal aliases.



Unicode 6.1 Fonts

The following are some free or shareware fonts that already (prematurely) include some of the characters that will be added in Unicode 6.1:

  • BabelStone Han (covers the one new CJK unified ideograph and the two new CJK compatibility ideographs)
  • Everson Mono (covers various 6.1 additions for Armenian, Georgian, Georgian Supplement, Tifinagh, Supplemental Punctuation, Cyrillic Extended-B, and Latin Extended-D)
  • Symbola (covers the additions for Miscellaneous Mathematical Symbols-A, Supplemental Punctuation, Miscellaneous Symbols and Pictographs, and Emoticons)


BabelMap for Unicode 6.1

A test version of BabelMap Online supporting Unicode 6.1 is now available:

BabelMap Online for Unicode 6.1 Beta


23 comments:

Alex said...

Another font is Everson Mono, which covers some other characters in Unicode 6.1.

Andrew West said...

Thanks, I've added Everson Mono to the list now.

Alex said...

When will Tangut be encoded?

Andrew West said...

Hard to say. It's already taken four years, and we still haven't reached a consensus. If it gets on a ballot next spring then it might possibly get into Unicode 6.2, but I was saying the other day that it will be Unicode 7.2 or 8.3 before Tangut, Khitan and Jurchen are all encoded.

JAEMIN said...

When will CJK Ext E be encoded?

Andrew West said...

I don't know when CJK-E will be encoded ... not for another few years I think.

ievlampiev said...

When will Old Permic be encoded?

Andrew West said...

Old Permic seems to have stalled, and no progress has been made since 1999. I think that the main problem is lack of input and review from the user community, so if you are an expert in Old Permic I suggest you contact Michael Everson, and see if you can help revitalise the proposal.

ievlampiev said...

I think I can. First, we have prepared several proposals (but for existing scripts) regarding the historical alphabets used in Russia. The second, there is an interest from Komi language activists and even Komi wikipedia uses this script in their logo.

Alex said...

The US wants to cancel the encoding of U+A78F Latin Letter Middle Dot, which you have proposed. http://std.dkuug.dk/JTC1/SC2/WG2/docs/n4164.pdf

Andrew West said...

Yes I know; it was to be expected I'm afraid. You can see the US ballot comment requesting its removal, together with the UK and Ireland ballot comments requesting its retention, in the document you linked to. My original proposal document is N3567, and I provided further rationales for encoding it in N3694. The US arguments against encoding the middle dot letter are encompassed in N3678.

The character has yo-yoed in and out of ballots more than any other character I can think of. It was first put on the ballot for ISO/IEC 10646:2003 Amd. 8, then taken off because of objections from the US; then added to the ballot for ISO/IEC 10646:2012, but again removed due to US objections; and has now been added to the ballot for ISO/IEC 10646:2012 Amd.1; the US forgot to object to it in the first PDAM ballot, but has now remembered, so given their intransigence it is likely to be eventually removed from this amendment as well, which is very annoying for scholars of Phags-pa and Tangut who actually have an urgent need for the character. I completely fail to understand the US objections, given the precedence of encoding other letters which are punctuation look-alikes (e.g. U+01C3 Latin Letter Retroflex Click which looks like "!" and U+02BC Modifier Letter Apostrophe which looks like an apostrophe) and many other dot-like characters. No doubt this is not the end of the story.

Alex said...

http://std.dkuug.dk/JTC1/SC2/WG2/docs/n1947.pdf See this for a more recent Old Permic proposal. One coming will be at http://std.dkuug.dk/JTC1/SC2/WG2/docs/n4177.pdf when it is done.

Alex said...

New Old Permic proposal at http://std.dkuug.dk/jtc1/sc2/wg2/docs/n4177.pdf

Alex said...

Good news! Latin Letter Middle Dot will stay in Amd. 1. http://std.dkuug.dk/jtc1/sc2/wg2/docs/n4161.pdf

Andrew West said...

Unfortunately it is not that simple. Middle Dot stays for now, but the US will request its removal at the next stage (at the WG2 meeting in California in two weeks time). The US behaviour is bizarre -- whilst they fight against this particular dot, "raised dot" has been happily accepted; and they illogically insist that middle dot should be represented by a character with a different glyph shape, different character properties and different semantics. The fight will go on and on I fear.

Alex said...

Here are Unicode 6.2 additions now that Unicode 6.1 is out: see http://std.dkuug.dk/jtc1/sc2/wg2/docs/n4244.pdf

Alex said...

However, the proposed Latin Letter Middle Dot is not on this list. It was pushed back again for the third time!

Alex said...

Tangut isn't on the list either, for that matter. All of this was expected.

Andrew West said...

As you say, all of this was expected. Tangut is not yet ready, but I hope that a revised final proposal will be accepted at the next WG2 meeting in October.

Илюха said...

Thank to this blog and to Mike there was an excellent progress on Old Permic!

My next question, when the historical Russian currency signs from http://std.dkuug.dk/jtc1/sc2/wg2/docs/n4208.pdf will be encoded?

ievlampiev.

Alex said...

Old Permic is under Ballot!!!
See http://std.dkuug.dk/jtc1/sc2/wg2/docs/n4275.doc

Alex said...

CJK Ext E under ballot. See http://std.dkuug.dk/jtc1/sc2/wg2/docs/n4358.pdf and the Unicode roadmap.

Andrew West said...

CJK-E is included in the Committee Draft (CD) ballot for ISO/IEC 10646 4th edition (current SC2 ballots are listed on this page), but this is just the start of a long process, and CJK-E will not be included in Unicode until version 7.1 or later.