Sunday, 1 November 2009

What's new in Unicode 6.0 ?

Previously discussed :


[2010-10-11 : Unicode 6.0 was released on the 11th October 2010.]

[2010-08-30 : The Indian Rupee Sign (see N3862) has now been accepted for fast-tracking into Unicode 6 at U+20B9 by the Unicode Technical Committee, although it is not in either of the corresponding amendments of ISO/IEC 10646, which will cause a temporary desynchronization between the two standards until Unicode 6.1.]

[2010-06-02 : Unicode 6.0 is now in Beta, and is scheduled for release at the end of September on or about the 11th October 2010.]

[2010-04-24 : The character repertoire, code points and characters names for Unicode 6.0 are now fixed.]

Now that Unicode 5.2 has been out for a month, I think that it would be a good idea to look forward to Unicode 6.0, which is scheduled for release in late 2010. Unicode 6.0 will correspond to a new (2nd) edition of ISO/IEC 10646 (ISO/IEC 10646:2010), which itself corresponds to ISO/IEC 10646:2003 plus Amendments 1 through 8, of which Amendments 7 and 8 include 2,089 2,087 new characters that are not in Unicode 5.2 (if this is confusing, it might be helpful to try reading my post on the relationship between Unicode and ISO/IEC 10646) plus the Indian Rupee Sign (U+20B9) that is not yet included in ISO/IEC 10646. In sumary, Unicode 6.0 will have a total of 109,448 characters 109,449 characters in 206 blocks covering 93 scripts.

Because of problems with the fonts for the CJK-B block, the 2nd edition of ISO/IEC 10646 will have a multi-column format for the CJK, CJK-A, CJK-C and CJK-D blocks, but the large CJK-B block (42,711 characters) will be presented in a single column format with a single font. In order to rectify this failing at the earliest opportunity, it has been decided to immediately start work on yet another new edition of the standard (the 3rd edition) instead of publishing a series of amendments as is normally the case. A summary of the additions which will be made to the 3rd edition (which will correspond to the version of Unicode after 6.0) is available here.

Whereas Unicode 5.2 saw the encoding of fifteen new scripts and a total 6,648 new characters, Unicode 6.0 only has three new scripts (Mandaic, Batak and Brahmi) and a total of 2,089 2,087 new characters. Nevertheless, Unicode 6.0 includes some of the most controversial additions to the standard for a long time. In particular, the addition of a large set of characters corresponding to Japanese Emoji 絵文字 used on mobile phones has been the cause of much heated debate (original proposal documents N3582 and N3583). Google and Apple have pushed hard for the encoding of emoji in Unicode in order to solve interoperability issues between the various vendors, who currently use different variants of emoji at different private-use code points. Two groups of emoji in particular have caused a lot of contention.

Firstly, a group of five characters representing specific cultural icons (Mount Fuji, Tokyo Tower, Statue of Liberty, Silhouette of Japan and Statue of Moyai) have been vigorously opposed because they give the appearance of setting a precedent for encoding hundreds of other characters representing cultural or nationalistic icons, such as the Great Wall of China, the Pyramids of Giza, the Eiffel Tower, Tower Bridge, Mount Kilimanjaro, etc. etc. Some of us would have prefered to encode generic versions of these characters (e.g. Snow-Capped Mountain instead of Mount Fuji), but Google insisted that these characters had specific semantics that generic versions of the characters would not be able to represent, so in the end they were accepted as is. Note however, that they are not precedents for encoding other characters representing cultural icons, as they were not encoded because of the importance of the objects these characters represent, but for interoperability reasons (cross-mapping to existing emoji codes). Of course, if mobile phone vendors start adding emoji for the Great Wall of China, etc. then ....

Secondly, a group of ten characters representing the flags of ten specific countries (People's Republic of China, Germany, Spain, France, the UK, Italy, Japan, Korea, Russia and the US) caused a great deal of consternation, as it seemed unreasonable to encode flag symbols for a few select countries and not for others. Two solutions were put forward to solve the problem. The US proposed encoding them as ten characters named EMOJI COMPATIBILITY SYMBOL-n with a glyph shape comprising EC-n in a dashed box (i.e. completely hide the fact that these characters map to emoji map symbols). On the other hand, Ireland and Germany proposed encoding 256 characters representing all currently assigned ISO 3166 two-letter country codes (see N3680). Neither of these proposals were acceptable to the other parties, and in the end a compromise solution to encode twenty-six "regional indicator symbols" (see N3727) was accepted. These characters may be combined into two-character sequences corresponding to ISO 3166 two-letter country codes, and applications may then render such sequences with the corresponding country flag. Of course, this does not provide a solution for the representation of flags for countries and regions that do not have an ISO 3166 two-letter code. For example, mobile phone vendors may want to display the Welsh flag in order to indicate Welsh language (GB-WLS) options, but could not do so using the currently defined "regional indicator symbols" mechanism.

The encoding of emoji has opened up the standard to the encoding of other related symbols that were traditionally considered outside the scope of character encoding (e.g. transport and map symbols, and symbols for playing cards), so in addition to characters deriving from emoji usage you will find in Unicode 6.0 many other symbols that have been proposed for encoding (see the expanded emoji proposal by Ireland and Germany).



Amendment 7 [225 characters]

Amendment 7 has now completed its two rounds of technical balloting, and so its repertoire (including code points and character names) is stable. Code charts for Amendment 7 are available here.


New Scripts

  • Mandaic {0840..085F} : 29 characters [N3485]
  • Batak {1BC0..1BFF} : 56 characters [N3320]
  • Brahmi {11000..1107F} : 108 characters [N3490 , N3491]

New Blocks

  • Kana Supplement {1B000..1B0FF} : one historic katakana letter and one historic hiragana letter [N3388]

Additions to Existing Blocks

  • Cyrillic Supplement {0500..052F} : two letters for Azerbaijani [N3481]
  • Oriya {0B00..0B7F} : six fraction characters [N3471]
  • Malayalam {0D00..0D7F} : two letters for scholarly orthography [N3494]
  • Tifinagh {2D30..2D7F} : one separator mark and one consonant joiner format character [N3482]
  • Latin Extended-D {A720..A7FF} : one orthographic letter and one phonetic letter [N3481]
  • Arabic Presentation Forms-A {FB50..FDFF} : 16 pedagogical symbols (spacing, non-combining symbols corresponding to diacritic marks on Arabic letters) [N3460 and N3460-A]


Amendment 8 [1,864 1,862 characters]

Amendment 8 has now completed its two rounds of technical balloting, and so its repertoire (including code points and character names) is stable. Code charts for Amendment 8 are available here.

Please note that the original emoji proposal (N3582/N3583) does not show the final distribution of the proposed characters amongst various existing and new blocks, and underwent extensive changes. If you wish to follow the paper trail from original proposal to final allocation then you should peruse the following documents:

  • N3582 : "Proposal for Encoding Emoji Symbols" (2009-02-06) by Markus Scherer, Mark Davis, Kat Momoi, Darick Tong (Google Inc.), and Yasuo Kida, Peter Edberg (Apple Inc.)
  • N3583 : "Emoji Symbols Proposed for New Encoding" (2009-02-06) by Markus Scherer, Mark Davis, Kat Momoi, Darick Tong (Google Inc.), and Yasuo Kida, Peter Edberg (Apple Inc.)
  • N3585 : "Emoji sources" (2009-02-06) by Markus Scherer
  • N3607 : "Towards an encoding of symbol characters used as emoji" (2009-04-06) by Irish and German National Bodies
  • N3614 : "Response to Concerns Raised in N3607 About Encoding Emoji Characters" (2009-04-09) by Mark Davis, Markus Scherer, Kat Momoi, Darick Tong, Yasuo Kida, Peter Edberg
  • N3619 : "Support Statements from KDDI/AU, SoftBank, and NTT docomo to Google/Apple Emoji Proposal" (2009-04-17) by Kat Momoi (Google Inc.)
  • N3620 : "Japanese translation of Document N3614" (2009-04-17) by Katsuhiko Momoi
  • N3621 : "Japanese translation of Document N3582" (2009-04-17) by Katsuhiko Momoi
  • N3636 : "Emoji Ad-Hoc Meeting Report" (2009-04-22) by Emoji Ad-hoc committee
  • N3671 : "Proposal to encode additional enclosed Latin alphabetic characters to the UCS" (2009-09-16) by Irish and German National Bodies
  • N3680 : "Proposal to encode Symbols for ISO 3166 Two-letter Codes in the UCS" (2009-09-18) by Irish and German National Bodies
  • N3681 : "Background data for Proposal for Encoding Emoji Symbols" (2009-09-17) by Markus Scherer (Google Inc.)
  • N3687 : "Proposal to encode two additional Mailbox Symbols complementing the Emoji set" (2009-09-21) by German National Body
  • N3711 : "A Proposal to Revise a Part of Emoticons in PDAM 8" (2009-10-22) by Katsuhiro Ogata, Koichi Kamichi, Shigeki Moro, Taichi Kawabata, Yasushi Naoi
  • N3712 : "Emoji sources" (2009-10-21) by Markus Scherer
  • N3713 : "Comment on 'A proposal to Revise a Part of Emoticons in PDAM 8'" (2009-10-22) by Karl Pentzlin
  • N3722 : "Disposition of comments on SC2 N 4078 (PDAM text for Amendment 8 to ISO/IEC 10646:2003)" (2009-10-26) by Michel Suignard (project editor)
  • N3726 : "Emoji Ad-Hoc Meeting Report" (2009-10-27) by Emoji Ad-hoc committee
  • N3727 : "Proposal to encode Regional Indicator Symbols in the UCS" (2009-10-28) by Michael Everson and Ken Whistler
  • N3728 : "Emoji sources" (2009-10-28) by Markus Scherer
  • N3769 : "Proposal to encode an emoticon "Neutral Face" in the UCS" (2010-01-26) by Karl Pentzlin
  • N3776 : "DoCoMo Input on Emoji" (2010-03-08) by Japanese National Body
  • N3777 : "KDDI Input on Emoji" (2010-03-08) by Japanese National Body
  • N3778 : "Updated Proposal to Change Some Glyphs and Names of Emoticons" (2010-03-03) by Japanese National Body
  • N3783 : "Willcom Input on Emoji" (2010-03-08) by Japanese National Body
  • N3826 : "Emoticons for FDIS 8" (2010-04-22) by Michael Everson
  • N3828 : "Disposition of comments on SC2 N 4123 (FPDAM text for Amendment 8 to ISO/IEC 10646:2003)" (2010-04-22) by Michel Suignard (project editor)
  • N3829 : "Emoji Ad-Hoc Meeting Report" (2010-04-21) by Emoji Ad-hoc committee
  • N3835 : "Emoji sources" (pending) by Markus Scherer

New Scripts

  • No new scripts

New Blocks

  • Ethiopic Extended-A {AB00-AB2F} : 32 syllables for Gamo-Gofa-Dawro, Basketo and Gumuz [N3572]
  • Bamum Supplement {16800-16A3F} : 569 historical letters [N3597]
  • Playing Cards {1F0A0-1F0FF} : 59 symbols for standard playing cards [N3607]
  • Miscellaneous Pictographic Symbols {1F300-1F5FF} : 529 characters, covering everything from the Statue of Liberty to a pile of poo) [N3583]
  • Emoticons {1F600-1F64F} : 62 63 symbols for human and cat faces showing all sorts of emotions [N3583, N3607, N3769]
  • Transport and Map symbols {1F680-1F6FF} : 70 characters [N3583, N3607]
  • Alchemical Symbols {1F700-1F77F} : 116 alchemical symbols [N3584]
  • CJK Unified Ideographs Extension D {2B740-2B81F} : 222 characters (originally 223 characters, but the original U+2B779 has now been removed, and the following characters moved up by one) [N3560, China Evidence, Japan Evidence, Unicode Evidence, Taiwan Evidence]

Additions to Existing Blocks

  • Arabic {0600-06FF} : two characters for Kashmiri [N3673]
  • Devanagari {0900-097F} ten vowel letters and vowel signs for Kashmiri : [N3480, N3710, N3731]
  • Malayalam {0D00-0D7F} : one historic letter [N3676]
  • Tibetan {0F00-0FFF} : four Kalacakra letters [N3568] and two annotation marks [N3569]
  • Ethiopic {1200-137F} : two vowel length marks [N3572]
  • Batak {1BC0-1BFF} : two symbols [N3320] (removed to the next edition)
  • Combining Diacritical Marks Supplement {1DC0-1DFF} : one double combining mark for the Uralic Phonetic Alphabet [N3571]
  • Superscripts and Subscripts {2070-209F} : eight subscript letters for Uralic Phonetic Alphabet [N3571]
  • Miscellaneous Technical {2300-23FF} : eleven user interface symbols and time symbols [N3583]
  • Miscellaneous Symbols {2600-26FF} : four pentagram symbols [N3674], one astronomical symbol [N3672] and one zodiacal symbol [N3583]
  • Dingbats {2700-27BF} : two heavy low quotes [N3565] and fourteen miscellaneous symbols [N3583, N3607]
  • Miscellaneous Mathematical Symbols-A {27C0-27EF} : two operator symbols [N3677]
  • Bopomofo Extended {31A0-31BF} : three letters for Hmu and Ge [N3570]
  • Cyrillic Extended-B {A640-A69F} : two letters for Birch-Bark writing [N3563]
  • Latin Extended-D {A720-A7FF} : one letter for the Uralic Phonetic Alphabet [N3571], two letter for the Janalif alphabet [N3581], ten old Latvian letters [N3587], and one middle dot letter [N3567] (removed to the next edition)
  • Enclosed Alphanumeric Supplement {1F100-1F1FF} : 106 enclosed letters and letter sequences [N3583], including 26 "regional indicator symbols" [N3727]
  • Enclosed Ideographic Supplement {1F200-1F2FF} : 13 enclosed ideographs [N3583]


Unicode 6.0 Fonts

The following are some free or shareware fonts that include some of the characters added in Unicode 6.0:

  • BabelStone Han version 1.05 (covers CJK Unified Ideographs Extension D, Kana Supplement, Enclosed Ideographic Supplement, and Bopomofo Extended)
  • HanaMin version 2010-10-13 (covers CJK Unified Ideographs Extension D and Kana Supplement)
  • Symbola version 6.01 (covers Alchemical Symbols, Emoticons, Miscellaneous Symbols and Pictographs, Playing Cards, Transport and Map Symbols and other symbol characters introduced in Unicode 6.0)

In addition, the following fonts include the newly-invented Indian Rupee Sign U+20B9 ₹:


And if you have the fonts and want to look through all the 109,384 characters in Unicode 6.0, check out my Unicode Slide Show.


15 comments:

Christoph said...

I believe some of your links are broken:
- 'CJK Unified Ideographs Extension D' does say N3560 but links to N3584,
- 'Enclosed Ideographic Supplement', I couldn't find any info for this search term with the linked documents

Andrew West said...

'CJK Unified Ideographs Extension D' does say N3560 but links to N3584

Fixed.

'Enclosed Ideographic Supplement', I couldn't find any info for this search term with the linked documents

The original emoji proposal (N3582/N3583) does not show the final distribution of the proposed characters amongst various existing and new blocks by the committees involved. So, although N3583 includes nine squared ideographs and two circled ideographs, it does not include the term "Enclosed Ideographic Supplement".

For anyone who is morbidly interested in the details of the emoji proposal, I have now added a list of all relevent documents.

jedi787plus said...

Amendment 8
• Dingbats {2700-27BF} : sixteen miscellaneous symbols [N3583, N3607]


I think you are forgetting one more source: N3565 regarding the two heavy low quotes for German. Those quotes are not mentioned at all in N3583, and in N3607 they appear as already-encoded characters (rather than new yellow-colored proposed characters).

Andrew West said...

Thanks, fixed.

fgrosshans said...

For Arabic pedagogical symbols, I think a link to N3560-A ( http://std.dkuug.dk/JTC1/SC2/wg2/docs/n3460-A.pdf ) would be more appropriate, since that's where the detailed proposal is.

Alex said...

The Unicode Pipeline says the Indian Rupee Sign will be in Unicode 6.0 at U+20B9

Andrew West said...

Yes, thanks for pointing that out. The Indian Rupee Sign (as proposed in document N3862) has been accepted for fast-tracking into Unicode 6.0, which means that Unicode and ISO/IEC 10646 will be out of sync until the next (3rd) edition of ISO/IEC 10646 is published as it is too late to add it to the current edition of ISO/IEC 10646. I will update this post accordingly.

Emmanuel Vallois said...

Some other fonts that partially support Unicode 6.0:
- Symbola, by George Douros, available from Unicode Fonts for Ancient Scripts, which supports additions to the Superscripts and Subscripts, Miscellaneous Technical, Miscellaneous Symbols, Dingbats, Miscellaneous Mathematical Symbols-A blocks, and the newly added Playing Cards block
- DejaVu 2.32 supports the new Indian Rupee sign, as well as U+A78D Latin capital letter turned H, and the latest snapshot contains the Playing Cards block and U+26E2 Astronomical symbol for Urnaus

Alex said...

Emmanuel Vallois has left a new comment on the post "What's new in Unicode 6.0 ?":

Some other fonts that partially support Unicode 6.0:
- Symbola, by George Douros, available from Unicode Fonts for Ancient Scripts, which supports additions to the Superscripts and Subscripts, Miscellaneous Technical, Miscellaneous Symbols, Dingbats, Miscellaneous Mathematical Symbols-A blocks, and the newly added Playing Cards block
- DejaVu 2.32 supports the new Indian Rupee sign, as well as U+A78D Latin capital letter turned H, and the latest snapshot contains the Playing Cards block and U+26E2 Astronomical symbol for Urnaus

Frédéric Grosshans said...

The newly released ubuntu font also supports the Rupee sign.

Alex said...

http://dejavu.sourceforge.net/snapshots/ is the link to the déjà vu font snapshots

Andrew West said...

Thanks for the suggestions -- I've added links to the DejaVu and Ubuntu fonts in the blog post.

Frédéric Grosshans said...

Do you have any plan for a "What's new in Unicode 6.1 ?" post ?

Andrew West said...

I plan to discuss Unicode 6.1 on June 12th.

Andrew West said...

One day late -- What's new in Unicode 6.1 ?