Back in November 2005 I asked What's new in Unicode 5.0 ? in anticipation of its release in July of the following year. Now that Unicode 5.0 has been out for nearly a year I thought it would be good time to look ahead to what is in store for Unicode 5.1. Just to be clear, Unicode 5.1 won't be released until the spring or summer of 2008, but the character repertoire is already basically fixed, and there are unlikely to be any major changes (but if there are I will update this post). Well in the end there was one major change -- see addendum at bottom of the page [2007-10-19]. See bottom of post for a list of fonts with Unicode 5.1 coverage.
The additions to Unicode 5.1 will correspond to Amendments 3 and 4 of ISO/IEC 10646:2003. A total of 1,102 new characters are added in Amd.3, although four (U+097B, U+097C, U+097E and U+097F) are already in Unicode 5.0, and a total of 636 526 new characters are expected to be added to Amd.4, so that Unicode 5.1 will have 1,734 1,624 additional characters compared with Unicode 5.0, making a grand total of 100,823 100,713 encoded characters (graphic, format and control characters) in Unicode, breaking the 100K mark for the first time (and for all those who are worried that 17 planes are just not enough, that still leaves room for another 873,707 873,817 characters).
The additions for 5.1 are not as controversial as those for 5.0, and maybe not be as exciting as 5.2 promises to be, but it will include twelve eleven new scripts [Lanna now postponed to Amd.5], which equals nearly equals 3.0 as being the largest number of scripts added in a single version of Unicode. From 5.1 Unicode will cover 76 75 scripts (including Braille which is classified as a script in Unicode), as shown in the table below. Regular readers of my blog will realise that there are still many more historic and less comon scripts waiting to be encoded.
| Script Name | ISO 15924 | Characters* | Version Introduced into Unicode | |
|---|---|---|---|---|
| 5.0 | 5.1 | |||
| Arabic | Arab | 966 | 999 | 1.0 |
| Armenian | Armn | 90 | 90 | 1.0 |
| Balinese | Bali | 121 | 121 | 5.0 |
| Bengali | Beng | 91 | 91 | 1.0 |
| Bopomofo | Bopo | 64 | 65 | 1.0 |
| Braille | Brai | 256 | 256 | 3.0 |
| Buginese | Bugi | 30 | 30 | 4.1 |
| Buhid | Buhd | 20 | 20 | 3.2 |
| Canadian Aboriginal | Cans | 630 | 630 | 3.0 |
| Carian | Cari | 0 | 49 | 5.1 |
| Cham | Cham | 0 | 83 | 5.1 |
| Cherokee | Cher | 85 | 85 | 3.0 |
| Coptic | Copt | 128 | 128 | 1.0 (disunified from Greek in 4.1) |
| Cuneiform | Xsux | 982 | 982 | 5.0 |
| Cypriot | Cprt | 55 | 55 | 4.0 |
| Cyrillic | Cyrl | 277 | 404 | 1.0 |
| Deseret | Dsrt | 80 | 80 | 3.1 |
| Devanagari | Deva | 107 | 107 | 1.0 |
| Ethiopic | Ethi | 461 | 461 | 3.0 |
| Georgian | Geor | 120 | 120 | 1.0 |
| Glagolitic | Glag | 94 | 94 | 4.1 |
| Gothic | Goth | 27 | 27 | 3.1 |
| Greek | Grek | 506 | 511 | 1.0 |
| Gujarati | Gujr | 83 | 83 | 1.0 |
| Gurmukhi | Guru | 77 | 79 | 1.0 |
| Han | Hani | 71,570 | 71,578 | 1.0 |
| Hangul | Hang | 11,620 | 11,620 | 1.0 (relocated in 2.0) |
| Hanunoo | Hano | 21 | 21 | 3.2 |
| Hebrew | Hebr | 133 | 133 | 1.0 |
| Hiragana | Hira | 89 | 89 | 1.0 |
| Kannada | Knda | 86 | 84 | 1.0 |
| Katakana | Kana | 164 | 299 | 1.0 |
| Kayah Li | Kali | 0 | 48 | 5.1 |
| Kharoshthi | Khar | 65 | 65 | 4.1 |
| Khmer | Khmr | 146 | 146 | 3.0 |
| Lana | 0 | |||
| Lao | Laoo | 65 | 65 | 1.0 |
| Latin | Latn | 1,070 | 1,241 | 1.0 |
| Lepcha | Lepc | 0 | 74 | 5.1 |
| Limbu | Limb | 66 | 66 | 4.0 |
| Linear B | Linb | 211 | 211 | 4.0 |
| Lycian | Lyci | 0 | 29 | 5.1 |
| Lydian | Lydi | 0 | 27 | 5.1 |
| Malayalam | Mlym | 78 | 95 | 1.0 |
| Mongolian | Mong | 152 | 153 | 3.0 |
| Myanmar | Mymr | 78 | 3.0 | |
| N’Ko | Nkoo | 59 | 59 | 5.0 |
| New Tai Lue | Talu | 80 | 80 | 4.1 |
| Ogham | Ogam | 29 | 29 | 3.0 |
| Ol Chiki | Olck | 0 | 48 | 5.1 |
| Old Italic | Ital | 35 | 35 | 3.1 |
| Old Persian | Xpeo | 50 | 50 | 4.1 |
| Oriya | Orya | 81 | 84 | 1.0 |
| Osmanya | Osma | 40 | 40 | 4.0 |
| Phags-pa | Phag | 56 | 56 | 5.0 |
| Phoenician | Phnx | 27 | 27 | 5.0 |
| Rejang | Rjng | 0 | 37 | 5.1 |
| Runic | Runr | 78 | 78 | 3.0 |
| Saurashtra | Saur | 0 | 81 | 5.1 |
| Shavian | Shaw | 48 | 48 | 4.0 |
| Sinhala | Sinh | 80 | 80 | 3.0 |
| Sundanese | Sund | 0 | 55 | 5.1 |
| Syloti Nagri | Sylo | 44 | 44 | 4.1 |
| Syriac | Syrc | 77 | 77 | 3.0 |
| Tagalog | Tglg | 20 | 20 | 3.2 |
| Tagbanwa | Tagb | 18 | 18 | 3.2 |
| Tai Le | Tale | 35 | 35 | 4.0 |
| Tamil | Taml | 71 | 72 | 1.0 |
| Telugu | Telu | 80 | 93 | 1.0 |
| Thaana | Thaa | 50 | 50 | 3.0 |
| Thai | Thai | 86 | 86 | 1.0 |
| Tibetan | Tibt | 195 | 201 | 1.0 (removed in 1.1 and reintroduced in 2.0) |
| Tifinagh | Tfng | 55 | 55 | 4.1 |
| Ugaritic | Ugar | 31 | 31 | 4.0 |
| Vai | Vaii | 0 | 300 | 5.1 |
| Yi | Yiii | 1,220 | 1,220 | 3.0 |
* Numbers of characters do not necessarily represent the total number of encoded characters used for the script (and are not necessarily the same as the number of characters in the same-named block), but are the number of characters that are uniquely assigned to that script by Unicode (i.e. excluding characters that have the Unicode script property of "common" or "inherited"). Some differences in the figures for particular scripts (e.g. Katakana and Latin) reflect changes in script assignment in Unicode 5.1.
For me, the highlights of Unicode 5.1 are the encoding of the symbols on the enigmatic Phaistos Disc (first proposed for encoding ten years ago, but delayed because of some opposition to encoding undeciphered symbols found on a unique artefact), and the encoding of a wide range of letters used in medieval manuscripts and early printed books, so that finally texts such as The Calixtus Bull can be represented exactly as they are written. The script that has had the biggest makeover for 5.1 is Myanmar, with changes to the encoding model to finally make it useable, as well as additions to support minority languages such as Mon, S'gaw Karen, Western Pwo Karen, Eastern Pwo Karen, Geba Karen, Kayah, Shan and Rumai Palaung (see Andrew Cunningham's The Myanmar script and Unicode for a useful overview of support for the Myanmar script) And then there are a handful of Tibetan (U+0FCE, U+0FD2..U+0FD4), Mongolian (U+18AA) and CJK (U+9FC3) characters that I am responsible for, which I am of course pleased to see make it into the standard.
Amendment 3
Amendment 3 is now at the FDAM stage of the ISO ballot process, and its repertoire is fixed, so the code points given below can be relied on. The ISO 15924 code for new scripts is given in square brackets, and the number of new characters is given in curly braces.
New Scripts
- Sundanese [Sund] {55} at 1B80..1BBF
- Lepcha [Lepc] {74} at 1C00..1C4F
- Ol Chiki [Olck] {48} at 1C50..1C7F
- Vai [Vaii] {300} at A500..A63F
- Saurashtra [Saur] {81} at A880..A8DF
- Kayah Li [Kali] {48} at A900..A92F
- Rejang [Rjng] {37} at A930..A95F
- Lycian [Lyci] {29} at 10280..1029F
- Carian [Cari] {49} at 102A0..102DF
- Lydian [Lydi] {27} at 10920..1093F
Other New Blocks
- Phaistos Disc {46} at 101D0..101FF
Additions to Existing Blocks
- Greek and Coptic [0370..03FF] {7} : three epigraphical letters (Heta, Archaic Sampi and Pamphylian Digamma); and capital Kai symbol
- Arabic [0600..06FF] {5} : five mathemamatical symbols
- Arabic Supplement [0750..077F] {16} : additional letters for Khowar, Torwali, and Burushaski
- Devanagari [0900..097F] {6} : four characters for Sindhi (already in Unicode 5.0), high spacing dot mark, Candra A
- Gurmukhi [0A00..0A7F] {2} : Udaat and Yakash signs
- Tamil [0B80..0BFF] {1} : Om symbol
- Telugu [0C00..0C7F] {13} : various letters, signs and fraction digits
- Malayalam [0D00..0D7F] {10} : numbers and fraction symbols; and letters for Sanskrit and date mark
- Tibetan [0F00..0FFF] {6} : two reversed letters used for Balti (discussed in Tibetan Extensions 2 : Balti); one astrological pebble symbol (discussed in Tibetan Extensions 1 : Astrological Pebble Symbols); a double tsheg mark; and a pair of archaic form head marks
- Myanmar [1000..10AF] {22} : seven disunified characters (added in order to solve various issues with the Unicode Myanmar model that has up to now prevented widespread adoption of Unicode for representing the Myanmar script); and additions for Mon and S'gaw Karen (plus one overlooked vowel sign for Mon)
- Mongolian [1800..18FF] {1} : additional letter for Manchu transcription of Tibetan (discussed in Manchu Letter LHA)
- Combining Diacritical Marks Supplement [1DC0..1DFF] {28} : superscript letters and combining marks for representing usage in medieval manuscripts and early printed books; and breve-macron and macron-breve for use in Lithuanian dialect notation
- Latin Extended Additional [1E00..1EFF] {9} : various letters for medieval Welsh and Portuguese
- Miscellaneous Symbols [2600..26FF] {11} : ten symbols used in Western Astrology (including symbols for Ceres, Pallas, Juno, Vesta, Chiron and Black Moon Lilith); and Outlined White Star (for Arabic mathematical use)
- Miscellaneous Mathematical Symbols-A [27C0..27EF] {2} : mathematical symbols for Arabic use
- Miscellaneous Symbols and Arrows [2B00..2BFF] {27} : mathematical symbols and arrows for Arabic use; and reversed forms of mirroring arrows
- Latin Extended-C [2C60..2C7F] {12} : phonetic and orthographic letters; phonetic letters used in a dictionary of Swedish dialects in Finland; and additional letters for the Uralic Phonetic Alphabet
- Supplemental Punctuation [2E00..2E7F] {1} : Inverted Interrobang (also known as a gnaborretni)
- CJK Strokes [31C0..31EF] {20} : additional CJK stroke characters (see this page for an explanation of the abstruse naming convention for these characters)
- Modifier Tone Letters [A700..A71F] {5} : modifier letters for phonetic use
- Latin Extended-D [A720..A7FF] {103} : Egyptological letters alef and ain; Mayanist letters (including Tresillo and Cuatrillo, encoded as casing pairs after some extremely bitter arguments over whether they were casing letters or not); a wide range of Medievalist characters, including Insular letterforms (which I was originally opposed to the encoding of), letters used as abbreviations in manuscripts and early printed books (such as thorn with stroke and rum rotunda), and casing forms of the letter R rotunda (discussed in R Rotunda Part 2); and a low circumflex accent (used for Lahu and Akha)
- Musical Symbols [1D100..1D1FF] {1} : Musical Symbol Multiple Measure Rest (added as the glyph associated with the existing U+1D13A MUSICAL SYMBOL MULTI REST is in fact a "breve rest" or "double whole rest")
Amendment 4
Amendment 4 is now at the FPDAM stage of the ISO ballot process, and its repertoire is unlikely to change significantly, but there may be changes, and the code point allocations could possibly change. The ISO 15924 code for new scripts is given in square brackets, and the number of new characters is given in curly braces.
New Scripts
Other New Blocks
- Cyrillic Extended-A {32} at 2DE0..2DFF (combining characters and marks for Early Slavic)
- Cyrillic Extended-B {78} at A640..A69F (additional letters for Abkhaz)
- Ancient Symbols {12} at 10190..101CF (Roman weights and monetary signs)
- Mahjong Tiles {44} at 1F000..1F02F
- Domino Tiles {100} at 1F030..1F09F
Additions to Existing Blocks
- Cyrillic [0400..04FF] {1} : combining Pokrytie
- Cyrillic Supplementary [0500..052F] {16} : additional letters for Mordvin, Kurdish, Aleut and Chuvash
- Arabic [0600..06FF] {10} : five letters for early Persian and one for Azerbaijani; and four characters for Qur'anic Arabic
- Arabic Extended [0750..077F] {2} : two more letters for early Persian
- Oriya [0B00..0B7F] {3} : characters needed to complete the set of vocalic liquids
- Malayalam [0D00..0D7F] {7} : six Chillu letters; and one character needed to complete the set of vocalic liquids
- Myanmar [1000..109F] {
3956} : additions for Karen and Kayah and Shan and Palaung (and now plus a further 17 for Shan) - Latin Extended Additional [1E00..1EFF] {1} : Capital Sharp S (somewhat controversial !)
- General Punctuation [2000..206F] {1} : Invisible Plus
- Combining Diacritical Marks for Symbols [20D0..20FF] {1} : Combining Asterisk Above
- Letterlike Symbols [2100..214F] {1} : Samaritan text symbol
- Number Forms [2150..218F] {4} : Roman numerals
- Miscellaneous Symbols [2600..26FF] {4} : Draughts pieces (Checkers pieces)
- Miscellaneous Mathematical Symbols-A [27C0..27EF] {3} : mathematical symbols
- Miscellaneous Symbols and Arrows [2B00..2BFF] {24} : symbols and arrows
- Supplemental Punctuation [2E00..2E7F] {22} : Palm Branch mark (also known as an ramulus); Medievalist punctuation marks; and Vertical Tilde (for Early Slavic)
- Bopomofo [3100..312F] {1} : a little-used letter (used to represent the inherent vowel in ㄓ zhi, ㄔ chi, ㄕ shi and ㄖ ri)
- CJK Unified Ideographs [4E00..9FFF] {8} : seven new characters (which I will discuss further in my next post); and a character created by the disunification of U+4039 (discussed in Vanished in the Twinkling of an Eye)
- Latin Extended-D [A720..A7FF] {9} : five Roman epigraphic letters; two modifier letters and casing forms of the letter Saltillo (Saltillo is an apostrophe-like letter used to represent a glottal stop in Mixtec and many other languages)
- Combining Half Marks [FE20..FE2F] {3} : combining macron marks (for use primarily in Coptic)
What's Not in Unicode 5.1
Egyptian Hieroglyphs (an initial set of 1,063 characters corresponding to Gardiner's Sign List) are not in 5.1, but are in Amd.5 which is currently undergoing its first ballot, and should correspond to Unicode 5.2 (there will probably be several minor versions before Unicode 6.0 is published). Other scripts that are in Amd.5 are Meitei Mayek, Bamum (removed for further study), Tai Viet and Avestan. Amd.5 also includes two new blocks for a set of controversial Old Hangul Jamo.
Not yet ready for inclusion in Unicode 5.2 is Tangut. A first proposal has now been submitted to the UTC, but has not yet reached WG2. Because of the complexity of the Tangut repertoire and probable issues about "ownership" of the script, it may take some time to reach an agreement on encoding Tangut, and so may not be in Unicode for a few more versions yet. [Well, I was wrong about that—it has made it into Amd.6 which means that it is scheduled for inclusion in Unicode 5.2]
However, the big and unexpected hole in 5.1 (Amd.4) is CJK-C, which is the first installment of the tens of thousands of additional Han characters submitted for encoding by members of the Ideographic characters Rapporteur Group (IRG). This set of 4,219 CJKV ideographs was included in PDAM4, but was moved from Amd.4 to Amd.5 at the last WG2 meeting (in Frankfurt at the end of April). I will look at CJK-C in more detail in my next post.
Addendum [2007-10-19]
At the WG2 meeting in Hangzhou last month (which I had hoped to attend if it was in Ürümqi as originally planned) two important changes to the Amd.4 repertoire were made.
Firstly, 17 additional Myanmar characters (including 10 Shan digits) were added in order to complete the extensions to the Myanmar script required to support the Shan language.
Secondly, the agreement on encoding the Lanna script achieved at the Frankfurt WG2 meeting in the Spring fell apart, with China demanding significant changes to the proposal. The end result was that Lanna was removed from Amd.4, and put back to Amd.5 (this will mean that it will miss the train for Unicode 5.1 next year). In addition, the script name is to be changed to TAI THAM due to objections to the name "Lanna" by China. (There have been a lot of disputes over script names recently, with user communities objecting to traditional English script names such as Pollard and Fraser.)
So now the repertoire of Amds. 3 and 4 have been finalised, and consequently the contents of Unicode 5.1 are now fixed, and will be going beta in the Spring. However, I think that Amd.5 is going to be the interesting one, as it includes both CJK-C and Egyptian hieroglyphs (but with Bamum removed by request of the user community, and Meitei Mayek removed due to fierce differences of opinion on danda disunification within WG2).
Unicode 5.1 Fonts [2008-04-28]
Now that Unicode 5.1 has been released (April 2008) a lot of people want to be able to make use of all the new scripts and characters, but obviously can't if they don't have any fonts that support the new Unicode 5.1 characters. So here is a list of some freeware and shareware fonts that do have Unicode 5.1 coverage (Unicode 5.1 coverage in brackets):
- Aegean (Ancient Symbols, Carian, Lycian, Phaistos Disc)
- Aegyptus (Lydian)
- Code2000 (Cham, Cyrillic, Cyrillic Extended-B, Greek, Kayah Li, Latin Extended Additional, Latin Extended-C, Latin Extended-D, Myanmar, Ol Chiki, Rejang, Saurashtra, Supplemental Punctuation, Vai)
- Code2001 (Domino Tiles, Phaistos Disc)
- Everson Mono (Ancient Symbols, Combining Diacritical Marks Supplement, Cyrillic, Cyrillic Extended-A, Cyrillic Extended-B, Greek, Latin Extended Additional, Latin Extended-C, Latin Extended-D, Phaistos Disc, Supplemental Punctuation)
- Padauk (Myanmar)
- RomanCyrillic Std and CampusRoman Std (Ancient Symbols, Cyrillic Extended-A, Cyrillic Extended-B)
- Sundanese Unicode (Sundanese)
- Unicode Symbols (Domino Tiles, Mahjong Tiles)
On Beyond Unicode 5.1 ...
And finally, if you are interested in what will be in the next version of Unicode after 5.1, take a look at What's new in Unicode 5.2 ?.

7 comments:
I still say the damned thing is a board game, and when we find a cache of several hundred more all identical, the encoding universe is going to look pretty stoopid. But what the (@#$* do I know?
Yeah, I agree it could well be a board game, but in my opinion the symbols on it need encoding in order that people are able to discuss it -- whether in futile attempts to decipher it or in order to describe the rules of the game. If it does turn out to be a game board then encoding the signs on it is little different from encoding Mahjong and Domino tiles. In any case Unicode will not define Phaistos Disc symbols as a script -- the characters are just there for people to use for whatever reason they want. So I don't think it makes the standard look stoopid.
You're welcome. :-)
Well, whatever; I guess once we go down the road of encoding game symbols, we might as well do the Disk too.
I have this board game called "Ur" somewhere -- the rules and pieces are modern inventions, but the board game is authentically Sumerian. Googling for "Ur" "board game" will find a sufficiency of information and pictures.
I guess we all have to be angry about something in Unicode ;)
Can you tell me which is the font for Saurashtra script A880.
Code2000 is the only font I know that supports Saurashtra.
Post a Comment