Otto Stolz asked on the Unicode List how many Unicode characters there were, classified as control characters, format characters, graphic characters, private use chacters, noncharacters, surrogate code points, etc. Now I love Unicode facts, figures and trivia, so I can't resist trying to answer this question, even if I only end up confusing everybody.
The "Unicode Version History" utility of BabelMap provides precisely the information requested by Otto for all versions of Unicode from 1.0.0 up to the current version (4.1 when I first wrote this post, but now updated to 5.1). This information is tabulated below :
| Version | 1.0.0 | 1.0.1 | 1.1 | 2.0 | 2.1 | 3.0 | 3.1 | 3.2 | 4.0 | 4.1 | 5.0 | 5.1 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Date | October 1991 | June 1992 | June 1993 | July 1996 | May 1998 | September 1999 | March 2001 | March 2002 | April 2003 | March 2005 | July 2006 | April 2008 |
| Scripts | 24 | 25 | 24 | 25 | 25 | 38 | 41 | 45 | 52 | 59 | 64 | 75 |
| Blocks | 57 | 59 | 63 | 67 | 67 | 86 | 95 | 107 | 122 | 142 | 151 | 168 |
| Total Code Points | 65,536 | 65,536 | 65,536 | 1,114,112 | 1,114,112 | 1,114,112 | 1,114,112 | 1,114,112 | 1,114,112 | 1,114,112 | 1,114,112 | 1,114,112 |
| Assigned Code Points | 12,795 | 34,505 | 40,635 | 178,500 | 178,502 | 188,809 | 233,787 | 234,803 | 236,029 | 237,302 | 238,671 | 240,295 |
| Unassigned Code Points | 52,741 | 31,031 | 24,901 | 935,612 | 935,610 | 925,303 | 880,325 | 879,309 | 878,083 | 876,810 | 875,441 | 873,817 |
| Encoded Characters | 7,161 | 28,359 | 34,233 | 38,950 | 38,952 | 49,259 | 94,205 | 95,221 | 96,447 | 97,720 | 99,089 | 100,713 |
| Private Use Characters | 5,632 | 6,144 | 6,400 | 137,468 | 137,468 | 137,468 | 137,468 | 137,468 | 137,468 | 137,468 | 137,468 | 137,468 |
| Non characters | 2 | 2 | 2 | 34 | 34 | 34 | 66 | 66 | 66 | 66 | 66 | 66 |
| Surrogate Code Points | 0 | 0 | 0 | 2,048 | 2,048 | 2,048 | 2,048 | 2,048 | 2,048 | 2,048 | 2,048 | 2,048 |
| Graphic Characters | 7,085 | 28,283 | 34,151 | 38,867 | 38,869 | 49,168 | 94,009 | 95,023 | 96,243 | 97,515 | 98,884 | 100,507 |
| Format Characters | 2 | 2 | 2 | 18 | 18 | 26 | 131 | 133 | 139 | 140 | 140 | 141 |
| Control Characters | 74 | 74 | 80 | 65 | 65 | 65 | 65 | 65 | 65 | 65 | 65 | 65 |
Note that for historic versions of Unicode the statistics are based on the General Category of the characters at the time of encoding, and do not take into account any subsequent changes in General Category. Thus the fact that 4.0 has 139 format characters and 4.1 has 140 format characters is not due to a new format character having been added in 4.1, but rather due to the General Category of U+200B ZERO WIDTH SPACE having been changed from Zs to Cf in Unicode 4.0.1. Note that the statistics for 1.0.0 and 1.0.1 are based upon Ken Whistler's reconstructed Unicode Character Data.
To help understand what we're talking about, here are some definitions of some of the terms used in the table (see Section 2.4 of the Unicode Standard for further information).
- Control characters are those characters with a General Category of Gc. These are invisible characters that perform a certain function that is defined by a protocol or standard other than Unicode (they are inherited from pre-existing 8-bit standards). They include familiar characters such as Tab, Carriage Return and Line Feed that are essential to writing Unicode (U+2028 LINE SEPARATOR and U+2029 PARAGRAPH SEPARATOR never took off as Unicode replacements for Carriage Return and/or Line Feed), as well as many characters that you should never see in plain text (the full range of control characters is 0000..001F and 007F..009F).
- Format characters are those characters with a General Category of Cf, Zl or Zp. These are invisible characters defined by Unicode for a particular function. These include things like U+200D ZERO WIDTH JOINER, U+202D LEFT-TO-RIGHT OVERRIDE, interlinear annotation characters (FFF9..FFFB) and the infamous set of Tag characters (E0001 and E0020..E007F). They work behind the scenes to do useful things like bidirectional control and character shaping.
- Graphic characters are those characters with a General Category other than Cc, Cn, Co, Cs, Cf, Zl and Zp, that is to say ordinary visible characters.
- Noncharacters are code points that are permanently reserved, and are guaranteed never to be assigned as characters. They are the thirty-four code points ending in xxFE and xxFF, as well as the thirty-two code points in the range FDD0..FDEF.
- Encoded characters = Graphic characters + Format characters + Control characters.
- Assigned code points = Encoded characters + Private Use characters + Noncharacters + Surrogate code points.
- Total code points = Assigned code points + unassigned code points.

2 comments:
What about Phoenician? Is it in or out?
Suzanne
In. Please see todays blog "What's new in Unicode 5.0".
Post a Comment