Thursday, 24 November 2005

How many Unicode characters are there ?

[Short answer: 113,021]


Otto Stolz asked on the Unicode List how many Unicode characters there were, classified as control characters, format characters, graphic characters, private use chacters, noncharacters, surrogate code points, etc. Now I love Unicode facts, figures and trivia, so I can't resist trying to answer this question.

The "Unicode Version History" utility of BabelMap provides precisely the information requested by Otto for all versions of Unicode from 1.0.0 up to the current version (4.1 when I first wrote this post, but now updated to 7.0). This information is tabulated below :


[Unicode Slide Show : 112,956 characters, one at a time]

[BabelMap Online : Javascript Unicode Character Map]


BabelMap : Unicode Version History
Version 1.0.0 1.0.1 1.1 2.0 2.1 3.0 3.1 3.2 4.0 4.1 5.0 5.1 5.2 6.0 6.1 6.2 6.3 7.0
Date October
1991
June
1992
June
1993
July
1996
May
1998
September
1999
March
2001
March
2002
April
2003
March
2005
July
2006
April
2008
October
2009
October
2010
January
2012
September
2012
September
2013
June
2014
Scripts 24 25 24 25 25 38 41 45 52 59 64 75 90 93 100 100 100 123
Blocks 57 59 63 67 67 86 95 107 122 142 151 168 194 206 217 217 217 249
Total
Code
Points
65,536 65,536 65,536 1,114,112 1,114,112 1,114,112 1,114,112 1,114,112 1,114,112 1,114,112 1,114,112 1,114,112 1,114,112 1,114,112 1,114,112 1,114,112 1,114,112 1,114,112
Assigned
Code
Points
12,795 34,505 40,635 178,500 178,502 188,809 233,787 234,803 236,029 237,302 238,671 240,295 246,943 249,031 249,763 249,764 249,769 252,603
Unassigned
Code
Points
52,741 31,031 24,901 935,612 935,610 925,303 880,325 879,309 878,083 876,810 875,441 873,817 867,169 865,081 864,349 864,348 864,343 861,509
Encoded
Characters
7,161 28,359 34,233 38,950 38,952 49,259 94,205 95,221 96,447 97,720 99,089 100,713 107,361 109,449 110,181 110,182 110,187 113,021
Private
Use
Characters
5,632 6,144 6,400 137,468 137,468 137,468 137,468 137,468 137,468 137,468 137,468 137,468 137,468 137,468 137,468 137,468 137,468 137,468
Non
characters
2 2 2 34 34 34 66 66 66 66 66 66 66 66 66 66 66 66
Surrogate
Code
Points
0 0 0 2,048 2,048 2,048 2,048 2,048 2,048 2,048 2,048 2,048 2,048 2,048 2,048 2,048 2,048 2,048
Graphic
Characters
7,085 28,283 34,151 38,867 38,869 49,168 94,009 95,023 96,243 97,515 98,884 100,507 107,154 109,242 109,975 109,976 109,975 112,804
Format
Characters
2 2 2 18 18 26 131 133 139 140 140 141 142 142 141 141 147 152
Control
Characters
74 74 80 65 65 65 65 65 65 65 65 65 65 65 65 65 65 65

For historic versions of Unicode the statistics are based on the General Category of the characters at the time of encoding, and do not take into account any subsequent changes in General Category. Thus the fact that 4.0 has 139 format characters and 4.1 has 140 format characters is not due to a new format character having been added in 4.1, but rather due to the General Category of U+200B ZERO WIDTH SPACE having been changed from Zs to Cf in Unicode 4.0.1. Note that the statistics for 1.0.0 and 1.0.1 are based upon Ken Whistler's reconstructed Unicode Character Data.

To help understand what we're talking about, here are some definitions of some of the terms used in the table (see Section 2.4 of the Unicode Standard for further information).

  • Control characters are those characters with a General Category of 'Cc'. These are invisible characters that perform a certain function that is defined by a protocol or standard other than Unicode (they are inherited from pre-existing 8-bit standards). They include familiar characters such as Tab, Carriage Return and Line Feed that are essential to writing Unicode (U+2028 LINE SEPARATOR and U+2029 PARAGRAPH SEPARATOR never took off as Unicode replacements for Carriage Return and/or Line Feed), as well as many characters that you should never see in plain text (the full range of control characters is 0000..001F and 007F..009F).
  • Format characters are those characters with a General Category of 'Cf', 'Zl' or 'Zp'. These are invisible characters defined by Unicode for a particular function. These include things like U+200D ZERO WIDTH JOINER, U+202D LEFT-TO-RIGHT OVERRIDE, interlinear annotation characters (FFF9..FFFB) and the infamous set of Tag characters (E0001 and E0020..E007F). They work behind the scenes to do useful things like bidirectional control and character shaping.
  • Graphic characters are those characters with a General Category other than Cc, Cn, Co, Cs, Cf, Zl and Zp, that is to say ordinary visible characters.
  • Noncharacters are code points that are permanently reserved, and are guaranteed never to be assigned as characters. They are the thirty-four code points ending in xxFE and xxFF, as well as the thirty-two code points in the range FDD0..FDEF.
  • Encoded characters = Graphic characters + Format characters + Control characters.
  • Assigned code points = Encoded characters + Private Use characters + Noncharacters + Surrogate code points.
  • Total code points = Assigned code points + Unassigned code points.


Charts





3 comments:

Suz said...

What about Phoenician? Is it in or out?

Suzanne

Andrew said...

In. Please see todays blog "What's new in Unicode 5.0".

crasshopper said...

Awesome. I was just thinking about ways to come up with easy to imagine large numbers. Lots of people know what Unicode is, now I can say there are approx 120,000 Unicode symbols so the number of possible permutations of the Unicode "alphabet" would be 120,000!, a number with 557,389 digits.

http://www.wolframalpha.com/input/?i=120000!

Goodbye, protein example!