Thursday, 24 November 2005

How many Unicode characters are there ?

[Short answer: 120,737]


Otto Stolz asked on the Unicode List how many Unicode characters there were, classified as control characters, format characters, graphic characters, private use chacters, noncharacters, surrogate code points, etc. Now I love Unicode facts, figures and trivia, so I can't resist trying to answer this question.

The "Unicode Version History" utility of BabelMap provides precisely the information requested by Otto for all versions of Unicode from 1.0.0 up to the current version (4.1 when I first wrote this post, but now updated to 6.3). This information is tabulated below :


[Unicode Slide Show : 120,672 characters, one at a time]

[BabelMap Online : Javascript Unicode Character Map]


Unicode Version History
Version Date Scripts Blocks Total
Code
Points
Total Code Points Assigned Code Points Encoded Characters
Assigned Unassigned Encoded
Characters
Private
Use
Characters
Non
characters
Surrogate
Code
Points
Graphic
Characters
Format
Characters
Control
Characters
1.0.0 October
1991
24 57 65,536 12,795 52,741 7,161 5,632 2 0 7,085 2 74
1.0.1 June
1992
25 59 65,536 34,505 31,031 28,359 6,144 2 0 28,283 2 74
1.1 June
1993
24 63 65,536 40,635 24,901 34,233 6,400 2 0 34,151 2 80
2.0 July
1996
25 67 1,114,112 178,500 935,612 38,950 137,468 34 2,048 38,867 18 65
2.1 May
1998
25 67 1,114,112 178,502 935,610 38,952 137,468 34 2,048 38,869 18 65
3.0 September
1999
38 86 1,114,112 188,809 925,303 49,259 137,468 34 2,048 49,168 26 65
3.1 March
2001
41 95 1,114,112 233,787 880,325 94,205 137,468 66 2,048 94,009 131 65
3.2 March
2002
45 107 1,114,112 234,803 879,309 95,221 137,468 66 2,048 95,023 133 65
4.0 April
2003
52 122 1,114,112 236,029 878,083 96,447 137,468 66 2,048 96,243 139 65
4.1 March
2005
59 142 1,114,112 237,302 876,810 97,720 137,468 66 2,048 97,515 140 65
5.0 July
2006
64 151 1,114,112 238,671 875,441 99,089 137,468 66 2,048 98,884 140 65
5.1 April
2008
75 168 1,114,112 240,295 873,817 100,713 137,468 66 2,048 100,507 141 65
5.2 October
2009
90 194 1,114,112 246,943 867,169 107,361 137,468 66 2,048 107,154 142 65
6.0 October
2010
93 206 1,114,112 249,031 865,081 109,449 137,468 66 2,048 109,242 142 65
6.1 January
2012
100 217 1,114,112 249,763 864,349 110,181 137,468 66 2,048 109,975 141 65
6.2 September
2012
100 217 1,114,112 249,764 864,348 110,182 137,468 66 2,048 109,976 141 65
6.3 September
2013
100 217 1,114,112 249,769 864,343 110,187 137,468 66 2,048 109,975 147 65
7.0 June
2014
123 249 1,114,112 252,603 861,509 113,021 137,468 66 2,048 112,804 152 65
8.0 June
2015
129 259 1,114,112 260,319 853,793 120,737 137,468 66 2,048 120,520 152 65

For historic versions of Unicode the statistics are based on the General Category of the characters at the time of encoding, and do not take into account any subsequent changes in General Category. Thus the fact that 4.0 has 139 format characters and 4.1 has 140 format characters is not due to a new format character having been added in 4.1, but rather due to the General Category of U+200B ZERO WIDTH SPACE having been changed from Zs to Cf in Unicode 4.0.1. Note that the statistics for 1.0.0 and 1.0.1 are based upon Ken Whistler's reconstructed Unicode Character Data.

To help understand what we're talking about, here are some definitions of some of the terms used in the table (see Section 2.4 of the Unicode Standard for further information).

  • Control characters are those characters with a General Category of 'Cc'. These are invisible characters that perform a certain function that is defined by a protocol or standard other than Unicode (they are inherited from pre-existing 8-bit standards). They include familiar characters such as Tab, Carriage Return and Line Feed that are essential to writing Unicode (U+2028 LINE SEPARATOR and U+2029 PARAGRAPH SEPARATOR never took off as Unicode replacements for Carriage Return and/or Line Feed), as well as many characters that you should never see in plain text (the full range of control characters is 0000..001F and 007F..009F).
  • Format characters are those characters with a General Category of 'Cf', 'Zl' or 'Zp'. These are invisible characters defined by Unicode for a particular function. These include things like U+200D ZERO WIDTH JOINER, U+202D LEFT-TO-RIGHT OVERRIDE, interlinear annotation characters (FFF9..FFFB) and the infamous set of Tag characters (E0001 and E0020..E007F). They work behind the scenes to do useful things like bidirectional control and character shaping.
  • Graphic characters are those characters with a General Category other than Cc, Cn, Co, Cs, Cf, Zl and Zp, that is to say ordinary visible characters.
  • Noncharacters are code points that are permanently reserved, and are guaranteed never to be assigned as characters. They are the thirty-four code points ending in xxFE and xxFF, as well as the thirty-two code points in the range FDD0..FDEF.
  • Encoded characters = Graphic characters + Format characters + Control characters.
  • Assigned code points = Encoded characters + Private Use characters + Noncharacters + Surrogate code points.
  • Total code points = Assigned code points + Unassigned code points.


Charts






[Last updated : 2015-06-17]


9 comments:

Suz said...

What about Phoenician? Is it in or out?

Suzanne

Andrew said...

In. Please see todays blog "What's new in Unicode 5.0".

crasshopper said...

Awesome. I was just thinking about ways to come up with easy to imagine large numbers. Lots of people know what Unicode is, now I can say there are approx 120,000 Unicode symbols so the number of possible permutations of the Unicode "alphabet" would be 120,000!, a number with 557,389 digits.

http://www.wolframalpha.com/input/?i=120000!

Goodbye, protein example!

Taro said...

According to http://unicode.org/versions/Unicode5.2.0/#Character_Additions , there are 107,156 graphic characters in Unicode 5.2.

In your article, the number is 107,154.

Do you know the reason of this difference?

Andrew West said...

Hi Taro,

I have checked the figures, and I can confirm that my table is correct.

My table gives 107,154 graphic characters and 142 format characters for Unicode 5.2; whereas as the Unicode page gives 107,156 graphic characters and 140 format characters, which is the same total but a different distribution between graphic and format characters. The Unicode web site's figure of 140 format characters only takes into account the 140 characters with general category = Cf, whereas the Unicode Standard ch. 2 Table 2-3 defines format characters as those characters with gc=Cf|Zl|Zp. My figure of 142 for format characters is calculated as 140 Cf + 1 Zl + 1 Zp, and my figure for graphic characters excludes the two Zl/Zp characters.

I will report the issue with the statistics to Unicode.

Andrew West said...

Just to clarify, the Unicode 5.2 Table 2-3 also defines format characters as Cf + Zl + Zp.

Interestingly, the Unicode 6.0 page gives the correct figures for format characters, and notes that they are derived from Cf + Zl + Zp.

Taro said...

I understood very well.
Thank you so much.

Tomi Adewole said...

Awesome...but is there a page where one can actually see (and thus copy-and-paste) these unicode images? Thanks in advance...

Andrew West said...

PDF code charts covering all Unicode characters are available from the Unicode Consortium website (here).