Sunday, 1 April 2007

Numbers that Don't Add Up : U+5EFE

One of the properties defined by Unicode that I am least fond of is the numeric value of a character. It's obvious what the use of this property is for scripts which have special numeric digits such as "0, 1, 2, 3, 4, 5, 6, 7, 8 and 9". However, in many scripts ordinary letters can double up as numbers, so there is no clear differentiation between a number and a letter, and the numeric definition of a letter can become problematic.

In the 16-rune futhark, for example, runic letters were used to represent the nineteen golden numbers, but as there are obviously only sixteen runic letters in a 16-rune futhark three additional runes had to be devised solely for the purpose of representing the last three golden numbers. These three runes, encoded as U+16EE RUNIC ARLAUG SYMBOL ᛮ, U+16EF RUNIC TVIMADUR SYMBOL ᛯ and U+16F0 RUNIC BELGTHOR SYMBOL ᛰ, have the numeric values of "17", "18" and "19" respectively, whereas none of the other runic characters that correspond to the golden numbers "1" through "16" are assigned a numeric value by Unicode. It's not that the other sixteen runic letters are less valid as golden numbers than the three special runic symbols, it's just that under the Unicode concept of a number, only characters that are used solely as numbers are assigned a numeric value (as we shall see below, that is not entirely true).

Likewise, all Gothic letters have numeric values (see here for a table showing the numeric value of each Gothic letter) :

𐌾𐌰𐌷 𐌱𐌰𐍂 𐌰𐌹𐌽 .𐌻. 𐌾𐌰𐌷 𐌰𐌹𐌽 .𐌾. 𐌾𐌰𐌷 𐌰𐌹𐌽 .𐍂.

... and one bore 30, one 60, and one 100

[Mark 4.9]

But as there were not enough letters to cover the numbers from 1-9, 10-90 and 100-900, two special letters were devised to represent the numbers "90" and "900". In Unicode only the two special letters that have no phonetic value, encoded as U+10341 GOTHIC LETTER NINETY 𐍁 and U+1034A GOTHIC LETTER NINE HUNDRED 𐍊, are assigned a numeric value. It's not that any of the other twenty-five Gothic letters are less valid as numbers than these two letters, it's just that they are used as letters and numbers whereas the other two are used solely as numbers.

It seems to me that giving three Runic letters and two Gothic letters Unicode numeric values, but not any of the other Runic and Gothic letters that are used as numbers, is next to useless to anybody. How on earth can it be of any use to anyone to learn from the official Unicode data that U+16EE, U+16EF and U+16F0 represent the numbers "17", "18" and "19", but not be able to determine from the same Unicode data which Runic letters represent "1" through "16" ?

But enough ranting for today. What I really wanted to discuss was the numeric value of one particular Chinese character.

The numeric values of Chinese characters are defined in the Unihan database, and are categorised as one of :

  • Primary Numeric : "the value of the character when used in the writing of numbers in the standard fashion"
  • Accounting Numeric : "the value of the character when used in the writing of accounting numerals" (complex variants of numeric characters used on bank notes, etc. to help prevent fraud)
  • Other Numeric : "the numeric value for the character in certain unusual, specialized contexts" (most of the characters with the Other Numeric property are rare or archaic variant forms of numeric characters)

In Unicode 5.0 the following CJK ideographs are defined as having a numeric value.

Code Point Character Numeric Value Numeric Type
U+3405 5 Other Numeric
U+3483 2 Other Numeric
U+382A 5 Other Numeric
U+3B4D 7 Other Numeric
U+4E00 1 Primary Numeric
U+4E03 7 Primary Numeric
U+4E07 10000 Primary Numeric
U+4E09 3 Primary Numeric
U+4E5D 9 Primary Numeric
U+4E8C 2 Primary Numeric
U+4E94 5 Primary Numeric
U+4E96 4 Other Numeric
U+4EBF 亿 100000000 Primary Numeric
U+4EC0 10 Other Numeric
U+4EDF 1000 Accounting Numeric
U+4EE8 3 Other Numeric
U+4F0D 5 Accounting Numeric
U+4F70 100 Accounting Numeric
U+5104 100000000 Primary Numeric
U+5146 1000000000000 Primary Numeric
U+5169 2 Other Numeric
U+516B 8 Primary Numeric
U+516D 6 Primary Numeric
U+5341 10 Primary Numeric
U+5343 1000 Primary Numeric
U+5344 20 Other Numeric
U+5345 30 Other Numeric
U+534C 40 Other Numeric
U+53C1 3 Accounting Numeric
U+53C2 3 Accounting Numeric
U+53C3 3 Accounting Numeric
U+53C4 3 Other Numeric
U+56DB 4 Primary Numeric
U+58F1 1 Accounting Numeric
U+58F9 1 Accounting Numeric
U+5E7A 1 Other Numeric
U+5EFE 9 Other Numeric
U+5EFF 廿 20 Other Numeric
U+5F0C 1 Accounting Numeric
U+5F0D 2 Accounting Numeric
U+5F0E 3 Accounting Numeric
U+5F10 2 Accounting Numeric
U+62FE 10 Accounting Numeric
U+634C 8 Accounting Numeric
U+67D2 7 Accounting Numeric
U+7396 9 Accounting Numeric
U+767E 100 Primary Numeric
U+8086 4 Accounting Numeric
U+842C 10000 Accounting Numeric
U+8CAE 2 Accounting Numeric
U+8CB3 2 Accounting Numeric
U+8D30 2 Accounting Numeric
U+9646 6 Accounting Numeric
U+964C 100 Accounting Numeric
U+9678 6 Accounting Numeric
U+96F6 0 Primary Numeric
U+20001 𠀁 7 Other Numeric
U+20064 𠁤 4 Other Numeric
U+200E2 𠃢 4 Other Numeric
U+20121 𠄡 5 Other Numeric
U+2092A 𠤪 1 Other Numeric
U+20983 𠦃 30 Other Numeric
U+2098C 𠦌 40 Other Numeric
U+2099C 𠦜 40 Other Numeric
U+20AEA 𠫪 6 Other Numeric
U+20AFD 𠫽 3 Other Numeric
U+20B19 𠬙 3 Other Numeric
U+22390 𢎐 2 Other Numeric
U+22998 𢦘 3 Other Numeric
U+23B1B 𣬛 3 Other Numeric
U+2626D 𦉭 4 Other Numeric

There are quite a few things about this table that disturb me. In particular, to my mind, it is of very dubious value to give a numeric value to those characters which are extremely rare archaic variant forms of ordinary characters, hardly ever used outside of dictionaries (e.g. U+3405 㐅 and U+20121 𠄡, which are both archaic forms of U+4E94 五, only encoded because the Kangxi Dixtionary cross-references them to 五).

Then there are characters like U+382A 㠪 and U+23B1B 𣬛, which really make me shudder. The former is, according to the Kangxi Dictionary, an archaic form of the character 巨 "huge", and indeed the Unihan database seems to agree with that derivation by giving it a Mandarin reading of JU4 and Cantonese reading of geoi6. However, the Unihan database also gives it the definition "ancient form of 五 'five'", which I think is wrong. True, it does look somewhat like the character 五 "five", but U+22011 𢀑 (another archaic form of 巨) looks even more like 五, but is not given a numeric value of "5".

As to U+23B1B 𣬛, according to the Kangxi Dictionary it is not used as a number per se, but only occurs as variant way of writing the character 三 in the name of the god Sanlang (𣬛郎 = 三郎), or in a couple of placenames (𣬛趙村, 𣬛陽鎮).

But I'm straying from the subject yet again. The character that I really want to talk about today is U+5EFE 廾, which is given a numeric value of "9" in the Unihan database. The first thing to note about this character is that it is not a number by definition. According to the earliest dictionary of Chinese, the Shuowen 說文 dictionary (compiled by Xu Shen 許慎 in about 100 A.D.), 廾 (originally written as 𠬞), means 竦手 "to lift [something] up in two hands", and the character is generally given the reading gǒng and glossed as being equivalent to the character gǒng 拱. In the Kangxi Dictionary and most modern dictionaries the character 廾 only has the Shuowen meaning, and there is no mention of it having any numeric value.

Why then does Unihan give the character a numeric value of "9" ? The answer is that the Hanyu Dazidian 漢語大字典 quotes a supplement to the late Ming dictionary Zi Hui 字彙 as defining the character as an ancient form of the character jiǔ 九 "nine" that is used in [some editions of] the Yi Jing "Book of Changes" 廾,古文九字,見《周易全書》. The character 九 isn't written like that in any of my editions of the Yi Jing, so unfortunately I can't give any examples showing the character 廾 used to mean "nine". Nor have I been able to find any other dictionary, ancient or modern, that defines 廾 as being equivalent to 九. I can only assume that in at least one edition of the Yi Jing the character 九 is written with an extended horizontal stroke and no final bend and hook, so that it resembles the character 廾. In this case, in my opnion, it is simply a glyph variant of U+4E5D 九 that looks like 廾, and not actually the character U+5EFE 廾.

However, the issue with U+5EFE 廾 is complicated by the fact that several dictionaries also give this character as an alternate form for 廿 niàn (U+5EFF) "twenty" (in fact there is only a very subtle difference between U+5EFE 廾 and U+5344 卄, a variant form of U+5EFF 廿 which is assigned a numeric value of "20"). The earliest dictionary with this definition is Daguang Yihui Yupian 大廣益會玉篇 (Yupian for short), which was compiled in the Song dynasty (late 10th century), and was the most important dictionary of the period :

And for a reference from a modern Chinese dictionary, this is the definition of 廿 niàn (U+5EFF) from the Ci Yuan 辭源 dictionary (Shangwu Yinshuguan, 1988) :

二十 。也作"廾"。唐石經二十皆作"廾"。

Twenty. Also written as 廾. In the classics engraved on stone during the Tang dynasty the character is always written this way.

Thus according to these two dictionaries the character 廾 should have a numeric value of "20". However, on the same principle that 廾 meaning "nine" should be treated at the encoding level as a glyph variant of U+4E5D 九, then 廾 meaning "twenty" should be treated at the encoding level as a glyph variant of U+5EFF 廿.

In conclusion, in my opinion, U+5EFE should either be assigned a numeric value of both "9" and "20" or have no numeric value assigned to it at all. Given that Unicode does not allow for a character to have more than a single numeric value, the former option is out of the question, so removing the numeric value property from U+5EFE would be the best option. Unfortunately, I don't think that this will happen any time soon, so the moral of the story is, don't always believe what Unicode data tells you about the numeric value of a character.

No comments: