Saturday, 7 April 2007

Numbers that Don't Add Up : Tibetan Half Digits

One set of numbers that have caused no end of discussion since they were first encoded in Unicode 2.0 (July 1996) are the ten Tibetan half digits, which are forms of the Tibetan digits zero through nine [U+0F20..U+0F29] with a hooked slash through them :


The problem with these characters is that it is very hard to get hold of any examples of their usage, and even respected Tibetan experts (and most Tibetans) are not familiar with them other than from the Unicode character charts. Tibetan reference books are universally silent on these characters, and as no other Indic scripts have similar half digits it is not immediately evident what they are meant to represent.

As I understand it, the original Tibetan encoding proposal from 1995 included a combining slash character that could be used in combination with one or more of the Tibetan digit characters, but this was rejected by the UTC in favour of the ten precomposed half digits. This was in line with the position of the Chinese national body that only these ten slashed digits were required.

In Unicode 2.0 and 2.1 none of these ten characters had a numeric value assigned to them in the Unicode Data files; then exactly ten years ago to the day Tim Greenwood asked on the Unicore mailing list why the Tibetan half digits did not have a numeric property. The response from Lee Collins, who was one of the key players in the reintroduction of Tibetan into Unicode after it had been banished following the merger with ISO/IEC 10646, was that they don't have a fixed value, but are used to represent fractions. For the next couple of years the Unicode Data files remained the same with respect to these characters, but then in August 2000 the ten half digits finally got assigned numeric values (in square brackets in the above list) in Unicode 3.0.1. There is nothing in the UTC minutes or the Unicode document register to indicate where the impetus to add these numeric values came from or what the evidence for the values assigned to them was -- the values simply appeared in the final beta of 3.0.1, and have remained unchanged ever since.

But every now and then on the Tibetan mailing lists that I am subscribed to there is heated and inconclusive discussion as to what these characters represent and whether we really need them or not.

The values assigned to the half digit characters in the Unicode Data file are one half less than the corresponding whole digit, but there are some Tibetan experts who believe that these values are wrong, and should be corrected or removed. There are two different theories of what slashed digits represent numerically.

Firstly, there is some supporting evidence for the Unicode Data position that the digits represent one half less than the unslashed digit. When I say some, I mean the solitary example of a single Tibetan postage stamp :

This stamp is one of a set of five stamps first issued in 1933 with the following values (1 tranka ཊཾ = 1 ½ zho ཞོ = 15 skar སྐར) :

  • 7 ½ skar = ½ tranka
  • 1 zho = 10 skar = ⅔ tranka
  • 1 tranka
  • 2 tranka
  • 4 tranka

The value of the 7 ½ skar stamp is given on the stamp as (on the right panel) and སྐར (on the left panel), using U+0F31 TIBETAN DIGIT HALF EIGHT to represent the value of 7 ½. The slashed digit is unfortunately incomplete on the example given above from my own collection, but in this page from some stamp catalogue that I once photocopied many years ago the figure (labelled "1/2 t.") is shown quite clearly to be identical to the representative glyph for U+0F31 in the Unicode charts :

Unfortunately this is the total extent of evidence that has thusfar been adduced in favour of the "half less" usage of slashed digits. There is a 7 ½ skar Tibetan coin that was minted 1918-1925, but the value is given in words as skar phyed brgyad སྐར་ཕྱེད་བརྒྱད "seven and a half skar". Note that the written Tibetan for "seven and a half" (as with other half values) is "half [less than] eight", so it is easy to understand why slashing a number would be used to indicate a value of half less than the unslashed number.

The competing theory is that slashed digits are used in Tibetan art when drawing thangkas to indicate the proportions of Buddha figures and chörtens (stupas), indicating a dimension half that of the unslashed number. It is claimed that in this usage the slash may be applied to a single digit or to a group of digits representing a larger number, although one informed informant has stated that he has only ever seen a slashed digit one in these contexts, and of course the value of a slashed digit one is a half whichever theory you subscribe to.

I would love to be able to give an example of this usage of slashed digits, but am unable to do so at present. If any of my readers have some pictures of thangkas showing half digits that I can use please let me know.

Clearly the difficulty in finding examples of their usage indicates that the Tibetan Half Digit characters are not particularly needed by anyone other than philatelists and possibly scholars of thangkas, and it is of very little consequence whether the numeric value assigned to these characters by Unicode is correct or not. Indeed for those handful of users who will ever want to use a Tibetan half digit in real life it is probably totally irrelevant what value the character has according to Unicode -- all that matters is that the character is rendered correctly. So, all in all, given that the "half less than" usage is attested by the 7 ½ skar stamp, I personally think that the Unicode Data values for these characters are perfectly OK, and should not be changed.

Sunday, 1 April 2007

Numbers that Don't Add Up : U+5EFE

One of the properties defined by Unicode that I am least fond of is the numeric value of a character. It's obvious what the use of this property is for scripts which have special numeric digits such as "0, 1, 2, 3, 4, 5, 6, 7, 8 and 9". However, in many scripts ordinary letters can double up as numbers, so there is no clear differentiation between a number and a letter, and the numeric definition of a letter can become problematic.

In the 16-rune futhark, for example, runic letters were used to represent the nineteen golden numbers, but as there are obviously only sixteen runic letters in a 16-rune futhark three additional runes had to be devised solely for the purpose of representing the last three golden numbers. These three runes, encoded as U+16EE RUNIC ARLAUG SYMBOL ᛮ, U+16EF RUNIC TVIMADUR SYMBOL ᛯ and U+16F0 RUNIC BELGTHOR SYMBOL ᛰ, have the numeric values of "17", "18" and "19" respectively, whereas none of the other runic characters that correspond to the golden numbers "1" through "16" are assigned a numeric value by Unicode. It's not that the other sixteen runic letters are less valid as golden numbers than the three special runic symbols, it's just that under the Unicode concept of a number, only characters that are used solely as numbers are assigned a numeric value (as we shall see below, that is not entirely true).

Likewise, all Gothic letters have numeric values (see here for a table showing the numeric value of each Gothic letter) :

𐌾𐌰𐌷 𐌱𐌰𐍂 𐌰𐌹𐌽 .𐌻. 𐌾𐌰𐌷 𐌰𐌹𐌽 .𐌾. 𐌾𐌰𐌷 𐌰𐌹𐌽 .𐍂.

... and one bore 30, one 60, and one 100

[Mark 4.9]

But as there were not enough letters to cover the numbers from 1-9, 10-90 and 100-900, two special letters were devised to represent the numbers "90" and "900". In Unicode only the two special letters that have no phonetic value, encoded as U+10341 GOTHIC LETTER NINETY �� and U+1034A GOTHIC LETTER NINE HUNDRED ��, are assigned a numeric value. It's not that any of the other twenty-five Gothic letters are less valid as numbers than these two letters, it's just that they are used as letters and numbers whereas the other two are used solely as numbers.

It seems to me that giving three Runic letters and two Gothic letters Unicode numeric values, but not any of the other Runic and Gothic letters that are used as numbers, is next to useless to anybody. How on earth can it be of any use to anyone to learn from the official Unicode data that U+16EE, U+16EF and U+16F0 represent the numbers "17", "18" and "19", but not be able to determine from the same Unicode data which Runic letters represent "1" through "16" ?

But enough ranting for today. What I really wanted to discuss was the numeric value of one particular Chinese character.

The numeric values of Chinese characters are defined in the Unihan database, and are categorised as one of :

  • Primary Numeric : "the value of the character when used in the writing of numbers in the standard fashion"
  • Accounting Numeric : "the value of the character when used in the writing of accounting numerals" (complex variants of numeric characters used on bank notes, etc. to help prevent fraud)
  • Other Numeric : "the numeric value for the character in certain unusual, specialized contexts" (most of the characters with the Other Numeric property are rare or archaic variant forms of numeric characters)

In Unicode 5.0 the following CJK ideographs are defined as having a numeric value.

Code Point Character Numeric Value Numeric Type
U+3405 5 Other Numeric
U+3483 2 Other Numeric
U+382A 5 Other Numeric
U+3B4D 7 Other Numeric
U+4E00 1 Primary Numeric
U+4E03 7 Primary Numeric
U+4E07 10000 Primary Numeric
U+4E09 3 Primary Numeric
U+4E5D 9 Primary Numeric
U+4E8C 2 Primary Numeric
U+4E94 5 Primary Numeric
U+4E96 4 Other Numeric
U+4EBF 亿 100000000 Primary Numeric
U+4EC0 10 Other Numeric
U+4EDF 1000 Accounting Numeric
U+4EE8 3 Other Numeric
U+4F0D 5 Accounting Numeric
U+4F70 100 Accounting Numeric
U+5104 100000000 Primary Numeric
U+5146 1000000000000 Primary Numeric
U+5169 2 Other Numeric
U+516B 8 Primary Numeric
U+516D 6 Primary Numeric
U+5341 10 Primary Numeric
U+5343 1000 Primary Numeric
U+5344 20 Other Numeric
U+5345 30 Other Numeric
U+534C 40 Other Numeric
U+53C1 3 Accounting Numeric
U+53C2 3 Accounting Numeric
U+53C3 3 Accounting Numeric
U+53C4 3 Other Numeric
U+56DB 4 Primary Numeric
U+58F1 1 Accounting Numeric
U+58F9 1 Accounting Numeric
U+5E7A 1 Other Numeric
U+5EFE 9 Other Numeric
U+5EFF 廿 20 Other Numeric
U+5F0C 1 Accounting Numeric
U+5F0D 2 Accounting Numeric
U+5F0E 3 Accounting Numeric
U+5F10 2 Accounting Numeric
U+62FE 10 Accounting Numeric
U+634C 8 Accounting Numeric
U+67D2 7 Accounting Numeric
U+7396 9 Accounting Numeric
U+767E 100 Primary Numeric
U+8086 4 Accounting Numeric
U+842C 10000 Accounting Numeric
U+8CAE 2 Accounting Numeric
U+8CB3 2 Accounting Numeric
U+8D30 2 Accounting Numeric
U+9646 6 Accounting Numeric
U+964C 100 Accounting Numeric
U+9678 6 Accounting Numeric
U+96F6 0 Primary Numeric
U+20001 𠀁 7 Other Numeric
U+20064 𠁤 4 Other Numeric
U+200E2 𠃢 4 Other Numeric
U+20121 𠄡 5 Other Numeric
U+2092A 𠤪 1 Other Numeric
U+20983 𠦃 30 Other Numeric
U+2098C 𠦌 40 Other Numeric
U+2099C 𠦜 40 Other Numeric
U+20AEA 𠫪 6 Other Numeric
U+20AFD 𠫽 3 Other Numeric
U+20B19 𠬙 3 Other Numeric
U+22390 𢎐 2 Other Numeric
U+22998 𢦘 3 Other Numeric
U+23B1B 𣬛 3 Other Numeric
U+2626D 𦉭 4 Other Numeric

There are quite a few things about this table that disturb me. In particular, to my mind, it is of very dubious value to give a numeric value to those characters which are extremely rare archaic variant forms of ordinary characters, hardly ever used outside of dictionaries (e.g. U+3405 㐅 and U+20121 𠄡, which are both archaic forms of U+4E94 五, only encoded because the Kangxi Dixtionary cross-references them to 五).

Then there are characters like U+382A 㠪 and U+23B1B 𣬛, which really make me shudder. The former is, according to the Kangxi Dictionary, an archaic form of the character 巨 "huge", and indeed the Unihan database seems to agree with that derivation by giving it a Mandarin reading of JU4 and Cantonese reading of geoi6. However, the Unihan database also gives it the definition "ancient form of 五 'five'", which I think is wrong. True, it does look somewhat like the character 五 "five", but U+22011 𢀑 (another archaic form of 巨) looks even more like 五, but is not given a numeric value of "5".

As to U+23B1B 𣬛, according to the Kangxi Dictionary it is not used as a number per se, but only occurs as variant way of writing the character 三 in the name of the god Sanlang (𣬛郎 = 三郎), or in a couple of placenames (𣬛趙村, 𣬛陽鎮).

But I'm straying from the subject yet again. The character that I really want to talk about today is U+5EFE 廾, which is given a numeric value of "9" in the Unihan database. The first thing to note about this character is that it is not a number by definition. According to the earliest dictionary of Chinese, the Shuowen 說文 dictionary (compiled by Xu Shen 許慎 in about 100 A.D.), 廾 (originally written as 𠬞), means 竦手 "to lift [something] up in two hands", and the character is generally given the reading gǒng and glossed as being equivalent to the character gǒng 拱. In the Kangxi Dictionary and most modern dictionaries the character 廾 only has the Shuowen meaning, and there is no mention of it having any numeric value.

Why then does Unihan give the character a numeric value of "9" ? The answer is that the Hanyu Dazidian 漢語大字典 quotes a supplement to the late Ming dictionary Zi Hui 字彙 as defining the character as an ancient form of the character jiǔ 九 "nine" that is used in [some editions of] the Yi Jing "Book of Changes" 廾,古文九字,見《周易全書》. The character 九 isn't written like that in any of my editions of the Yi Jing, so unfortunately I can't give any examples showing the character 廾 used to mean "nine". Nor have I been able to find any other dictionary, ancient or modern, that defines 廾 as being equivalent to 九. I can only assume that in at least one edition of the Yi Jing the character 九 is written with an extended horizontal stroke and no final bend and hook, so that it resembles the character 廾. In this case, in my opnion, it is simply a glyph variant of U+4E5D 九 that looks like 廾, and not actually the character U+5EFE 廾.

However, the issue with U+5EFE 廾 is complicated by the fact that several dictionaries also give this character as an alternate form for 廿 niàn (U+5EFF) "twenty" (in fact there is only a very subtle difference between U+5EFE 廾 and U+5344 卄, a variant form of U+5EFF 廿 which is assigned a numeric value of "20"). The earliest dictionary with this definition is Daguang Yihui Yupian 大廣益會玉篇 (Yupian for short), which was compiled in the Song dynasty (late 10th century), and was the most important dictionary of the period :

And for a reference from a modern Chinese dictionary, this is the definition of 廿 niàn (U+5EFF) from the Ci Yuan 辭源 dictionary (Shangwu Yinshuguan, 1988) :

二十 。也作"廾"。唐石經二十皆作"廾"。

Twenty. Also written as 廾. In the classics engraved on stone during the Tang dynasty the character is always written this way.

Thus according to these two dictionaries the character 廾 should have a numeric value of "20". However, on the same principle that 廾 meaning "nine" should be treated at the encoding level as a glyph variant of U+4E5D 九, then 廾 meaning "twenty" should be treated at the encoding level as a glyph variant of U+5EFF 廿.

In conclusion, in my opinion, U+5EFE should either be assigned a numeric value of both "9" and "20" or have no numeric value assigned to it at all. Given that Unicode does not allow for a character to have more than a single numeric value, the former option is out of the question, so removing the numeric value property from U+5EFE would be the best option. Unfortunately, I don't think that this will happen any time soon, so the moral of the story is, don't always believe what Unicode data tells you about the numeric value of a character.