Sunday, 26 March 2006

Unicode Character Names Part 3 : A Name by any other Name

As discussed in Part 1 there are some unfortunately misnamed characters, and as discussed in Part 2 Unicode character names once assigned can never be changed, and so misnamed characters are stuck with their names whether they like it or not. Whilst the characters themselves may or may not mind what they are called, characters with wrong names cause untold anguish to some people. Until now there has not been very much that can be done about the problem, and if anyone complains about a particular character name all that Unicode can advise them is that character names are intended as unique mnemonic identifers and should not be relied on for identification of a character's function or meaning — which is unfortunate as most character names can be relied on for this purpose.

However, as from Unicode 5.0 (due for release in May) some of the most badly misnamed characters will be provided with a formal alias which implementers will be encouraged to use in user interfaces in place of the character's official name. At present the following eleven characters will be given formal aliases (see NameAliases.txt for the current list of formal aliases) :


Code Point Character Character Name Formal Alias Assigned
01A2ƢLATIN CAPITAL LETTER OILATIN CAPITAL LETTER GHAUnicode 5.0
01A3ƣLATIN SMALL LETTER OILATIN SMALL LETTER GHAUnicode 5.0
0CDEKANNADA LETTER FAKANNADA LETTER LLLAUnicode 5.0
0E9DLAO LETTER FO TAMLAO LETTER FO FONUnicode 5.0
0E9FLAO LETTER FO SUNGLAO LETTER FO FAYUnicode 5.0
0EA3LAO LETTER LO LINGLAO LETTER ROUnicode 5.0
0EA5LAO LETTER LO LOOTLAO LETTER LOUnicode 5.0
0FD0TIBETAN MARK BSKA- SHOG GI MGO RGYANTIBETAN MARK BKA- SHOG GI MGO RGYANUnicode 5.0
2448OCR DASHMICR ON US SYMBOLUnicode 6.1
2449OCR CUSTOMER ACCOUNT NUMBERMICR DASH SYMBOLUnicode 6.1
A015YI SYLLABLE WUYI SYLLABLE ITERATION MARKUnicode 5.0
FE18PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCETPRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRACKETUnicode 5.0
1D0C5𝃅BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASISBYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASISUnicode 5.0

There are two important points to be made about these formal aliases.

Firstly, formal aliases will only be given to the most perniciously misnamed characters, and not to every character which has a sub-optimal name, or for which there is academic dispute about the transliteration or naming convention to use.

Secondly, formal aliases are completely different to the aliases already provided in the Unicode code charts. The code chart aliases are alternative names by which a character may be known, and are provided for information only. On the other hand, the fomal aliases conform to the Character-naming guidelines (and must be unique within the scope of both character names and formal aliases), and so look just like real character names, and are intended to be used in place of the real character names in applications' user interfaces.

Here is some of the history behind these characters.

LATIN CAPITAL/SMALL LETTER OI [U+01A2/01A3]

The character names reflect the ghyph shape of this letter (Ƣ), and do represent its phonetic value. The letter was devised to represent the sound /gh/, and was used in the Kirghiz Latin alphabet between 1928 and 1940.

KANNADA LETTER FA [U+0CDE]

I've no idea how it came to be that a letter that is used to represent a Dravidian /l/ sound should be named KANNADA LETTER FA. The alias is spelled with three L's in accordance with the unofficial convention of distinguishing flavours of the same letter by reduplicating the ASCII letter used to represent it. KANNADA LETTER LA [U+0CB2] and KANNADA LETTER LLA [U+0CB3] are already taken, so the alias for U+0CDE is KANNADA LETTER LLLA; and if ever a new Kannada letter representing a different flavour of /l/ were to be invented, it would no doubt be named KANNADA LETTER LLLLA !

Lao Letters FO TAM/FO SUNG and LO LING/LO LOOT

The Lao script was part of the original Unicode 1.0 repertoire encoded in 1991, but these two pairs of swapped character names did not come to light until October last year, when a user of the French version of BabelMap queried the names of LAO LETTER FO TAM and LAO LETTER FO SUNG with me. I raised the matter on the Unicode mailing list, and got confirmation from a number of experts on the Lao script that there were indeed some mistakes with Lao character names. The result was that, although I know very little about Lao, I ended up writing a document summarizing the issues with Lao character names, and recommending possible solutions. One of my recommendations was that the misnamed letters be assigned the formal aliases that are now to be assigned.

Most of the consonants in the Lao block are named from the syllabic sound of the letter, plus the word SUNG "high" or TAM "low" to indicate a high tone class or low tone class if two letters share the same syllabic sound. U+0E9D and U+0E9F both share the same syllabic sound, FO, but the former character is a high tone class letter and should have been named LAO LETTER FO SUNG, whilst the latter character is a low tone class letter and should have been named LAO LETTER FO TAM. Unfortunately, the names of this pair of characters were assigned the wrong way round. As formal aliases must be unique and cannot duplicate an existing character name, it is not possible to assign what should have been their correct names as the formal aliases, and so the formal aliases for these two characters needed to based on a different naming system. Luckily, the Lao people do not normally name their letters using SUNG and TAM, but use mnemonic names similar in form to English "A is for Apple", "B is for Ball", etc., and so the formal aliases could be based on the characters' mnemonic names. There are no official Lao mnemonic names, and the mnemonics may vary from one source to another, but for these two characters the most common mnemonic names are FO FON "fo as in the word fon [rain]" and FO FAY "fo as in the word fay [fire]". Thus the formal aliases for U+0E9D and U+0E9F are LAO LETTER FO FON and LAO LETTER FO FAY respectively. The most common mnemonic names for the other consonants are being added to the code charts as informal aliases.

The names for U+0EA3 and U+0EA5 are different from all the other Lao consonants, as they are the only two letters with character names that are based on mnemonic names, LO LING "lo as in the word ling [monkey]" and LO LOOT "lo as in the word loot [motor car]". However, the mnemonic names are the wrong way round, with U+0EA3 named LAO LETTER LO LING when it should have been named LAO LETTER LO LOOT, and U+0EA5 named LAO LETTER LO LOOT when it should have been named LAO LETTER LO LING. The reason for the different naming system for these two letters was presumably due to the fact that they are both low tone class letters, and they could not both be named LAO LETTER LO TAM. Actually U+0EA3, which has been deprecated by the Lao government since 1975, is used to represent [r] in foreign words, and so the two letters could have been differentiated by simply naming them LAO LETTER RO and LAO LETTER LO — which are the names used for the formal aliases.

TIBETAN MARK BSKA- SHOG GI MGO RGYAN [U+0FD0]

This character is full of woe. To start with its proper name was misappropriated by U+0F0A. U+0F0A, which was encoded in Unicode 2.0, is a Bhutanese mark used in formal documents to indicate an inferior addressing a superior, and should have been named something like TIBETAN MARK ZHU YIG GI MGO RGYAN, corresponding to the Tibetan zhu yig gi mgo rgyan ཞུ་ཡིག་གི་མགོ་རྒྱན "starting flourish for making a petition"; but somehow it got assigned the name TIBETAN MARK BKA- SHOG YIG MGO, corresponding to the Tibetan bka' shog yig mgo བཀའ་ཤོག་ཡིག་མགོ "starting flourish for giving a command".

So when the character that actually indicates a superior addressing an inferior was encoded in Unicode 4.1 at U+0FD0 it had to be given a slightly different but synonymous name, which should have been TIBETAN MARK BKA- SHOG GI MGO RGYAN, corresponding to the Tibetan bka' shog gi mgo rgyan བཀའ་ཤོག་གི་མགོ་རྒྱན. Unfortunately, BKA- became miswritten as BSKA- (a syllable that does not naturally occur in Tibetan) in the proposal (N2694).

Mistakes like this can be hard to spot if you don't know Tibetan. Luckily I do know Tibetan, and pointed out the mistake on the Unicode mailing list when the proposal was first announced; but unfortunately I didn't check to see that the mistake had been corrected as the character progressed towards standardization, so I feel somewhat responsible for this one.

YI SYLLABLE WU [U+A015]

When the Liangshan Yi script was originally proposed for encoding in 1995 (see N1187) there was some confusion over a character ꀕ that appeared in some Yi sources but not in others. This was given the name YI SYLLABLE WU and positioned between the syllables with vowel initials and the syllables with consonant initials.

In fact the character named YI SYLLABLE WU does not represent the syllable /wu/ or any specific syllable, but is a special syllable iteration mark that is used to indicate that the preceding syllable is repeated. Syllable reduplication is is most commonly found in adjectives or verbs, where reduplication of the final syllable indicates the interrogative (note that the reduplicated syllable is pronounced in the mid level tone after a syllable in the secondary high tone, and that final "p", "t" and "x" are tone markers) :

  • vat ꃪ "OK"; and vat vat ꃪꀕ (for ꃪꃪ) "OK ?"
  • bbo ꁧ "go"; and bbox bbo ꁦꀕ (for ꁦꁧ) "shall we go ?"
  • zzyr muo ꋬꂻ "fine and well"; and zzyr muox muo ꋬꂺꀕ (for ꋬꂺꂻ) "Are you fine and well ?"

In the Yi phonetic alphabet, syllable iteration is represented by the letter "w", so that ꃪꀕ is represented as vatw (for vat vat), ꁦꀕ is represented as bboxw (for bbox bbo), and ꋬꂺꀕ is represented as zzyr muoxw (for zzyr muox muo). This is presumably where the mistaken notion that the character ꀕ represents the sound /wu/ came from. In fact the reason why the iteration mark ꀕ is represented by the letter "w" is simply because it looks like a letter "w" with two vertical strokes beneath it (i.e. the "w" approximates the glyph shape of the character, and not its phonetic value).

More details on this character and other Yi issues can be found in a document that I wrote for the UTC in 2004.

PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET [U+FE18]

A really embarrassing mistake, that was first pointed out by Alan Wood shortly after the character was released into the wild as part of the Unicode 4.1 repertoire in March 2005. At that time the corresponding Amendment 1 of ISO/IEC 10646:2003 was only at the FDAM stage, and it wouldn't be published until November of that year, but it was too late to change it as the ISO and Unicode character names cannot differ.

BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS [U+1D0C5]

Twenty Byzantine musical symbols with the word FTHORA in their names were introduced in Unicode 3.1, but somewhere along the line one of them got mistyped as FHTORA. An easy mistake to make, but a difficult mistake to spot, especially if you don't read Greek — which I don't (and for anyone else who doesn't, according to an internet dictionary the word φθορά means "abrasion, decay, deterioration, vitiation, waste, wear, weathering").


Saturday, 25 March 2006

Unicode Character Names Part 2 : A Name is for Life

As discussed in my post on Good. Bad and Ugly Character Names, there are some Unicode characters with wrong or misleading names. Some people get very worked up about bad character names (or names that they perceive to be bad), and insist that Unicode must change the name. However, for reasons of stability with other standards, which may refer to Unicode characters by name rather than code point, character names once assigned cannot under any circumstances be changed.

Nevertheless, the names for 1,944 characters introduced in Unicode 1.0 are different from their current names (in the vast majority of cases the changes are very minor), but these name changes were required by the merger between the developing Unicode and ISO/IEC 10646 standards in 1993. One of the most noticeable difference between the 1.0 names (pre-merger) and the 1.1 names (post-merger) is that the 1.0 names reflect American English (because Unicode is in origins a consortium of American companies), whereas the 1.1 names have a more British English flavour (because, or so I am told, this was insisted upon by Bruce Paterson, who is British and was the editor of ISO/IEC 10646 until 2000).


Differences in Names between Unicode 1.0 and 1.1
Code Point Unicode 1.0 Name Unicode 1.1 Name
002E PERIOD FULL STOP
002F SLASH SOLIDUS
005C BACKSLASH REVERSE SOLIDUS
00B6 PARAGRAPH SIGN PILCROW SIGN
02D2 MODIFIER LETTER CENTERED RIGHT HALF RING MODIFIER LETTER CENTRED RIGHT HALF RING
02D3 MODIFIER LETTER CENTERED LEFT HALF RING MODIFIER LETTER CENTRED LEFT HALF RING
271B OPEN CENTER CROSS OPEN CENTRE CROSS
271C HEAVY OPEN CENTER CROSS HEAVY OPEN CENTRE CROSS
272B OPEN CENTER BLACK STAR OPEN CENTRE BLACK STAR
272C BLACK CENTER WHITE STAR BLACK CENTRE WHITE STAR
2732 OPEN CENTER ASTERISK OPEN CENTRE ASTERISK
273C OPEN CENTER TEARDROP-SPOKED ASTERISK OPEN CENTRE TEARDROP-SPOKED ASTERISK
2742 CIRCLED OPEN CENTER EIGHT POINTED STAR CIRCLED OPEN CENTRE EIGHT POINTED STAR
32A5 CIRCLED IDEOGRAPH CENTER CIRCLED IDEOGRAPH CENTRE
FE4A SPACING CENTERLINE OVERSCORE CENTRELINE OVERLINE
FE4E SPACING CENTERLINE UNDERSCORE CENTRELINE LOW LINE

Nevertheless, Unicode 1.1 did preserve a couple of American English spellings from Unicode 1.0 :

  • U+3238 PARENTHESIZED IDEOGRAPH LABOR
  • U+3298 CIRCLED IDEOGRAPH LABOR

Since Unicode 1.1 the character names have remained predominantly British English, with U+1D355 TETRAGRAM FOR LABOURING and a further seven characters with CENTRE in their name. However, two American spellings did slip in with Unicode 3.0 :

  • U+2F7E KANGXI RADICAL PLOW
  • U+2F8A KANGXI RADICAL COLOR

Since the merger between Unicode and ISO/IEC 10646 only two characters have ever changed their name, namely U+00C6 and U+00E6, which were originally called LATIN CAPITAL LETTER A E and LATIN SMALL LETTER A E in Unicode 1.0, then changed to LATIN CAPITAL LIGATURE AE and LATIN SMALL LIGATURE AE in Unicode 1.1 after the merger with ISO/IEC 10646, and finally changed to their current names LATIN CAPITAL LETTER AE and LATIN SMALL LETTER AE in Unicode 2.0. The latter change was due to representations by the Danish Standards Association who considered these two characters to be letters rather than ligatures; but this caused so much trouble and acrimony that the respective committees of Unicode and ISO/IEC 10646 resolved never again to make any name changes, regardless of the severity of the mistake or the triviality of the change required (see the Unicode Standard Stability Policy).


Unicode Character Names Part 1 : the Good the Bad and the Ugly

The one thing about Unicode that really seems to bug people more than anything else is that the character names are not always perfect, are sometimes misleading, and in a few cases are just plain wrong.

All Unicode characters have an official name which is used to uniquely identify them (but see Note 1 below the table). The 71,226 CJK ideographs have algorithmically derived names based on their code point (e.g. CJK UNIFIED IDEOGRAPH-4E00 for U+4E00), and the 11,172 Hangul syllables have algorithmically derived names based on their phonetic composition (e.g. HANGUL SYLLABLE GAH for U+AC1B, which is composed of the three jamo letters G, A and H). The remaining 15,257 characters have hand-crafted names, and it is perhaps not suprising that a few mistakes have crept in from time to time. These are some of the sort of problems that may be found in Unicode character names :

  • Misuse of technical terms, such as ligature ("a character or type formed by two or more letters joined together"), digraph ("a group of two letters representing one sound") and ideograph ("a character symbolizing the idea of a thing without expressing the sequence of sounds in its name").
  • Misinterpretation of a character's glyph shape (e.g. U+2118 ℘ SCRIPT CAPITAL P, which is actually a calligraphic lowercase p).
  • Misunderstanding of a character's meaning or function (e.g. U+A015 ꀕ YI SYLLABLE WU, which is not a syllable pronounced "wu" but a syllable iteration mark).
  • Confusion of one character with another (for example the names of U+0EA3 LAO LETTER LO LING and U+0EA5 LAO LETTER LO LOOT are the wrong way round).
  • Simple typographic errors, such as U+FE18 PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET.

In addition to these sort of problems, there are also many character names that are technically "correct", but which some people still object to, for example because the name represents the pronunciation of the character in one language but is pronounced differently in their language, or because the Unicode name is based on one system of transliteration, but they prefer a different system of transliteration (character names are constrained to the letters "A" through "Z", the digits "0" through "9", space and hyphen, so often there is no choice but to resort to awkward names such as DEVANAGARI LETTER LLLA). In cases such as these the alternative pronunciation or transliteration may be annotated in the Unicode code charts.

One of the things that really annoys some people is that Han characters (漢字 hànzì / kanji / hanja) are named as "CJK [Unified/Compatibility] Ideographs", when technically they are not ideographs ("a character symbolizing the idea of a thing without expressing the sequence of sounds in its name" according to the SOED). Nor are they limited to Chinese, Japanese and Korean (CJK) usage, but have also been used for Vietnamese (ideographs used to write Vietnamese are called chữ nôm 字喃 / 𡦂喃 / 𡨸喃) and Zhuang (ideographs used to write Zhuang are called sawndip). Thus on two counts two-thirds of Unicode characters could be considered to be wrongly named. As Confucius put it :


名不正,則言不順;言不順,則事不成;事不成,則禮樂不興;禮樂不興,則刑罰不中;刑罰不中,則民無所措手足。

When names are not correct, what is said will not sound reasonable; when what is said does not sound reasonable, affairs will not culminate in success; when affairs do not culminate in success, rites and music will not flourish; when rites and music do not flourish, punishments will not fit the crime; when punishments do not fit the crime, the common people will not know where to put hand and foot.

Lun Yu 論語 [The Analects] 13.3 (D.C.Lau trans.)


But, hey, I'm not a Confucianist, so I don't mind too much about wrong or misleading character names (except for U+A856 of course, which will irk me to the grave), and I have no problems referring to 漢字 as ideographs -- to me it's just a convenient label.

Anyway here is my list of characters which either deliberately or accidentally have sub-optimal names. This is by no means an exhaustive list, and other people will no doubt have their own suggestions to add.


Wrong or Misleading Character Names
Code Point Character Character Name Comments
0132
0133
IJ
ij
LATIN CAPITAL LIGATURE IJ
LATIN SMALL LIGATURE IJ
These are not ligatures as the "i" and "j" are not joined together.
01A2
01A3
Ƣ
ƣ
LATIN CAPITAL LETTER OI
LATIN SMALL LETTER OI
These characters represent the letter "gha" used in the Kirghiz Latin alphabet between 1928 and 1940, and have nothing to do with either "o" or "i".
01BE ƾ LATIN LETTER INVERTED GLOTTAL STOP WITH STROKE Whilst this character superficially looks like an inverted glottal stop, it is in fact derived from a ligature of the letters "t" and "s", which explains its use as an archaic phonetic representation of [ts] as an affricate (e.g. for the sound of the "z" in German Zimmer "room").
0238
0239
ȸ
ȹ
LATIN SMALL LETTER DB DIGRAPH
LATIN SMALL LETTER QP DIGRAPH
These characters are ligatures of "db" and "qp" respectively, and not digraphs.
02C7
030C
032C
ˇ
̌
̬
CARON
COMBINING CARON
COMBINING CARON BELOW
These and 42 other precomposed characters such as U+010D LATIN SMALL LETTER C WITH CARON č use the word "caron" to signify what is normally called a háček ("little hook" in Czech). Indeed, in Unicode 1.0 the names of these letters all used the term HACEK (e.g. U+02C7 MODIFIER LETTER HACEK), but all instances of "hacek" were changed to "caron" when Unicode merged with ISO/IEC 10646.
Nobody knows ahat the etymology of the term "caron" is, or where and when it was coined, but the earliest known use of the term is in the 1967 edition of the United States Government Printing Office Style Manual, from whence it was introduced into ISO character encoding standards (see Antedating the Caron for details).
034F   COMBINING GRAPHEME JOINER This character does not combine graphemes, but rather indicates that adjacent characters should be treated as a graphemic unit.
047C
047D
Ѽ
ѽ
CYRILLIC CAPITAL LETTER OMEGA WITH TITLO
CYRILLIC SMALL LETTER OMEGA WITH TITLO
The diacritic on these characters is not actually a "titlo" (although everyone agrees that it is not a titlo, it is not clear exactly what the origins of the diacritic mark is), which explains why they do not decompose to U+0460/U0461 CYRILLIC CAPITAL/SMALL LETTER OMEGA and U+0483 COMBINING CYRILLIC TITLO. The character is used to represent the exclamations "о!" and "оле!", and is known in Russian as "beautiful omega" красивая омега or "wide omega" широкая омега.
0598 ֘ HEBREW ACCENT ZARQA This character is not actually a "zarqa" at all (which is U+05AE), but is intended to represent the sign called "tsinorit" that is used in the three poetic books (Job, Proverbs, Psalms), and that is centred above a base letter.
05AE ֮ HEBREW ACCENT ZINOR This character is intended to represent the sign called "zarqa" that is used in the twenty-one books of the Old Testament, as well to represent the sign called "tsinor" (sometimes transliterated "zinor") that is used in the three poetic books (Job, Proverbs, Psalms). Both these signs share the same glyph form and are placed above and to the left of a base letter.
0670 ٰ ARABIC LETTER SUPERSCRIPT ALEF This is actually a vowel sign, not a letter.
0B83 TAMIL SIGN VISARGA Although this sign derives from a special type of visarga, it is not called a visarga in Tamil, but is known as an "āytham" (which is a Tamilized form of the Sankrit word "āśrita", being a class of visarga).
0CDE KANNADA LETTER FA This letter has nothing to do with the sound /f/, but actually represents a Dravidian /l/, and should rightly have been called KANNADA LETTER LLLA, in line with the corresponding letters in other Indic scripts, such as U+0934 DEVANAGARI LETTER LLLA, U+0BB4 TAMIL LETTER LLLA and U+0D34 MALAYALAM LETTER LLLA].
0E9D
0E9F

LAO LETTER FO TAM
LAO LETTER FO SUNG
The character names for U+0E9D and U+0E9F are swapped. U+0E9D is a high tone class letter, and should have been named LAO LETTER FO SUNG (SUNG meaning "high"); whereas U+09EF is a low tone class letter, and should have been named LAO LETTER FO TAM (TAM meaning "low").
0EA3
0EA5

LAO LETTER LO LING
LAO LETTER LO LOOT
The character names for U+0EA3 and U+0EA5 are swapped. LO LING is the mnemonic name for U+0EA5 ("lo as in ling [monkey]"); whereas LO LOOT is the badly transliterated mnemonic name for U+0EA3 ("lo as in "loot" for "ro as in rot [motor car]").
0F0A TIBETAN MARK BKA- SHOG YIG MGO This character is meant to represent the sign that is used in formal documents in Bhutan to indicate an inferior addressing a superior (the "petition honorific"), but the Tibetan name BKA- SHOG YIG MGO actually indicates a superior addressing an inferior ("starting flourish for giving a command"). When the character that really indicates a superior addressing an inferior was later encoded at U+0F0D, it had to be assigned a slightly different but synonymous name, TIBETAN MARK BSKA- SHOG GI MGO RGYAN ("starting flourish for giving a command").
0F0B TIBETAN MARK INTERSYLLABIC TSHEG The tsheg mark is not restricted to intersyllabic usage, and may occur at the end of a terminal syllable or multiple times as "justifying tshegs" at the end of a line.
0F0C TIBETAN MARK DELIMITER TSHEG BSTAR This character is simply a non-breaking version of the "tsheg" mark (U+0F0B) that is used exclusively between the letter NGA (U+0F44) and the "shad" mark (U+0F0D).
0FD0 TIBETAN MARK BSKA- SHOG GI MGO RGYAN Mistake for TIBETAN MARK BKA- SHOG GI MGO RGYAN (the syllable BSKA- does not naturally occur in Tibetan).
156F CANADIAN SYLLABICS TTH This character looks like an asterisk, and it probably is an asterisk. The imaginary letter TTH was accidentally encoded when someone mistook an asterisk denoting a proper noun as a letter in the Canadian aboriginal script.
1880
1881

MONGOLIAN LETTER ALI GALI ANUSVARA ONE
MONGOLIAN LETTER ALI GALI VISARGA ONE
The ONE in the names of these two characters is spurious. Each of these two characters have two different glyphs forms, which are distinguished by the application or not of U+180B MONGOLIAN FREE VARIATION SELECTOR ONE (FVS-1) :
<1880> ᢀ and <1880 180B> ᢀ᠋ (actually, the former is technically a CANDRABINDU and the latter an ANUSVARA, and even though CANDRABINDU and ANUSVARA are used interchangeably in Mongolian contexts, I would have thought that they should have been encoded separately, as is the case with Tibetan and other Brahmic scripts);
<1881> ᢁ and <1881 180B> ᢁ᠋.
My theory is that in an early draft for the Mongolian block each variant form of these two characters was assigned a separate code point, with names differentiated by ONE and TWO :
MONGOLIAN LETTER ALI GALI ANUSVARA ONE
MONGOLIAN LETTER ALI GALI ANUSVARA TWO
MONGOLIAN LETTER ALI GALI VISARGA ONE
MONGOLIAN LETTER ALI GALI VISARGA TWO
When a decision was later made to unify the variant forms of the two characters and distinguish their variant forms by means of variation selectors, MONGOLIAN LETTER ALI GALI ANUSVARA TWO and MONGOLIAN LETTER ALI GALI VISARGA TWO were deleted, leaving MONGOLIAN LETTER ALI GALI ANUSVARA ONE and MONGOLIAN LETTER ALI GALI VISARGA ONE unchanged.
200B ZERO WIDTH SPACE Being zero-width, it is not actually a "space".
2118 SCRIPT CAPITAL P Actually a lowercase calligraphic "p".
262B FARSI SYMBOL This is not a symbol of Farsi (the modern Persian language), but is in fact the official emblem of the goverment of the Islamic Republic of Iran. In Unicode 1.0 this character was properly named SYMBOL OF IRAN, but the name was changed on merger with ISO/IEC 10646.
309F
30FF

HIRAGANA DIGRAPH YORI
KATAKANA DIGRAPH KOTO
These characters are ligatures, not digraphs.
A015 YI SYLLABLE WU This is neither a syllable nor pronounced "wu", but is actually a syllable iteration mark, similar in function to the ideographic iteration marks such as U+3005 々 IDEOGRAPHIC ITERATION MARK.
FA0E
FA0F
FA11
FA13
FA14
FA1F
FA21
FA23
FA24
FA27
FA28
FA29











CJK COMPATIBILITY IDEOGRAPH-FA0E
CJK COMPATIBILITY IDEOGRAPH-FA0F
CJK COMPATIBILITY IDEOGRAPH-FA11
CJK COMPATIBILITY IDEOGRAPH-FA13
CJK COMPATIBILITY IDEOGRAPH-FA14
CJK COMPATIBILITY IDEOGRAPH-FA1F
CJK COMPATIBILITY IDEOGRAPH-FA21
CJK COMPATIBILITY IDEOGRAPH-FA23
CJK COMPATIBILITY IDEOGRAPH-FA24
CJK COMPATIBILITY IDEOGRAPH-FA27
CJK COMPATIBILITY IDEOGRAPH-FA28
CJK COMPATIBILITY IDEOGRAPH-FA29
These are all unified ideographs in their own right, not compatibility ideographs (which are duplicate ideographs encoded for roundtrip mapping to legacy character sets where the same character is encoded more than once, either as pronunciation variants or as minor glyph variants).
FE18 PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET Mistake for PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRACKET.
1D0C5 𝃅 BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS Mistake for BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS.
1D13A 𝄺 MUSICAL SYMBOL MULTI REST The glyph is actually a "breve rest" or "double whole rest". A new character named MUSICAL SYMBOL MULTIPLE MEASURE REST is introduced in Unicode 5.1 at U+1D129 to represent a rest of arbitrary length (sometimes called an H-bar rest).
1D300
1D301
1D302
1D303
1D304
1D305
𝌀
𝌁
𝌂
𝌃
𝌄
𝌅
MONOGRAM FOR EARTH
DIGRAM FOR HEAVENLY EARTH
DIGRAM FOR HUMAN EARTH
DIGRAM FOR EARTHLY HEAVEN
DIGRAM FOR EARTHLY HUMAN
DIGRAM FOR EARTH
TaiXuan Jing symbols are made up of a combination of three different elements, an unbroken line that represents heaven (Chinese tian 天), a single broken line that represents earth (Chinese di 地) and a double broken line that represents human (Chinese ren 人). The monograms and digrams are named using the terms HEAVEN, EARTH and HUMAN, but they map the single broken line to HUMAN and the double broken line to EARTH, which is not the normal association.
The correct mappings for these characters are :
MONOGRAM FOR EARTH = ren (human)
DIGRAM FOR HEAVENLY EARTH = tian ren (heaven/human)
DIGRAM FOR HUMAN EARTH = di ren (earth/human)
DIGRAM FOR EARTHLY HEAVEN = ren tian (human/heaven)
DIGRAM FOR EARTHLY HUMAN = ren di (human/earth)
DIGRAM FOR EARTH = ren ren (human/human)

Note 1. The 65 control characters at <0000..001F>, <007F> and <0080..009F> do not have have formal names in Unicode or ISO/IEC 10646, and they are generally referred to by their designations in ISO/IEC 6429. However, there is a move under foot to formally define names for these characters (see N3046 "Improving formal definition for control characters").



Addendum [2006-05-14]

Unicode has now issued their own list of anomalous character names as Unicode Technical Note 27 : Known Anomalies in Unicode Character Names.


Saturday, 4 March 2006

The Origins of Go

As noted in my post on Tibetan Go, one of the features that sets Tibetan Go apart from the game played elsewhere in East Asia is that Tibetan Go is played on board with a 17 × 17 grid of 289 points, whereas in China, Japan and Korea the game is played on a board with a 19 × 19 grid of 361 points. Tibetan sources suggest that Go has been played in Tibet since at least the time of the first great Tibetan king, Srong-brtsan-sgam-po (reigned c.627-650), and this was confirmed by archaeological evidence in 2000, when a crude stone Go board was unearthed at the site of the palace where King Srongtsan Gampo was born :


Stone Go Board from Tibet


Although the game of Go outside of Tibet is now played on a 19 × 19 grid, historical records and archaeological evidence indicates that this was not always so, and that the earliest form of the game in China was actually played on a 17 × 17 grid, as in Tibet. The earliest complete Go board to have been discovered is this stone board with a 17 × 17 grid (and the five primary star points elaborately marked) that was found in 1952 in a tomb at Wangdu 望都 in Hebei province dating to the late Eastern Han period (25-220) :


Easter Han Stone Go Board (from the Wangdu Han Tomb)

Source : Wangdu Hanmu Bihua 望都漢墓壁畫 (Beijing, 1955)

(Click on image to see the face of the board)


In the 1990s this fragment of a crude pottery board was excavated from the site of the southern gate to the mausoleum of Emperor Jing Di 景帝 (reigned 156-141) and his consort at Yangling 陽陵 near Xianyang in Shaanxi province. Although the board is not associated with the main burial, it must date to the Western Han period (206-25), and so is the earliest physical evidence of the game of Go. Unfortunately it is not possible to tell whether this was a 17 × 17 board or 19 × 19 board (it has one star point marked at the lower left corner, but no star points on the sides or in the centre).


Fragment of Western Han Pottery Go Board

28.5 × 19.7 cm at the widest points

Source : 围棋考古 (2007-03-27)

The correspondence between Tibetan Go boards and early Chinese Go boards has led some game historians to conclude that Go originated in Tibet, and then spread to China, where the board later developed to a 19 × 19 grid. However, there is a much longer documented history of Go playing in China than there is in Tibet, and it seems more likely to me that Go originated in China, and spread to Tibet at a time when the 17 × 17 board was still in use; then due to the isolation of Tibet, Tibetan Go preserved the archaic 17 × 17 board when it was superceded by the 19 × 19 board in China and elsewhere.

As to when the game of Go first came into being, most histories of the game will tell you that it has been around for three or four thousand years, whereas in fact there is no concrete evidence for the existence of Go before about two thousand years ago. The only evidence for the game in sources dating from the 1st millenium BCE is the occasional use of the word 弈 "to play Go", as in this famous saying by Confucius :


子曰:“飽食終日,無所用心,難矣哉!不有博弈者乎?為之,猶賢乎已。”

Confucius said : "If you eat your full all day long, and have nothing to apply your mind to, that is disastrous ! Are there not games of chance and skill ? Would it not be better to play such games than do nothing ?"

Lun Yu 論語 "The Analects of Confucius" (Zhonghua Shuju, 1980) 17.22.


Here 博 is usually taken to refer to the game of Liubo 六博 (an early board game of chance in which players throw six sticks), and 弈 is usually taken to refer to the game of Go (a game of pure skill). Although there is not enough detail in pre-Han sources such as the Analects that mention 弈 to be certain that they are actually referring to the game of Go rather than some other game, Han dynasty lexical sources do confirm that 弈 is another name for wéiqí 圍棋. The first dictionary of Chinese characters, the Shuo Wen 說文, written by Xu Shen 許慎 in about the year 100 CE defines 弈 as wéiqí 圍棊, and the Fang Yan 方言, a comparative lexicon of Chinese words used in different parts of China, compiled by Yang Xiong 揚雄 (53 BCE - 18 CE) states that "Weiqi is called yi; the people east of the Hangu Pass, in the area of Qi and Lu everyone calls it yi 圍棊謂之弈。自關而東齊魯之間皆謂之弈。

Nevertheless, I am still dubious as to whether early references to 弈 do actually refer to the game of Go because there is a complete absence of any supporting archaelogical evidence prior to the Han dynasty. This would not be significant in itself were it not for the fact that the game of Liubo is abundantly attested in the archaeological record. Not only have dozens of Liubo boards been unearthed from Han and pre-Han tombs, but scores of Han and pre-Han murals and engravings showing people playing Liubo have been found, as well as statuettes of Liubo players.

This example of an elaborately carved stone Liubo board dates to the 4th century BCE (Warring States period) :


Liubo Board from the Warring States Period

Source: Mysteries of Ancient China (British Museum, 1996) page 76.


And this is a set of wooden funerary statuettes playing Liubo (note the distinctive "TLV" pattern on the board), found in a late Western Han tomb from Gansu :


Funerary Statuettes of Liubo Players


And these are couple of Han dynasty pictorial representations of people playing the Liubo :


Han Dynasty Pictorial Brick from Sichuan

Note that the Liubo board is usually depicted to the side of the mat where the six sticks are laid out.


Han Dynasty Pictorial Stone from Jiangsu

Note the typical posture of the left player, with one arm raised in readiness and the other arm stretched out towards the six sticks.


In stark contrast to the wealth of early material for Liubo, only two Han dynasty Go boards are known, and the earliest certain image of Go-playing does not occur until post-Han. The picture below is the left side panel of an illustrated stone monument depicting the life of Buddha that dates to the Northern Qi (550-577). The board shown is probably a Go board, although it only has an 11 × 11 grid, but this might be simply because there was not enough room to engrave more lines.


Northern Qi Stone Monument depicting the Life of Buddha

北齊周榮祖造像碑

Source : Zhongguo Meishu Quanji 中國美術全集 (繪畫編) vol.19 plate 26


To my mind, this suggests that Go may not have been invented until the Han dynasty. At any rate, once Go became popular it quickly supplanted Liubo. There is virtually no evidence of Liubo being played post-Han, and it seems to have become extinct within a few generations of the rise to prominence of Go. The precise rules of Liubo are now long since lost.

The earliest unambiguous references to the game of Go are not found until the Eastern Han dynasty, and the first complete description of the game is in the work Yi Jing 藝經 "Classic of Arts" by Handan Chun 邯鄲淳, who lived during the third century (Three Kingdoms period) :


棊局縱橫各十七道,合二百八十九道,白黑棊子各一百五十枚。

The Go board has seventeen vertical and seventeen horizontal lines, making two hundred and eighty-nine points; the black and white stones are each one hundred and fifty in number.

Shuo Fu Sanzhong 說郛三種 (Shanghai Guji Chubanshe, 1988) p.4693.


Sometime before the Tang dynasty (618-907) the 17 × 17 Go board was replaced by the 19 × 19 board, as is evidenced by this ceramic Go board dating to the Sui dynasty (581-618) :


Sui Dynasty Ceramic Go Board

Source : Survey of Chinese Ceramics (Taibei, 1991) vol.2 page 171


Nevertheless, there is evidence that the 17 × 17 board remained in use to some extent into the Tang dynasty, especially in peripheral regions of the Chinese empire. For example, in this 7th century silk picture from Xinjiang, the board is shown as a 17 × 16 grid, which is probably a badly-drawn representation of a 17 × 17 board (the artist seems to have run out of space on the righthand side, so that the 16th vertical line is just squeezed in, but there is no room for the 17th) :


Tang Dynasty Silk painting of a Go player


Non-standard boards (e.g. 13 × 13) continued to be used in addition to the standard 19 × 19 board for many more centuries, as I discuss in my post on Playing Go on a Chinese Chess Board.



See Also


Tibetan Go

When discussing the French translations of the Tibetan astrological pebble symbols recently, an interesting question was asked : what sort of "pebble" do these symbols represent ? Well, the Tibetan word rdel རྡེལ, although customarily translated into English as "pebble", is a general word for any small, hard, roundish object, including both natural stones and human artifacts. My theory is that the black and white pebbles used in divination are the same as the small, round black and white pieces used to play the classic board game of Go (go 碁 in Japanese or wéi qí 圍棋 in Chinese).

For anyone not familiar with Go, it is a board game of enormous popularity in China, Korea and Japan that is played by placing black and white "stones" (rdel in Tibetan) on a 19 x 19 grid of 361 points in order to encompass territory. The rules of Go are simple, but due to the large grid size on which it is played, the possibilities for different positions are almost endless, and so it has a far greater degree of complexity than chess. Along with painting, calligraphy and playing the zither (Chinese qín 琴), mastery of the game of Go was traditionally considered to be one of the four accomplishments of a Chinese lady or gentleman.

The game of Go is also played in Tibet, where it is known as mig mangs མིག་མངས (pronounced ming mang or mi mang), meaning "many eyes". The game is not as widely played in Tibet as it is elsewhere in East Asia, being traditionally restricted to aristocratic families and some communities of monks. The Tibetan form of Go has a number of idiosyncratic rules, but the most significant difference from the game played elsewhere is that the game is played on a board with a 17 x 17 grid of 289 points. In addition to the standard form of the game, a uniquely Tibetan game in which black and white stones are lined up along adjacent edges of the board, and the stones moved in straight lines in order to pincer the opponent's stones is also played (somewhat like Othello or Reversi). If you want to know more about Tibetan Go, then read The Game of Go in Ancient and Modern Tibet by Peter Shotwell.

Anyway, what has the game of Go got to do with divination ? Games, and board games in particular, have often had a ritualistic or magical element, and in a country so imbued with ritual and religion as Tibet is, it is easy to see how the tools of game-playing would almost inevitably be adopted for the purposes of divination. It is a short step from playing the black and whites stones as a game, to casting the stones onto the board, and reading good or bad fortune from the resulting patterns. Although I don't really have any evidence to back up my theory, this quote from Peter Shotwell's article on Tibetan Go does suggest a close affinity between Go stones and astrological pebbles :

I showed some villagers a [Go] game diagram and they became fearful. They didn’t want to talk about it, saying it was ‘Black Bon’ and, ‘This is what they used, when they would tell you things like how long you had to live.’



Addendum [2006-11-04] : Not Tibetan Go

Yesterday I was looking through some photos that I took in Lhasa more than twenty years ago (!!), and which had been hidden away unseen for almost as many years, when I came across this one (which I have no recollection of taking). At first glance it looked like it was a picture of a game of Tibetan Go, but on closer examination I was disappointed to see that it could not be (no grid lines, very large black and white pieces, and pieces not laid out as they would be in a game of Go). It was still an interesting picture, and would have been a good one but for the wire, so Ithought I would post it here anyway.



I think it must be a game akin to Shove Ha'penny, where the players take turns to knock the other player's peices into the holes at the four corners of the board, but I would love to hear from anyone who can identify the game, and tell me whether this is a traditional Tibetan game or not.