Saturday, 25 March 2006

Unicode Character Names Part 1 : the Good the Bad and the Ugly

The one thing about Unicode that really seems to bug people more than anything else is that the character names are not always perfect, are sometimes misleading, and in a few cases are just plain wrong.

All Unicode characters have an official name which is used to uniquely identify them (but see Note 1 below the table). The 71,226 CJK ideographs have algorithmically derived names based on their code point (e.g. CJK UNIFIED IDEOGRAPH-4E00 for U+4E00), and the 11,172 Hangul syllables have algorithmically derived names based on their phonetic composition (e.g. HANGUL SYLLABLE GAH for U+AC1B, which is composed of the three jamo letters G, A and H). The remaining 15,257 characters have hand-crafted names, and it is perhaps not suprising that a few mistakes have crept in from time to time. These are some of the sort of problems that may be found in Unicode character names :

  • Misuse of technical terms, such as ligature ("a character or type formed by two or more letters joined together"), digraph ("a group of two letters representing one sound") and ideograph ("a character symbolizing the idea of a thing without expressing the sequence of sounds in its name").
  • Misinterpretation of a character's glyph shape (e.g. U+2118 ℘ SCRIPT CAPITAL P, which is actually a calligraphic lowercase p).
  • Misunderstanding of a character's meaning or function (e.g. U+A015 ꀕ YI SYLLABLE WU, which is not a syllable pronounced "wu" but a syllable iteration mark).
  • Confusion of one character with another (for example the names of U+0EA3 LAO LETTER LO LING and U+0EA5 LAO LETTER LO LOOT are the wrong way round).
  • Simple typographic errors, such as U+FE18 PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET.

In addition to these sort of problems, there are also many character names that are technically "correct", but which some people still object to, for example because the name represents the pronunciation of the character in one language but is pronounced differently in their language, or because the Unicode name is based on one system of transliteration, but they prefer a different system of transliteration (character names are constrained to the letters "A" through "Z", the digits "0" through "9", space and hyphen, so often there is no choice but to resort to awkward names such as DEVANAGARI LETTER LLLA). In cases such as these the alternative pronunciation or transliteration may be annotated in the Unicode code charts.

One of the things that really annoys some people is that Han characters (漢字 hànzì / kanji / hanja) are named as "CJK [Unified/Compatibility] Ideographs", when technically they are not ideographs ("a character symbolizing the idea of a thing without expressing the sequence of sounds in its name" according to the SOED). Nor are they limited to Chinese, Japanese and Korean (CJK) usage, but have also been used for Vietnamese (ideographs used to write Vietnamese are called chữ nôm 字喃 / 𡦂喃 / 𡨸喃) and Zhuang (ideographs used to write Zhuang are called sawndip). Thus on two counts two-thirds of Unicode characters could be considered to be wrongly named. As Confucius put it :


名不正,則言不順;言不順,則事不成;事不成,則禮樂不興;禮樂不興,則刑罰不中;刑罰不中,則民無所措手足。

When names are not correct, what is said will not sound reasonable; when what is said does not sound reasonable, affairs will not culminate in success; when affairs do not culminate in success, rites and music will not flourish; when rites and music do not flourish, punishments will not fit the crime; when punishments do not fit the crime, the common people will not know where to put hand and foot.

Lun Yu 論語 [The Analects] 13.3 (D.C.Lau trans.)


But, hey, I'm not a Confucianist, so I don't mind too much about wrong or misleading character names (except for U+A856 of course, which will irk me to the grave), and I have no problems referring to 漢字 as ideographs -- to me it's just a convenient label.

Anyway here is my list of characters which either deliberately or accidentally have sub-optimal names. This is by no means an exhaustive list, and other people will no doubt have their own suggestions to add.


Wrong or Misleading Character Names
Code Point Character Character Name Comments
0132
0133
IJ
ij
LATIN CAPITAL LIGATURE IJ
LATIN SMALL LIGATURE IJ
These are not ligatures as the "i" and "j" are not joined together.
01A2
01A3
Ƣ
ƣ
LATIN CAPITAL LETTER OI
LATIN SMALL LETTER OI
These characters represent the letter "gha" used in the Kirghiz Latin alphabet between 1928 and 1940, and have nothing to do with either "o" or "i".
01BE ƾ LATIN LETTER INVERTED GLOTTAL STOP WITH STROKE Whilst this character superficially looks like an inverted glottal stop, it is in fact derived from a ligature of the letters "t" and "s", which explains its use as an archaic phonetic representation of [ts] as an affricate (e.g. for the sound of the "z" in German Zimmer "room").
0238
0239
ȸ
ȹ
LATIN SMALL LETTER DB DIGRAPH
LATIN SMALL LETTER QP DIGRAPH
These characters are ligatures of "db" and "qp" respectively, and not digraphs.
02C7
030C
032C
ˇ
̌
̬
CARON
COMBINING CARON
COMBINING CARON BELOW
These and 42 other precomposed characters such as U+010D LATIN SMALL LETTER C WITH CARON č use the word "caron" to signify what is normally called a háček ("little hook" in Czech). Indeed, in Unicode 1.0 the names of these letters all used the term HACEK (e.g. U+02C7 MODIFIER LETTER HACEK), but all instances of "hacek" were changed to "caron" when Unicode merged with ISO/IEC 10646.
Nobody knows ahat the etymology of the term "caron" is, or where and when it was coined, but the earliest known use of the term is in the 1967 edition of the United States Government Printing Office Style Manual, from whence it was introduced into ISO character encoding standards (see Antedating the Caron for details).
034F   COMBINING GRAPHEME JOINER This character does not combine graphemes, but rather indicates that adjacent characters should be treated as a graphemic unit.
047C
047D
Ѽ
ѽ
CYRILLIC CAPITAL LETTER OMEGA WITH TITLO
CYRILLIC SMALL LETTER OMEGA WITH TITLO
The diacritic on these characters is not actually a "titlo" (although everyone agrees that it is not a titlo, it is not clear exactly what the origins of the diacritic mark is), which explains why they do not decompose to U+0460/U0461 CYRILLIC CAPITAL/SMALL LETTER OMEGA and U+0483 COMBINING CYRILLIC TITLO. The character is used to represent the exclamations "о!" and "оле!", and is known in Russian as "beautiful omega" красивая омега or "wide omega" широкая омега.
0598 ֘ HEBREW ACCENT ZARQA This character is not actually a "zarqa" at all (which is U+05AE), but is intended to represent the sign called "tsinorit" that is used in the three poetic books (Job, Proverbs, Psalms), and that is centred above a base letter.
05AE ֮ HEBREW ACCENT ZINOR This character is intended to represent the sign called "zarqa" that is used in the twenty-one books of the Old Testament, as well to represent the sign called "tsinor" (sometimes transliterated "zinor") that is used in the three poetic books (Job, Proverbs, Psalms). Both these signs share the same glyph form and are placed above and to the left of a base letter.
0670 ٰ ARABIC LETTER SUPERSCRIPT ALEF This is actually a vowel sign, not a letter.
0B83 TAMIL SIGN VISARGA Although this sign derives from a special type of visarga, it is not called a visarga in Tamil, but is known as an "āytham" (which is a Tamilized form of the Sankrit word "āśrita", being a class of visarga).
0CDE KANNADA LETTER FA This letter has nothing to do with the sound /f/, but actually represents a Dravidian /l/, and should rightly have been called KANNADA LETTER LLLA, in line with the corresponding letters in other Indic scripts, such as U+0934 DEVANAGARI LETTER LLLA, U+0BB4 TAMIL LETTER LLLA and U+0D34 MALAYALAM LETTER LLLA].
0E9D
0E9F

LAO LETTER FO TAM
LAO LETTER FO SUNG
The character names for U+0E9D and U+0E9F are swapped. U+0E9D is a high tone class letter, and should have been named LAO LETTER FO SUNG (SUNG meaning "high"); whereas U+09EF is a low tone class letter, and should have been named LAO LETTER FO TAM (TAM meaning "low").
0EA3
0EA5

LAO LETTER LO LING
LAO LETTER LO LOOT
The character names for U+0EA3 and U+0EA5 are swapped. LO LING is the mnemonic name for U+0EA5 ("lo as in ling [monkey]"); whereas LO LOOT is the badly transliterated mnemonic name for U+0EA3 ("lo as in "loot" for "ro as in rot [motor car]").
0F0A TIBETAN MARK BKA- SHOG YIG MGO This character is meant to represent the sign that is used in formal documents in Bhutan to indicate an inferior addressing a superior (the "petition honorific"), but the Tibetan name BKA- SHOG YIG MGO actually indicates a superior addressing an inferior ("starting flourish for giving a command"). When the character that really indicates a superior addressing an inferior was later encoded at U+0F0D, it had to be assigned a slightly different but synonymous name, TIBETAN MARK BSKA- SHOG GI MGO RGYAN ("starting flourish for giving a command").
0F0B TIBETAN MARK INTERSYLLABIC TSHEG The tsheg mark is not restricted to intersyllabic usage, and may occur at the end of a terminal syllable or multiple times as "justifying tshegs" at the end of a line.
0F0C TIBETAN MARK DELIMITER TSHEG BSTAR This character is simply a non-breaking version of the "tsheg" mark (U+0F0B) that is used exclusively between the letter NGA (U+0F44) and the "shad" mark (U+0F0D).
0FD0 TIBETAN MARK BSKA- SHOG GI MGO RGYAN Mistake for TIBETAN MARK BKA- SHOG GI MGO RGYAN (the syllable BSKA- does not naturally occur in Tibetan).
156F CANADIAN SYLLABICS TTH This character looks like an asterisk, and it probably is an asterisk. The imaginary letter TTH was accidentally encoded when someone mistook an asterisk denoting a proper noun as a letter in the Canadian aboriginal script.
1880
1881

MONGOLIAN LETTER ALI GALI ANUSVARA ONE
MONGOLIAN LETTER ALI GALI VISARGA ONE
The ONE in the names of these two characters is spurious. Each of these two characters have two different glyphs forms, which are distinguished by the application or not of U+180B MONGOLIAN FREE VARIATION SELECTOR ONE (FVS-1) :
<1880> ᢀ and <1880 180B> ᢀ᠋ (actually, the former is technically a CANDRABINDU and the latter an ANUSVARA, and even though CANDRABINDU and ANUSVARA are used interchangeably in Mongolian contexts, I would have thought that they should have been encoded separately, as is the case with Tibetan and other Brahmic scripts);
<1881> ᢁ and <1881 180B> ᢁ᠋.
My theory is that in an early draft for the Mongolian block each variant form of these two characters was assigned a separate code point, with names differentiated by ONE and TWO :
MONGOLIAN LETTER ALI GALI ANUSVARA ONE
MONGOLIAN LETTER ALI GALI ANUSVARA TWO
MONGOLIAN LETTER ALI GALI VISARGA ONE
MONGOLIAN LETTER ALI GALI VISARGA TWO
When a decision was later made to unify the variant forms of the two characters and distinguish their variant forms by means of variation selectors, MONGOLIAN LETTER ALI GALI ANUSVARA TWO and MONGOLIAN LETTER ALI GALI VISARGA TWO were deleted, leaving MONGOLIAN LETTER ALI GALI ANUSVARA ONE and MONGOLIAN LETTER ALI GALI VISARGA ONE unchanged.
200B ZERO WIDTH SPACE Being zero-width, it is not actually a "space".
2118 SCRIPT CAPITAL P Actually a lowercase calligraphic "p".
262B FARSI SYMBOL This is not a symbol of Farsi (the modern Persian language), but is in fact the official emblem of the goverment of the Islamic Republic of Iran. In Unicode 1.0 this character was properly named SYMBOL OF IRAN, but the name was changed on merger with ISO/IEC 10646.
309F
30FF

HIRAGANA DIGRAPH YORI
KATAKANA DIGRAPH KOTO
These characters are ligatures, not digraphs.
A015 YI SYLLABLE WU This is neither a syllable nor pronounced "wu", but is actually a syllable iteration mark, similar in function to the ideographic iteration marks such as U+3005 々 IDEOGRAPHIC ITERATION MARK.
FA0E
FA0F
FA11
FA13
FA14
FA1F
FA21
FA23
FA24
FA27
FA28
FA29











CJK COMPATIBILITY IDEOGRAPH-FA0E
CJK COMPATIBILITY IDEOGRAPH-FA0F
CJK COMPATIBILITY IDEOGRAPH-FA11
CJK COMPATIBILITY IDEOGRAPH-FA13
CJK COMPATIBILITY IDEOGRAPH-FA14
CJK COMPATIBILITY IDEOGRAPH-FA1F
CJK COMPATIBILITY IDEOGRAPH-FA21
CJK COMPATIBILITY IDEOGRAPH-FA23
CJK COMPATIBILITY IDEOGRAPH-FA24
CJK COMPATIBILITY IDEOGRAPH-FA27
CJK COMPATIBILITY IDEOGRAPH-FA28
CJK COMPATIBILITY IDEOGRAPH-FA29
These are all unified ideographs in their own right, not compatibility ideographs (which are duplicate ideographs encoded for roundtrip mapping to legacy character sets where the same character is encoded more than once, either as pronunciation variants or as minor glyph variants).
FE18 PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET Mistake for PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRACKET.
1D0C5 𝃅 BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS Mistake for BYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASIS.
1D13A 𝄺 MUSICAL SYMBOL MULTI REST The glyph is actually a "breve rest" or "double whole rest". A new character named MUSICAL SYMBOL MULTIPLE MEASURE REST is introduced in Unicode 5.1 at U+1D129 to represent a rest of arbitrary length (sometimes called an H-bar rest).
1D300
1D301
1D302
1D303
1D304
1D305
𝌀
𝌁
𝌂
𝌃
𝌄
𝌅
MONOGRAM FOR EARTH
DIGRAM FOR HEAVENLY EARTH
DIGRAM FOR HUMAN EARTH
DIGRAM FOR EARTHLY HEAVEN
DIGRAM FOR EARTHLY HUMAN
DIGRAM FOR EARTH
TaiXuan Jing symbols are made up of a combination of three different elements, an unbroken line that represents heaven (Chinese tian 天), a single broken line that represents earth (Chinese di 地) and a double broken line that represents human (Chinese ren 人). The monograms and digrams are named using the terms HEAVEN, EARTH and HUMAN, but they map the single broken line to HUMAN and the double broken line to EARTH, which is not the normal association.
The correct mappings for these characters are :
MONOGRAM FOR EARTH = ren (human)
DIGRAM FOR HEAVENLY EARTH = tian ren (heaven/human)
DIGRAM FOR HUMAN EARTH = di ren (earth/human)
DIGRAM FOR EARTHLY HEAVEN = ren tian (human/heaven)
DIGRAM FOR EARTHLY HUMAN = ren di (human/earth)
DIGRAM FOR EARTH = ren ren (human/human)

Note 1. The 65 control characters at <0000..001F>, <007F> and <0080..009F> do not have have formal names in Unicode or ISO/IEC 10646, and they are generally referred to by their designations in ISO/IEC 6429. However, there is a move under foot to formally define names for these characters (see N3046 "Improving formal definition for control characters").



Addendum [2006-05-14]

Unicode has now issued their own list of anomalous character names as Unicode Technical Note 27 : Known Anomalies in Unicode Character Names.


5 comments:

orcmid said...

U+A856? What?

Andrew West said...

U+A856 won't be hitting the street until May, when Unicode 5.0 is released; but here's a brief introduction to it.

U+A856 PHAGS-PA LETTER SMALL A is the Phags-pa letter that corresponds to the Tibetan letter 'a འ. Now this letter has two functions in life: firstly it can act as a base consonant (although a null consonant in modern Tibetan); and secondly it can act as a vowel lengthener. In the Tibetan script, when it functions as a vowel lengthener it is written as a small-sized letter subjoined to a base consonant, and this small-sized letter is called 'a chung འ་ཆུང "small 'a" in Tibetan. In Unicode these two functions are encoded separately, as U+0F60 TIBETAN LETTER -A (i.e. the base consonant 'a) and U+0F71 TIBETAN VOWEL SIGN AA (i.e. the vowel lengthener).

In the Phags-pa script this letter is written in its ordinary full-sized form for use both as a base consonant and as a vowel lengthener, and so there is no need to encode two separate characters for the two functions. It was originally proposed for encoding with the name PHAGS-PA LETTER -A, corresponding to the equivalent Tibetan character name, but the Chinese insisted that it should be changed to PHAGS-PA LETTER SMALL A, despite that fact that only one of its functions corresponds the Tibetan letter "small a", and that it is not in fact small. We accepted this name change as a necessary compromise, but it still irks me (美中不足, as one would say in Chinese).

Sascha Leib said...

There is a good reason to name code point U+0132 "LATIN CAPITAL LIGATURE IJ", because even if in most fonts the characters are not joint, it actually IS a joint character (in the Dutch language) and there are fonts which merge them to a ligature.

Paul Clapham said...

I notice that U+264F is named SCORPIUS. But that's the name the astronomers use for the constellation. Seems to me that the name that the astrologers use should apply, and that's SCORPIO.

Andrew West said...

In the Unicode code charts, U+264F is included under the subheading "Zodiacal symbols", and as Scorpius is the name of one of the constellations of the zodiac, I don't think the name can be considered to be wrong. But I would have expected "scorpio" to be given as an alias for U+264F in the code charts (and perhaps it will be when Unicode 5.1 is published).