Sunday, 26 March 2006

Unicode Character Names Part 3 : A Name by any other Name

As discussed in Part 1 there are some unfortunately misnamed characters, and as discussed in Part 2 Unicode character names once assigned can never be changed, and so misnamed characters are stuck with their names whether they like it or not. Whilst the characters themselves may or may not mind what they are called, characters with wrong names cause untold anguish to some people. Until now there has not been very much that can be done about the problem, and if anyone complains about a particular character name all that Unicode can advise them is that character names are intended as unique mnemonic identifers and should not be relied on for identification of a character's function or meaning — which is unfortunate as most character names can be relied on for this purpose.

However, as from Unicode 5.0 (due for release in May) some of the most badly misnamed characters will be provided with a formal alias which implementers will be encouraged to use in user interfaces in place of the character's official name. At present the following eleven characters will be given formal aliases (see NameAliases.txt for the current list of formal aliases) :


Code Point Character Character Name Formal Alias Assigned
01A2ƢLATIN CAPITAL LETTER OILATIN CAPITAL LETTER GHAUnicode 5.0
01A3ƣLATIN SMALL LETTER OILATIN SMALL LETTER GHAUnicode 5.0
0CDEKANNADA LETTER FAKANNADA LETTER LLLAUnicode 5.0
0E9DLAO LETTER FO TAMLAO LETTER FO FONUnicode 5.0
0E9FLAO LETTER FO SUNGLAO LETTER FO FAYUnicode 5.0
0EA3LAO LETTER LO LINGLAO LETTER ROUnicode 5.0
0EA5LAO LETTER LO LOOTLAO LETTER LOUnicode 5.0
0FD0TIBETAN MARK BSKA- SHOG GI MGO RGYANTIBETAN MARK BKA- SHOG GI MGO RGYANUnicode 5.0
2448OCR DASHMICR ON US SYMBOLUnicode 6.1
2449OCR CUSTOMER ACCOUNT NUMBERMICR DASH SYMBOLUnicode 6.1
A015YI SYLLABLE WUYI SYLLABLE ITERATION MARKUnicode 5.0
FE18PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCETPRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRACKETUnicode 5.0
1D0C5𝃅BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASISBYZANTINE MUSICAL SYMBOL FTHORA SKLIRON CHROMA VASISUnicode 5.0

There are two important points to be made about these formal aliases.

Firstly, formal aliases will only be given to the most perniciously misnamed characters, and not to every character which has a sub-optimal name, or for which there is academic dispute about the transliteration or naming convention to use.

Secondly, formal aliases are completely different to the aliases already provided in the Unicode code charts. The code chart aliases are alternative names by which a character may be known, and are provided for information only. On the other hand, the fomal aliases conform to the Character-naming guidelines (and must be unique within the scope of both character names and formal aliases), and so look just like real character names, and are intended to be used in place of the real character names in applications' user interfaces.

Here is some of the history behind these characters.

LATIN CAPITAL/SMALL LETTER OI [U+01A2/01A3]

The character names reflect the ghyph shape of this letter (Ƣ), and do represent its phonetic value. The letter was devised to represent the sound /gh/, and was used in the Kirghiz Latin alphabet between 1928 and 1940.

KANNADA LETTER FA [U+0CDE]

I've no idea how it came to be that a letter that is used to represent a Dravidian /l/ sound should be named KANNADA LETTER FA. The alias is spelled with three L's in accordance with the unofficial convention of distinguishing flavours of the same letter by reduplicating the ASCII letter used to represent it. KANNADA LETTER LA [U+0CB2] and KANNADA LETTER LLA [U+0CB3] are already taken, so the alias for U+0CDE is KANNADA LETTER LLLA; and if ever a new Kannada letter representing a different flavour of /l/ were to be invented, it would no doubt be named KANNADA LETTER LLLLA !

Lao Letters FO TAM/FO SUNG and LO LING/LO LOOT

The Lao script was part of the original Unicode 1.0 repertoire encoded in 1991, but these two pairs of swapped character names did not come to light until October last year, when a user of the French version of BabelMap queried the names of LAO LETTER FO TAM and LAO LETTER FO SUNG with me. I raised the matter on the Unicode mailing list, and got confirmation from a number of experts on the Lao script that there were indeed some mistakes with Lao character names. The result was that, although I know very little about Lao, I ended up writing a document summarizing the issues with Lao character names, and recommending possible solutions. One of my recommendations was that the misnamed letters be assigned the formal aliases that are now to be assigned.

Most of the consonants in the Lao block are named from the syllabic sound of the letter, plus the word SUNG "high" or TAM "low" to indicate a high tone class or low tone class if two letters share the same syllabic sound. U+0E9D and U+0E9F both share the same syllabic sound, FO, but the former character is a high tone class letter and should have been named LAO LETTER FO SUNG, whilst the latter character is a low tone class letter and should have been named LAO LETTER FO TAM. Unfortunately, the names of this pair of characters were assigned the wrong way round. As formal aliases must be unique and cannot duplicate an existing character name, it is not possible to assign what should have been their correct names as the formal aliases, and so the formal aliases for these two characters needed to based on a different naming system. Luckily, the Lao people do not normally name their letters using SUNG and TAM, but use mnemonic names similar in form to English "A is for Apple", "B is for Ball", etc., and so the formal aliases could be based on the characters' mnemonic names. There are no official Lao mnemonic names, and the mnemonics may vary from one source to another, but for these two characters the most common mnemonic names are FO FON "fo as in the word fon [rain]" and FO FAY "fo as in the word fay [fire]". Thus the formal aliases for U+0E9D and U+0E9F are LAO LETTER FO FON and LAO LETTER FO FAY respectively. The most common mnemonic names for the other consonants are being added to the code charts as informal aliases.

The names for U+0EA3 and U+0EA5 are different from all the other Lao consonants, as they are the only two letters with character names that are based on mnemonic names, LO LING "lo as in the word ling [monkey]" and LO LOOT "lo as in the word loot [motor car]". However, the mnemonic names are the wrong way round, with U+0EA3 named LAO LETTER LO LING when it should have been named LAO LETTER LO LOOT, and U+0EA5 named LAO LETTER LO LOOT when it should have been named LAO LETTER LO LING. The reason for the different naming system for these two letters was presumably due to the fact that they are both low tone class letters, and they could not both be named LAO LETTER LO TAM. Actually U+0EA3, which has been deprecated by the Lao government since 1975, is used to represent [r] in foreign words, and so the two letters could have been differentiated by simply naming them LAO LETTER RO and LAO LETTER LO — which are the names used for the formal aliases.

TIBETAN MARK BSKA- SHOG GI MGO RGYAN [U+0FD0]

This character is full of woe. To start with its proper name was misappropriated by U+0F0A. U+0F0A, which was encoded in Unicode 2.0, is a Bhutanese mark used in formal documents to indicate an inferior addressing a superior, and should have been named something like TIBETAN MARK ZHU YIG GI MGO RGYAN, corresponding to the Tibetan zhu yig gi mgo rgyan ཞུ་ཡིག་གི་མགོ་རྒྱན "starting flourish for making a petition"; but somehow it got assigned the name TIBETAN MARK BKA- SHOG YIG MGO, corresponding to the Tibetan bka' shog yig mgo བཀའ་ཤོག་ཡིག་མགོ "starting flourish for giving a command".

So when the character that actually indicates a superior addressing an inferior was encoded in Unicode 4.1 at U+0FD0 it had to be given a slightly different but synonymous name, which should have been TIBETAN MARK BKA- SHOG GI MGO RGYAN, corresponding to the Tibetan bka' shog gi mgo rgyan བཀའ་ཤོག་གི་མགོ་རྒྱན. Unfortunately, BKA- became miswritten as BSKA- (a syllable that does not naturally occur in Tibetan) in the proposal (N2694).

Mistakes like this can be hard to spot if you don't know Tibetan. Luckily I do know Tibetan, and pointed out the mistake on the Unicode mailing list when the proposal was first announced; but unfortunately I didn't check to see that the mistake had been corrected as the character progressed towards standardization, so I feel somewhat responsible for this one.

YI SYLLABLE WU [U+A015]

When the Liangshan Yi script was originally proposed for encoding in 1995 (see N1187) there was some confusion over a character ꀕ that appeared in some Yi sources but not in others. This was given the name YI SYLLABLE WU and positioned between the syllables with vowel initials and the syllables with consonant initials.

In fact the character named YI SYLLABLE WU does not represent the syllable /wu/ or any specific syllable, but is a special syllable iteration mark that is used to indicate that the preceding syllable is repeated. Syllable reduplication is is most commonly found in adjectives or verbs, where reduplication of the final syllable indicates the interrogative (note that the reduplicated syllable is pronounced in the mid level tone after a syllable in the secondary high tone, and that final "p", "t" and "x" are tone markers) :

  • vat ꃪ "OK"; and vat vat ꃪꀕ (for ꃪꃪ) "OK ?"
  • bbo ꁧ "go"; and bbox bbo ꁦꀕ (for ꁦꁧ) "shall we go ?"
  • zzyr muo ꋬꂻ "fine and well"; and zzyr muox muo ꋬꂺꀕ (for ꋬꂺꂻ) "Are you fine and well ?"

In the Yi phonetic alphabet, syllable iteration is represented by the letter "w", so that ꃪꀕ is represented as vatw (for vat vat), ꁦꀕ is represented as bboxw (for bbox bbo), and ꋬꂺꀕ is represented as zzyr muoxw (for zzyr muox muo). This is presumably where the mistaken notion that the character ꀕ represents the sound /wu/ came from. In fact the reason why the iteration mark ꀕ is represented by the letter "w" is simply because it looks like a letter "w" with two vertical strokes beneath it (i.e. the "w" approximates the glyph shape of the character, and not its phonetic value).

More details on this character and other Yi issues can be found in a document that I wrote for the UTC in 2004.

PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET [U+FE18]

A really embarrassing mistake, that was first pointed out by Alan Wood shortly after the character was released into the wild as part of the Unicode 4.1 repertoire in March 2005. At that time the corresponding Amendment 1 of ISO/IEC 10646:2003 was only at the FDAM stage, and it wouldn't be published until November of that year, but it was too late to change it as the ISO and Unicode character names cannot differ.

BYZANTINE MUSICAL SYMBOL FHTORA SKLIRON CHROMA VASIS [U+1D0C5]

Twenty Byzantine musical symbols with the word FTHORA in their names were introduced in Unicode 3.1, but somewhere along the line one of them got mistyped as FHTORA. An easy mistake to make, but a difficult mistake to spot, especially if you don't read Greek — which I don't (and for anyone else who doesn't, according to an internet dictionary the word φθορά means "abrasion, decay, deterioration, vitiation, waste, wear, weathering").


2 comments:

Anonymous said...

Hi Mr. West,
I've found the following translation for "phtora"
φθορά
n. wastage, corruption, waste, decay, wear, spoilage, wear and tear, attrition, spoiling.

Source: Greek Dictionary

shreevatsa said...

My arbitrary guess for the "KANNADA LETTER FA" (I speak and read and write Kannada) is that some (then) leading phonetic transliteration software mapped 'fa', a sound which other doesn't exist in Kannada (usually represented by 'pha') to this letter, which is obsolete and rarely used. Then when the standard was proposed, this software's convention was followed.

(Or more simply, the proposer of the standard picked up an English letter that was free and gave it to this Kannada letter thinking no one would care.)