Thursday, 31 August 2006

Tibetan Extensions 2 : Balti

N2985 is a proposal by Michael Everson to encode four additional Tibetan letters that are used for transcribing Urdu or Arabic words when writing the Balti language (nothing to do with curry) :

  • TIBETAN LETTER KKA = Reversed letter KA (U+0F40), transcribing [q]
  • TIBETAN LETTER RRA = Reversed letter RA (U+0F62), transcribing [ɽ]
  • TIBETAN LETTER KHHA = Letter KHA (U+0F41) with an "ear" (U+0F39), transcribing [x]
  • TIBETAN LETTER GHHA = Letter GA (U+0F42) with an "ear" (U+0F39), transcribing [ɣ]

The first two of the proposed letters are uncontroversial as encoding a new character is the only way to deal with letters that are created by reversal of existing letters. However, the other two letters are problematic, as there is already an established mechanism for dealing with Tibetan letters that are created by the addition of an "ear" (Tibetan tsa 'phru ཙ་འཕྲུ་ or sbrang gsad སྦྲང་གསད་). This is explained in the Unicode Standard 4.0 section 9.11 (my highlighting) :

The sign U+0F39 TIBETAN MARK TSA -PHRU (tsa-’phru, which is a lenition mark) is the ornamental flaglike mark that is an integral part of the three consonants U+0F59 TIBETAN LETTER TSA, U+0F5A TIBETAN LETTER TSHA, and U+0F5B TIBETAN LETTER DZA. Although those consonants are not decomposable, this mark has been abstracted and may by itself be applied to “pha” and other consonants to make new letters for use in transliteration and transcription of other languages. For example, in modern literary Tibetan, it is one of the ways used to transcribe the Chinese “fa” and “va” sounds not represented by the normal Tibetan consonants. Tsa-’phru is also used to represent tsa, tsha, or dza in abbreviations.

Thus, according to the Unicode Standard it is quite clear that the Balti letters KHA and GA with a flag should be represented as a sequence of KHA plus TSA -PHRU <0F41 0F39> and GA plus TSA -PHRU <0F42 0F39> respectively. Why then are these two letters being proposed for encoding as distinct characters ?

Well, the main reason is that there is a problem with U+0F39 that means that Tibetan text that uses this character may not render correctly if it is normalized; and due to the infamous Stability Policy this problem cannot be fixed. The details of the problem are quite simple : U+0F39 has been assigned an immutable canonical combining class of 216, and Tibetan vowel signs have canonical combining classes of 130 and 132; which means that when normalized U+0F39 will be reordered after any vowel signs, which is wrong as U+0F39 should be more closely attached to the base consonant than vowel signs.

The end result is that under Windows Vista a base consonant followed directly by TSA -PHRU renders correctly, but when normalized so that a vowel sign falls between a base consonant and TSA -PHRU it renders incorrectly (images taken using Uniscribe version 1.0606.5112.0 and the Tibetan Machine Uni font) :

  • <0F41, 0F39, 0F72> =
  • <0F41, 0F72, 0F39> =

Note that this is only a problem if TSA -PHRU is followed by a vowel, and only if the text has been normalized. Personally I do not think that the defect with the combining canonical class of U+0F39 is a legitimate reason to encode these two letters as precomposed characters. The Microsoft rendering engine (Uniscribe) could and should be changed so that both normalized and unnormalized sequences render correctly.

The real problem with accepting precomposed characters for these two cases is that there are other letters that are formed by the addition of a flag, and if it is deemed necessary to encode KHA plus TSA -PHRU and GA plus TSA -PHRU separately, then all the other transliteration letters formed with TSA -PHRU should also be encoded separately. There are quite a few of these :

  • <0F55, 0F39> ཕ༹, PHA plus TSA -PHRU = [f]
  • <0F56, 0F39> བ༹, BA plus TSA -PHRU = [v]
  • <0F50, 0F39> ཐ༹, THA plus TSA -PHRU = [θ]
  • <0F51, 0F39> ད༹, DA plus TSA -PHRU = [ð]
  • <0F5E, 0F39> ཞ༹, ZHA plus TSA -PHRU (PRC Tibetan Set A U+F56B, see also N2621)
  • <0F5F, 0F39> ཟ༹, ZHA plus TSA -PHRU (PRC Tibetan Set A U+F57C, see also N2621)
  • <0F60, 0F39> འ༹, -A plus TSA -PHRU [Old Tibetan usage] (PRC Tibetan Set A U+F590, see also N2621)

Incidentally, [f] and [v] are nowadays more commonly represented using HA plus SUBJOINED PHA <0F67 0FA5> ཧྥ and HA plus SUBJOINED BA <0F67 0FA6> ཧྦ.

In addition to these seven atomic letters that are composed using TSA -PHRU, TSA -PHRU is also attached to a variety of letters when writing informal shorthand contractions (bskungs-yig བསྐུངས་ཡིག་ "concealed writing" or bsdu-yig བསྡུ་ཡིག་ "conglomerated writing"), where it is used to represent the letters U+0F59 TSA, U+0F5A TSHA, U+0F5B DZA or U+0F5F ZA. The letters it attaches to include U+0F42 GA, U+0F44 NGA, U+0F46 CHA (!), U+0F51 DA, U+0F53 NA, U+0F55 PHA, U+0F56 BA, U+0F62 RA (on the ra mgo in the combination RGYA) and U+0F63 LA, as can be seen in the following examples :

  • ཀུན༹ང་ = ཀུན་བཟང་
  • ཀོུབ༹་ = ཀུན་རྫོབ་
  • སྐལ༹ང་ = སྐལ་བཟང་
  • རྒྱོ༹་ = རྒྱ་མཚོ་
  • ཁོུས༹་ = ཁུར་ཚོས་
  • རྒྱལ༹ན་ = རྒྱལ་མཚན་
  • ཆོུ༹ད་ = ཆུ་ཚོད་
  • རྗེུན༹་ = རྗེ་བཙུན་
  • ཏིངི་ན༹་ = ཏིང་འཛིན་
  • ཐེུག༹ས་ = ཐུགས་བརྩེ་
  • བདིུད༹་ = བདུད་རྩི་
  • སྣོག༹ས་ = སྣོ་ཚོགས་
  • ཕྱ༹ལོ་ = ཕྱག་འཚལ་ལོ་
  • ཕོུག༹ས་ = ཕུན་ཚོགས་
  • བྱོ༹གས་ = བྱ་ཚོགས་
  • སྦྲིང༹་ = སྦྲང་རྩི་
  • ལྷ༹ཾས་ = ལྷ་མཚམས་

Thus, if the precomposed letters KHHA (KHA plus TSA -PHRU) and GHHA (GA plus TSA -PHRU) are accepted for encoding, then precomposed forms of at least seven other letters plus TSA -PHRU also need to be encoded, and given its common use in shorthand contractions attached to almost any letter, it may be prudent to simply encode TSA -PHRU versions of all thirty plus Tibetan consonants. Not only would this be a major change to the Tibetan encoding model, but due to the constraints of the Stability Policy none of the precomposed letters with TSA -PHRU would be canonically equivalent to the decomposed forms that are currently in use. This would introduce all sorts of problems with legacy data. Moreover, people would still be able to compose transcription letters using TSA -PHRU rather than using the new precomposed letters if they wanted (TSA -PHRU could be deprecated but once encoded a character can never be removed from the standard), thus resulting in multiple non-equivalent spellings. All in all, I believe that such a change would be disastrous, causing chaos for years to come.

So, in conclusion, I hope to see TIBETAN LETTER KKA and TIBETAN LETTER RRA encoded soon (preferably in Unicode 5.1), and Tibetan rendering engines and/or fonts modified so that they render Consonant plus TSA -PHRU sequences correctly with or without intervening vowel signs.

Addendum [2006-10-08]

As I hoped, at the September 2006 Tokyo meeting of WG2 it was agreed to encode TIBETAN LETTER KKA and TIBETAN LETTER RRA at U+0F6B and U+0F6C respectively, and not to encode the two precomposed letters with TSA -PHRU. The two accepted letters are under final ballot for inclusion in ISO/IEC 10646:20003 Amd.3, which will correspond to Unicode 5.1.

Sunday, 27 August 2006

BabelMap : Unicode in Colour

A new release of BabelMap which fixes certain problems that may be encountered by users of non-English locales when they select the option to display all styles of fonts. Many thanks to Nocturnal Dreamer for helping me get to the bottom of it all. John Cowan also had some good suggestions for improving the Composite Font Mappings dialog, but I'm afraid that I haven't been able to do anything about this yet.

On Friday someone somewhere mentioned that colour coding characters according to a selected property would be quite useful. I thought so too, so I have quickly added such a feature into the latest release. You can now select any of the following properties from the "Colour Coding" option of the "Options" menu, and each character in the BabelMap grid will then be highlighted in a particular colour according to the value of that property for the character. Even if you select no colour coding then you still get some colour coding now, as assigned characters will be displayed on a white background and unassigned code points will be rendered in grey.

  • General Category
  • Canonical Combining Class
  • Bidi Class
  • Bidi Mirrored
  • Decomposition Type
  • Numeric Type
  • Script
  • Unicode Version (when the character was encoded)

This, for example, is what the Basic Latin block looks like if you select colour coding by General Category :

The tables below show what colours are used for each value of the various properties other than Script. I'll leave it as an exercise to the reader to work out what colour I've used for each of the sixty-four scripts defined in Unicode 5.0.

Colour Coding by General Category
Colour General Category
red Lu (Letter, Uppercase)
yellow Ll (Letter, Lowercase)
orange red Lt (Letter, Titlecase)
salmon Lm (Letter, Modifier)
crimson Lo (Letter, Other)
pink Mn (Mark, Nonspacing)
hot pink Mc (Mark, Spacing Combining)
deep pink Me (Mark, Enclosing)
violet Nd (Number, Decimal Digit)
fuchsia Nl (Number, Letter)
purple No (Number, Other)
green yellow Pc (Punctuation, Connector)
lime Pd (Punctuation, Dash)
light green Ps (Punctuation, Open)
spring green Pe (Punctuation, Close)
sea green Pi (Punctuation, Initial Quote)
forest green Pf (Punctuation, Final Quote)
olive drab Po (Punctuation, Other)
teal Sm (Symbol, Math)
cyan Sc (Symbol, Currency)
steel blue Sk (Symbol, Modifier)
royal blue So (Symbol, Other)
white Zs (Separator, Space)
khaki Zl (Separator, Line)
dark khaki Zp (Separator, Paragraph)
brown Cc (Other, Control)
peru Cf (Other, Format)
chocolate Co (Other, Private Use)
dark slate gray Cs (Other, Surrogate)
gray Cn (Other, Not Assigned)

Colour Coding by Canonical Combining Class
Colour Canonical Combining Class
white 0
violet 1 (Overlays and interior)
indigo 7 (Nuktas)
coral 8 (Hiragana/Katakana voicing marks)
yellow 9 (Viramas)
red 10-199 (fixed position classes)
green yellow 200 (Below left attached)
lime 202 (Below attached)
spring green 204 (Below right attached)
olive drab 208 (Left attached)
teal 210 (Right attached)
aqua 212 (Above left attached)
sky blue 214 (Above attached)
steel blue 216 (Above right attached)
slate blue 218 (Below left)
royal blue 220 (Below)
pink 222 (Below right)
hot pink 224 (Left)
deep pink 226 (Right)
sandy brown 228 (Above left)
peru 230 (Above)
chocolate 232 (Above right)
sienna 233 (Double below)
brown 234 (Double above)
gold 240 (Below - iota subscript))
gray Unassigned

Colour Coding by Bidi Class
Colour Bidi Class
white L (Left-to-Right)
fuchsia LRE (Left-to-Right Embedding)
purple LRO (Left-to-Right Override)
green R (Right-to-Left)
lime AL (Right-to-Left Arabic)
cyan RLE (Right-to-Left Embedding)
teal RLO (Right-to-Left Override)
orange red PDF (Pop Directional Format)
sandy brown EN (European Number)
chocoloate ES (European Number Separator)
brown ET (European Number Terminator)
steel blue AN (Arabic Number)
salmon CS (Common Number Separator)
crimson NSM (Non-Spacing Mark)
pink BN (Boundary Neutral)
khaki B (Paragraph Separator)
dark khaki S (Segment Separator)
yellow WS (Whitespace)
royal blue ON (Other Neutrals)

Colour Coding by Bidi Mirrored
Colour Bidi Mirrored ?
White No
Red Yes
gray Unassigned

Colour Coding by Decomposition Type
Colour Decomposition Type
white None
red Canonical
sky blue <font>
medium blue <noBreak>
pale violet red <initial>
fuchsia <medial>
slate blue <final>
purple <isolated>
aqua <circle>
green yellow <super>
lime green <sub>
gold <vertical>
orange red <wide>
dark khaki <narrow>
peru <small>
olive drab <square>
steel blue <fraction>
green <compat>
gray Unassigned

Colour Coding by Numeric Type
Colour Numeric Type
white None
red Decimal Digit
green Digit
blue Numeric
gray Unassigned

Colour Coding by Unicode Version
Colour Unicode Version
white 1.0.0 (4,716 characters)
pink 1.0.1 (21,204 characters)
deep pink 1.1 (1,657 characters)
green yellow 2.0 (11,373 characters)
lime green 2.1 (2 characters)
salmon 3.0 (10,307 characters)
crimson 3.1 (44,946 characters)
red 3.2 (1,016 characters)
sandy brown 4.0 (1,226 characters)
chocolate 4.1 (1,273 characters)
sky blue 5.0 (1,369 characters)
gray Unassigned

Saturday, 19 August 2006

Welsh Double L

Having last week discussed Welsh double d, this week I'm going to briefly touch upon Welsh "double l" /ɬ/.

In the earliest Welsh texts the /ɬ/ sound was normally written as "l" initially and as "ll" medially and finally, but in most medieval manuscripts /ɬ/ is written as "ll" in all positions, as in modern Welsh. As the /l/ sound was also sometimes written as "ll", in words that in modern Welsh are spelled with a single "l" (e.g. callon, modern calon "heart"), there was scope for confusion between the two sounds. In order to differentiate the two phonetic values of "ll", in some manuscripts, as well as in some modern editions of middle Welsh texts, the /ɬ/ sound is represented by a ligatured "double l" (you can see some examples in Jones's 1941 transcription of Brut y Tywysogyon given in last week's double d post). This ligatured double l has been proposed for encoding as LATIN CAPITAL LETTER MIDDLE-WELSH LL and LATIN SMALL LETTER MIDDLE-WELSH LL (see N3027), and will available for use from Unicode 5.1.

In early printed books, "lh" is often used instead of "ll", as for example in Yny lhyvyr hwnn (1546) and Y Drych Cristianogawl (1585). What interests me is that in the latter book "lh" is sometimes replaced by what I think is a form of blackletter "l" with a loop (it looks a bit like a "P"), in both blackletter and roman text, as can be seen on this page :

Y Drych Cristianogawl (1585) page 72r

darllen (line 7), llyfr (line 8), holl (line 15, cf. holh on lines 24 & 25), deall (line 27), etc.

This looped form of blackletter letter "l" is not uncommon, especially in 16th century French books (it is also used in Caxton's Canterbury Tales), but it is the first time that I have seen it used contrastively with the ordinary form of the letter "l" to indicate a particular phonetic value. It is even more remarkable to see it used in roman text as well as blackletter text. I haven't seen this looped blackletter l used elsewhere to represent Welsh /ɬ/, but then again I only have access to a very few early Welsh books.

The fact that the blackletter looped l is used in this important text (the first Welsh book printed in Wales) contrastively with "l" to represent /ɬ/ means that it is a very good candidate for encoding in my opinion. The only question is whether it should be encoded as a lowercase letter only or as an uppercase and lowercase pair. In Y Drych Cristianogawl it only appears to occur in lowercase form, and I suspect that it would always be replaced by "LH" or "Lh" in capitals, in which case it could be argued that, like long s and r rotunda, it is a lowercase only letter, and a corresponding capital form is not required. On the other hand, one can easily envisage a blackletter capital looped "L", and for ease of casing it may be simplest to encode a hypothetical capital form.

Sunday, 13 August 2006

Welsh Double D

I'm afraid that I'm still obsessing with the finer points of medieval orthography, and so over the next couple of weeks I'm going to discuss a couple of Welsh oddities that I have come across recently.

As everyone knows, Welsh "dd" is pronounced /ð/, but what may not be so well known is that it was not always written this way. In medieval Welsh manuscripts, words which in modern Welsh are spelled with double d are almost always spelled with a single "d" (or sometimes "t"). Thus early Welsh texts do not differentiate modern Welsh /d/ and /ð/, presumably because /ð/ is a later phonetic development. The following example is from one of the earliest Welsh manuscripts, written during the late 13th century :

The Black Book of Carmarthen (NLW MS Peniarth 1 folio 4r)

The start of the prophetic poem Breuddwyd a Welwn Neithiwr "The dream I saw last night", written Breuduid a uelun neithir.

As far as I can tell, up until the 15th century the /ð/ sound is not normally represented in written Welsh. However, there is one manuscript that does distinguish /ð/ from /d/ in some parts, that is MS Peniarth 20 (mostly comprising the text of Brut y Tywysogion "The Chronicle of the Princes"), which was written about 1330.

In the latter section of Brut y Tywysogion /ð/ is not differentiated, but in the earlier parts /d/ and /ð/ do seem to be distinguished, as can be seen from the words blwydyn wedy (modern Welsh blwyddyn wedi "the next year") shown below. In the example from page 264 modern Welsh "d" and "dd" both correspond to an ordinary letter "d", whereas in the example from page 73 modern Welsh "d" corresponds to an ordinary letter "d" but modern Welsh "dd" corresponds to what appears to be the common Latin abbreviation for "que", that is the letter "q" followed by the "3" shaped letter "et" (which is proposed for encoding in N3027). I'm not entirely sure whether this is a novel reuse of the Latin abbreviation for "que" or whether there is some other explanation for this peculiar letter, but as I have been unable to find any other examples of it in other Welsh manuscripts, it may be an idiosyncrasy of the particular scribe who wrote this section of Brut y Tywysogion. Whatever the origin of the letter, I think that it must be one of the earliest indications of the phonetic shift of /d/ to /ð/.

Peniarth MS 20 p.73 Peniarth MS 20 p.264
/d/ and /ð/ differentiated /d/ and /ð/ not differentiated

The following is a short extract from MS Peniarth 20 that shows the use of this letter, followed by the corresponding transcription in Thomas Jones's 1941 scholarly edition of Brut y Tywysogyon (taken from N3027 Fig.1). What seems particularly perverse to me is that Jones transcribes this letter with an "insular" letter d, but transcribes what in the manuscript is an "insular" letter d as an ordinary "d".

The Chronicle of the Princes (NLW MS Peniarth 20 page 73)

Despite this one manuscript, it was not until the late 15th century that /ð/ came to be regularly represented in written Welsh. During the 15th and 16th centuries it is variously represented by "dd", "dh" or (in blackletter) d with a dot above. In early printed texts, d with a dot above seems most common in blackletter type and "dh" in roman type. However, when representing a /ð/ sound blackletter capital "D" is always followed by an "h" even when lowercase d is dotted (see second example below).

Yny lhyvyr hwnn (London, 1546) page 22

This book uses d with dot above, e.g. Blwydyn (blwyddyn "year") on lines 1, 6 and 12.

Y Drych Cristianogawl (1585) page xii

This book uses "Dh" and dotted d, e.g. Dhyd (Ddydd "Day"), Dhuw (Dduw "God"), drwg (ddrwg "evil"), digon (ddigon "enough").

When the Welsh Bible was published in 1588 it used "dd" for /ð/, and due to its overwhelming influence as the standard for written Welsh, "dd" has been used for this sound ever since.