Thursday, 23 February 2006

Stacking Diacritics and Complex Tibetan Stacks

Michael Kaplan thinks that stacking diacritics up to the ceiling and down to the basement is really cool. I think so too, and was disappointed to find that it doesn't work with the current release of BabelPad. Well, with a couple of tweaks (allowing large line spacing values and centring the output vertically within the space between the previous and next lines), I've got stacking diacritics to display correctly in BabelPad, as you can see from the following screenshots of the letter a with 72 combining diacritics (33 above and 39 below). In both cases the font is Doulos SIL at 24 points, but the first screenshot shows what you get if you turn off Uniscribe and render everything as spacing characters, whilst the second one shows monumental stacking when you turn on Uniscribe (version 1.420.2600.2180 on my computer) and set BabelPad's line spacing to 12.0.


Screenshot 1 : Combining diacritics laid out horizontally (Uniscribe off)


Screenshot 2 : Combining diacritics stacked vertically above and below (Uniscribe on)


Unfortunately the new improved version of BabelPad used for these screenshots won't be coming out until May. It was scheduled for release at the end of March, as soon as Unicode 5.0 is released, but we now hear that the release of Unicode 5.0 is being delayed until May. As my working versions of BabelPad and BabelMap have already been upgraded internally to support 5.0 I can't release them until after 5.0 is out. On the one hand, this is rather annoying, as there are lots of bug fixes and improvements that I want to release as soon as possible; on the other hand, it gives me some desperately needed time to get everything else ready for 5.0, including my suite of Phags-pa fonts.

Anyhow, back to stacking diacritics. a with 72 dicaritics is certainly impressive, but not very useful in the real world. However, there is one script that I can think of that does occasionally require multiple stacking characters. This is Tibetan. Tibetan is normally written horizontally with consonant clusters stacking vertically (implemented as one full-sized consonant from the range 0F40..0F69 and zero or many subjoined consonants from the range 0F90..0FB9). Ordinary Tibetan text only has limited vertical stacking (usually just one subjoined consonant, but sometimes two), and can be rendered correctly using Uniscribe version 1.453.3665.0 or later and a competent OpenType Tibetan font (of which there are now several freely available). However, occasionally, in esoteric texts, consonants are piled up (or rather down) like some crazy Yertle stack. With the first version of Uniscribe to support Tibetan OpenType features, Tibetan stacks with many subjoined consonants do not render correctly with any Tibetan OpenType font (I'm using Microsoft's not-yet-released Ximalaya font for these examples).


Screenshot 3 : Complex non-standard Tibetan stacks (Uniscribe version 1.453.3665.0)


But with the latest versions of Uniscribe (version 1.468.4011.0 or later), in conjunction with the Ximalaya font, it is possible to correctly render Tibetan stacks with many subjoined consonants. It's not terribly pretty, but I think it is pretty amazing.


Screenshot 4 : Complex non-standard Tibetan stacks (Uniscribe version 1.468.4011.0)


The above are all real examples of complex stacking, taken from sngags kyi klog thabs shes rab mig 'byed སྔགས་ཀྱི་ཀློག་ཐབས་ཤེས་རབ་མིག་འབྱེད. However, there are still some complex stacks that cannot yet be rendered in plain text. For example, in some of the complex stacks in this text there is also a horizontal element, where one or more of the subjoined letters is followed horizontally by the letter NGA to make a subjoined syllable such as yang ཡང (the YA is vertically in line with the stack, but the NGA protrudes forward). At present there is no way of indicating horizontal progression at the subjoined level (and there probably never will be).

Also, with the "Ximalya" font (or at least my version of it, maybe the version shipping with Vista [called Microsoft Himilaya] will have been improved) non-standard multiple vowel signs do not work well, for example two i vowel signs will overlay each other, when they should each occupy their own space. As double i vowel signs are found in some abbreviations (which do not use abnormal stacking), the failure to render multiple vowel signs correctly is a little disappointing.


Wednesday, 15 February 2006

When is a Swastika not a Swastika ?

When it's encoded in Unicode, when it is a CJK Unified Ideograph ... or two CJK ideographs to be precise:

  • U+534D CJK UNIFIED IDEOGRAPH-534D 卍 (left-facing or anticlockwise swastika)
  • U+5350 CJK UNIFIED IDEOGRAPH-5350 卐 (right-facing or clockwise swastika)

This comes as a suprise to most people, who do not naturally associate the swastika with the Chinese script. Of course, the swastika is not a Chinese invention, but was originally an ancient Indian religious symbol. It was introduced into China along with Buddhism, as the swastika was supposed to be one of the thirty-two marks of a Buddha. In the year 693 Empress Wu decreed that the swastika should henceforth be regarded as a Chinese character, to be pronounced the same as the character 萬 wàn "ten thousand".

The swastika thus entered the vast corpus of Chinese characters. The left-facing form is most common in Chinese usage, but both forms are found, as there was some disagreement amongst Chinese authorities as to which form was correct. The swastika, in either or both forms, is duly recorded in most large modern dictionaries (although only the left-facing form is found in the Kangxi Dictionary 康熙字典, where it has a very meagre entry). The two swastika characters were included in early Chinese encodings such as CNS 11643-1986, and so also included in the earliest version of Unicode as part of the CJK unified ideograph repertoire derived from the various legacy encodings.

The swastika character in Chinese does not have any meaning other than its own shape as an auspicious symbol, and so it is usually only used in the compound word wàn zì 卍字 (also often written as 萬字) "swastika character" to describe the swastika motif in the decorative arts. The following excerpt from the great 18th century novel Honglou Meng 紅樓夢 "A Dream of Red Mansions" illustrates the use of the swastika character in running text (the novel also includes a maid with the name of Wan'er 卍兒) :

昨兒我開庫房,看見大板箱裏還有好些匹銀紅蟬翼穿花紗,也有各樣折枝花樣的,也有流雲卍福花樣的,也有百蝶穿花花樣的,顏色又鮮,紗又輕軟,我竟沒見過這樣的。

Yesterday when I opened the storeroom I saw quite a few rolls of vermilion cicada-wing patterned gauze in some big chests. There were all sorts of designs with sprigs of flowers, as well as designs with floating clouds and patterns of swastika and good fortune characters, and designs with butterflies fluttering amongst the flowers. The colours are bright, and the gauze is soft and light, the like of which I have never seen before.

Honglou Meng 紅樓夢 (Beijing: Renmin Wenxue Chubanshe, 1982) ch.40 p.547.

N.B. In some editions the word biānfú 蝙蝠 "bat" is found in place of wànfú 卍福 "swastika and good fortune", the bat also being an auspicious emblem in Chinese. The name of the maid Wan'er 卍兒 is also written 萬兒 in some editions.

The swastika is also an important symbol in other cultures, particularly in Tibet, where the swastika 卐 is a symbol of changelessness and eternity for Buddhists, and the left-facing swastika 卍 is the main emblem of the native Bön བོན religion. The most common name for the swastika symbol in Tibetan is g.yung drung གཡུང་དྲུང་ (silent initial g), which is a word of uncertain etymology. By themselves, g.yung གཡུང་ means a cross between a cow and a yak, and drung དྲུང་ means "near to, in front of or beside", so literally the word g.yung drung would mean something like "in front of the cow-yak", which obviously makes no sense. However, in the ancient Zhang Zhung language that is partially preserved in the Bön tradition, the word for the swastika is drung mu, which obviously has some relationship to Tibetan g.yung drung, although the etymology of the Zhang Zhung word is equally obscure (mu means "sky, heaven" in Zhang Zhung, but the root meaning of drung is not clear).

As the swastika is not confined to Han usage, but is a symbol used by many other cultures, some would argue that the swastika signs should be encoded in Unicode as symbols for general usage, in the same way that U+262F YIN YANG ☯, U+262A STAR AND CRESCENT ☪, U+262D HAMMER AND SICKLE ☭, U+2629 CROSS OF JERUSALEM ☩ and many other such religious or political symbols are, and that U+534D and U+5350 should then be restricted to Han usage. This is unlikely to happen due to sensitivities over the misuse of the swastika symbol by one particular culture. Nevertheless, there are several problems that I see with only encoding the swastika ideographs and not encoding swastika symbols in their own right.

Firstly, the swastika ideographs are given a Unicode script property of "Han", which indicates that they are only intended for use in a Han ideographic context. However, other scripts have a legitimate claim to the use of the swastika, and the Unicode Standard explictly states that the Tibetan script uses U+534D and U+5350 (TUS 4.0 p.257). This suggests to me that, out of the 70,000+ CJK ideographs currently encoded, U+534D and U+5350 alone should perhaps be given a script property of "Common". Michael Kaplan has suggested that it is a deficiency in the Unicode script property that characters must either belong to a single script only or else belong to all scripts, and thus it is not possible to specify that a character belongs to a particular subset of scripts, such as "Han and Tibetan" in the case of U+534D and U+5350. I guess that for many characters it is difficult to define the boundaries of script usage, and it is a lot simpler to just use "common" rather than a potentially controversial or changing list of scripts.

Secondly, the glyphs for the ideographic swastikas are often drawn in an ideographic style which may not be suitable for non-Han usage.

Thirdly, because U+534D and U+5350 are hidden amongst the thousands of anonymous CJK ideographs, it is not easy for users to find them if they do not already know where to look. For example, searching for "swastika" in either Windows Character Map or BabelMap will not produce any results (though this will change in the next version of BabelMap), which would probably lead most people to suppose that there are no swastika symbols encoded in Unicode ... and perhaps they would be half right.

[This blog follows on from Michael Kaplan's recent post Every character has a story #17: U+534d and U+5350]



Addendum I

Looking through some old files I have just rediscovered some images of bon head marks, which are formed from the left-facing swastika. [2007-05-27 : these headmarks are actually used in the sMar-chen script, as discussed in my Zhang Zhung Scripts post.]



These marks are the equivalent of the head mark character U+0F04 TIBETAN MARK INITIAL YIG MGO MDUN MA, or perhaps more accurately the recently proposed archaic-style head mark character, pencilled in as U+0FD3 TIBETAN MARK INITIAL BRDA RNYING YIG MGO MDUN MA, with the curl styled into a swastika, and are used in bon religious texts. (I think that Tibetan head marks should perhaps be the topic for my next Tibetan Extensions blog.)



Addendum II: An Unauthorised History [2014-07-07]

When I wrote this blog post eight years ago I thought it unlikely that generic swastika signs would ever be encoded in Unicode "due to sensitivities over the misuse of the swastika symbol by one particular culture". But only a little more than a year later four generic swastika signs (two plain swastikas and two dotted swastikas) were proposed for encoding by Michael Everson as part of a larger proposal for Vedic Sanskrit. The swastikas have been entirely removed from the revised version of this proposal dated 2007-04-26 that is in the WG2 document registry (N3235), but the original proposal dated 2007-04-13 is still preserved on Everson's site:


Page 8 of the original version of N3235 (2007-04-13)

Discussion of Right-facing Swastika, Left-facing Swastika, Right-facing Dotted Swastika, and Left-facing Dotted Swastika


This propsal was discussed at WG2 meeting 50 at Frankfurt, Germany during the week of 23–27 April 2007, although all such discussion and mention of the word "swastika" have been redacted from the minutes of the meeting. Luckily, I was present in person at the meeting, and remember much of the discussion. There was general consensus amongst the experts at the meeting that the four proposed characters should be encoded in principal, but there was concern that German anti-Nazi laws could land the ISO/IEC 10646 and Unicode Standards in legal trouble if the characters were encoded. However, the German delegates and meeting hosts assured the meeting that encoding these signs would not cause any problems as they are religious symbols, and use of religious swastika symbols was not illegal in Germany. Nevertheless, in order to ensure that users of the standard clearly understand that these swastika signs are intended for use as religious symbols, it was agreed that they should be encoded in the Tibetan block with Tibetan names. The following month, in response to the discussion at this meeting, Michael Everson, Chris Fynn, Peter Scharf and myself submitted a proposal to encode four swastika signs (N3268):


Page 1 of N3268 (2007-05-09)


In this proposal the swastika signs are given a new identity as Tibetan symbols:

  • TIBETAN SYMBOL GYUNG DRUNG NANG -KHOR གཡུང་དྲུང་ནང་འཁོར (yung-drung nang khor)
  • TIBETAN SYMBOL GYUNG DRUNG PHYI -KHOR གཡུང་དྲུང་ཕྱི་འཁོར (yung-drung chi khor)
  • TIBETAN SYMBOL GYUNG DRUNG NANG -KHOR BZHI MIG CAN གཡུང་དྲུང་ནང་འཁོར་བཞི་མིག་ཅན (yung-drung nang khor shi mik chen)
  • TIBETAN SYMBOL GYUNG DRUNG PHYI -KHOR BZHI MIG CAN གཡུང་དྲུང་ཕྱི་འཁོར་བཞི་མིག་ཅན (yung-drung chi khor shi mik chen)

At the next WG2 meeting, in Hangzhou, China in September 2007, it was agreed to add these four characters to Amendment 6 of ISO/IEC 10646 for balloting by ISO national bodies (see N3353 §9.17). In their ballot comments the Indian national body requested that the characters be moved out of the Tibetan block, and placed in either a Devanagari block or a general symbols block, and renamed to use the name "swastika" (see N3476). The following year, Amendment 6 underwent a second Proposed Draft Amendment (PDAM) ballot, and in response to the concerns of the Indian national bodies the USA and Irish national bodies requested that the characters be renamed to use the terms "left svasti" and "right svasti" (deliberately avoiding the stigmatized "swastika") instead of the Tibetan names (see N3516). In the end, the four characters were kept in the Tibetan block, but with non-Tibetan names (although the Tibetan names were given as aliases), when Amendment 6 was published in October 2009, following two further rounds of ISO balloting:


Detail of Page 23 of ISO/IEC 10646:2003 Amendment 6 (2009-10-15)


The contents of Amendment 6, including these four svasti signs, were included in Unicode version 5.2, which was released on 1st October 2009.

  • U+0FD5 RIGHT-FACING SVASTI SIGN (alias "gyung drung nang -khor")
  • U+0FD6 LEFT-FACING SVASTI SIGN (alias "gyung drung nang -khor")
  • U+0FD7 RIGHT-FACING SVASTI SIGN WITH DOTS (alias "gyung drung nang -khor bzhi mig can")
  • U+0FD8 LEFT-FACING SVASTI SIGN WITH DOTS (alias "gyung drung phyi -khor bzhi mig can")

However, five years after entering Unicode, these four characters are still very poorly represented in fonts, with a perhaps understandable reluctance of large companies to support characters which may be misconstrued as Nazi symbols. No fonts that ship with Windows XP, 7 or 8 include these characters, and on my computer only the following freeware fonts cover the four svasti characters:


Swastika characters rendered using BabelStone Han at 48 points

U+534D & U+5350 (CJK swastikas)

U+0FD5 & U+0FD6 (undotted swastikas)

U+0FD7 & U+0FD8 (dotted swastikas)


... and because the four svasti characters are very poorly supported in fonts, they are rarely used in text, with most internet users continuing to use the CJK swastika characters.



Addendum III: Examples of Swastikas in Tangut Texts [2014-07-08]

Fragment of a Tangut manuscript from Kharakhoto [Or.12380/2621]

A left-facing swastika and two side-by-side left-facing swastikas


Woodblock edition of the Wonderful Dharma Lotus Flower Sutra 妙法蓮華經 (Saddharma Puṇḍarīka)

Right-facing 45° rotated swastika (currently not representable in plain text with Unicode)



Addendum IV: Swastikas in Other Scripts [2014-07-09]

Marchen Script

The Marchen script is one of several scripts within the Tibetan Bön tradition that were supposedly used to write the Zhang Zhung language of the Zhang Zhung culture that flourished in the western and northern parts of Tibet before the introduction of Buddhism into the country during the 7th century. The left-facing swastika is the paramount symbol of the Bön religion, and is present in two characters in the script: the head mark (corresponding to in the Tibetan script and in the Phags-pa script); and the letter nya. The Marchen script has been included in Amendment 2 of ISO/IEC 10646:2014, which is currently undergoing its first round of ISO balloting (the PDAM ballot). If it successfully passes the ISO balloting process, the Marchen script will be included in a future version of the Unicode Standard.


Part of the Marchen code chart from ISO/IEC 10646:2014 PDAM 2 Ballot document (SC2 N4340)

U+11C70 "MARCHEN HEAD MARK" and U+11C79 "MARCHEN LETTER NYA" are both based on the left-facing swastika


Nüshu Script

The Nüshu script 女书 is a script used exclusively by women in Jiangyong county of Hunan province to write the local dialect of Chinese. Many Nüshu characters are derived directly from Chinese characters by skewing the shape of the character, and the Nüshu character representing uoɯ³³ 万 (wàn), uoɯ⁴⁴ 弯 or 湾 (wān), va³³ 位 (wèi), and iu 约 (yuē) is based on the right-facing ideographic swastika character (wàn). One other Nüshu character looks as if it is based on a dotted right-facing ideographic swastika character, but it is actually derived from the Chinese character 断 (duàn), and is used to represent taŋ¹³ 断 (duàn), ta³³ 地 (dì), taŋ³³ 段 or 缎 (duàn), laŋ³⁵ 短 (duǎn), laŋ⁴⁴ 端 (duān), tuoɯ¹³ 但 (dàn), tai¹³ 动 (dòng), and ŋu¹³ 午 (wǔ). The Nüshu script (named "Nushu" for technical reasons) has also been included in Amendment 2 of ISO/IEC 10646:2014, and should be included in a future version of the Unicode Standard.


Part of the Nushu code chart from ISO/IEC 10646:2014 PDAM 2 Ballot document (SC2 N4340)

U+1B195 "NUSHU CHARACTER-1B195"* is based on the right-facing ideographic swastika


Part of the Nushu code chart from ISO/IEC 10646:2014 PDAM 2 Ballot document (SC2 N4340)

U+1B258 "NUSHU CHARACTER-1B258"* looks like it is based on a dotted swastika, but is actually derived from the character 断 (duàn)


* Nushu was originally included in ISO/IEC 10646:2014 Amendment 1, with character names based on the phonetic reading of the most frequent meaning of the character, so U+1B195 was named "NUSHU CHARACTER UOW33", and U+1B258 was named "NUSHU CHARACTER TANG13" (see N4484). However, in response to feedback from Japanese experts (see N4513) and ballot comments from the Japanese national body (see N4520), and after much heated debate, it was decided to use algorithmic names based on the hexadecimal code point of the character for all East Asian ideographic scripts, including Nüshu, Tangut, Jurchen and Khitan. Consequently, Nushu was moved back to Amendment 2 so that it can be balloted with the new algorithmic names.