Last year my friend Nathan Hill kindly invited me to give a talk on Tangut at my Alma Mater. I accepted with some trepidation because I am still very much at the start of a long and steep learning curve with regards to Tangut, but I hoped that by the time the talk was due to be given in May this year I would have something interesting and exciting to talk about. Unfortunately I got tied up with other stuff (Tangut, ironically), so in the end my talk turned out to be more of a general introduction to the structure of the Tangut script and some of the issues that I have faced over the last year or so in preparing an encoding proposal for Tangut. But anyway, the talk didn't go too badly, and so I thought that I would convert my PowerPoint slides into a four-part series of blog posts.
Notes for an introductory talk on the Tangut script given at SOAS on 21st May 2009
- Part 1 : How Complex is Tangut ?
- Part 2 : Untangling the Web of Characters
- Part 3 : Tangut Homographs
1.1 The Age of New Scripts
During the 10th to 13th centuries a number of new scripts were devised by peoples who had come into contact with (and conflict with) China, and who wanted to assert their national identity and cultural superiority by means of their own, unique and distinct writing systems (colour-coded to show their current Unicode status):
- Khitan Large Characters (c.920) ⇐ Chinese
- Khitan Small Characters (c.924) ⇐ Chinese
- Tangut (c.1036) ⇐ ?
- Jurchen (c.1120) ⇐ Chinese and Khitan
- Mongolian (c.1204) ⇐ Old Uyghur
- ʼPhags-pa (c.1269) ⇐ Tibetan
[See Documents relating to the encoding of the Tangut, Jurchen and Khitan scripts for Unicode encoding proposals]
Three of these scripts, Large Khitan, Jurchen and Tangut, are structurally similar to Chinese, and I will look at their similarities and differences, both amongst themselves and in relation to Chinese, below.
1.2 Khitan Large Script
- Closely modelled on Chinese
- Many characters borrowed directly from Chinese
- Some with the same meaning (e.g. 皇帝 in the text below)
- Some as phonetic borrowings
- Many other characters derived from Chinese characters by adding or removing strokes (e.g. 東 with two extra strokes on the 6th line from the right in the text below)
- Few or no characters composed of multiple elements with large numbers of strokes (i.e. no characters like Chinese 雙)
- Uses exactly the same stroke types as Chinese
- Largely undeciphered
Transcription of a Khitan Memorial Stone
Source: Miínzú Yǔwén 民族语文 2005 no.4 page 54
Click here to highlight Khitan characters that are the same as Chinese characters
1.3 Jurchen
- Very similar to Khitan Large Script
- Many characters derived from Khitan and/or Chinese
- Relatively few direct borrowings from Chinese compared with Khitan
- No characters with large numbers of strokes or composed from multiple complex elements
- Uses exactly the same stroke types as Chinese
- Largely deciphered
Drawing of a "Medallion" with a Jurchen inscription
Source: S. W. Bushell, "Inscriptions in the Juchen and Allied Scripts" in Actes du Onzième Congrès International des Orientalistes (1897) 2nd section page 21
(originally from Fāngshì Mòpǔ 方氏墨譜 [Mr. Fang's Catalogue of Inkstones] (1588) vol. 1 folio 33)
Table of Chinese, Khitan and Jurchen Numerals
Source: Daniel Kane, The Sino-Jurchen Vocabulary of the Bureau of Interpreters (1989) page 21
1.4 Tangut
- Only superficially similar to Chinese
- Characters are not obviously derived directly from Chinese or Khitan characters, although they are clearly influenced by Chinese
- Discrete elements arranged into a square character
- Appears crowded compared with Chinese, with few non-complex characters
- Most characters composed of two or three distinct components, and only a few characters are themselves elemental components
- Mostly written using the same stroke types as used for writing Chinese, but some stroke types and stroke constructions are unique to Tangut
- Higher proportion of diagonal and oblique strokes than in Chinese
- No closed elements (i.e. no box elements like Chinese 口 and 囗)
Chrysographic Edition of the Lotus Sutra
Source 中国少数民族文字字符总集
Fragment of a Memorial Stone from the Western Xia Royal Tombs
Source: 大夏寻踪——西夏文物特展 (Vanished Exhibition on Western Xia artefacts at the National Museum of China)
[Can you spot the characters meaning "one" and "three" ?]
1.5 Stroke Complexity
Tangut is renowned as being very complex in terms of the structure of its individual characters, but I wanted to try to determine exactly how complex Tangut is, and how it compares with Chinese, Khitan and Jurchen, so I produced the following graphs to show the distribution of characters by stroke count in these various scripts.
Distribution of Tangut Characters by Stroke Count
Data derived from Proposal for a revised Tangut character set for encoding in the SMP of the UCS (SC2/WG2/N3577) Appendix A.
Distribution of Traditional CJK Characters by Stroke Count
Data derived from the kTotalStrokes field of the Unihan Database for those characters defined in Unicode 1.0 (i.e. U+4E00 through U+9FA5), excluding simplified characters (mostly those characters with a kTraditionalVariant field).
Distribution of Simplified CJK Characters by Stroke Count
Data derived from the kTotalStrokes field of the Unihan Database for those characters defined in Unicode 1.0 (i.e. U+4E00 through U+9FA5) that have the kXHC1983 field but do not have the kSimplifiedVariant field (i.e. most simplified characters in the 1983 edition of Xiàndài Hànyǔ Cídiǎn 现代汉语词典).
Distribution of Large Khitan Characters by Stroke Count
Data derived from the transcription of a Khitan memorial stone given in Miínzú Yǔwén 民族语文 2005 no.4 page 54 and page 55.
Distribution of Jurchen Characters by Stroke Count
Data derived from Jin Qizong 金啓孮, Nüzhenwen Cidian 女真文辞典 [Dictionary of Jurchen Characters] (Beijing: Wenwu Chubanshe, 1984).
Stroke Count Data for Traditional CJK, Simplified CJK, Tangut, Jurchen and Khitan
Strokes | CJK Traditional | CJK Simplified | Tangut | Jurchen | Khitan |
---|---|---|---|---|---|
1 | 10 | 2 | 0 | 3 | 0 |
2 | 37 | 22 | 0 | 6 | 6 |
3 | 80 | 60 | 0 | 25 | 28 |
4 | 157 | 143 | 3 | 165 | 52 |
5 | 240 | 215 | 32 | 287 | 60 |
6 | 386 | 351 | 65 | 401 | 41 |
7 | 664 | 568 | 160 | 293 | 34 |
8 | 957 | 759 | 310 | 147 | 18 |
9 | 1,125 | 851 | 524 | 37 | 10 |
10 | 1,369 | 923 | 773 | 13 | 4 |
11 | 1,555 | 901 | 847 | 0 | 2 |
12 | 1,636 | 870 | 885 | 0 | 0 |
13 | 1,546 | 761 | 782 | 0 | 0 |
14 | 1,446 | 594 | 640 | 0 | 0 |
15 | 1,502 | 534 | 473 | 0 | 0 |
16 | 1,251 | 409 | 336 | 0 | 0 |
17 | 1,020 | 311 | 173 | 0 | 0 |
18 | 793 | 175 | 106 | 0 | 0 |
19 | 716 | 168 | 60 | 0 | 0 |
20 | 519 | 105 | 29 | 0 | 0 |
21 | 394 | 79 | 15 | 0 | 0 |
22 | 304 | 47 | 6 | 0 | 0 |
23 | 240 | 40 | 1 | 0 | 0 |
24 | 149 | 21 | 1 | 0 | 0 |
25 | 107 | 22 | 0 | 0 | 0 |
26 | 54 | 6 | 0 | 0 | 0 |
27 | 52 | 1 | 0 | 0 | 0 |
28 | 26 | 1 | 0 | 0 | 0 |
29 | 13 | 1 | 0 | 0 | 0 |
30 | 8 | 0 | 0 | 0 | 0 |
31 | 5 | 0 | 0 | 0 | 0 |
32 | 3 | 1 | 0 | 0 | 0 |
33 | 4 | 1 | 0 | 0 | 0 |
34 | 0 | 0 | 0 | 0 | 0 |
35 | 1 | 0 | 0 | 0 | 0 |
36 | 1 | 1 | 0 | 0 | 0 |
37 | 0 | 0 | 0 | 0 | 0 |
38 | 0 | 0 | 0 | 0 | 0 |
39 | 1 | 0 | 0 | 0 | 0 |
40 | 0 | 0 | 0 | 0 | 0 |
41 | 0 | 0 | 0 | 0 | 0 |
42 | 0 | 0 | 0 | 0 | 0 |
43 | 0 | 0 | 0 | 0 | 0 |
44 | 0 | 0 | 0 | 0 | 0 |
45 | 0 | 0 | 0 | 0 | 0 |
46 | 0 | 0 | 0 | 0 | 0 |
47 | 0 | 0 | 0 | 0 | 0 |
48 | 1 | 0 | 0 | 0 | 0 |
Total | 18,373 | 8,943 | 6,221 | 1,377 | 255 |
Mean | 13.46 | 11.49 | 12.09 | 6.01 | 5.43 |
Mode | 12 | 10 | 12 | 6 | 5 |
Comparison of CJK, Tangut, Jurchen and Khitan Stroke Counts
Jurchen and Large Khitan are the two scripts that appear to be most similar to Chinese, yet actually they are the most different when it comes to stroke count, both having only half the number of strokes as traditional CJK characters on average. This difference is probably due to the fact that Large Khitan and Jurchen characters do not have any high stroke count radicals such as 言 "speech" (7 strokes), 金 "gold" (8 strokes), 馬 "horse" (9 strokes) and 鳥 "bird" (9 strokes) that are very common in Chinese characters.
On the other hand, it was a surprise (to me at least) to see how closely the contour of Tangut matches that of traditional Chinese, as I had always assumed that Tangut characters must, on average, be much more complex than Chinese characters. But although Tangut does not have any characters with very few strokes (less than 4 strokes) or very many strokes (more than 24 strokes), which distinguishes it from Chinese, if you ignore the lower and upper ends of the graph the distribution of stroke counts for Tangut is very close to that of traditional Chinese. Why then does Tangut text look so much more complex and more crowded than Chinese? That could be answered with another graph which took into account each character's frequency of occurence. A large proportion of high frequency Chinese characters have very few strokes (e.g. 一二三人女山火水大小中), and conversely Chinese characters with very many strokes tend to occur less frequently, with the result that normal Chinese text always has a large proportion of characters with few strokes. In contrast to the situation with Chinese, there does not appear to be any relationship between frequency and stroke count for Tangut characters, so that normal Tangut text is uniformly composed of characters with 12±6 strokes, with the result that it appears denser and more crowded than Chinese.
1.6 Structure of Tangut Characters
- Individual Tangut characters not obviously derived directly from Chinese or Khitan characters
- Limited set of component elements
- Elements are themselves built from simpler elements by the addition of 1 or 2 strokes
- Most characters constructed from 2 or 3 component elements
- Very few basic elements are also characters in their own right
Series of components are constructed from a basic element, on the one hand by the addition of strokes to the basic element to make other simple components (vertical progression in the diagrams below), and on the other hand by combining these simple components with other components to make complex components (horizontal progression in the diagrams below).
Series of Tangut Components (Example A)
Series of Tangut Components (Example B)
Due to this incremental process many character components are very similar to each other, and when two or three such similar components (coloured red in the diagram below) are combined together in different combinations to make different characters (coloured blue in the diagram below), the results are confusingly confusable.
Eleven Characters composed from different combinations of Five Components
1.7 Tangut Radicals
- Not true radicals (determinatives)
- But simply aids to character lookup
- Chinese dictionaries select leftmost or topmost character element as the radical
- Most Russian dictionaries base the radical on the character element at the bottom right corner of the character
In the example below, the same radical is used in both Li Fanwen's dictionary and Kychanov's dictionary, but in the former it is a lefthand radical, and in the latter it is a bottom right radical. This shows how most horizontally aligned components can occur equally on the left side or on the right side of a character, and it is largely an arbitrary decision of dictionary compilers as to whether it is treated as a lefthand side radical or a righthand side radical.
Li Fanwen 2008 | Kychanov 2006 | |
---|---|---|
![]() |
![]() |
The proposed Unicode character ordering is based on 527 left-based radicals (including some top, bottom and enclosing radicals where there is no lefthand component). The advantage of this system of ordering is that it is consistent and allows for deterministic lookup of characters, but the disadvantage is that there are some high stroke-count radicals with very few members.
1.8 Structural Analysis
- Because Tangut characters are composed of a limited set of component elements arranged in different configurations they are very amenable to structural analysis
- Nishida’s 1966 dictionary gives structural analysis of each character
Table of Tangut Component Configurations identified by Nishida
Source: Nishida Tatsuo 西田龍雄, Seikago no kenkyū 西夏語の研究 (1964) page 246
Entry in Nishida's 1966 Tangut Dictionary
Source: Nishida Tatsuo 西田龍雄, Seikabun Shōjiten 西夏文小字典 (1966) no. 10-103
The Unicode proposal gives an Ideographic Description Sequence (IDS) for each proposed character. This borrows a character description syntax designed for CJK characters (but which will no longer be restricted to CJK characters from Unicode 6.0).
6 comments:
I love Ideographic Description Sequence! I want to order by that, and use regular expressions to search it. Can we do it already in your database?
"Jurchen and Large Khitan are the two scripts that appear to be most similar to Chinese, yet actually they are the most different when it comes to stroke count"
That's not how I see the graphs. What is most different between the scripts is the number of characters (the area under the graph). If you look at the left side, they ascend in a very similar manner, and then taper off as the need for characters is exhausted. Looked at this way, Tangut is clearly the exceptional script - it is right-shifted compared to the others.
You can look at this in terms of information density. The total number of possible characters for each stroke number is an exponential curve. For low values, the non-Tangut scripts are very information dense - they closely approximate the total number of possible characters. The reason that the right side of the curve drops off slowly (rather than using up the total possible number of characters and then dropping off suddenly) is internal morphology. What the right-shifted Tangut curve illustrates is the high degree of internal morphology of Tangut characters.
Another way to look at it: I think IDS would work very well as an input method for Tangut. It wouldn't work as well for Chinese because there are too many basic elements.
David,
You're quite right that the problem with an IDS-based input method for Chinese is that there are too many basic elements, making it somewhat impractical to display all the possible elements. In addition there are quite a few elements that are not encoded as characters (for example, the lefthand side of 师, the righthand side of 铅, the righthand side of 拣, the top of 览). On the other hand, Tangut has a relatively few number of basic elements. all of which will be encoded, so an IDS-based input method would be more practical, and is something I want to work on (next year maybe).
But the Ideographic Description Characters makes me afraid: will people code Tangut script in a dynamic-composition way? That sounds terrible! I hope they use precomposed Tangut like Chinese: ... And now someone want to code Tangut in that way ...
You misunderstand. I use Ideographic Description Characters to help analyse the structure of Tangut characters, and in the future to create an input method for Tangut (user can enter the component elements of a Tangut character, but the output is a single Tangut character) But no-one is proposing to encode Tangut as decomposed character components that need to be dynamically composed by the user. We are proposing to:
A) Encode 6,221 individual Tangut characters. This includes all characters found in modern dictionaries, including variant characters.
B) Encode a set of Tangut radicals and components for use in dictionary indexing and discussion of Tangut character structure by scholars.
I hope this clarifies the situation.
Post a Comment