Sunday, 8 July 2007

Old Hanzi

As I have intimated on more than one occasion, one of the challenges facing Unicode and WG2 is how to successfully encode historic scripts which mostly do not have a standard, well-defined repertoire and which frequently exhibit great variation in character repertoire and glyph forms geographically and/or chronologically. The problems are often exacerbated by the fact that different scholars may have very different opinions on how to encode the script and what names to use for the characters (people often get very hung up on names), and it can be exceedingly difficult to reconcile these differences.

When the Unicode was first devised it was intended to accommodate all the scripts of the world in common modern usage, but as can be seen from Joe Becker's 1988 outline of the proposed Unicode standard, it was not envisaged that "obsolete or rare" scripts would be allowed into the Unicode repertoire :

Unicode gives higher priority to ensuring utility for the future than to preserving past antiquities. Unicode aims in the first instance at the characters published in modern text (e.g. in the union of all newspapers and magazines printed in the world in 1988), whose number is undoubtedly far below 2 14 = 16,384. Beyond those modern-use characters, all others may be defined to be obsolete or rare; these are better candidates for private-use registration than for congesting the public list of generally-useful Unicodes.

Joe Becker, Proposal for the Unicode Standard (29th August 1988) page 5

Ten years later, when Unicode had been around for nearly six years, there was still an antipathy in some quarters towards the encoding of rare and historic scripts, as can be seen from this position statement to SC2 by the Netherlands National Body (I just love the line about standardization bodies subsidizing academic research !) :

Market-relevance should guide selection of projects. This does not mean that academic preferences should be ignored, only that standards institutes, depending on industry contributions, cannot be expected to subsidize academic research. If Learned Societies want to raise their agreed conventions to the status of an International Standard, they should take the way of a Fast Track procedure, after having done the development themselves.

SC2 N2881 "Position of the Netherlands National Body (NNI) Regarding Further Development in JTC 1/SC 2" [1997-06-02]

Since the opening up of the supplementary planes this sort of attitude has thankfully become less prevalent, and most people involved in Unicode and 10646 have come to appreciate the importance to the scholarly community of being able to represent historical scripts (or even enigmatic script-like symbols) in electronic form. In many cases the encoding of an historic script is an important step towards greater understanding of the corpus of texts or even the decipherment of the script. As of Unicode 5.1 the following primarily historic scripts will have been encoded, in a large part due to the single-handed dedication and hard work of Michael Everson :

And under consideration for encoding are a number of other historic scripts, including :

Scripts that were devised by a single person at a single point in time, such as Gothic and Phags-pa, generally have a clearcut character repertoire, but it is often difficult to define the character repertoire of scripts that evolved over a long period of time, especially when they developed geographically distinct variants as in the case of Runic. In many cases it is difficult to even clearly define the limits of the script, and there may be arguments amongst experts as to whether different assemblages of inscriptions represent the same or different scripts, or whether a script that evolves over a long period of time should be treated as a single script or a number of distinct scripts in the same lineage. When this is the case, reaching a consensus on how best to encode a script (or even whether a script should be encoded separately) can be quite difficult. Matters are only made more difficult when a proposed script is an historic form of a living script, and users of the living script insist that the characters of the proposed script should be treated as glyph variants of the corresponding characters in the modern script. This was the case when Phoenician was proposed for encoding, and subscribers to the Unicode public mailing list will remember the endless vitriolic arguments between pro-encoders (Phoenician is a separate script in its own right and should be encoded separately from Hebrew) and anti-encoders (Phoenician is just an historical variant of Hebrew that should be dealt with at the font level not the character encoding level).

Which brings me in a roundabout way to "Old Hanzi" 古漢字 (hànzì 漢字 being the Chinese word for a Chinese character or "ideograph", equivalent to the Japanese word kanji). Like other long-lived scripts, the Chinese script is best viewed as a script continuum which evolved by stages to the modern form. Up until a few years ago I think that it was generally assumed within Unicode circles that ancient forms of the Chinese script should be dealt with at the font level rather than at the encoding level, but there was pressure from within China to encode at least the most important early forms of the Chinese script, resulting in an agreement in 2003 to initially encode three important nodes in the Chinese script continuum (the links are to encoding samples for each script prepared by the Chinese National Body) :



Oracle Bone Script

No-one knows for sure when or where the Chinese script was devised, although a number of neolithic sites dating from as early as about 6600 BC up to about 2000 BC have yielded examples of individual symbols carved in isolation on tortoise shells or pottery shards that may or may not be early forms of Chinese characters (personally, I am quite sceptical that any of these marks are directly related to the Chinese script). However, the earliest undisputed stage in the Chinese script continuum that we have evidence for is the Oracle Bone Script (jiǎgǔwén 甲骨文), which was used for divination inscriptions in the royal court of the Shang 商 dynasty at the capital Yin 殷 (near modern Anyang in Henan province) during the period 1300-1050 BC (a few examples of inscribed oracle bones dating the early Western Zhou period have also been found at a number of other sites).

A question, or more frequently a series of parallel questions, is asked by a specialist diviner, and the answer divined by applying intense localised heat to the shell or bone and observing the pattern of the resultant cracks and/or the sound that the cracks make (the character *pŏk 卜 "to divine" both graphically represents a crack, and onomatopoeically represents the sound of a crack being made). The question (usually prefixed by the cyclic day on which the divination took place and the name of the diviner) as well as the resultant prognostication are then inscibed on the shell or bone, and the object archived, so that thousands of years later archaelogists can unearth them and learn all about the daily ritual of court life in the Shang dynasty. Many thousands of inscribed oracle bones from the ancient capital of the Shang dynasty have been preserved, and they indicate that every aspect of royal life, from toothache to warfare, was governed by a complex cycle of divination and ritual.


An Oracle Bone Inscription on an Ox Scapula

Historical Relics Unearthed in New China 新中國出土文物 (Foreign Languages Press, 1972) plate 37.


The above oracle bone was discovered in 1955 southeast of the site of the ancient capital of Yin, and dates to the third of five periods that oracle bone inscriptions can be classified as belonging to. The inscription itself comprises a compound question inscribed in a single column :

丁卯、癸亥卜:王其入商,叀(惟)乙丑王弗每(悔)?

On the cyclic days ding mao [Day 4] and gui hai [Day 60] it was divined: "Should the King enter the city of Shang, and on the cyclic day yi chou [Day 2] should the king not perform the hui rite ?"

The resultant prognostication, 弘吉 "very auspicious", is incised by the crack marks to the left of the question.



Bronze Inscription Script

The next stage in the history of the Chinese script is the Bronze Inscription Script (jīnwén 金文), which is a form of the Chinese script that was used for inscriptions on bronze bells and vessels. A few very short inscriptions on Shang dynasty bronze vessels (mostly little more than the name of the vessel's owner) have been found, but the vast majority of bronze inscriptions date to the succeeding Zhou dynasty (circa 1050 to 256 BC). Because of the long period during which these bronze inscriptions were made there is quite a large variation in the style of characters used. The characters found on the earliest bronze inscriptions from the Shang and early Zhou dynasties are very similar in form to those found on oracle bones (although as would be expected, oracle bone characters are generally more angular and often simpler than the corresponding bronze inscription characters due to the difficulty of inscribing characters on a hard medium such as bone and shell). Bronze inscriptions from the later period are much less closely related to the oracle bone script and are more closely related to the Small Seal script.


The Xing Hou gui 邢侯簋 ...

Chinese Bronzes: Art and Ritual (British Museum, 1987) plate 25.


... and its Inscription

Chinese Bronzes: Art and Ritual (British Museum, 1987) rubbing 10.

This is a very famous example of a ritual vessel for offering food known as a gui 簋 that was unearthed at Luoyang 洛陽 in 1921, and is now at the British Museum. The vessel dates to the early or middle Western Zhou period, and has a quite long and rather difficult to read inscription that seems to record the grant of men to the Marquis of Xing (Xing Hou 邢侯), and is dedicated to his famous ancestor, the Duke of Zhou (Zhou Gong 周公), brother of the first ruler of the Zhou dynasty ("〇" represents an undeciphered or unencoded character) :

隹(惟)三月,王令榮〇內史曰〇井(邢)侯服,易(錫)臣三品,州人、重人、庸人。拜稽首,魯天子造厥瀕福,克奔走上下帝,無冬(終)令(命)于右(有)周追考(孝?),對不敢〇,邵(昭)朕福盟,朕臣天子,用冊王令(命),乍(作)周公彞。



Small Seal Script

The Small Seal Script (xiǎo zhuàn 小篆) was adopted by the First Emperor (Qin Shi Huang 秦始皇) as the standard script of the Qin dynasty (221-206 BC). It developed from the characters used for inscriptions during the latter part of the Zhou dynasty, and so many late Zhou bronze inscriptions are written with characters that are much closer in style to the small seal script thanto the early Zhou bronze inscription script. By the time that the small seal script had developed the Chinese writing system had adopted the radical/phonetic method of character composition, and so the vast majority of small seal characters correspond directly to a modern character.

The main source for the Small Seal script repertoire will be editions of the Shuowen 說文 dictionary that was compiled by Xu Shen 許慎 in about the year 100. The illustration below shows a page from the table of 540 radicals at the beginning of a modern edition of Xu Shen's dictionary :


Table of Radicals in the Shuowen Dictionary

Shuowen Jiezi 說文解字 (Zhonghua Shuju, 1963) page 3.



Discussion

You might have thought that the decision to encode these three historic script forms of Chinese would have led to the same level of complex debate and bitter argument that we saw for Phoenician, especially as the result of this decision will be to add many thousands more characters to Unicode, but there hasn't been a squeak. So here are my thoughts about some of the issues involved.

The first thing to realise is that the oracle bone script is quite different from the modern Chinese script in several respects, and that a large percentage of oracle bone characters remain undeciphered or do not correspond directly to any modern character. One of the reasons for this is that the method of composing characters by combining radical and phonetic elements, which is used for the majority of modern Chinese characters, is little used in the oracle bone script, with the result that a character that in the later script is written as a radical/phonetic compound may have been written in the earlier script as a completely different unitary character, which is unrecognisable to modern eyes.

The oracle bone script also makes use of compound characters, in which two separate characters are combined into a single glyph. For example the character jiǎ 甲 in oracle bone script is written as a cross (like 十), but the titles of the royal ancestors Shang Jia 上甲 "Upper Jia" and Xiao Jia 小甲 "Little Jia" are not written as a sequence of two characters shàng plus jiǎ and xiǎo plus jiǎ respectively, as would be expected according to the principles of the modern Chinese script, but Shang Jia is written as a cross (= 甲) in square box, and Xiao Jia is written as a cross (= 甲) with a dot in each of the four corners (however Da Jia 大甲 "Big Jia" is written as a sequence of the two characters plus jiǎ). Likewise, the titles of the royal ancestors Bao Yi 報乙, Bao Bing 報丙 and Bao Ding 報丁 are each written as a sideways bowl shape (similar to a reversed "C") representing the character baò 報 with the second character of the title ( 乙, bǐng 丙 or dīng 丁) enclosed within. All these compound characters are shown in the oracle bone inscription below, which is also a good example of how it is currently impossible to represent many oracle bone inscriptions accurately, and why most authors working with oracle bone texts write the characters out by hand (those characters that are currently unencoded are represented by "〇", although a couple of them are in the pipeline for CJK-D, including ⿰酉彡 for the character which looks like but isn't 酒) :

乙未,〇(⿰酉彡)〇(系)品〇(上甲)十,〇(报乙)三,〇(报丙)三,〇(报丁)三,示壬三,示癸三,大乙十,大丁十,大甲十,大庚十,〇(小甲)三,□三,且乙(祖乙)十。

Complex numbers may also written as compound characters, so that for example the numbers "50", "60", "70", "80" and "90" are represented by the characters for "5" (五), "6" (六), "7" (七), "8" (八) and "9" (九) with the character for "10" (十, written as a vertical line in oracle bone script) joined from above.

In a few cases there is a complete disjuncture between the character used in the oracle bone script and the corresponding modern Chinese character. For example in the oracle bone script the character used to represent the first of the twelve earthly branches does not correspond to the modern character for the first earthly branch ( 子) but is written with a completely unrelated glyph of unknown meaning; whereas the oracle bone character used to represent the sixth earthly branch (which is 巳 in modern Chinese) is actually written with the character for 子 "son". Thus 子 is the first earthly branch in the modern Chinese script but the sixth earthly branch in the oracle bone script.

These sorts of issues are the reason why I think that it is not practical to treat the oracle bone script simply as a stylistic variant of the modern Han script. The fact that a majority of oracle bone characters either have no known counterpart in the modern Chinese script or are significantly different from the corresponding modern Chinese character with respect to their glyph composition also makes it very difficult to represent oracle bone script text using CJK Unified Ideographs and a suitable oracle bone style font that maps oracle bone glyphs to the corresponding modern characters (in many cases the mapping just does not exist). However, it has to be said that many artificial modernised versions of oracle bone and bronze inscription characters have been encoded already or are proposed for encoding (there are 367 characters in CJK-C and 1,481 characters in CJK-D that are derived from Yinzhou Jinwen Jicheng Yinde 殷周金文集成引得 [Concordance of Shang and Zhou Dynasty Bronze Inscriptions]). And it could be argued that if the encoding of artificial modernised forms of ancient characters is extended so that all ancient characters can be mapped to an encoded CJK Unified Ideograph then there would be no need to encode the oracle bone script separately. But the counterargument is that artificial modernised forms of ancient characters can only be encoded if they are attested, and not all oracle script or bronze inscription script characters have been or probably ever will be represented with artificial modernised forms (and often it is almost impossible to devise a modern form of an oracle bone script character). Another argument against this approach is that different scholars may modernise a character differently, so that there may be multiple artificial modern forms for the same oracle bone character.

A further problem that scholars of ancient Chinese inscriptions face is that most oracle bone and bronze inscription characters occur in a variety of different glyph forms, often composed using different combinations of component elements, and scholars want to be able to represent these significant glyph differences at the encoding level. Just picking at random the character for "spring" (chūn 春), it occurs in at least five distinct glyph forms :

Each of these five forms of the character is written with a different set of components, and are thus not unifiable according to the rules of CJK unification. It is to be expected that when the oracle bone script repertoire is eventually submitted for encoding it will contain separate characters for each of these forms (and probably also for other less common forms of the character).

I personally think that encoding the oracle bone script separately from the ordinary Han script is the only way for scholars to be able to work with oracle bone script texts, and I am looking forward to seeing it encoded as soon as possible. The same arguments that I have used to support the encoding of the oracle bone script may also be used for the bronze inscription script, although it could be argued that due to the similarity between the characters on early bronze inscriptions and oracle bone inscriptions it would have been better (or at least more economical) to combine the two scripts at the encoding level so as to avoid encoding duplicate versions of characters that are used in both oracle bone and bronze inscriptions.

I haven't said much about the small seal script, mainly because the issues of character identity that affect the oracle bone and bronze inscription scripts largely do not apply to the seal script. There is a high level of correspondence between small seal characters and modern Chinese characters, so I think that it is quite possible to deal with small seal script satisfactorily at the font level. Nevertheless, I don't have any strong objections to seeing the small seal script encoded as a separate script if that is what the user community wants.


Monday, 2 July 2007

CJK Unified Ideographs : To Infinity and Beyond

It has been remarked now and then that Unicode basically consists of an innumerable number of Han thingies to which assorted non-Han detritus has attached itself. And this does seem to be borne out from the figures :


Percentage of Han Characters within the Unicode Repertoire
Unicode
Version
Total
Characters
Han Script Characters Everything
Else
Percentage of
Han Characters
CJK Unified
Ideographs
CJK Compatibility
Ideographs
CJK Radicals Other Total
1.028,35920,91429001121,2157,14474.81%
1.134,23320,91429001121,21513,01861.97%
2.038,95020,91429001121,21517,73554.47%
2.138,95220,91429001121,21517,73754.46%
3.049,25927,4962903291428,12921,13057.10%
3.194,20570,2078323291471,38222,82375.67%
3.295,22170,2078913291571,44223,77975.03%
4.096,44770,2078913291571,44225,00574.07%
4.197,72070,2299973291571,57026,15073.24%
5.099,08970,2299973291571,57027,51972.23%
5.1100,71370,2379973291571,57829,13571.07%
5.2107,36174,39410003291575,73831,62370.55%
6.0109,44974,61610003291575,96033,48969.40%
6.1110,18174,61710023291575,96335,56468.94%

Looking 10 years or so into the future, after the encoding of CJK-C, CJK-D, CJK-E and CJK-F, as well as Old Hanzi, even after taking into account large non-Han scripts such as Egyptian Hieroglyphs (~1,000), Tangut (~6,000) and Jurchen (~1,000), it is likely that the Han percentage will still be around 75% of the entire Unicode repertoire (this is assuming that Old Hanzi are classified as belonging to the Han script, which is not entirely certain).

It could also be said that Han ideographs are the driving force behind Unicode. Without them it is unlikely that there would have been the impetus to develop a 16-bit universal character set in the first place, and now that all the major modern scripts have been encoded the unfinished work on CJKV is the main reason why Unicode and 10646 are still continuing to expand. Once China and the other countries that use Han ideographs have encoded all the characters they need, then I expect that WG2 will cease to function and the ISO/IEC 10646 and Unicode standards will stabilize. This means that there is a limited window of opportunity to get as many as possible of the remaining unencoded scripts encoded.



The Han Script

In Unicode terms the Han script comprises unified ideographs, compatibility ideographs (duplicate versions of unified ideographs encoded for round-tripping compatibility with pre-existing standards) and radicals (Kangxi Radicals and CJK Radicals Supplement), as well as Suzhou numbers ("Hangzhou numbers" as they are called in Unicode), ideographic iteration marks and the ideographic zero (all in the CJK Symbols and Punctuation block).

Not included within the Han script are CJK Strokes and Ideographic Description Characters, which are both classified as "common" by Unicode. This makes sense as other (not yet encoded) scripts such as Tangut, Jurchen and Greater Khitan can all be analysed using ideographic description sequences. The characters of these scripts are also composed from the same or similar stroke elements as Han ideographs, and so "CJK" strokes may be used for these scripts when they are encoded (e.g. character indexes for Tangut and Jurchen dictionaries are often subdivided by stroke type). Indeed, I don't see any reason why those strokes that are peculiar to Tangut characters may not be encoded in the "CJK Strokes" block.


Breakdown of the Han Script by Block (as for Unicode 6.1)
Block Name Range Han Characters Unicode Versions
CJK Unified Ideographs4E00..9FFF20,9411.0, 4.1, 5.1, 5.2, 6.1
CJK Unified Ideographs Extension A3400..4DBF6,5823.0
CJK Unified Ideographs Extension B20000..2A6DF42,7113.1
CJK Compatibility IdeographsF900..FAFF4721.0, 3.2, 4.1, 5.2, 6.1
CJK Compatibility Ideographs Supplement2F800..2FA1F5423.1
Kangxi Radicals2F00.2FDF2143.0
CJK Radicals Supplement2E80..2EFF1153.0
CJK Symbols and Punctuation3000.303F151.0, 3.0, 3.2

Note that the total number of Unified Ideographs (70,237) is twelve more than the sum of the three CJK Unified Ideograph blocks, as twelve characters in the CJK Compatibility Ideographs block are actually unified ideographs.

There seems to be no end to the growth in numbers of unified ideographs, and perhaps if anyone could have imagined when Unicode was first instigated that eventually over a 100,000 Chinese, Japanese, Korean, Vietnamese and Zhuang ideographs would be encoded, then maybe a compositional model of Han ideograph encoding would have been considered; as it is we are stuck, for better or for worse, with a unitary ideograph encoding model (see the Comments to A Brief History of CJK-C for some discussion of this issue), so the only way to represent unencoded Han characters is to add yet more and more unified ideographs to the standard.

But, however many ideographs are encoded, it always seems possible to find yet more to encode. And if you have much dealing with modern, informal Chinese usages such as letter-writing and sign-writing, you will doubtless have encountered a whole class of Han characters which are largely unencoded, that is Second Stage Simplifications :



In the above Chinese postage stamp from 1978 you can see (with a strong magnifying glass!) the word "lacquerware" qīqì 漆器 written with ultrasimplified characters (㲺 for 漆, and a rectangle with a vertical stroke for 器). The ultrasimplified form of 器 (a rectangle with a vertical stroke) is scheduled for encoding in CJK-D [what was going to be CJK-D when this post was originally written, but which is now rescheduled as CJK-E because CJK-D has been taken by a couple of hundred "urgent need characters"], together with some other ultrasimplified forms (e.g. hollow 面 and the righthandside of 能), but no systematic proposal to encode all of the second stage simplifications has yet been made.



CJK-D

CJK-D was originally intended to comprise some 16,000+ ideographs that had not made it into CJK-C (see pages 1-100, 101-200, 201-300 and 301-396). However, just a month ago Taiwan withdrew 6,545 personal name usage characters from CJK-D that were no longer in use (see IRG N1306), so CJK-D has now been reduced in size to about 10,000 characters, plus about fifty more that will be taken out of CJK-C.

The proposed CJK-D collection includes a few characters that I have been patiently waiting to be encoded for many years now, including this one that I had to hack a glyph for when I was compiling and typesetting the Catalogue of the Morrison Collection nearly ten years ago (spot the deliberate error !) :



The character in question (⿰冫玉) is identifiable from context as being a variant form of jué 珏, where the "two dots of water" act as a component iteration mark (i.e. jade doubled), as they also do in U+3560 㕠 (a variant form of shuāng 雙). My great delight in seeing this old friend encoded at last is only matched by my utter dejection when I realise that it is one of the withdrawn Taiwan characters, and with no other source reference it will not be in the proposed CJK-D set after all.



CJK-E

The CJK-D collection is now closed for business, and new submissions (such as 1,277 Vietnamese characters, 24 Taiwan characters for Minnan and Hakka usage and 2 PRC placename characters) are queuing for inclusion in CJK-E. Work on CJK-E has not yet officially started, so I'm not going to guess at how many characters it may comprise eventually.



Zhuang Usage Ideographs

One very large set of ideographs that remains largely unencoded are "Zhuang square characters" fangkuai Zhuangzi 方塊壯字 (known as saw ndip in the Zhuang language) that have (mostly in the past) been used to write the Zhuang language. These Zhuang ideographs comprise a mixture of existing Chinese ideographs borrowed for their meaning or pronunciation, together with many idiosyncratic creations modelled on Chinese ideographs (mostly on the same principles of radical and phonetic that are used for Chinese, but with some more interesting methods of forming characters as well). As Zhuang usage of Chinese and Chinese-style ideographs was never standardized the actual choice of character used to represent any particular syllable varies from manuscript to manuscript, and as can be seen from the first page of the Gu Zhuangzi Zidian 古壯字字典 [Dictionary of Old Zhuang Characters] (Guangxi Minzu Chubanshe, 1989) there are usually multiple ways of writing any given syllable :


[Image courtesy of John Knightley]


Work on a comprehensive encoding proposal for Zhuang usage ideographs has just started at Guangxi University, but there is a huge amount of material to cover, and it will probably take 3-5 years before the complete set of unencoded ideographs has been identified and analysed. The end result may be another 5,000-10,000 characters to be encoded after CJK-E.

Later in the year (or more probably next year) I want to analyse in detail an actual example of a Zhuang poetic text written in sawndip characters, but for my final post of the current blogging season I will be taking a look at Old Hanzi.