Monday, 2 July 2007

CJK Unified Ideographs : To Infinity and Beyond

It has been remarked now and then that Unicode basically consists of an innumerable number of Han thingies to which assorted non-Han detritus has attached itself. And this does seem to be borne out from the figures :


Percentage of Han Characters within the Unicode Repertoire
Unicode
Version
Total
Characters
Han Script Characters Everything
Else
Percentage of
Han Characters
CJK Unified
Ideographs
CJK Compatibility
Ideographs
CJK Radicals Other Total
1.028,35920,91429001121,2157,14474.81%
1.134,23320,91429001121,21513,01861.97%
2.038,95020,91429001121,21517,73554.47%
2.138,95220,91429001121,21517,73754.46%
3.049,25927,4962903291428,12921,13057.10%
3.194,20570,2078323291471,38222,82375.67%
3.295,22170,2078913291571,44223,77975.03%
4.096,44770,2078913291571,44225,00574.07%
4.197,72070,2299973291571,57026,15073.24%
5.099,08970,2299973291571,57027,51972.23%
5.1100,71370,2379973291571,57829,13571.07%
5.2107,36174,39410003291575,73831,62370.55%
6.0109,44974,61610003291575,96033,48969.40%
6.1110,18174,61710023291575,96335,56468.94%

Looking 10 years or so into the future, after the encoding of CJK-C, CJK-D, CJK-E and CJK-F, as well as Old Hanzi, even after taking into account large non-Han scripts such as Egyptian Hieroglyphs (~1,000), Tangut (~6,000) and Jurchen (~1,000), it is likely that the Han percentage will still be around 75% of the entire Unicode repertoire (this is assuming that Old Hanzi are classified as belonging to the Han script, which is not entirely certain).

It could also be said that Han ideographs are the driving force behind Unicode. Without them it is unlikely that there would have been the impetus to develop a 16-bit universal character set in the first place, and now that all the major modern scripts have been encoded the unfinished work on CJKV is the main reason why Unicode and 10646 are still continuing to expand. Once China and the other countries that use Han ideographs have encoded all the characters they need, then I expect that WG2 will cease to function and the ISO/IEC 10646 and Unicode standards will stabilize. This means that there is a limited window of opportunity to get as many as possible of the remaining unencoded scripts encoded.



The Han Script

In Unicode terms the Han script comprises unified ideographs, compatibility ideographs (duplicate versions of unified ideographs encoded for round-tripping compatibility with pre-existing standards) and radicals (Kangxi Radicals and CJK Radicals Supplement), as well as Suzhou numbers ("Hangzhou numbers" as they are called in Unicode), ideographic iteration marks and the ideographic zero (all in the CJK Symbols and Punctuation block).

Not included within the Han script are CJK Strokes and Ideographic Description Characters, which are both classified as "common" by Unicode. This makes sense as other (not yet encoded) scripts such as Tangut, Jurchen and Greater Khitan can all be analysed using ideographic description sequences. The characters of these scripts are also composed from the same or similar stroke elements as Han ideographs, and so "CJK" strokes may be used for these scripts when they are encoded (e.g. character indexes for Tangut and Jurchen dictionaries are often subdivided by stroke type). Indeed, I don't see any reason why those strokes that are peculiar to Tangut characters may not be encoded in the "CJK Strokes" block.


Breakdown of the Han Script by Block (as for Unicode 6.1)
Block Name Range Han Characters Unicode Versions
CJK Unified Ideographs4E00..9FFF20,9411.0, 4.1, 5.1, 5.2, 6.1
CJK Unified Ideographs Extension A3400..4DBF6,5823.0
CJK Unified Ideographs Extension B20000..2A6DF42,7113.1
CJK Compatibility IdeographsF900..FAFF4721.0, 3.2, 4.1, 5.2, 6.1
CJK Compatibility Ideographs Supplement2F800..2FA1F5423.1
Kangxi Radicals2F00.2FDF2143.0
CJK Radicals Supplement2E80..2EFF1153.0
CJK Symbols and Punctuation3000.303F151.0, 3.0, 3.2

Note that the total number of Unified Ideographs (70,237) is twelve more than the sum of the three CJK Unified Ideograph blocks, as twelve characters in the CJK Compatibility Ideographs block are actually unified ideographs.

There seems to be no end to the growth in numbers of unified ideographs, and perhaps if anyone could have imagined when Unicode was first instigated that eventually over a 100,000 Chinese, Japanese, Korean, Vietnamese and Zhuang ideographs would be encoded, then maybe a compositional model of Han ideograph encoding would have been considered; as it is we are stuck, for better or for worse, with a unitary ideograph encoding model (see the Comments to A Brief History of CJK-C for some discussion of this issue), so the only way to represent unencoded Han characters is to add yet more and more unified ideographs to the standard.

But, however many ideographs are encoded, it always seems possible to find yet more to encode. And if you have much dealing with modern, informal Chinese usages such as letter-writing and sign-writing, you will doubtless have encountered a whole class of Han characters which are largely unencoded, that is Second Stage Simplifications :



In the above Chinese postage stamp from 1978 you can see (with a strong magnifying glass!) the word "lacquerware" qīqì 漆器 written with ultrasimplified characters (㲺 for 漆, and a rectangle with a vertical stroke for 器). The ultrasimplified form of 器 (a rectangle with a vertical stroke) is scheduled for encoding in CJK-D [what was going to be CJK-D when this post was originally written, but which is now rescheduled as CJK-E because CJK-D has been taken by a couple of hundred "urgent need characters"], together with some other ultrasimplified forms (e.g. hollow 面 and the righthandside of 能), but no systematic proposal to encode all of the second stage simplifications has yet been made.



CJK-D

CJK-D was originally intended to comprise some 16,000+ ideographs that had not made it into CJK-C (see pages 1-100, 101-200, 201-300 and 301-396). However, just a month ago Taiwan withdrew 6,545 personal name usage characters from CJK-D that were no longer in use (see IRG N1306), so CJK-D has now been reduced in size to about 10,000 characters, plus about fifty more that will be taken out of CJK-C.

The proposed CJK-D collection includes a few characters that I have been patiently waiting to be encoded for many years now, including this one that I had to hack a glyph for when I was compiling and typesetting the Catalogue of the Morrison Collection nearly ten years ago (spot the deliberate error !) :



The character in question (⿰冫玉) is identifiable from context as being a variant form of jué 珏, where the "two dots of water" act as a component iteration mark (i.e. jade doubled), as they also do in U+3560 㕠 (a variant form of shuāng 雙). My great delight in seeing this old friend encoded at last is only matched by my utter dejection when I realise that it is one of the withdrawn Taiwan characters, and with no other source reference it will not be in the proposed CJK-D set after all.



CJK-E

The CJK-D collection is now closed for business, and new submissions (such as 1,277 Vietnamese characters, 24 Taiwan characters for Minnan and Hakka usage and 2 PRC placename characters) are queuing for inclusion in CJK-E. Work on CJK-E has not yet officially started, so I'm not going to guess at how many characters it may comprise eventually.



Zhuang Usage Ideographs

One very large set of ideographs that remains largely unencoded are "Zhuang square characters" fangkuai Zhuangzi 方塊壯字 (known as saw ndip in the Zhuang language) that have (mostly in the past) been used to write the Zhuang language. These Zhuang ideographs comprise a mixture of existing Chinese ideographs borrowed for their meaning or pronunciation, together with many idiosyncratic creations modelled on Chinese ideographs (mostly on the same principles of radical and phonetic that are used for Chinese, but with some more interesting methods of forming characters as well). As Zhuang usage of Chinese and Chinese-style ideographs was never standardized the actual choice of character used to represent any particular syllable varies from manuscript to manuscript, and as can be seen from the first page of the Gu Zhuangzi Zidian 古壯字字典 [Dictionary of Old Zhuang Characters] (Guangxi Minzu Chubanshe, 1989) there are usually multiple ways of writing any given syllable :


[Image courtesy of John Knightley]


Work on a comprehensive encoding proposal for Zhuang usage ideographs has just started at Guangxi University, but there is a huge amount of material to cover, and it will probably take 3-5 years before the complete set of unencoded ideographs has been identified and analysed. The end result may be another 5,000-10,000 characters to be encoded after CJK-E.

Later in the year (or more probably next year) I want to analyse in detail an actual example of a Zhuang poetic text written in sawndip characters, but for my final post of the current blogging season I will be taking a look at Old Hanzi.


4 comments:

Anonymous said...

�� U+233B5 can be found in Unihan 5.1

Andrew West said...

Excellent, thanks for that -- I just never thougt of looking under the 木 radical. And of course it is in the Kangxi Dictionary, so after all it is not an unencoded character. Oh well, sorry about that!

JAEMIN said...

It seems like CJK D is now known as CJK E and these urgently need characters are now known as CJK D.

Andrew West said...

Yes, that's right -- looks like I'll have to update all my posts that mention CJK-D.