Saturday, 10 October 2009

A Modest Proposal to Encode Ultra-Simplified Chinese Characters


A lot of people hate simplified Chinese characters, but I personally think they are great, and that the only things better than simplified Chinese characters are simplified simplified Chinese characters. But for some reason most of the second stage simplified characters introduced in 1977 (and abandonned less than a year later) remain unencoded in Unicode, so it is difficult for ultra-simplificationists like myself to communicate electronically in our preferred form of writing. As it does not look as if China is in any hurry to propose them for encoding, I have put together a modest proposal to encode the 257 outstanding second stage simplifications from 1977, as well as 23 unencoded Singapore simplifications from 1969, and 55 unencoded "first batch" simplifications from 1935. This document is available as a 112 page 10MB PDF from the WG2 document register (N3695) or in its original html format from my web site. NB If you are going to view the html source file then you should first install my BabelStone Han font (details here), otherwise the document will make little sense. Unfortunately, WG2 delegates CJK encoding matters to its Ideographic Rapporteur Group (IRG) which I understand does not accept submissions from individuals, so unless China or Unicode adopts my proposal it is not likely to get anywhere very fast.


22 comments:

Michael Everson said...

I support their encoding.

Astara said...

you just need to become your own organization.

I think that the 'BabelStone Organization', needs to be formed! (Or just register a ".org" domain and start putting up documents) Get linked in, maybe with w3.org.... get some/build some cred....then they might listen?... Get babelmap linked to from unicode.org (if it's not already) (I found you through the Fonts & Encoding 'bible' (book))....

Use babelmap (though not pad) as much as possible to learn alternate to MS's poopy charmap.

Hey -- also , re: babelmap -- have you thought about a linux version? You could sublicence the source to allow an open source version for linux, with restriction that code not be used for a windows version, though the windows version could use some enhancement ....just some thoughts...linda

Philip said...

I would consider it a good idea for Second Round simplified characters to be encoded, too.

Rasmus said...

The 23 Singapore and especially the 257 Cultural-Revolution simplifications are not problematic -- they allow the user to create an alternate simplified-Chinese character set for the times and places in question. In effect, they are distinct standards that supplement the GB 2312 set.

But I'm less sure about the 1935 set. It seems like a different animal. I presume it was an official educational standard, even if only for a few months, and I understand your desire to underline the significance of this early effort. But it was never completed, it doesn't mesh with the later simplifications (i.e. it's a supplement to traditional Chinese, not simplified Chinese, and so the argument for encoding it is completely different from the other two), and encoding it seems like more of an academic exercise than anything else.

I'm just sort of thinking aloud, perhaps mostly as Devil's advocate, but I think you should consider revising the proposal to not include the 1935 set -- without that, it would be an absolute no-brainer. If you feel strongly about it, then put it in a separate proposal. It feels like you're mixing apples and oranges here.

Plus, it's a slippery slope. If a variant being used in a book is enough for it's being encoded, then that opens a door to all of the myriad variants used by publishers over the centuries. Anyone who spends time comparing old editions of texts sees these on a regular basis. Where do you draw the line?

For example, what about the Siku quanshu? It is an Imperial, state-sanctioned production with infinitely more historical value and academic importance than the aborted 1935 simplification. Certainly it is one of the largest and most significant editorial projects in human history. However, if you've worked with the Digital Heritage database, you know it contains a number of variant forms that can't be represented in Unicode -- they use a special font for them. Maybe it's just my luck, but I seem to encounter a lot of them. Is there a case for encoding them? I would say yes. But is that because the SKQS is unique, like the Taisho canon (which I also think should be encoded entirely if it hasn't been already via Morohashi, and probably also the Daoist canon which I don't know much about), or because every published text that contains a non-unifiable variant should be represented in Unicode?

Rasmus said...

Ugh -- I hate it when people use "it's" when they mean "its" -- see the second instance in my comment. I revised it several times but still didn't catch that typo, I'm afraid...

Rasmus said...

Err, third instance, not second. Going a bit nuts here now. Must be all those chocolate-covered coffee beans... !

電腦小天地 said...

Your character #17 is already in U+2F800. Right?

Andrew West said...

Yes, but compatibility ideographs don't count. 2F800 looks like the character in question, but any compliant Unicode process is at liberty to turn it into 4E3D whenever it wants; so although on the surface it looks like the same character, in reality it is just 4E3D in disguise. See the discussion of the newly encoded 9FC3 鿃 in Vanished in the Twinkling of an Eye.

電腦小天地 said...

The characters in “CJK Compatibility Ideographs Supplement” (except 2F840, 2F894, 2F8CD, 2F994, 2F9B2, 2F9D4, which were from HKSCS) came from CNS11643 standard of Taiwan. But Taiwan itself do not consider these characters as “Compatibility characters”. All of them have different shapes or glyphs. It is differ from all but 12 characters in CJK Compatibility Ideographs (F900–FAFF), which we knew that they are intentionally encoded twice in the legacy standards.

p.s. please check this: U+4695 ←→ CNS 5-7A38 ←→ U+2F9CB according to the current standard. I get very confused for it.

電腦小天地 said...

AFAIK CNS11643 has only one codepoint 5-7A38 for 麗+見. It is not justifiable to have U+2F9CB apperaed in Unicode since U+4695 has no kIRG_TSource, and

finally I'd suggest that the characters in “CJK Compatibility Ideographs Supplement” (except 2F840, 2F894, 2F8CD, 2F994, 2F9B2, 2F9D4) as NOT-COMPATIBILITY IDEOGRAPGHs.

flow said...

do you have any documentation on your BabelStone Han font? i googled for it but couldn't find anything but the font itself (which i downloaded and installed, and it works).

beyond that, i want to say that i think that the whole unicode compatibility character / variant selector thing has gone utterly wrong. i know no simple solution to the problem that e.g. the grass radical and a host of other components can be written in several ways and so produce thousands of character shapes that differ ever so slightly. but on the other end of the scale it just feels wrong to state on the one hand that character A looks distinct from character B and then proceed to ponder whether they are subsumable under a common codepoint. cmon there are lots of codepoints in unicode and their very reason to exist is to make characters / glyphs / elements of writing identifiable.

what bugs me most is that there does not seem to be any public and authoritative documentation of the character variants à la http://en.wikipedia.org/wiki/Han_unification#Examples_of_language_dependent_characters that could tell me what to expect of a font and how to design my own.

add to this that i am expected to put information about the shape of a character into some extra channel (an html language annotation or such) and i cringe. this feels so wrong. i know i know i know there is an analogy with the two shapes of latin a and g here but language-dependent variants cannot be the solution.

i would be very grateful if this horrible idea got at least documented; the only thing i have is the wikipedia article and some remarks like "yeah things may look different at some codepoints"-like statements by the unicode consortium. they never seem to want to tell me exactly which variants got stuffed together at what codepoints. it is a mess.

Andrew West said...

do you have any documentation on your BabelStone Han font?

It's still under development, and has not been officially released yet. I was hoping to at least complete CJK-C before releasing it, but at my current rate of progress that may take another five years !

what bugs me most is that there does not seem to be any public and authoritative documentation of the character variants à la http://en.wikipedia.org/wiki/Han_unification#Examples_of_language_dependent_characters that could tell me what to expect of a font and how to design my own.

The Unicode 5.2 Code Charts now give the CJK/-A/-B/-C charts in multi-column format, showing the representative glyph forms for different sources (China, Taiwan, Japan, Korea, etc.), which should help you.

add to this that i am expected to put information about the shape of a character into some extra channel (an html language annotation or such) and i cringe. this feels so wrong.

I think that the expectation is that a font would use the appropriate OpenType features to indicate which language the glyphs are intended for.

Andrew West said...

@電腦小天地

finally I'd suggest that the characters in “CJK Compatibility Ideographs Supplement” (except 2F840, 2F894, 2F8CD, 2F994, 2F9B2, 2F9D4) as NOT-COMPATIBILITY IDEOGRAPGHs.

I would agree that compatability ideographs should not have a different glyph shape — if they have a different glyph shape then they should either be encoded separately as unified ideographs or else encoded as an IVS (Ideographic Variation Sequence). In the past, and even now, glyph variants of the same magnitude as U+2F9CB versus U+4695 have been encoded as separate unified ideographs according to ISO/IEC 10646:2003 Annex S "Procedure for the unification and arrangement of CJK Ideographs", so I do wonder why all these glyph variants ever got encoded as compatability ideographs (it was before I got involved in Unicode). Unfortunately, the fact that all the characters in the CJK Compatibility Ideographs Supplement have canonical decomposition mappings to unified ideographs means that they definitely are compatibility ideographs, and cannot be treated as if they are not. Personally, I would say that there is a case for re-encoding the glyph variant compatibility ideographs as unified ideographs (this idea will never fly) or as ideographic variation sequences (a more likely possibility).

電腦小天地 said...

An important reason why a CJK character being encoded twice, is from their sources. For example 兀 has appeared twice in Big-5; but it is not the case as in 麗+見, as no legacy charset contains it twice.

On the other hand, the characters in “CJK Compatibility Ideographs Supplement” are not duplicates in Taiwan standards.

My conclusions are that, U+2F8xx are not compatible chars at all, and U+2F9CB is a UNICODE DUPLICATE and it should be deplicated ASAP.

電腦小天地 said...

Can the “canonical decomposition mappings” for characters in the CJK Compatibility Ideographs Supplement be deleted in a later version of Unicode?

Andrew West said...

Can the “canonical decomposition mappings” for characters in the CJK Compatibility Ideographs Supplement be deleted in a later version of Unicode?

No.

電腦小天地 said...

If I need to map/use those characters in Unicode, what can I do? Ask the authority to encode those characters (twice) elsewhere (and deprecate the 2Fxxx)? Or?

Anyway, no one wants to waste the encoding spaces.

沛睿 said...

I find it kind of sad that you feel that way about the 2nd round of simplified characters. They have lost all of their beauty and most all of their semantic and phonetic values. They are empty and the only thing left to do with them is complete the simplification to a latin alphabet or something similar to the Japanese syllabary... That was the idea at time...

If you don't like the traditional characters, then perhaps you shouldn't be learning (or doing anything) with Chinese characters. Just my opinion as an American learning Mandarin (and traditional characters at that).

水水 said...

SJ-29 is already encoded at U+4E3D (T-column).

Philip Newton said...

The link to the proposal does not work (any longer?) due to case-sensitivity. Please change it from http://std.dkuug.dk/jtc1/sc2/wg2/docs/N3695.pdf to http://std.dkuug.dk/jtc1/sc2/wg2/docs/n3695.pdf (capital N to small n before the number 3695). Thank you!

Andrew West said...

Thanks for pointing that out -- please try again now.

Philip Newton said...

Thank you for the quick fix!

Incidentally, in the linked PDF itself, I wonder whether entry EJ-2.1.011 on page 38 is the wrong way round - it seems to say that 斗 is simplified as 陡, which seems counterintuitive (and contradicts the example on page 26 on table 2 subtable 1).

Alas, it's probably not possible for you to correct the PDF on dkuug.dk, but if you ever update that proposal, that's an erratum to keep in mind.