Saturday, 10 October 2009

A Modest Proposal to Encode Ultra-Simplified Chinese Characters

A lot of people hate simplified Chinese characters, but I personally think they are great, and that the only things better than simplified Chinese characters are simplified simplified Chinese characters. But for some reason most of the second stage simplified characters introduced in 1977 (and abandonned less than a year later) remain unencoded in Unicode, so it is difficult for ultra-simplificationists like myself to communicate electronically in our preferred form of writing. As it does not look as if China is in any hurry to propose them for encoding, I have put together a modest proposal to encode the 257 outstanding second stage simplifications from 1977, as well as 23 unencoded Singapore simplifications from 1969, and 55 unencoded "first batch" simplifications from 1935.

This 112 page document, entitled "Proposal to Encode Obsolete Simplified Chinese Characters", is available as document N3695 in the WG2 document register or as document L2/09-260 in the Unicode document register.

Unfortunately, WG2 delegates CJK encoding matters to its Ideographic Rapporteur Group (IRG) which I understand does not accept submissions from individuals, so unless China or Unicode adopts my proposal it is not likely to get anywhere very fast.

Update [2016-01-01]

Just over six years after my submission of the original proposal (N3695), 102 "Table 1" second stage simplified characters were included in the Unicode Consortium's submission for IRG Working Set 2015 (IRGN2091). The complete set of documents are available on the IRG website under "IRGN2091". IRG Working Set 2015 should eventually become CJK Unified Ideographs Extension G (in several years time, probably in time for inclusion in Unicode 12.0 in 2019).

A UK submission of 1,640 characters for IRG Working Set 2015 is also available on the IRG website under "IRGN2107" (or on my website). Draft documents for a possible future IRG submission are available here.

Thursday, 1 October 2009

BabelMap Version

Coinciding with the release of Unicode 5.2 today (code charts), I am releasing a new version of BabelMap that supports Unicode 5.2 (download BabelMap now).

Unicode 5.2 adds 6,648 new characters and 15 new scripts (Table of Unicode scripts), including 1,071 basic Egyptian Hieroglyphs and 4,149 additional Han "ideographs" (taking the total number of "CJK Unified Ideographs" to 74,394, with another 220+ coming in the next version of Unicode). Unfortunately you will not be able to actually see any of the new characters until the appropriate Unicode fonts are designed and released (and in many cases, not even then if you are using Windows 7). I am maintaining a list of free fonts with Unicode 5.2 coverage at the bottom of my What's new in Unicode 5.2 ? post.

Creative Commons License
All screenshots of BabelMap on this page are licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License (CC-BY-SA-3.0) by Andrew West.

The new version of BabelMap has a number of bug fixes and several new features :

  • The Han Radical lookup utility now covers all 74,394 unified ideographs;
  • A Cantonese phonetic lookup tool based on the Jyutping 粵拼 system has been added;
  • The Pinyin and Jyutping lookup utilities now enable you to filter results on Traditional characters or Simplified characters or both (this is not as straight forward as it may seem, and I still need to do some more work on the Simplified filter which currently lets through many traditional form characters that have no corresponding simplified form);
  • The "ISO Comments" field that is no longer populated, has been replaced by the "Formal Alias" field in the Character Properties dialog and the Advanced Search tool;
  • The character cells in the character grid now use the configured system colours by default, but the colours used can be manually configured from the new "Customize Colours..." menu (making use of some of the new controls in the MFC Feature Pack for Visual C++ 2008 ... but at the expense of an additional 1MB on the size of the executable);
  • A solid or outline image of the currently selected character can be copied to the clipboard by pressing F6 or Shift-F6 respectively (the colours and font size of the glyph image can be configured from the "Customize Colours..." menu);
  • Any or all of the glyphs in any font (selected by Unicode code points or by glyph IDs) can be exported to file in BMP, GIF, JPG or PNG format (font size and glyph colours are customizable) from the new "Export Font Glyphs..." menu item;
  • A screenshot of either the entire BabelMap window or just the contents of the edit buffer can be copied to the clipboard from the Edit menu (at present there is a problem with the full window screen shot under Vista, although there are no problems under XP, or with BabelPad under Vista).

Creative Commons License

Note: Under Windows 7 none of the new Unicode 5.2 characters in the Supplementary Multilingual Plane (Avestan, Egyptian Hieroglyphs, Imperial Aramaic, Inscriptional Pahlavi, Inscriptional Parthian, Kaithi, Old South Arabian, Old Turkic, Enclosed Alphanumeric Supplement, Enclosed Ideographic Supplement and Rumi Numeral Symbols) display correctly in Notepad and most other Windows applications (they are all rendered as two .notdef glyphs). However, they do display correctly in Word 2007, and will also display correctly in the BabelMap character grid, but not in the BabelMap edit buffer. This is an issue with the version of Uniscribe that ships with Windows 7, which appears to support Unicode 5.1, but is not forwardly compatible with Unicode 5.2.

Unfortunately I have not yet had the time to get round to fixing the number one feature request for BabelMap—an up-to-date Help System—but I am hoping to work on it soon, and will be able to release the new help file by the end of the year.

The new version of BabelPad supporting Unicode 5.2 is currently undergoing testing and bug fixing, and unless I get distracted by something else (which seems highly probable), it should be ready by the end of November (I haven't been distracted too much by other things, but unfortunately I have had to delay the release of BabelPad because of problems with displaying Unicode variation sequences using the version of Uniscribe that ships with Windows 7 [2009-11-27]). Anyway, when it is released I will announce it here.

BabelMap Version [2009-10-31]

This update incorporates the following additional features and bug fixes :

  • Fixes a bug whereby BabelMap crashed when sorting "Fonts Covering Selected Block" in the Configuration dialog if the list of fonts contained more than one font with exactly the same name (this can happen if you have the same font installed both as an independent font [e.g. batang.ttf] and also as part of a True Type Collection [e.g. batang.ttc])
  • Fixes a bug whereby some of the radicals in the Han Radical Lookup Utility were displayed as the wrong character
  • Adds a new "Font Coverage" utility that lists all fonts that cover a particular character, or all the characters in a given piece of text, or all the characters in the BabelMap edit buffer

Creative Commons License

The Font Coverage utility lets you determine which fonts on your system cover a particular character or all the characters in a given piece of text

BabelMap Version [2009-11-30]

This update incorporates the following additional features and bug fixes :

  • The Font Information dialog now includes character counts for all CMAP subtables listed for the font (subtable formats 2, 8, 13 and 14 have been newly implemented)
  • The Font Information dialog now includes a Copy CMAP Subtable button, which will copy to the clipboard a list of the character-glyph mappings given in the selected CMAP subtable (covers all platforms, encodings and subtable formats)
  • Now supports fonts that have Unicode type 3 or 4 CMAP encodings (Unicode 2.0+ BMP or Full Repertoire) but do not have any Windows CMAP encodings (Note that by design BabelMap does not support fonts with a Windows type 3 "PRC", 4 "Big5", 5 "Wansung" or 6 "Johab" CMAP encoding but no Unicode CMAP encoding)
  • Now supports Last Resort fonts that use the new Format 13 subtable format with a Unicode or Windows CMAP encoding (not that I have ever seen such a font)
  • A new VS button has been added to the character grid: this button will only be enabled if the currently selected font has a Format 14 CMAP subtable that covers the character with focus; pressing the VS button will open a dialog window that shows the variation sequences in the selected font for the selected character (Note that this feature will work on all supported versions of Windows (i.e. NT 4.0 and later), but only for fonts that include a Format 14 CMAP subtable for Unicode Variation Sequences, such as the Cambria Math and Microsoft PhagsPa fonts that ship with Windows 7; this feature will not work for fonts such as Code2000 and BabelStone Phags-pa Book that implement variation sequences using OpenType features)

Creative Commons License

The Unicode Variation Sequences for U+2269 GREATER-THAN BUT NOT EQUAL TO in the Windows 7 Cambria Math font

BabelMap Version [2009-12-23]

This update fixes a bug that causes BabelMap to get an incorrect glyph index for certain characters in certain fonts, which affects the "Copy CMAP Subtable" function.

BabelMap Version [2009-12-31]

This update fixes a bug that causes BabelMap to crash if the tab key is pressed multiple times.

BabelMap Version [2010-01-02]

This update improves character search, and displays the ISO/IEC 6429 names for control characters in the character description (these used to be displayed, but were inadvertently dropped somewhere along the line).

BabelMap Version [2010-06-06]

This update adds the following features :

  • Supports up to 32 character bookmarks, accessible from the new 'Bookmarks' menu. Press the 'Insert' key to add a bookmark for the currently highlighted character, and press Ctrl+Del to delete any bookmark for the currently highlighted character.
  • Adds a new 'HTML' character mode for the edit buffer, which displays non-Basic Latin characters as HTML entities or NCR references.
  • Changes the behaviour of the UCN and NCR character modes to only display escape codes for non-Basic Latin characters.
  • Adds a new 'Alternative Representation' edit box next to the Character Name edit box, which displays the currently highlighted character as a decimal value, UCN escape code, NCR hexadecimal escape code, NCR decimal escape code, HTML entity (or NCR hexadecimal code), UTF-8 hexadecimal byte code, or UTF-8 octal byte code, as specified in the 'Options' menu.

BabelMap Version [2010-06-09]

Refixes a bug whereby some of the radicals in the Han Radical Lookup Utility were displayed as the wrong character.

BabelMap Version [2010-06-18]

Fixes a bug with the display of text labels in the Options menu that affects some locales; and a bug with the block coverage statistics in the Composite Font Mappings dialog.

BabelMap Version BETA [2010-07-04]

Beta releases of BabelMap and BabelPad suporting Unicode 6.0 are now available for download:

Caveat: The Unicode properties in BabelMap/BabelPad are based on the latest versions of the Unicode 6.0 Beta data files, and although the data is unlikely to change substantially before the release of Unicode 6.0 in late September, some properties may be subject to change, and should not be relied on. However, character names and code points are fixed, and may be relied on.

As usual, please send any bug reports or feature requests to me (see my profile for my email address).