Saturday, 26 May 2012

What's new in Unicode 6.2 ?

Previously discussed :

The answer to the question "What's new in Unicode 6.2 ?" is rather short :


Yep, that's it, just a single new character. The Unicode Technical Committee (UTC) decided earlier this month to fast track the encoding of the recently announced currency symbol, as it had previously done with the newly invented Indian Rupee Sign ₹ (U+20B9, added to Unicode 6.0 in 2010) and the Euro Sign € (U+20AC, one of only two characters added to Unicode 2.1 in 1998 [kudos to anyone who knows what the other character was, and a special prize to anyone who has ever had cause to use it]). However, whereas the Indian Rupee Sign was fast tracked into an already scheduled release, the Turkish Lira Sign has the dubious honour of being the first ever character to be given an entirely new version of Unicode all to itself, Unicode 6.2, which will probably be released in late September or early October 2012. This also means that 2012 will be the first ever year during which more than one major or minor version of Unicode has been released.

Unicode releases are normally coordinated with publications of new editions or amendments to the corresponding international standard, ISO/IEC 10646 (see Unicode and ISO/IEC 10646 for details of the relationship between these two standards), but the next amendment to ISO/IEC 10646:2012 (i.e. Amendment 1, covering Linear A, Palmyrene, Manichaean, Khojki, Khudawadi, Bassa Vah, Duployan, and additional Wingdings symbols) isn't scheduled to start its final ballot until the end of this year, so a version of Unicode corresponding to Amendment 1 could not be released until spring 2013. In order to meet expected demand to use the newly devised currency sign as soon as possible, the UTC therefore decided not to wait until the next anticipated version of Unicode next year, but instead release a new version especially for the Turkish Lira Sign, on the assumption that the character is uncontroversial and will be accepted into ISO/IEC 1064 anyway. Of course this puts the ISO committee (WG2) in a slightly awkward position, as the ISO/IEC 10646 and Unicode repertoires need to be identical (and preferably synchronised), but Unicode 6.2 will probably be published before the committee even has a chance to discuss the proposal for the first time at its next meeting in October, and so faced with a fait accompli by the UTC it will have to accept the Turkish Lira Sign into ISO/IEC 1064 at the earliest opportunity regardless of what individual national body members of the committee may think of the new currency symbol. And as the UTC is looking into ways of making quicker releases of Unicode in response to industry demand to encode urgent-use characters, perhaps we will see more intercalary releases of Unicode with only one or two character additions in the future (there are probably some people who are looking forward to an accelerated release of Unicode 6.3 to meet the demand for the New Greek Drachma Sign, but that might be more controversial given the existence of the unused and unloved Drachma Sign ₯ at U+20AF [not to be confused with the ancient Greek Drachma Sign 𐅻 at U+1017B]).

The broader Unicode community did not all agree with the assessment that this was an uncontroversial addition, and a tsunami of emails has engulfed the Unicode mailing list since the initial announcement on 15 May. I don't want to be drawn into this futile argument, but if you want to start using the Turkish Lira Sign today, you can, as it is already included in Michael Everson's free Rupakara font. And if you are eager to take a closer look at Unicode 6.2, then I have just released beta versions of BabelPad and BabelMap that support Unicode 6.2 (NB the Unicode 6.2 data incorporated into BabelMap and BabelPad is provisional and subject to change before Unicode 6.2 is officially released, and so should not yet be relied on).

What Else ?

What else can we say about Unicode 6.2 ? Well, U+0709 ܉ SYRIAC SUBLINEAR COLON SKEWED RIGHT is getting a new formal alias: SYRIAC SUBLINEAR COLON SKEWED LEFT; U+1240F 𒐏 CUNEIFORM NUMERIC SIGN FOUR U through U+12414 𒐔 CUNEIFORM NUMERIC SIGN NINE U are having their numeric values changed from '4' through '9' to '40' through '90'; and U+065F ARABIC WAVY HAMZA BELOW is moving from inherited script to Arabic script. On a more practical point, the Unicode 6.2 code charts will for the first time show variation sequences, which are now growing in number at a startling rate.

On Beyond 6.2

The main side effect of this special release of Unicode 6.2 will be to push back the date of the release of version of Unicode synchronised with ISO/IEC 10646:2012 Amendment 1, which was originally anticipated for release next spring. It is now probable that the next version of Unicode (shall we call it Unicode 7.0?) will be synchronised with ISO/IEC 10646:2012 Amendments 1 and 2, and will not be released until early 2014. I will blog about the contents of Unicode 7.0 in October this year.

In the meantime, it is probable that an "update version" of Unicode (i.e. Unicode 6.2.1), which includes any required changes to character properties and updates to the standard annexes, but which does not include any changes to character repertoire, will be released in spring 2013. Unicode 6.2.1 will include the addition of 1,002 standardized variants for CJK Unified Ideographs, corresponding to CJK Compatibility Ideographs, as an alternative, roundtripable mechanism for representing compatibility ideographs. I suspect that this will confuse the hell out of implementations that assumed that variation sequences for CJK Unified Ideographs only ever used Variation Selectors 17 through 256, and that VS1 through VS16 were only used for variation sequences that did not feature Han ideographs.

[Update (2013-10-01): In fact Unicode 6.2.1 turned into Unicode 6.3, which was released on the last day of September 2013; and Unicode 7.0 is probably delayed until the second half of 2014.]

Wednesday, 23 May 2012

BabelPad and BabelMap Version 6.1.0

Delayed for nearly four months, I have finally released new versions of BabelPad (version and BabelMap (version that support Unicode 6.1. Many thanks to everyone who has suggested new features, reported bugs, and helped test the beta releases of BabelPad and BabelMap.

  • (simply unzip the file BabelPad.exe and run it from wherever you like)
  • (simply unzip the file BabelMap.exe and run it from wherever you like)

Creative Commons License
This screenshot of BabelMap is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License (CC-BY-SA-3.0) by Andrew West.

There have been a large number of bug fixes and enhancements to both BabelPad and BabelMap. In particular, BabelMap has been significantly improved by the implementation of a custom edit buffer that supports the user-defined virtual composite font, drag-and-drop editing and rich text (RTF) copy. This edit buffer, which is also used in the CJK/Yi character input utilities, is in effect a lightweight version of BabelPad, and allows text from multiple Unicode blocks to be rendered using different fonts at the same time. This should be a very big improvement on previous versions of BabelMap, which only applied a single font to the edit buffer, frequently resulting in little white boxes when the edit buffer contained text from multiple Unicode blocks. Another useful new feature in BabelMap is the "Locate Character" function. If you come across a character you do not recognise on the internet or in a document, simply paste it into the BabelMap edit buffer and hit F2, and the character will be immediately located in the character grid, showing you the code point and Unicode name of the unknown character in the status bar beneath the character grid (as well as any readings if it is a CJK character). For strings of text, put the caret in front of the character to locate before hitting F2.

Various minor bug fixes and enhancements have been made to BabelPad, including improvements to CJK conversion between simplified and traditional characters, and CJK conversion to Mandarin pinyin readings (now accessible as Ctrl+Shift+P), as well as the addition of a new function to convert between old style and new style Japanese kanji.

BabelPad Enhancements

  • Unicode Support: Updated to support Unicode 6.1 (110,116 graphic and format characters in 217 blocks covering 100 scripts).
  • Rich Text Copy: New Rich Text Format (RTF) copy function (Edit: Cut and Past: Rich Text Copy from the menu, or Ctrl+Shift+K) that allows you to copy font-linked text (based on either the currently selected single font or the user-defined composite font) from BabelPad and paste it into applications such as Word and WordPad.
  • Synchronize Settings: New function to synchronize the settings (font configuration, layout options, etc.) on all currently running instances of BabelPad (Window: Synchronize Settings from the menu).
  • List Undefined Glyphs: New function to copy a list of all characters in the current document which have undefined glyphs in the currently selected font (Tools: List Undefined Glyphs from the menu). This function is only available when BabelPad is configured to use a single font.
  • Character Encodings: Now supports opening files encoded using the North Korean standard, KPS 9566-2003.
  • File Open: Improved auto-detect of file encodings.
  • CJK Mandarin Readings: Mandarin pinyin readings have been completely replaced by a new set of readings based on Xiàndài Hànyǔ Cídiǎn 现代汉语词典, supplemented by readings from Hànyǔ Dà Zìdiǎn 汉语大字典 (there are too many obscure readings in HDZ, so it is only used for characters not in XHC), with readings approximately ordered by frequency.
  • CJK Pinyin Conversion: Improvements to the CJK pinyin conversion routine (Ctrl+Shift+P) to support contextual conversion of characters with multiple readings, and only output a single reading for each character converted (a new function has also been added to convert each character to a list of all readings in frequency order, similar to the original implementation). There is still much more work to be done on this feature, and some common characters with multiple readings are not yet dealt with contextually, but it should be a big improvement on the original pinyin conversion function.
  • CJK Traditional-Simplified Conversion: Improvements to the CJK traditional-to/from-simplified conversion routines, in particular contextual conversion of characters which do not have a simple one-to-one mapping between simplified and traditional forms.
  • CJK Old-New Form Kanji Conversion: Addition of functions to convert between old form Kanji (kyūjitai 舊字體) and new form Kanji (shinjitai 新字体).
  • Normalization: New shortcuts for converting texts to normalization forms NFC (Ctrl+Shift+C) and NFD (Ctrl+Shift+D).
  • Status Bar: Status bar panels have been reordered so that the code point and character name panel is on the left, in order that code point and character name will still be visible on reduced size windows. Informal character aliases are now appended to the character name where available to help with the identification of characters.
  • Find and Replace Dialog: Unicode escape sequence preference is now remembered within and between sessions; the "not found" dialog can now be cancelled by hitting escape.
  • Auto Number Dialog: Auto number configuration is remembered within sessions.
  • Font Options: Font options and settings are now in a separate Fonts sub-menu.
  • Printing: Now fixed (was broken in version 6.0.0.n).
  • Font Quality: New "Font Quality" option under the Fonts sub-menu, allowing the user to select the font rendering quality (previous versions of BabelPad used the default quality, which is still recommended for normal use) :
    • Default (this uses the font smoothing setting that has been set at the system level in the control panel, and depending upon that setting will result in non-antialiased, antialiased or ClearType quality)
    • Draft (on my test machines this results in the same quality as default)
    • Proof (on my test machines this results in the same quality as default)
    • Non-antialiased (disables any font-smoothing set at the system level; i.e. the same as default quality if font smoothing is disabled at the system level; this setting almost always results in the poorest quality)
    • Antialiased (enables standard font-smoothing; i.e. the same as default quality if standard font smoothing is enabled at the system level; at large font sizes [approximately greater 24 points] than this may result in better quality than ClearType font smoothing, but at huge font sizes [approximately greater than 120 points] antialiasing will be automatically disabled, and the result will be the same as for non-antialiased)
    • ClearType (enables ClearType font-smoothing for LCD screens; i.e. the same as default quality if ClearType font smoothing is enabled at the system level)
    • ClearType natural (same as ClearType but may result in slightly different character widths)
  • Bold and Italic: Bold and/or Italic style can now be applied to the current document (from the Fonts menu, or the B and I buttons on the main toolbar). This feature is not available when the "List all Styles of Fonts" option is selected from the Fonts menu.
  • Long S: Casing rules for long ſ (Options: Casing Rules: Long S from the menu) have been improved.
  • Tables: For dialogs with copyable tables, the columns can now be drag-reordered by the user, and when the table is copied to the clipboard the layout of the data will reflect the user-defined column order.
  • Layout: The vertical scroll bar is now displayed on the left side of the window when in RTL layout.
  • Rendering: A number of miscellaneous tweaks and improvements to text rendering have been implemented, including reducing screen flicker when performing conversion operations on selected text, changing the cursor to a pointer when it moves over selected text, and preserving selection direction when copying or moving text.

BabelMap Enhancements

  • Unicode Support: Updated to support Unicode 6.1 (110,116 graphic and format characters in 217 blocks covering 100 scripts).
  • Ideographic Variation Database: Updated to support version 2012-03-02 of the Ideographic Variation Database (8,882 additional Adobe-Japan1 and Hanyo-Denshi sequences, making a total of 27,724 ideographic variation sequences).
  • Edit Buffer: All new edit buffer that supports display using either a single font or the user-defined composite font. The edit buffer supports drag-and-drop editing, and provides the ability to convert alphabetic text between upper, lower and title case (Case from the right-click menu, or Ctrl+U, Ctrl+L and Ctrl+T); the ability to convert text between normalization forms (Normalization from the right-click menu); and the ability to change the font size of the edit buffer using the mouse wheel whilst pressing the Ctrl key. Note that the edit buffer is limited to displaying 64 display lines of text, and if you paste in text longer than the limit, only the first 64 display lines will be displayed (the actual amount of text displayed will vary depending upon the selected font size). The edit buffer is not intended for editing large amounts of text, and if you do paste large documents into the buffer you may notice a significant deterioration in performance.
  • Character Locate: Paste a character into the edit buffer and select "Locate Character" from the "Find" menu (or simply hit F2) to locate its position in the character grid. If the edit buffer contains more than a single character, put the caret in front of the character you wish to locate.
  • Han and Yi Lookup Utilities: Same new edit buffer as described above, but with ability to convert between traditional and simplified Chinese characters in the Han lookup utilities (from the right-click menu, or Ctrl+F and Ctrl+J).
  • Rich Text Copy: Text copied from the main edit buffer or any of the lookup utility edit buffers will be put on the clipboard in both plain text and Rich Text Format (RTF) format, so font linking will be preserved when pasted into applications such as Word and WordPad.
  • Character Name Search: You can now use the special characters ^ and $ to match the start and/or end of character names in the "Search for Character by Name" box (e.g. "^BUS$" will only match U+1F68C BUS; "^BUS" will match U+1F68F BUS STOP and all character names starting with BUS; "BUS$" will match U+1F68D ONCOMING BUS and all character names ending in BUS; and "BUS" will match all character names containing BUS anywhere).
  • Unicode Summary Utility: The "Unicode Summary" tool has been replaced by separate "Unicode Scripts", "Unicode Planes", and "Unicode Blocks" tools.
  • Name Aliases: Attempt to support in character name display and search the large and confusing set of formal name aliases newly defined in Unicode 6.1.
  • Font Quality: New "Font Quality" option under the Fonts sub-menu, allowing the user to select the font rendering quality (previous versions of BabelMap used the default quality, which is still recommended for normal use). See under BabelPad, above, for details of the font quality options.
  • Navigation Shortcuts: The following navigation shortcuts have been implemented: Ctrl+E sets the focus on the edit buffer; Ctrl+F sets the focus on the search (find) box; Ctrl+G sets the focus on the go to box; and Ctrl+M sets the focus on the character map grid.

Future Enhancements to BabelPad

The following are some features that I would like to implement in future versions of BabelPad, roughly in order of priority:

  • User-defined macros (record and play sequences of key strokes and commands)
  • Option for normalization-folding in find/replace (matching canonical equivalent or compatibility equivalent text strings)
  • CJK character composition from IDS sequences
  • Support for tab-separated tables (allowing the ability to select, copy, move, delete and sort tab-separated columns of text)
  • Mojibake repair function
  • Support for Unicode regular expressions in find/replace
  • Unicode collation

BabelStone Forum

I have recently set up a Google Groups forum at!forum/babelstone where you are welcome to suggest new features, report bugs and ask questions about BabelPad and BabelMap, as well as other BabelStone software and fonts.