Saturday, 5 October 2013

BabelPad Version 6.3.0

Unicode 6.3 was released at the beginning of this week, and so I have released updated versions of BabelPad and BabelMap. There are no significant changes to BabelMap this time (although I am planning a makeover for BabelMap for version 7.0 next year), but I have spent a considerable amount of time working on a number of significant enhancements to BabelPad. As yet again I have not had time to implement the most requested feature (a working help system), I thought it best to describe these new features in a blog post. If you have any comments, questions or suggestions about BabelPad you may either comment on this post or post a question to the BabelStone forum.

Open Lines

You can now open part of a file by selecting the "Open Lines..." command from the File menu.

Open Lines...

When you select "Open Lines..." the standard file dialog will be opened, but after you choose the file to open a new dialog will be opened that allows you to specify which lines of the file to open. For very large files it may take a few seconds for this dialog to appear as it has to first parse the file to determine how many lines long the document is.

Open Lines Dialog

Manipulation of Tabular Columns

I work a lot with tabular data, and I frequently have to swap my data out of BabelPad and into Excel in order sort or reorder the columns, and then swap it back into BabelPad to do other editing. In order to reduce my reliance on Excel I have now implemented support in BabelPad for manipulating tabular columns of text, delimited by tabs, commas or any user-specified character or string. For all operations described below, you need to first select one or more whole lines of text (i.e. the start and end points of the selection both have to be at the start of a line), then select the appropriate operation from the "Columns" submenu of the Edit menu. BabelPad will automatically detect if your columns are tab-delimited or comma-delimited (based on the first line of the selected block of text), but if you want to override the detected delimiter or specify a different column delimiter (either a single character or a text string) you may do so checking the appropriate radio button (if you change the custom delimiter you must check the "other" radio box again for the delimiter to be applied).

Columns Submenu on the Edit Menu

Ordering Columns

This operation enables you to order any number of columns in the selected block of text, for example changing the order of columns A, B, C, D, E to D, A, E, C, B. Simply use the up and down buttons to order a selected column or a contiguous range of selected columns. When you are happy with the new order, press the "Order" buttton; or else press the "Cancel" button to cancel the operation.

Order Columns Dialog

Cutting, Copying and Deleting Columns

These operations enable you to cut copy or delete any number of columns in the selected block of text. Simply select one or more columns (they do not need to be contiguous), then press the "Cut", "Copy" or "Delete" button as appropriate. When cutting or copying multiple columns, each column will be separated by the delimiter character or string (if you are cutting or copying discontiguous columns there will only be a single delimiter character or string between the columns, regardless of how many columns apart they are).

Delete Columns Dialog

Pasting Columns

This operation enables you to insert any number of columns into the selected block of text at a particular column position. The text to be inserted does not need to have been cut or copied from another table, but may be any block of multiple lines of text. If the columns to be inserted are shorter (fewer number of lines) than the selected block of text, the remaining lines will be filled with empty cells. You can choose to insert the column or columns before, over (i.e. replacing) or after any particular column.

Paste Columns Dialog

Sorting Columns

This operation enables you to sort the selected block of text by the values of one or more columns. See below for the various types of sort that are currently supported by BabelPad. You may specify any number of sort levels, with each sort level using any type of sort (however, be aware that using multiple sort levels may significantly slow down the sort, depending on the data and types of sort involved). To specify the columns to sort on, move the column or columns from the left box to the right box by double-clicking or by clicking on the ">" button. To change the default sort type and/or select sort options double-click on the column in the right box. If the selected block of text includes column headers, check the "Do not sort first line" box.

Sort Columns Dialog

Contextual Conversion

I have modified the Contextual Conversion dialog ("Contextual Conversion..." from the Edit menu; or Ctrl+Shft+X) to allow you to restrict the scope of any conversion operation to a a specific column (at present only a single tab-delimited column can be selected). I have also added a Find and Replace conversion so that you can now, for example, replace all occurences of "pig" with "cow" in column 3 of the selected block of text.

Contextual Conversion Dialog


I have long resisted requests to add sorting functionality to BabelPad, even after one kind correspondent offered me the free use of their implementation of the Unicode Collation Algorithm (UCA). However, support for tabular columns would not be complete without sorting, so reluctantly (given the amount of time and effort required) I have now added the ability to sort lines and to sort tabular columns. I have also added commands to randomize lines and remove duplicate lines.

Sort Lines...

To sort whole lines of text, select the lines to sort (these must be a whole number of lines) and hit the "Sort Lines..." command from the Edit menu, which will open the "Sort Options" dialog. If you want to sort by column, select the lines to sort, open the "Sort Columns" dialog (see above), select the column or columns to sort by, and double-click on a selected column to open the Sort Options dialog.

Sort Options Dialog

The "Sort Options" dialog allows you to specify what type of sort you want to use. I have implemented various types of sort, and may add other types of sort in the future (e.g. CJK radical/stroke sort, and sorting CJK characters by pinyin reading):

  • Unicode Collation Algorithm: Implements the Unicode Collation Algorithm (UCA). The UCA collation is based on the Default Unicode Collation Element Table (DUCET), and in BabelPad you can either use the untailored DUCET for language-neutral collation or use the DUCET tailored for certain languages (as discussed below).
  • CLDR Collation Algorithm: Implements the CLDR Collation Algorithm, which is an extension of the Unicode Collation Algorithm. The CLDR collation is based on the CLDR Root Colation which is a modification of the DUCET that puts script-common characters (whitespace, punctuation, general symbols, some numbers, currency symbols) before script-specific characters. In BabelPad you can either use the untailored root collation for language-neutral collation or use the root collation tailored for certain languages (as discussed below).
  • Windows Default Collation: Simply calls the Windows function CStringT::Collate (or CStringT::CollateNoCase if a case-insensitive sort is requested) for each sort comparison.
  • Unicode Code Point Sort: Sorts by the scalar value of the Unicode characters in the sort string.
  • Hexadecimal: Sorts by the hexadecimal value of the sort string.
  • Numeric: Sorts by the numeric value of the sort string. This should work well for decimal numbers in any script, and for Chinese ideographic numbers, but may not yet work correctly for complex non-decimal numbers (such as used in Cuneiform). And, of course, if the string to be sorted is not a number or combines numbers and text you may get unexpected results.
  • Length: Sorts by the length of the sort string in characters (that is characters, not bytes or code units).

The Unicode Collation Algorithm and CLDR Collation Algorithm both have a default root collation table that is language-neutral, but their root collation tables can be tailored to support the specific collation requirements of any particular locale and/or language. At present BabelPad only supports tailored collation for a few languages, but I will consider supporting other language tailorings on request:

  • Old English (Runes)
  • Old English (Latin)
  • Spanish
  • Tibetan (Tibetan collation is quite complex, and any feedback on my implementation of Tibetan sorting is very welcome)
  • Welsh

The "Sort Options" section of the dialog shows the options that are available for the selected sort type. Most of the options are only applicable to the Unicode and CLDR collation algorithms, and some of them are rather esoteric and may not be comprehensible unless you have studied the specifications for the collation algorithms. One additional option that I have added is to limit the sort string to a specified number of characters ("Maximum number of characters to compare"). If you are sorting long lines of text you may only need to check the first few characters of each line to sort correctly, so limiting the number of characters to compare may improve the speed of the sort in some cases.

The check box at the bottom left corner of the "Sort Options" dialog allows you to define the currently selected sort options as the default for text sorts (not applicable to numeric, hexadecimal and text length sorts).

Conversion to/from UTF Code Units

I have added new functions to convert between Unicode characters and ASCII representations of UTF-8, UTF-16 and UTF-32 code units (e.g. convert U+10082 𐂂 to or from "F0 90 82 82" (UTF-8), "D800 DC82" (UTF-16) or "10082" (UTF-32). When converting from code units to characters, any characters in curly braces will not be converted (and the braces dropped), so, for example converting "D800 DC82 { = U+10082}" from UTF-16 code units will result in "𐂂 = U+10082". These conversions are also available from the right-click menu under the "Convert" submenu.

UTF Code Units Submenu on the Edit Menu

Insertion of new Bidirectional Control Characters

BabelPad supports Unicode 6.3 by allowing you to easily insert any of the five bidirectional control characters newly encoded in 6.3. From the Insert menu, click on the "Bidirectional Control Characters" submenu, and the five new characters are listed at the bottom of the submenu.

Bidirectional Control Characters Submenu on the Insert Menu

I have also improved the Variation Selectors submenu to allow insertion of all currently-used variation selectors for Ideographic Variation Sequences (VS17 through VS47).

Test Utilities

For this release of BabelPad I have done a lot of work on improving the testability of BabelPad's Unicode functionality and data. Prior to releasing a new version of BabelPad I run various tests, but in the past these have been mostly carried out manually, and can be quite tiresome to perform. I have now automated several key tests, and although they are intended for my personal use it is not inconceivable that some users might find them helpful, so I have exposed them publicly under the "Test Utilities" submenu in the Tools menu.

Test Utilities Submenu on the Tools Menu

[The Generate UCD Data and Generate Full UCD XML Data functions are also available in BabelMap.]

Generate Core UCD Data

The utility to generate the core UCD data produced for each version of Unicode (UnicodeData.txt) has been available in BabelMap for some years, but I have now added it to BabelPad, under the Test Utilities submenu. It generates an on-screen listing of all rows of the core UCD data for any given version of Unicode.

Generate Core UCD Data Utility

I used to run this for each major version of Unicode, and copy the on-screen listing to BabelPad (copying automatically inserts semi-colon field separators), then save to file (with LF instead of CR/LF) and use WinDiff to compare the actual UnicodeData.txt file for that version of Unicode. That was rather tiresome, so I have now added the ability to automatically generate the data for all versions of Unicode and save as individual files in a specified directory. Now I all I need to do is press the "Save All..." button, go away and make a cup of tea, then come back and use WinDiff to do a directory compare between the directory where the generated files have been saved to and the directory where the original UnicodeData.txt files are stored.

Comparison of Original and Generated Unicode Data

As you can see from the above screenshot, version 1.1.5 fails the comparison, but this is expected as the original file includes some prefatory blurb before the data rows, has an unexpected blank line after U+FD74, and has some unexpected spaces in the decomposition description of five characters.

Generate Full UCD XML Data

The Generate Core UCD Data tool allows me validate BabelPad's current and historical core Unicode character properties, but it does not cover many of the other character properties that are used in BabelPad and BabelMap. To ensure that all Unicode data used in BabelPad is correct, I have added a new tool to generate full Unicode data (excluding Unihan properties) in XML format, exactly matching the data provided in the non-Unihan, flat-format XML version of UCD data (ucd.nounihan.flat.xml, available as a zip from The official XML/UCD data includes 100 properties for each character, many of which are not currently needed for BabelPad, and so writing the tool required the addition of quite a few new functions to produce all the properties, which in the end took a lot longer than I anticipated.

UCD XML Data Generated by BabelPad

Running WinDiff between BabelPad's generated XML/UCD data and the official XML/UCD file shows that the only difference between the two files is the comment that BabelPad adds to the top of the generated document. Originally, after I had completed the utility and ironed out all of the bugs in my code, there were still a number of discrepancies between my generated file and the 6.3 beta version of ucd.nounihan.flat.xml, so I reported the various unexpected idiosyncracies and apparently incorrect data, and I am pleased to say that the XML/UCD data files were quickly updated to fix the reported defects before 6.3 was released.

Run Normalization Tests

BabelPad supports conversion of text into any of the four standard normalization forms (NFD, NFC, NFKD, and NFKC), and before each new release of BabelPad I use the normalization test file produced by Unicode (NormalizationTest.txt from here) to validate BabelPad's normalization functionality. This used to be quite time-consuming and troublesome as I would have to manually extract the five columns of data from the test file, run each of the four normalization functions on each of the five columns of data, then run WinDiff on the twenty output files. I finally became fed up with this approach, and have now added a single function that will read the input file, then perform the required normalizations and comparisons on each line of the file, and report the result at the end. Much better!

Normalization Test Output

Run Unicode Collation Algorithm Tests

As I discuss elsewhere in this post, I have now implemented the Unicode Collation Algorithm and the CLDR Collation Algorithm for sorting in BabelPad. As part of my implementation I added a utility that runs either the UCA test files ( from here) or CLDR test files (available under common\uca in from here), either the "non-ignorable" or "shifted" test file in both cases, and reports the results.

Unicode Collation Algorithm Test Outputs for DUCET

Unicode Collation Algorithm Test Outputs for CLDR

As can be seen, the CLDR tests pass, but the DUCET test for shifted fails for two lines (in fact the DUCET test for non-ignorable only accidentally passes, as BabelPad produces different sort keys than the test expects in the places that the shifted test fails). I believe that the DUCET test files are faulty, and give incorrect sort keys for eight lines of the test relating to Tibetan characters (I reported this before 6.3 was released).


mythoman said...

I love what BabelPad offers me right now, but I have no utility that performs correct collation for Hànyǔ Pīnyīn, so this is what I would like to submit for consideration.

The algorithm should handle the following cases in the way indicated:

shī < shí < shǐ < shì < shi
lù < lǘ < lǚ < lǜ < luán
xiàn < Xī’ān < xiāng

m̄ < ḿ < m̀ < mā
ń < ň < ǹ < nā
è < e < ê̄ < ế < ề < ēi

’ = ' (U+2019 = U+0027)
ɑ = a (U+0251 = U+0061)
ɡ = g (U+0261 = U+0067)

ŋ́ = ńg
ĉ = ch
ŝ = sh
Ẑ = Zh

Andrew West said...

Thanks for the suggestion. I agree that pinyin collation would be very useful to many people, and I will implement it for the next release (some time next year).

mythoman said...

I wrote to the Unicode discussion list:

“Pinyin orthography changed last year. Rule 7.3 of GB/T 16159-2012 stipulates that a preceding dot (probably U+00B7 or U+2022) be used to indicate neutral tone in dictionaries, as had been common practice among many dictionary makers anyway. When there is alternation between neutral and another tone two tone marks may be used simultaneously, as in zhī·dào (知道).”

As Eric Muller mentions dot above was also used by some to indicate neutral tone, and I let me add that others used ring above for this purpose.

For the purpose of collation it is probably best to ignore neutral tone marks altogether.

Yuri da Silva said...

Dear West,

I want to port to Linux, could you provide the source code? (BabelMap is CC, like GNU GPL).

Andrew West said...

Hi Yuri, thank you for your request. However, BabelPad and BabelMap are both not open source, and neither are licensed under a CC license. I have no plans to release the source code for BabelMap/BabelPad, and do not believe it would be possible to port them to Linux anyway.