Sunday, 30 July 2006

What's That ?

N3027 ("Proposal to add medievalist characters to the UCS") proposes to encode a wide range of abbreviation letters used in medieval manuscripts and early printed books. At present it is impossible to transcribe into Unicode many early texts as special abbreviation letters are so common, so I am very pleased to see that these letters are finally being encoded. However, there is one proposed letter that I have a little quibble with. According to N3207 :

LATIN LETTER THORN WITH STROKE is used for Old Norse þat, þess, þor-, þæt (Figures 29, 32, 33, 40, 73, 79).

Which is true enough as far as it goes, but I suspect that most of my readers will be more familiar with the letter thorn with a stroke through the ascender in the context of Old English, where it is the ubiquitous abbreviation for þæt (and unlike Old Norse, only þæt).

It is an odd thing about the proposal that Latin, Old Norse, Irish, Welsh and even Cornish are frequently cited as languages using a particular proposed character, but Old English is only cited for a single character (COMBINING DOUBLE CIRCUMFLEX ABOVE, which is an editorial mark used in some editions of Old English poetry) and there are only two other mentions of Old English in the entire 51 pages of the document, when quite a few of the proposed characters are applicable to Old English (three primarily used for OE), and six of the examples provided are actually of Old English text (figs. 29, 30, 31, 37, 39, 40). In fact two of the six examples cited for LATIN LETTER THORN WITH STROKE are Old English, contrary to what the casual reader might assume.

Not only does the proposal not mention Old English in relation to the proposed LATIN LETTER THORN WITH STROKE, but it omits the crucial piece of information that the the glyph forms of Old Norse and Old English letter thorn with stroke are quite different from each other. The Old Norse form has a short horizontal stroke through the ascender , whereas the Old English form has a longer diagonal stroke through the ascender . This difference can be seen in the examples given in N3027, where the Old Norse examples (figs. 32, 33, 42, 73 and 79) all use the former letterform and the Old English examples (figs. 29 and 40) both use the latter letterform. Although the examples show that Old Norse and Old English use distinct glyph forms, the text of the proposal does not make any mention of the fact that this character occurs in two distinct glyph forms, which I think is an important detail that should have been made explicit.

The following is an example of an early 12th century Old Norse manuscript :

Elucidarius (AM 674a folio 17r)

Thorn with stroke on lines 2 and 7

And this is an example of an Old English manuscript dated to about the year 1000 :

The Cædmon Manuscript [part of the Old English verse rendition of Genesis] (Bodleian Junius MS 11 folio 14)

Thorn with stroke on lines 3 and 11

These manuscripts exemplify the differences between the Old Norse and Old English forms of thorn with stroke. This difference is preserved in most modern typeset editions, with editions of Old Norse texts normally using a short horizontal stroke, and editions of Old English texts normally using a longer diagonal stroke. The following are a few examples that show the Old English form of the letter (see N3027 figs. 32, 33, 73 and 79 for some ON examples) :

Plummer and Earle, Two of the Saxon Chronicles Parallel (Oxford: Oxford University Press, 1889) p.69

A. Campbell, An Old English Grammar (Oxford: Oxford University Press, 1959) p.12

C.L. Wrenn (ed.), Beowulf (London: Harrap, 1973) p.210

The question then arises, should the Old Norse and Old English forms be encoded separately (LATIN LETTER THORN WITH STROKE and LATIN LETTER THORN WITH DIAGONAL STROKE) or should they be considered to be glyph variants of the same abstract character ? According to N3027 it would seem that they should be encoded as a single character, although the latest version of the MUFI character recommendation treats the two glyph forms as separate characters :

MUFI Character Recommendation Version 2.0 f (12 January 2006)

My inclination is to agree with MUFI on this one, although I suspect that I am wrong. According to Unicode encoding principles, language-specific glyph variations should be dealt with at the font level (i.e. in a font designed for Old Norse the glyph for LATIN LETTER THORN WITH STROKE would have a horizontal stroke, whereas a font designed for Old English would have a glyph with a diagonal stroke). However, there are plenty of precedents for encoding language-specific letterforms as separate characters.

My feeling is that in N3027, the proposed LATIN LETTER VEND is really only an Old Norse glyph variant of the already encoded LATIN LETTER WYNN used in Old English, and so if Vend and Wynn should be distinguished at the character level, why not Thorn with a horizontal stroke and Thorn with a diagonal stroke ?

An example from further afield that has been discussed recently is the proposed MYANMAR LETTER MON JHA (N3044), which is acknowledged to be a glyph variant of the already encoded MYANMAR LETTER JHA used for writing Mon, but which Michael Everson is proposing to encode as a distinct character because there is a requirement for a single "plain-text monofont" that covers all of the Myanmar-script languages of Union of Myanmar, and so language-specific glyph variants must be dealt with at the character level rather than the font level.

This has a bearing on the encoding of Thorn with a stroke, as fonts that are intended for use by medievalists (Alphabetum, Andron Scriptor, Cardo, Junicode, Leeds Uni) are general fonts that cover the characters required for all languages. Thus, users will generally not be using fonts specifically designed for Old Norse or Old English, but will be using a single "medievalist" font with a single glyph for LATIN LETTER THORN WITH STROKE, which will either cater for Old Norse or Old English, but not for both. I think that this is a pretty good argument for saying that as Old English Thorn with a diagonal stroke "is a language-specific variant which differs significantly from the 'default' letter" (ME's justification for MYANMAR LETTER JHA), it should be encoded separately from LATIN LETTER THORN WITH STROKE.

Anyhow, those are just my thoughts. It would be interesting to hear what other people think on this issue.


John Cowan said...

I think this is in the same class as the Polish kreska being much steeper and stubbier than modern versions of the acute in Western European languages. Either language-specific fonts are needed, or there must be compromises.

Graham Asher said...

On the subject of Everson et al. (, largely ignoring Old English, I suspect (though I would love this not to be true) that he suffers from an unconscious anti-English bias, being a very keen Celticist, and what is more, originally an American. Unicode also fails to acknowledge Old English in its remark on the Tironian Et (U+204A), the note on which is "Irish Gaelic, ..." - it seems a little mean-spirited not to say "Irish Gaelic, Old English, ...".

The document contains at least one error: it glosses Old Norse "tottogo" as "thirtieth". It means "twenty".

Andrew West said...

At this point I perhaps ought to own up to being a friend of Michael Everson. Despite my occasional disagreement with him on technical issues (more of which will come to light in next week's posting on Tibetan Extensions), I very much admire all the hard work that he has put into the expansion of the Unicode character repertoire, and I am probably more frequently in agreement with him than disagreement. Indeed we have been working together in recent months to repair and improve the Myanmar encoding model (see N3043 which is a joint proposal by Ireland and the UK).

I have personally never perceived any anti-English sentiments in my discussions with Michael, and although he is a keen Celticist he has also been reading Old English since the age of thirteen and has a lifelong interest in Tolkein's fantasy scripts and languages ... which is one of the reasons why I was so surprised by the lack of any mention of Old English in N3027.

As to Tironian Et (U+204A), you will be pleased to hear that in the code charts for Unicode 5.0 this character is now annotated as "Irish Gaelic, Old English, ..." and cross-referenced to Ampersand (U+0026) ... both at my suggestion. The final 5.0 code charts are not yet available, but the text of the code charts is available in the Names List.

Graham Asher said...

I obviously did Michael Everson a grave disservice - and I'm pleased I was wrong, because I also admire his work and find everything he writes fascinating (except when he writes in languages I can't understand, like Irish). I am a heavy user of the Unicode standard - I have implemented support for it for two operating systems (Symbian & RIM OS) - so I benefit from his dedication.

Michael Everson said...

Anti-English? Moi?

Certainly not. I prefer Oxford spelling, indeed.

Why no Old English? Because that proposal in question didn't have in put from people interested in adding more letters specific to the Old English tradition.

I look forward to working on more medievalist proposals, of course. Currently I am working on Old Cyrillic.

Andrew West said...

Anti-English? Moi? Certainly not. I prefer Oxford spelling, indeed.

Indeed, and at the recent Tokyo WG2 meeting we were fighting together to retain the word "draughts" in Michael's Mahjong, Domino, and Draughts proposal against American attempts to change it "checkers".