Saturday, 16 April 2011

Unicode 6.0 — One character at a time

A recent youtube video by jörg piringer that scrolls through "all" 49,571 Unicode characters in 33 minutes and 16 seconds (25 characters a second) has been doing the rounds, but I'm afraid that I was not impressed. The 49,571 characters in the video only cover the BMP, and even then it is 5,000 characters short, missing out most of the characters that have been added to Unicode over the past ten years, and missing out entirely some scripts that have been in Unicode since Year Zero.

Unicode version 6.0 (released October 2010) actually defines 109,384 characters (109,244 graphic and 140 format characters). How many of them you are able to see depends upon your operating system, your browser and whether you have additional fonts installed covering obscure and recently encoded scripts and characters (and whether your browser will actually apply those fonts or not). On my Windows 7 SP1 machine, with no additional fonts installed, I can see 95,372 of these 109,384 characters (87.1% coverage of total number of characters, but only fully covering 66 out of 203 blocks, and 85 blocks with no coverage at all).

Anyway, I've made my own attempt at a javascript-based "video" that goes through the entire 109,384 characters in Unicode 6.0 at this page, which you can launch in a new window by clicking on the image below. Then if you have 3 hours 2 minutes and 18 seconds to spare (at the default 10 characters per second) just hit the "Start" button, and see how well your system does. By default it lets your browser choose what fonts to use, but if you have additional fonts installed covering obscure scripts and recent Unicode additions which are not being applied by your browser, then try checking the "Use Custom Fonts" button and it will apply a custom list of pan-Unicode and script-specific fonts for each block that can give up to 99% coverage of the 109,384 characters if you have the appropriate fonts installed.

[2011-04-17 Update : to scroll through random characters from random blocks check the "Random Characters" checkbox before hitting "Start"; to view a single random character hit the "Random Character" button; to view a specific Unicode character enter its hexadecimal code point value in the text box and hit "Go To".]

[2011-04-19 Update : to find a character by whole or partial name (regular expressions not currently supported) enter the name in the text box and hit "Search For"; if not searching by exact name, keep hitting "Search For" to find the next matching character.]

[2011-04-21 Update : fixed skip blocks bug; added formal aliases; slide show page now accepts parameters to initially show a particular character (?char=A1B2), show a random character (?char=random), search for a given character name or part of a name (?find=string), or find an exact character name (?name=character name).]


John Yunker said...

Andrew -- I love the slide show! I added a link to it here:


jmdesp said...

I'd love to see independantly what the custom list of pan-Unicode and script-specific fonts that gives 99% coverage is.
Preferably organised by coverage, showing how many font you need to get an already really good coverage.

Also I thought this would be very fast an exercice in seeing a lot of chinese characters go by, but I was surprised to see that they are more than 10 000 non-Chinese characters before you get to that huge block of Hanji. It's fun, but I'm missing the Birmanese font, would have love to see what those characters look like.

Andrew West said...

I've planned for a long time to write a blog post entitled "How to get 100% Unicode coverage" which lists what fonts you need to display what Unicode scripts/blocks, but every time I get close to achieving 100% a new version of Unicode is released with thousands more new characters, and so I've never yet manged to achieve 100% myself and I've never written the post. In particular, I do not know of any freely available font that covers these blocks: Mandaic, Balinese, Batak, Lepcha, Vedic Extensions, Bamum, Common Indic Number Forms, Devanagari Extended, Rumi Numeral Symbols, Brahmi, Kaithi, Bamum Supplement, and Variation Selectors Supplement. Because of comment length restrictions, I cannot provide the custom font list here and now, but I will give a detailed answer in the post this evening.

Andrew West said...

I have now inserted in the post a link to the custom font list used and download links for the freeware and shareware fonts included.

Denilson Sá said...

Suggestion: stop using the "b" element, or reset the font-weight back to normal.

Why? Because some characters are rendered incorrectly because the browser (or the operating system, I don't know) tries to "emulate" the bold font by increasing the thickness. This happened for me in both Opera and Chrome under Linux, I don't know about other systems.

Things get very bad for "picture"-style characters. For instance, the ones from "Block 191 : Playing Cards".

Andrew West said...

Thanks for the suggestion -- I have removed the bolding now.

partha said...

I am new to this study of Unicode glyphs, fonts etc. I am interested in the Malayalam Language and want to see how I can make a distinction between two forms of a letter `na' : Alanwoods Resourses Characterlist for Malayalam Decimal 3368 Hex 0D28 Malayalam letter NA ന. One is pronounced as this NA itself and the other its phonetic variation a dental one. I have to design a new symbol for this and make all the glyphs necessary to show how it combines with other letters. Can you help where to start reading?

Andrew West said...

Hi Partha, I'm afraid that I am not really the best person to answer such a question. I suggest that you subscribe to the Unicode public email list where there are many experts who should be able to answer your questions.