Writing internationalised softwareBy John-Mark Bell. Published: 13th Apr 2005, 11:57:01 | Permalink | Printable
John-Mark Bell dives into the nitty-gritty of Unicode and writing RISC OS software for users worldwideMuch has been said in the past few years about internationalisation under RISC OS. Theo Markettos outlined many of the issues that developers should consider when developing their applications. Recently, there has also been some discussion of Japanese support in the OS. As yet, no-one has actually nailed down what RISC OS 5 actually offers in this area. This article attempts to address this and offers some handy resources for software developers to use.
Internationalisation can be broken down into three main areas:
- The display of international text
- Processing of international text
- Entering international text
Display of International Text
This is handled, in the main, by the UCS Font Manager, which supports text encoded in a number of encodings. For our purposes, the Unicode family of encodings are the significant advance over and above what was present in previous versions of the OS. It is now possible to give the Font Manager UTF-8/16/32 encoded text and have it rendered correctly.
The advantages of Unicode have been explained before in Theo's in-depth article. Put simply, it defines a single representation of all the characters you may ever wish to make use of, plus a huge amount of room for expansion. The disadvantage of this is that a character is no longer guaranteed to be 1 byte long - depending upon the representation used, it could be anything up to 6 bytes in length, although typically it's between 1 and 4 bytes. Historically, the large majority of software available has assumed that characters are always 1 byte each.
UTF-8/16/32 are 'transformation formats' of the Unicode codepoints for characters: A codepoint is a 32 bit number that represents a single entity in Unicode. Usually, a single codepoint will represent a single symbol, but Unicode also allows two or more codepoints to be composed to produce a 'composite symbol'. For example, you may wish to compose the codepoints for upper case 'A' and an acute accent into a discrete entity - the Aacute symbol, in this case. A character is the representation of a codepoint as a string of bytes. In a given transformation format, there's a 1:1 mapping between characters and their codepoints.
UTF-32 specifies that each character is 32 bits (4 bytes) long. This is a fixed length and every Unicode codepoint can be represented directly in UTF-32.
UTF-16 similarly specifies that each character is 16 bits (2 bytes) long. Again this is a fixed length and the vast majority of Unicode codepoints can be represented in UTF-16. As there are some codepoints which cannot be mapped directly to a UTF-16 character, a number of UTF-16 character values have been set aside as 'surrogates'. By making use of these, it is possible to access those codepoints which cannot be mapped directly. For these codepoints, two UTF-16 characters are required.
UTF-8 is somewhat more complicated than UTF-16 or UTF-32. It is a byte-oriented encoding of the Unicode codepoints which is backwards compatible with ASCII. Essentially, the length of a UTF-8 character is not fixed, as it can be anywhere between 1 and 4 bytes long, according to the RFC. This makes processing UTF-8 slightly more complex than dealing with UTF-16 or UTF-32, but its backwards compatibility and compactness make up for this. It is commonly used in network-based applications.
When asking the Font Manager to open a font for use, it is possible to specify the encoding used to map the glyph's present in the font to the characters in the text you wish to display. For example, if you wished to display some text as Greek, you would open the font with the Greek encoding specified, thus:
SYS "XFont_FindFont",,"Homerton.MediumEGreek",160,160,0,0 TO fh%
With the Unicode Font Manager, you can open the font with a UTF-8 encoding, thus allowing the text to be processed as if it were UTF-8 encoded:
SYS "XFont_FindFont",,"Homerton.MediumEUTF8",160,160,0,0 TO fh%
If you don't specify an encoding, the Font Manager will check the current alphabet, retrieve its name via
Service_International and then use that as the encoding. Thus, if the current alphabet is "Hebrew", any font opened without specifying an encoding will be opened with the Hebrew encoding. Note that this behaviour differs from previous versions of the Font Manager, which simply asked the Territory Manager for the default alphabet of the currently selected territory.
What this means is that if you change the system alphabet to UTF-8 with:
then any font opened without an encoding specified will use UTF-8 as the encoding. The advantage of this is that any text encoded as UTF-8 will be displayed correctly, whether or not the application has any support for UTF-8 or not.
What of UTF-16/32, then? Earlier, I mentioned that the Font Manager is capable of handling these encodings as well. This is true, although it won't work seamlessly with existing applications. The reason for this is that bits 12 and 13 of the
Font_Paint flags word have been appropriated in order to define the encoding of the text being passed to the Font Manager. Existing applications will set these bits to 0, thus all text passed from existing applications will be assumed to be eight bit encoded.
Setting bit 12 of the flags word indicates that the text is UTF-16 encoded and setting bit 13 of the flags word indicates that the text is UTF-32 encoded. These are mutually exclusive (as text can't be UTF-16 and UTF-32 encoded at the same time).
Processing of International Text
This is something which is really down to applications. Ideally they would represent all text internally in some form of Unicode, thus being able to handle anything thrown at them.
Parts of the OS are reasonably capable in this regard already. For example, much of the Wimp and Filer have been updated to support display and editing of UTF-8 encoded text, including correct behaviour when moving the caret between characters. This does rely on the system alphabet being set to UTF-8, however.
Additionally, a common interface to much of the information contained within the Unicode Character Database wouldn't go amiss. For example, conversion of text from one case to another is not as simple as replacing one character with another, as it may be the case that a single character in one case is represented by a number of characters in another case. Equally, if manipulating UTF-8 encoded text, the length in bytes of the output string may be longer than the length of the input string. This concept can be extended to include things such as sorting and comparison of Unicode text and other such functionality.
Entering International Text
This is probably the most complex area of internationalisation and the one where RISC OS is the weakest. The InternationalKeyboard module (which provides the keyboard drivers for the OS) knows about UTF-8 and, if the system alphabet is set to UTF-8, all keyboard drivers insert UTF-8 characters into the keyboard buffer, regardless of the driver's default alphabet.
Most of the provided drivers are variations on the UK keyboard, thus using some form of Latin encoding. Therefore, they aren't of great interest to us, as they don't really push the bounds of what's possible. However, there are a small number of drivers for Far-Eastern keyboards - namely for Japan, Korea and Taiwan. Additionally, there are drivers for Russian, Greek and Israeli keyboards. As the Japanese keyboard driver is the best documented, that is the one we'll focus on here. The others work similarly.
The Japanese keyboard driver will output a number of different sets of characters dependent upon the state of the Kana/Romaji lock. This lock works in the same way as Caps Lock and its state is determined by bit 5 of the keyboard status byte (readable with
As with other keyboard drivers, pressing Alt together with a key will output a different character. The characters produced by the Japanese keyboard driver are as follows:
* - Note that the Japanese support specification says that this is actually Hiragana - experimentation suggests otherwise. It is impossible, however, to enter Kanji with the current Japanese keyboard driver. This isn't overly surprising, given that it's not possible on other systems, either.
The Japanese keyboard has a number of extra keys beyond those provided on a standard UK keyboard. Most of these are only of use with an Input Method Engine (IME). It appears to be impossible to toggle Kana/Romaji lock if using the Japanese driver with a UK keyboard, due to the lack of a suitable key on the UK keyboard. The Russian keyboard driver uses bit 5 of the keyboard status byte to indicate Cyrillic input mode and allows it to be toggled by pressing Alt-Shift. Under Microsoft Windows, Alt-Shift is used to select the current keyboard driver, whereas here it is simply toggling the input mode of the currently selected driver. Therefore, it should be possible to switch to the Russian keyboard driver, toggle the lock and then switch back to the Japanese driver.
Unfortunately, this fails to work as, once you've toggled the lock with the Russian driver, it is impossible to switch to any other driver. This is because it is entering Cyrillic text into the keyboard buffer and pressing Ctrl-Alt-F1 to switch to the UK driver (or Ctrl-Alt-F12 Alt-<international dialing code>) fails to do anything under the Russian driver - it appears to ignore Alt completely, as Ctrl-Alt-F12 produces the same result as Ctrl-F12 and opens a TaskWindow. In order to avoid the pain of the above, you can download and install this module which provides the Alt-Shift functionality for all keyboard drivers.
It is worth noting that throughout the documentation for the Japanese support within the OS, something refered to as "the Japanese IME" is mentioned. This suggests that there is an IME for Japanese language input at least, presumably internal to Castle and Tematic.
The OS also provides a service call -
Service_International 8, which returns a table which maps between the 8 bit alphabets and Unicode. This is useful for applications which use some variant of Unicode as their internal text representation and wish to make use of the top-bit-set characters in the various 8 bit alphabets. By way of explanation, this is the method that NetSurf uses:
- Read the currently selected alphabet
- Acquire a pointer to the UCS conversion table for this alphabet:
- Try using
ServiceInternational 8 to get the table
- If that fails, use our internal table (needed for pre-RISC OS 5)
- If the alphabet is not UTF8 and the conversion table exists:
- Lookup UCS code in the conversion table
- If the code is -1 (i.e. undefined):
- Use codepoint 0xFFFD instead (defined as the missing glyph)
- If the alphabet is UTF8, we must buffer input, thus:
- If the keycode is < 0x80 (i.e. ASCII):
- If the keycode is a UTF8 sequence start:
- Initialise the buffer appropriately
- Insert relevant bits from keycode into buffer
- If we've received an entire UTF8 character:
- Simply handle the keycode directly, as there's no easy way of performing the mapping from keycode -> UCS4 codepoint.
There are a number of areas which have rough edges and could do with tidying up. Note that this is not an exhaustive list and it is likely that some fairly obvious deficiencies have been missed.
Changing the system alphabet changes the assumed encoding of Messages files and other such resources. This is particularly noticeable when changing to the UTF-8 alphabet, as text in menus becomes corrupt, as it is assumed to be UTF-8 encoded. There are a number of possible solutions here, but I would suggest the best option would be for MessageTrans to be updated such that it returns text encoded in the current system alphabet from
MessageTrans_Lookup and other SWIs. Of course, this doesn't cater for text embedded in Templates and Resources files and that's another area that would need due consideration.
The Font Manager performs no glyph borrowing or other similar techniques. This is noticeable if you try to enter a character into a save dialogue box which isn't present in the currently selected desktop font. Changing the desktop font to one which contains the relevant glyphs displays the character perfectly. More importantly, this forces the user to install fonts which cover a large proportion of the Unicode range if they wish to view non-Latin text. Equally, it prevents developers from simply passing the Font Manager the text they want rendered and forgetting about it. That said, glyph borrowing isn't the answer to everything but, if done properly, is probably the most sensible solution to the problem.
The printing system is also woefully behind the rest of the OS with respect to Unicode support. Firstly, it is impossible to print to PostScript when using fonts with the UTF-8 font encoding. The font encoding information output to the PostScript printer is entirely invalid PostScript. It appears that the PostScript printer driver simply copies the contents of the relevant encoding file direct into the PostScript output stream. This is fine for the other encodings, as they are valid PostScript, but the UTF-8 encoding file is an exceptional case.
Secondly, attempting to call
Font_Paint with UTF-16/32 encoded text during a print job will fail. The reason for this is that the call is passed on to the relevant printer driver module and none of these have any support for non-8bit encoded text.
As a consequence of the development of NetSurf, a number of standalone components have been made available for use by other developers.
Firstly, there is the RUfl library, which provides an API of a higher-level than that provided by the Font Manager for dealing with text. It operates on UTF-8 encoded text and provides functionality for painting, splitting and width measurement of text. It performs glyph borrowing, such that if a glyph is present on the system, it will be used. Additionally, it works on all RISC OS versions, not just RISC OS 5. Note that RUfl calls the Font Manager with UTF-16 encoded text so, if printing is needed under RISC OS 5, consider the caveats in the previous section.
Secondly, there is the Iconv module which is able to convert text from one character set to another.
UCS Font Manager documentation
RISC OS Japanese Support Functional Specification.
Unicode FAQ for more details
John-Mark Bell is a NetSurf developer. Theo Markettos kindly contributed to this article with further thoughts and clarifications
Previous: Spring user group news
Next: Firefox port is '95 percent complete'
DiscussionViewing threaded comments | View comments unthreaded, listed by date | Skip to the end
Please login before posting a comment. Use the form on the right to do so or create a free account.
Search the archives
Today's featured article
RISCDomain magazine reviewed
A media watch special
9 comments, latest by druck on 30/10/07 8:55AM. Published: 20 Oct 2007
Processing text the easy way
Paul Beverley shows search'n'replace who's boss
8 comments, latest by VinceH on 9/9/03 6:23PM. Published: 5 Sep 2003
News and media:
RISCOS Ltd •
RISC OS Open •
MW Software •
Advantage Six •
CJE Micros •
Liquid Silicon •
Chris Why's Acorn/RISC OS collection •
The Register •
The Inquirer •
Apple Insider •
BBC News •
Sky News •
Google News •