Internationalising RISC OSBy Theo Markettos. Published: 10th Jul 2003, 23:03:01 | Permalink | Printable
Unicode, i18n and more explainedThe advent of RISC OS 5 saw the introduction of Unicode, but what is it and how is it used? RISC OS is used all over the world so how does RISC OS and RISC OS software cope with the many languages and alphabets present in our world? What is internationalisation? Theo Markettos explains all in his informative, in-depth article for both users and developers.
Until recently, the vast majority of RISC OS users were English-speaking, Western Europe based with little need to deal with languages outside those of Western Europe. This however is beginning to change in the wider world, and has implications on even those who never use anything other than English. Internationalisation (or i18n to its friends since there are 18 letters between the I and the N) will have an increasing impact on the insular RISC OS world, and there is much that programmers and users alike need to be aware of.
Firstly there are two sides to i18n. There's the ability to be able to enter another language (for example, a Biblical Hebrew quote in an essay) and then there's the ability to have the operating system work in that language with all messages, menus and so on translated. This is an important distinction because whilst I may know enough Hebrew to write the essay, I might not to be able to operate the computer using Hebrew. Often technical words are difficult for the non-native speaker to understand when out of their everyday context (consider the non-technical uses of icon, mouse and desktop to get an idea).
Using international text
Let's consider the first of these, the ability to enter other languages. This is in fact a slightly narrow definition - I may wish to read webpages in Chinese - so we'll expand this to using them instead.
Now the first thing to consider is that many languages do not use a Latin alphabet. It would be tempting to just replace the Latin characters with others, as the RISC OS font Sidney does. However this doesn't help the situation since often we need to use Latin characters as well - think about BASIC programming where the program itself must be in Latin characters but the screen messages in Greek.
RISC OS has already considered this, and provides different alphabets. Here the lower 128 characters of the character set remain unaccented Latin characters, while the top 128 (or top bit set) characters change according to the alphabet. These are (roughly) in line with international standard alphabets, meaning each character can be found in the same place across any platform implementing the standard.
*Alphabets will give you a list. In ROM RISC OS provides a set of definitions for the system font. To try them, load the !Chars application and set it to system font. Then call *Alphabet where is one listed by *Alphabets, and see the changes this brings to !Chars.
I said 'international standard', though inevitably things aren't that simple. There is a standard, namely ISO-8859, which defines character positions (encodings) in different alphabets (for example, Cyrillic is ISO-8859-5 whilst Greek is ISO-8859-7). However other platforms predate it and have gone their own way, meaning that there also exist alphabets of the form CPxxx (from DOS), Windows-xxx and MacRoman, MacGreek etc. Acorn was not immune from this since its alphabets also contain characters not present in the standards. Whilst it is sometimes necessary to be wary of these, some are roughly compatible with ISO-8859 so we shall ignore them for the moment.
Also of note is the fact that there are multiple Latin alphabets. These just have different sets of accented characters in the top bit set range to cope with different European languages - typically Latin1 is for Western Europe, whilst others cater for Central and Eastern Europe.
So far, so good. RISC OS will let you represent different alphabets using different character sets, and the programmer can add their own. But that's where it stops. To see the problem, try loading !Chars and !Edit (using the system font; not Zap or StrongEd). Do
*Alphabet Cyrillic and enter some 'Russian' using Chars. Now do *Alphabet Latin1. See that your Russian has been magically been 'translated' into English. OK, so your Russian is maybe the same as mine (nil) but even I know that the result is gobbledegook. The problem is that RISC OS has forgotten which encoding you wrote the text in.
This pervades the whole OS. The OS has no concept of associating an encoding with a piece of text - this goes for text files as well as
*Commands, filenames, the contents of text files, Obey files and any other place where text is used. As a result software authors (mostly from Western Europe) have assumed that any text will be in Latin1, because that's the default. So, for example, MacFS will try to translate characters in Mac filenames to Latin1, even if you happen to be using Hebrew at the time.
The other problem is that it's impossible to have characters from two encodings on screen at once. I am currently editing some webpages in Greek and Albanian. Albanian uses Latin characters with accents - Latin1. It is impossible to have both Greek and Albanian characters in the same page, because Latin1 is ISO-8859-1 and Greek is ISO-8859-7, and these are mutually incompatible.
Previous: More Wakefield Videos
Next: Spam fighting apps reviewed
DiscussionViewing threaded comments | View comments unthreaded, listed by date | Skip to the end
Please login before posting a comment. Use the form on the right to do so or create a free account.
Search the archives
Today's featured article
Brush up your ARM coding skills with Matthew Bloch's assembler guide
Discuss this. Published: 27 Feb 2001
May German Archimedes Group show
7 comments, latest by on 15/4/02 10:12AM. Published: 12 Apr 2002
News and media:
RISCOS Ltd •
RISC OS Open •
MW Software •
Advantage Six •
CJE Micros •
Liquid Silicon •
Chris Why's Acorn/RISC OS collection •
The Register •
The Inquirer •
Apple Insider •
BBC News •
Sky News •
Google News •