Drobe :: The archives
About Drobe | Contact | RSS | Twitter | Tech docs | Downloads | BBC Micro

Internationalising RISC OS

By Theo Markettos. Published: 10th Jul 2003, 23:03:01 | Permalink | Printable

Unicode, i18n and more explained

The advent of RISC OS 5 saw the introduction of Unicode, but what is it and how is it used? RISC OS is used all over the world so how does RISC OS and RISC OS software cope with the many languages and alphabets present in our world? What is internationalisation? Theo Markettos explains all in his informative, in-depth article for both users and developers.


Until recently, the vast majority of RISC OS users were English-speaking, Western Europe based with little need to deal with languages outside those of Western Europe. This however is beginning to change in the wider world, and has implications on even those who never use anything other than English. Internationalisation (or i18n to its friends since there are 18 letters between the I and the N) will have an increasing impact on the insular RISC OS world, and there is much that programmers and users alike need to be aware of.

Firstly there are two sides to i18n. There's the ability to be able to enter another language (for example, a Biblical Hebrew quote in an essay) and then there's the ability to have the operating system work in that language with all messages, menus and so on translated. This is an important distinction because whilst I may know enough Hebrew to write the essay, I might not to be able to operate the computer using Hebrew. Often technical words are difficult for the non-native speaker to understand when out of their everyday context (consider the non-technical uses of icon, mouse and desktop to get an idea).

Using international text
Let's consider the first of these, the ability to enter other languages. This is in fact a slightly narrow definition - I may wish to read webpages in Chinese - so we'll expand this to using them instead.

Character sets
Now the first thing to consider is that many languages do not use a Latin alphabet. It would be tempting to just replace the Latin characters with others, as the RISC OS font Sidney does. However this doesn't help the situation since often we need to use Latin characters as well - think about BASIC programming where the program itself must be in Latin characters but the screen messages in Greek.

RISC OS has already considered this, and provides different alphabets. Here the lower 128 characters of the character set remain unaccented Latin characters, while the top 128 (or top bit set) characters change according to the alphabet. These are (roughly) in line with international standard alphabets, meaning each character can be found in the same place across any platform implementing the standard.

The command *Alphabets will give you a list. In ROM RISC OS provides a set of definitions for the system font. To try them, load the !Chars application and set it to system font. Then call *Alphabet where is one listed by *Alphabets, and see the changes this brings to !Chars.

I said 'international standard', though inevitably things aren't that simple. There is a standard, namely ISO-8859, which defines character positions (encodings) in different alphabets (for example, Cyrillic is ISO-8859-5 whilst Greek is ISO-8859-7). However other platforms predate it and have gone their own way, meaning that there also exist alphabets of the form CPxxx (from DOS), Windows-xxx and MacRoman, MacGreek etc. Acorn was not immune from this since its alphabets also contain characters not present in the standards. Whilst it is sometimes necessary to be wary of these, some are roughly compatible with ISO-8859 so we shall ignore them for the moment.

Also of note is the fact that there are multiple Latin alphabets. These just have different sets of accented characters in the top bit set range to cope with different European languages - typically Latin1 is for Western Europe, whilst others cater for Central and Eastern Europe.

So far, so good. RISC OS will let you represent different alphabets using different character sets, and the programmer can add their own. But that's where it stops. To see the problem, try loading !Chars and !Edit (using the system font; not Zap or StrongEd). Do *Alphabet Cyrillic and enter some 'Russian' using Chars. Now do *Alphabet Latin1. See that your Russian has been magically been 'translated' into English. OK, so your Russian is maybe the same as mine (nil) but even I know that the result is gobbledegook. The problem is that RISC OS has forgotten which encoding you wrote the text in.

This pervades the whole OS. The OS has no concept of associating an encoding with a piece of text - this goes for text files as well as *Commands, filenames, the contents of text files, Obey files and any other place where text is used. As a result software authors (mostly from Western Europe) have assumed that any text will be in Latin1, because that's the default. So, for example, MacFS will try to translate characters in Mac filenames to Latin1, even if you happen to be using Hebrew at the time.

The other problem is that it's impossible to have characters from two encodings on screen at once. I am currently editing some webpages in Greek and Albanian. Albanian uses Latin characters with accents - Latin1. It is impossible to have both Greek and Albanian characters in the same page, because Latin1 is ISO-8859-1 and Greek is ISO-8859-7, and these are mutually incompatible.


Previous: More Wakefield Videos
Next: Spam fighting apps reviewed

Discussion

Viewing threaded comments | View comments unthreaded, listed by date | Skip to the end

Okay, so RO5 has Unicode fonts, but what about support at the other levels? Is it UTF-8 that's used, or is there a separate set of Unicode text functions (i.e., like Windows) available? And does RISC OS -understand- Unicode internally? Also, does RISC OS have the extra necessary APIs for this (i.e., functions for getting a string's length in characters rather than bytes, returning a pointer to the next character in an array, and so on), and does BBC BASIC on RISC OS 5 support Unicode?

 is a RISC OS Usermckinlay on 11/7/03 8:37AM
[ Reply | Permalink | Report ]

"This pervades the whole OS. The OS has no concept of associating an encoding with a piece of text"

And operating systems shouldn't have to, either - Unicode means that all pieces of text throughout the system have the same encoding.. they just might use different parts of it (something I think important to clarify for those who aren't familiar with Unicode).

 is a RISC OS Usermckinlay on 11/7/03 8:41AM
[ Reply | Permalink | Report ]

Koreans have less characters than Germans. If they read and write Korean, that is. Definitely less than thousands. Great article, though, Theo.

 is a RISC OS Usermaus on 11/7/03 9:28AM
[ Reply | Permalink | Report ]

What use is Korean? There are plenty of German RISC OS Users and German-speaking RISC OS Users but not many other languages. Surely the focus should be on helping foreigners to read the UK-based RISC OS not for English-speaking (which many foreign RISC OS users do brilliantly) to understand Madarin or whatever.

 is a RISC OS UserA.W. on 11/7/03 11:00AM
[ Reply | Permalink | Report ]

@Maus: In North Korea, yes, there is only hangul. South Koreans do, however, tend to use Chinese characters as well, albeit at a slowing rate. But you will still find Chinese characters in S. Korean media, hence Korean character sets (preceding Unicode) also tend to be 16-bit, and it is lumped together with Japanese and Chinese in the "difficult-to-handle" CJK group.

 is a RISC OS Useridrougge on 11/7/03 12:58PM
[ Reply | Permalink | Report ]

A.W.:

You're completely missing the point with Unicode and OS translations. People want to be able to use a computer that has an OS (and preferably applications) in their own language. "helping foreigners to read the UK-based RISC OS" is a very bad business practice, let alone condescending. Out of approx. 450 million Europeans, how many have English as their first language? Maybe 60-65 million. Having the possibility of using the OS in different languages is essential to continuing sales of RISC OS computers. Already Windows, MacOS and KDE (linux) are available in many languages, KDE in over 80 was the last I heard.

RISC OS has lagged way behind in this. My first language is for example Icelandic, although I prefer to use an OS in English I still want to be able to have the computer know the Icelandic alphabet and sort correctly according to it. Today I can only do it using a module that was created by the now defunct Acorn dealer, wether it works with RO 4 or 5 I have no idea. Quite a lot of RO programs don't do alphabetical sort correctly even though I have this territory module running because most RO programmers assume that everyone is English speaking and don't use other alphabets. Through the years I used to contact several software writers about this but only a handful seemed to care and even fewer actually corrected this in their applications. I even contacted Acorn in order to get them to make developers more aware of other languages but the only reply I got was "It's in the reference manuals on page sixhundred and something". Quite willing to help, don't you think? I don't recall if I contacted RISCOS Ltd. about it but at least I have never seen anything from them urging developers to take non-English speakers into account when developing applications, that's one of the reasons why I never upgraded to RO4 and probably will not.

Windows, for example, comes with an extensive amount of territorial information, alphabets, comma instead of a point as a decimal seperator (I think most non-English languages use "," as a decimal seperator), point as a thousands seperator and more and more. All by default. I'm quite certain that this information doesn't take a whole lot of space but is invaluable to non-English speakers.

"What use is Korean?" Hmm, for Koreans it's got quite a use. Do you know how many South Koreans are? They are approximately 49 million.Without the ability to use Korean there's absolutely no point in trying to sell RO there, with it there would at least be a chance. Selling 500 computers there would have a serious upswing effect on RISC OS!

"There are plenty of German RISC OS Users and German-speaking RISC OS Users but not many other languages."

Maybe there's a reason for this? I currently live in Denmark and I've been helping a few people with their Windows computers and I have yet to encounter anyone using a version in any other language than Danish! If the OS were easily translatable and translations were for example distributed with the OS (like KDE) maybe some of these Norwegians that used to use RISC OS would still be using it?

Unicode support is an essential part of allowing non-English speakers use the OS in the language of their choice, without Unicode support RISC OS will not get anywhere outside UK (and Germany).

-- Gunnlaugur Jonsson, Copenhagen, Denmark

 is a RISC OS UserGulli on 11/7/03 1:57PM
[ Reply | Permalink | Report ]

While I do agree RISC OS lags far behind in i18n support, it *does* have information in the Territory module about thousands/decimal separators and so on: [link] Don't know how much it's used, though. :)

 is a RISC OS Usermatthew on 11/7/03 2:13PM
[ Reply | Permalink | Report ]

Well said Gulli, I also believe some of Acorn's success in Germany was due to the fact they went to the effort of translating the OS (3.19 and 3.7).

 is a RISC OS Userflibble on 11/7/03 2:18PM
[ Reply | Permalink | Report ]

Gulli: Thanks - I was going to respond with something similar, but couldn't for the life of me think how to put it - your comments were spot on :)

 is a RISC OS Usermckinlay on 11/7/03 2:39PM
[ Reply | Permalink | Report ]

I've used the Unicode font manager (in a Bush box project) - very nice. Does anyone know if using it on the desktop is a practical proposition? i.e. all printer drivers work, Unicode fonts are available...

 is a RISC OS UserDavidPilling on 11/7/03 2:49PM
[ Reply | Permalink | Report ]

Actually, there was no real success of Acorn in Germany, ever. The number of boxes shifted were always low, even compared to most other Continental Europe markets.

There are many reasons for this, and I think lack of internationalization (we had a territory, but comparably few software packages used German) was not in the Top 10.

Anyway, there is even a German Territory available for both the latest RISC OS Select and RISC OS 5. But this is only due to the heroic efforts of Detlef Thielsch@a4com.

 is a RISC OS Userhubersn on 11/7/03 5:27PM
[ Reply | Permalink | Report ]

I do not know how many computers have been sold in Germany compared to France or the Netherlands or wharever. But iirc the number of units sold "explodes" with introducing RISC OS 3.19 (the german version of 3.1). Furthermore i seem to remember that the A5k with RISC OS 3.19 was the most successful Acorn in Germany until the RPC was shipped. And since these days there has always been a german RISC OS version except for 3.6 i think.

Sincerely Hauke

 is a RISC OS UserVLIW on 11/7/03 5:56PM
[ Reply | Permalink | Report ]

matthew:

The link you provided is actually the information Acorn pointed me to when I contacted them. The problem is that this SWI is only to read out the information, but a special module is required for each territory in order for this to work. These modules are not supplied with the OS and someone must create them in every country. If the OS actually did supply this information without the need for a third party module this would be useful. Unfortunately this is (was?) not the case and therefore this SWI is of little use.

Switching territories requires loading a new module everytime so if you are using say the Icelandic territory and need to switch to Danish you will have to get your hands on the Danish territory module that may or may not be available, and may or may not cost you something. If you can't get in touch with someone that does have a working module for the territory you need - you're on your own. Not exactly a tempting situation!

-- Gunnlaugur Jonsson, Copenhagen, Denmark

 is a RISC OS UserGulli on 11/7/03 7:11PM
[ Reply | Permalink | Report ]

Thanks a lot Theo for this article. I think it is a very important thing for the future of RO. I'm happy that a last with RO4 I could have a French kbd driver in ROM. I read that EasyWriter copes with Hebrew (right to left written language), but my interest is in Arabic. I translated an ATF1 font for RO, but I can only use it to display things from basic : no keyboard driver + the shape of the characters changes depending on its position in the word. IICR I read in the PRM that writing a keyboard driver was "beyond the scope of this manual". What can I do ?

 is a RISC OS Usera310 on 11/7/03 8:46PM
[ Reply | Permalink | Report ]

It is true that there is a lot of i18n work still to be done in RISC OS. The fact there is an Unicode aware FontManager is one tiny step and a lot more is needed :

- Routines (like iconv) to convert from one encoding to another. Someone volunteering to create such a module based on e.g. the GNU iconv routines found in glibc ? - An easy way to create large character fonts. So OpenType2Font and/or CID2Font (like we have T1ToFont) is needed. - All text handling parts of RISC OS needs to be aware that one char/glyph is not necessary equal to 1 byte. - And last but not least : input methods and char/glyph processors.

Visualising Unicode text is one thing, parsing is another but entering Unicode is *really* challenging.

And indeed apart from that, supplying Territories and translations is a must for entering new markets.

Anyone having projects on these subjects, I'm more than interested to hear from.

John.

 is a RISC OS Userjoty on 12/7/03 12:53AM
[ Reply | Permalink | Report ]

I agree that there is a lot of i18n work still to be done in RISC OS. The fact there is an Unicode aware FontManager is one tiny step and a lot more is needed.

-to be able to add an Encoding/Alphabet to the one in use would already be of help.

-having a !Chars -or !XChars like the latest from Martin Wuerthner- displaying all the at current available characters, would be handy. A bit like display of !FontEd with character names and numbers. Of course, with an option to at least copy/paste to a document.

-or to be able to 'extract' e.g a subset as a separate font from a Base0 font with an appropriate name e.g. Trinity.Latin2.Medium This could also be useful for PostScript printing, for the correct Encoding is in the font folder.

-the currrent !T1toFont can create or convert only PC Type1s. But, when a Type1 with all characters as mentioned in Base0 Encoding in RO5 is constructed, it is easy to add an Encoding to the current !T12Font to get all the mentioned characters in one font. Of course, an 'OpenType'toFont or a 'TrueType'toFont application would be better.

-having additional Encodings for the Win and Mac etc. Latin1 subsets, which also can be choosen like the current ones.

-having a more 'flexible' way to recognise font names to map fonts. In many instances the difference is only the full stop ...

I also agree: Visualising Unicode text is one thing, parsing is another but entering Unicode is *really* challenging.

Tonnie

 is a RISC OS Usertonnie on 12/7/03 10:33AM
[ Reply | Permalink | Report ]

Sorry, Theo, but the article is both wrong and misleading. Basically, what you have shown is that Edit is not suitable for writing international text, which is not surprising because Edit edits text files and plain text files have no encoding information.

What do you mean by "impossible to have characters from two encodings on screen at once"? This is complete nonsense. In HolyBible, for example, I can have Greek, Hebrew, Latin, Russian, Welsh all on my screen at the same time. There is absolutely nothing in the OS to stop you from doing so. The RISC OS FontManager has supported encodings from RISC OS 3.1. and you can have fonts with different encodings on screen at the same time.

The fact that Edit, Zap and StrongEd are limited in that they have no idea of encodings is a different matter. This is purely and application problem, but not an OS problem.

It is true that RISC OS misses important things required for i18n but it is important to point these things out and not confuse everybody by claiming that nothing works at all.

Martin

 is a RISC OS Userwuerthne on 12/7/03 1:08PM
[ Reply | Permalink | Report ]

By on screen at once I suspect he meant in the same editor. However as you say, that's down to the application to support it, and the font set to have the relevant characters.

 is a RISC OS Userpiemmm on 12/7/03 7:20PM
[ Reply | Permalink | Report ]

But you CAN have text in different encodings in the same editor without any problems, even in the same window, even in the same word if you want.

You just need an editor that supports that. In any case, it is NOT an operating system problem, which is what Theo's article implied.

 is a RISC OS Userwuerthne on 12/7/03 8:22PM
[ Reply | Permalink | Report ]

Err, you have read the other pages to this article by clicking on the 'next' link at the bottom of the article page?

 is a RISC OS Userpiemmm on 12/7/03 8:46PM
[ Reply | Permalink | Report ]

If your application supports multiple fonts it's easy to support multiple encodings?

But it's not possible to have multiple encodings in text files etc, because they have no concept of encodings, they're just plain 8bit text. And all RISC OS commands, etc, have this problem. Also you can't have more than one encoding in WIMP icons.

A unicode aware Text Editor would be needed for plain unicode text I assume.

 is a RISC OS Usermavhc on 12/7/03 8:49PM
[ Reply | Permalink | Report ]

Plain text files (and any files in fact) can say whether they're UTF-8 or Unicode (little or big endian) by starting with the various byte order marks that exist. Unfortunately, I don't think anything on RISC OS recognises these...?

 is a RISC OS Usermatthew on 13/7/03 12:06PM
[ Reply | Permalink | Report ]

Gulli:

I was thinking about priorities that's all and whether it was best to concentrate on those people that make most use of RISC OS - British (including for the purposes of language Australian and New Zealand and Ireland), Germans and Dutch afaik. I'm not very convinced that attempting to forge new markets outside of these places is a good idea given the limited resources and relatively small overall size of the RISC OS userbase and company resources.

 is a RISC OS UserA.W. on 14/7/03 1:18PM
[ Reply | Permalink | Report ]

Hmm, I seem to remember a company that tried actually that, think it was called Acorn.

I do understand that point of yours though but I still personally feel that it's wrong.

-- Gunnlaugur Jonsson, Copenhagen, Denmark

 is a RISC OS UserGulli on 14/7/03 5:54PM
[ Reply | Permalink | Report ]

gulli:

indeed.

but from what has been said the translation seems to be difficult.

provision for unicode is perhaps best regarded as a start.

perhaps you and the other danskRISC OSers could persuade kayak computer to help / provide backing?

 is a RISC OS Userepistaxsis_RISC OS on 14/7/03 7:23PM
[ Reply | Permalink | Report ]

Hmm, Kayak Computer. Had totally forgotten about them. I'm not Danish, I'm Icelandic and just moved to Denmark this year so I'll use that as an excuse :D

I've actually spoken to them once, at the Acorn World 1995 and if I recall correctly this point actually came up - not unicode but i18n. I was at the time thinking of getting a licence to sell Acorn computers in Iceland - too bad Acorn and their middlemen (company name slips my mind) wasn't exactly helpful, never received any information from them although they seemed quite enthusiastic at a meeting I had with them at that show.

Kayak's website doesn't really look as if the company is still operating. (www.kayak.dk)

-- Gunnlaugur Jonsson, Copenhagen, Denmark

 is a RISC OS UserGulli on 14/7/03 9:31PM
[ Reply | Permalink | Report ]

Gulli: middlemen = Lindis International ?

 is a RISC OS Userjoty on 15/7/03 1:55AM
[ Reply | Permalink | Report ]

gulli:

Ooops - should have realised given your name...

mmm must admit kayak may no longer exsist...

 is a RISC OS Userepistaxsis@work on 15/7/03 10:14AM
[ Reply | Permalink | Report ]

joty: Exactly, Lindis was it.

epistaxsis: I would have been really surprised if people were to recognize my nationality by my name, most just get baffled and wrap their toungs around their heads trying to pronounce it - most don't bother trying to guess where I'm from although some actually ask :D

-- Gunnlaugur Jonsson, Copenhagen, Denmark

 is a RISC OS UserGulli on 15/7/03 7:08PM
[ Reply | Permalink | Report ]

Martin: Perhaps what I said about multiple encodings in the same program was slightly misleading - I meant in applications that don't support fonts (so use the system font/standard WIMP font) and icons. If you're already invoking the Font Manager, fine you can do what you want, but many applications leave the WIMP to do the rendering so can at best specify one encoding per icon. And very few applications store encoding information in their files - though this is their fault.

Part of the point of writing the article was to make this issue more prominent than 'page 1234 of the PRM' - because I face it every day.

In terms of priorities, Unicode is IMHO the most important. Forget encodings, everything can be handled internally as Unicode and converted as necessary.

Something that users can use to generate Territory modules would also be nice. (though there will always be complicated ones)

IKHG for writing keyboard drivers is currently in a bit of a state - mail me for more details as to its status.

In terms of effort, Unicode is an increasing necessity to talk to the outside world. Many webpages are going Unicode, files are going Unicode and even domain names are hinted that way. The rest is relatively easy. I'm not suggesting translating RO into Swahili just for that market, but to make such a translation painless enough for a dealer to do.

Theo Larisa, Greece

 is a RISC OS Usercaliston2 on 16/7/03 8:57PM
[ Reply | Permalink | Report ]

Please login before posting a comment. Use the form on the right to do so or create a free account.

Search the archives

Today's featured article

  • South West 2005 show report
    Coverage of Castle, Qercus, UPP and ROL theatres
     42 comments, latest by jc on 03/03/05 11:12AM. Published: 28 Feb 2005

  • Random article

  • BOFH and PFY do RISC OS
    All publicity is good publicity, right?
     8 comments, latest by druck on 8/10/07 1:40PM. Published: 28 Sep 2007

  • Useful links

    News and media:
    IconbarMyRISCOSArcSiteRISCOScodeANSC.S.A.AnnounceArchiveQercusRiscWorldDrag'n'DropGAG-News

    Top developers:
    RISCOS LtdRISC OS OpenMW SoftwareR-CompAdvantage SixVirtualAcorn

    Dealers:
    CJE MicrosAPDLCastlea4X-AmpleLiquid SiliconWebmonster

    Usergroups:
    WROCCRONENKACCIRUGSASAUGROUGOLRONWUGMUGWAUGGAGRISCOS.be

    Useful:
    RISCOS.org.ukRISCOS.orgRISCOS.infoFilebaseChris Why's Acorn/RISC OS collectionNetSurf

    Non-RISC OS:
    The RegisterThe InquirerApple InsiderBBC NewsSky NewsGoogle Newsxkcddiodesign


    © 1999-2009 The Drobe Team. Some rights reserved, click here for more information
    Powered by MiniDrobeCMS, based on J4U | Statistics
    "Don't expect Chris to acknowledge where Drobe's articles originally came from"
    Page generated in 0.267 seconds.