Tags

, , , ,

And now for something completely different…

In this day of multiple languages, with computer users from all over the world (and sometimes found in the same office building), it is worthwhile to set up systems with internationalization. Basically, this means that the system is capable of handling eight-bit characters, foreign character input and output.

On a UNIX or Linux system, this is handled by a number of variables that all expand on the LANG environment variable. The current settings can be seen quickly by using the locale command. This is the output from my Fedora 7 system:

$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Each of these can be set seperately. The individual LC_* environment variables override LC_ALL (in specific areas), and all LC_* commands override LANG. There is an excellent and full-depth description of these variables available.

All of the possible values for these variables can be seen with the locale -a command. Here are a few selected entries from the long list (654 locale settings on Fedora 7!) that erupts from this command:

$ locale -a

ca_FR
ca_FR.iso885915
ca_FR.utf8

en_US
en_US.iso88591
en_US.iso885915
en_US.utf8

ru_RU
ru_RU.iso88595
ru_RU.koi8r
ru_RU.utf8

The first example is Canadian French; the second, American English; the third, Russian from Russia (as compared to Russian from Ukraine or Tatar from Russia). The code used is ISO 639-1 for the language (first two characters), ISO 3166-1 for the country codes (second set of dual characters), and then the name of the specific encoding to use. In the example, one can see ISO 8859-1 (Latin 1), ISO 8859-15 (Latin 9), ISO 8859-5 (Cyrillic), UTF-8 (Unicode), and KOI8-R (Cyrillic). The traditional “C” locale is represented by the name C or its equivalent POSIX – both of which refer to the traditional 7-bit ASCII representation. The best choice would be utf8 (such as en_US.utf8) as ISO 8859 disbanded and is no longer maintained today.

It is also necessary to make sure that the connection between your display and the system is what is called “eight-bit clean” – that is, all eight bits from the source system to your display are preserved and are intact. More specifically, the entire path from keyboard to display must be eight-bit clean in order for things to work properly.

These variables set the main character sets to use; however, programs must still be translated into other languages and must be prepared to handle the language in question. If a program is not translated into Russian, using ru_RU.utf8 will not make a difference in the output (which most likely will be English). Some programs may even have to be configured for a different language.

There is also the keyboard mapping – which can be a different set of challenges and configurations to handle. Linux has the xmodmap (for X) and loadkeys commands. The console keymap programs are included in the kbd RPM (or package).

Advertisements