Guidelines for character mnemonics in a minimal character set.

By Keld Simonsen, Danish UNIX User Group (DKUUG)
Representative to SC22 WG on Character Set Usage
for Danish Standards Association (DS), Denmark.

Draft January 1991.

Aim of Character Mnemonics 

The aim of the mnemonics is to be able to represent all characters
in all standard coded character sets in any standard coded
character set. Thus all standard coded character sets will be
related, and a conversion can take place.

The usage of the character mnemonics is primarily intended
within computer operating systems, programming languages and
applications, and this work with character mnemonics is the current
state of work which has been presented to the ISO working group
responsible for these computer related issues, namely the
ISO/IEC JTC1/SC22 special working group on character set usage.

Covered Coded Character Sets

Almost all characters in the standard coded character sets have been
given a mnemonic name in the minimal character set.
The minimal character set is defined as the basic character set
of ISO 646, where 12 positions are left undefined. 
The standard coded character sets are taken as the sum of
all ISO defined or ISO registered character sets. 

The most significant ISO coded character set is the 10646 coded character
set, whose aim is to code in 32 bits all characters in the world.
These guidelines can be seen as assigning mnemonic attributes 
to most characters in 10646, currently at DIS stage.

Other ISO coded character sets covered include all parts of
ISO 8859, ISO 6937-2 and all ISO 646 conforming coded character
sets in the ISO character set registry managed by ECMA
according to ISO 4873.
Some non-ISO character sets are also covered for convenience.

The Character Mnemonics Classes

The character mnemonics are classified into two groups:

1. A group with two-character mnemonics
   - Primarily intended for alphabetic scripts like Latin, Greek,
     Cyrillian, Hebrew and Arabic, and special characters.
2. A group with variable-length mnemonics
   - primarily intended for non-alphabetic scripts like Japanese
     and Chinese. 

All mnemonics are given a long descriptive name, written in the
reference character set and taken from ISO 10646, if possible.


The Two-Character mnemonics
     
The two-character mnemonics include various accented Latin letters,
Greek, Cyrillic, Hebrew, Arabic, Hiragana, Katakana and Bopomofo.
Also quite some special characters are included.
Almost all ISO or ISO registered 7- and 8-bit coded
character sets are covered with these two-character mnemonics.
Thus conversions between these character sets can be done via a
two-character conversion table.

The two characters are chosen so the graphical appearence in the
reference set resembles as much as possible (within the posibilities
available) the graphical appearance of the character. The basic character
set of ISO 646 is used as the reference set, as mentioned above.

The characters in the reference character set are chosen to represent
themselves. You may consider them as two-character mnemonics where
the second char is a space.

Control characters mnemonics are chosen according to ISO 2047 and ISO 6429 .

Letters, including Greek, Cyrillic, Arabic and Hebrew, are represented
with the base letter as the first letter, and the second letter
represents an accent or relation to a non-Latin script.
Non-Latin letters are translitterated to Latin letters,
following translitteration standards as closely as possible.

After a letter, the second character signifies the following:

  Exclamation mark           ! Grave
  Apostrophe                 ' Acute accent
  Greater-Than sign          > Circumflex accent
  Question Mark              ? tilde
  Hyphen-Minus               - Macron
  Left parenthesis           ( Breve
  Full Stop                  . Dot Above/Ring above
  Colon                      : Diaeresis
  Comma                      , Cedilla
  Underline                  _ Underline
  Solidus                    / Stroke
  Quotation mark             " Double acute accent
  Semicolon                  ; Ogonek
  Less-Than sign             < Caron
        
  Equals                     = Cyrillian
  Asterisk                   * Greek
  Percent sign               % Greek/Cyrillian special
  Plus                       + smalls: Arabic, capitals: Hebrew
  Four                       4 Bopomofo
  Five                       5 Hiragana
  Six                        6 Katakana

The ampersand & is reserved as an intro character, indicating that the
following string is in the mnemonic character set. This character
could also be another character, e.g. in the control character set.
One common choice in the control character set is decimal 29,
which seems to have no effect on almost all current equipment.
The intro character can be negotiated between the communicating parties,
but the default is the ampersand "&". Two intro characters in a row
signifies the intro character itself.

The underscore is reserved for the variable-length mnemonics.
This use does not eliminate usage as an accent or language identifier.
The right-pointing parenthesis ")" is not in use at the moment
for accent or language identifying.
This is also the case for some digits.

Special characters are encoded with some mnemonic value.
These are not systematic thruout, but most mnemonics start
with a special character of the reference set.
Special chars with some sort of reference to the reference
character set normally have this character as the first character
in the mnemonic.


The Variable-length Character Mnemonics

The Variable-length Character Mnemonics are primarily meant for the
ideographic characters in larger Asian character sets.
To have the mnemonics as short as possible, which both saves storage
and is easier to type in, a quite short name is preferred.
Considering the Chinese standard GB 2312-1980 and the Japanese standards
JIS X0208 and JIS X0212, they are all given by row and  column
numbers between 1 and 99. So two positions for row and column and
a character set identifier of one character would be almost as short
as possible. The following character set identifiers are defined:

         c   GB 2312-1980
         j   JIS X0208-1990
         J   JIS X0212-1990
         k   KS C 5601-1987

The first idea was to have a name in Latin describing the pronunciation
but that is not possible according to Asian sources.

One prominent character in the reference character set is reserved
for identifying variable-length mnemonics, namely the underscore "_". This character
is intended as a delimiter both in the front and in the end
of the mnemonic. An example of its use would be: (&=intro):

          &_j3210_ &_j4436_&_j6530_

The Variable-Length Character Mnemonics can also be used for less-used
Latin letters with more than one accent or other less-used special characters.
