- In internationalization terminology, a string of text is divided up
-into "characters", which are the printable units that make up the text.
-A single character is (for example) a capital `A', the number `2', a
-Katakana character, a Hangul character, a Kanji ideograph (an
-"ideograph" is a "picture" character, such as is used in Japanese
-Kanji, Chinese Hanzi, and Korean Hanja; typically there are thousands
-of such ideographs in each language), etc. The basic property of a
-character is that it is the smallest unit of text with semantic
-significance in text processing.
-
- Human beings normally process text visually, so to a first
-approximation a character may be identified with its shape. Note that
-the same character may be drawn by two different people (or in two
-different fonts) in slightly different ways, although the "basic shape"
-will be the same. But consider the works of Scott Kim; human beings
-can recognize hugely variant shapes as the "same" character.
-Sometimes, especially where characters are extremely complicated to
-write, completely different shapes may be defined as the "same"
-character in national standards. The Taiwanese variant of Hanzi is
-generally the most complicated; over the centuries, the Japanese,
-Koreans, and the People's Republic of China have adopted
-simplifications of the shape, but the line of descent from the original
-shape is recorded, and the meanings and pronunciation of different
-forms of the same character are considered to be identical within each
-language. (Of course, it may take a specialist to recognize the
-related form; the point is that the relations are standardized, despite
-the differing shapes.)
-
- In some cases, the differences will be significant enough that it is
-actually possible to identify two or more distinct shapes that both
-represent the same character. For example, the lowercase letters `a'
-and `g' each have two distinct possible shapes--the `a' can optionally
-have a curved tail projecting off the top, and the `g' can be formed
-either of two loops, or of one loop and a tail hanging off the bottom.
-Such distinct possible shapes of a character are called "glyphs". The
-important characteristic of two glyphs making up the same character is
-that the choice between one or the other is purely stylistic and has no
-linguistic effect on a word (this is the reason why a capital `A' and
-lowercase `a' are different characters rather than different
-glyphs--e.g. `Aspen' is a city while `aspen' is a kind of tree).
-
- Note that "character" and "glyph" are used differently here than
-elsewhere in XEmacs.
-
- A "character set" is essentially a set of related characters. ASCII,
-for example, is a set of 94 characters (or 128, if you count
-non-printing characters). Other character sets are ISO8859-1 (ASCII
-plus various accented characters and other international symbols), JIS
-X 0201 (ASCII, more or less, plus half-width Katakana), JIS X 0208
-(Japanese Kanji), JIS X 0212 (a second set of less-used Japanese Kanji),
-GB2312 (Mainland Chinese Hanzi), etc.
-
- The definition of a character set will implicitly or explicitly give
-it an "ordering", a way of assigning a number to each character in the
-set. For many character sets, there is a natural ordering, for example
-the "ABC" ordering of the Roman letters. But it is not clear whether
-digits should come before or after the letters, and in fact different
-European languages treat the ordering of accented characters
-differently. It is useful to use the natural order where available, of
-course. The number assigned to any particular character is called the
-character's "code point". (Within a given character set, each
-character has a unique code point. Thus the word "set" is ill-chosen;
-different orderings of the same characters are different character sets.
-Identifying characters is simple enough for alphabetic character sets,
-but the difference in ordering can cause great headaches when the same
-thousands of characters are used by different cultures as in the Hanzi.)
-
- A code point may be broken into a number of "position codes". The
-number of position codes required to index a particular character in a
-character set is called the "dimension" of the character set. For
-practical purposes, a position code may be thought of as a byte-sized
-index. The printing characters of ASCII, being a relatively small
-character set, is of dimension one, and each character in the set is
-indexed using a single position code, in the range 1 through 94. Use of
-this unusual range, rather than the familiar 33 through 126, is an
-intentional abstraction; to understand the programming issues you must
-break the equation between character sets and encodings.
-
- JIS X 0208, i.e. Japanese Kanji, has thousands of characters, and is
-of dimension two - every character is indexed by two position codes,
-each in the range 1 through 94. (This number "94" is not a
-coincidence; we shall see that the JIS position codes were chosen so
-that JIS kanji could be encoded without using codes that in ASCII are
-associated with device control functions.) Note that the choice of the
-range here is somewhat arbitrary. You could just as easily index the
-printing characters in ASCII using numbers in the range 0 through 93, 2
-through 95, 3 through 96, etc. In fact, the standardized _encoding_
-for the ASCII _character set_ uses the range 33 through 126.
-
- An "encoding" is a way of numerically representing characters from
-one or more character sets into a stream of like-sized numerical values
-called "words"; typically these are 8-bit, 16-bit, or 32-bit
-quantities. If an encoding encompasses only one character set, then the
-position codes for the characters in that character set could be used
-directly. (This is the case with the trivial cipher used by children,
-assigning 1 to `A', 2 to `B', and so on.) However, even with ASCII,
-other considerations intrude. For example, why are the upper- and
-lowercase alphabets separated by 8 characters? Why do the digits start
-with `0' being assigned the code 48? In both cases because semantically
-interesting operations (case conversion and numerical value extraction)
-become convenient masking operations. Other artificial aspects (the
-control characters being assigned to codes 0-31 and 127) are historical
-accidents. (The use of 127 for `DEL' is an artifact of the "punch
-once" nature of paper tape, for example.)
-
- Naive use of the position code is not possible, however, if more than
-one character set is to be used in the encoding. For example, printed
-Japanese text typically requires characters from multiple character sets
-- ASCII, JIS X 0208, and JIS X 0212, to be specific. Each of these is
-indexed using one or more position codes in the range 1 through 94, so
-the position codes could not be used directly or there would be no way
-to tell which character was meant. Different Japanese encodings handle
-this differently - JIS uses special escape characters to denote
-different character sets; EUC sets the high bit of the position codes
-for JIS X 0208 and JIS X 0212, and puts a special extra byte before each
-JIS X 0212 character; etc. (JIS, EUC, and most of the other encodings
-you will encounter in files are 7-bit or 8-bit encodings. There is one
-common 16-bit encoding, which is Unicode; this strives to represent all
-the world's characters in a single large character set. 32-bit
-encodings are often used internally in programs, such as XEmacs with
-MULE support, to simplify the code that manipulates them; however, they
-are not used externally because they are not very space-efficient.)
-
- A general method of handling text using multiple character sets
-(whether for multilingual text, or simply text in an extremely
-complicated single language like Japanese) is defined in the
-international standard ISO 2022. ISO 2022 will be discussed in more
-detail later (*note ISO 2022::), but for now suffice it to say that text
-needs control functions (at least spacing), and if escape sequences are
-to be used, an escape sequence introducer. It was decided to make all
-text streams compatible with ASCII in the sense that the codes 0-31
-(and 128-159) would always be control codes, never graphic characters,
-and where defined by the character set the `SPC' character would be
-assigned code 32, and `DEL' would be assigned 127. Thus there are 94
-code points remaining if 7 bits are used. This is the reason that most
-character sets are defined using position codes in the range 1 through
-94. Then ISO 2022 compatible encodings are produced by shifting the
-position codes 1 to 94 into character codes 33 to 126, or (if 8 bit
-codes are available) into character codes 161 to 254.
-
- Encodings are classified as either "modal" or "non-modal". In a
-"modal encoding", there are multiple states that the encoding can be
-in, and the interpretation of the values in the stream depends on the
-current global state of the encoding. Special values in the encoding,
-called "escape sequences", are used to change the global state. JIS,
-for example, is a modal encoding. The bytes `ESC $ B' indicate that,
-from then on, bytes are to be interpreted as position codes for JIS X
-0208, rather than as ASCII. This effect is cancelled using the bytes
-`ESC ( B', which mean "switch from whatever the current state is to
-ASCII". To switch to JIS X 0212, the escape sequence `ESC $ ( D'.
-(Note that here, as is common, the escape sequences do in fact begin
-with `ESC'. This is not necessarily the case, however. Some encodings
-use control characters called "locking shifts" (effect persists until
-cancelled) to switch character sets.)
-
- A "non-modal encoding" has no global state that extends past the
-character currently being interpreted. EUC, for example, is a
-non-modal encoding. Characters in JIS X 0208 are encoded by setting
-the high bit of the position codes, and characters in JIS X 0212 are
-encoded by doing the same but also prefixing the character with the
-byte 0x8F.
-
- The advantage of a modal encoding is that it is generally more
-space-efficient, and is easily extendable because there are essentially
-an arbitrary number of escape sequences that can be created. The
-disadvantage, however, is that it is much more difficult to work with
-if it is not being processed in a sequential manner. In the non-modal
-EUC encoding, for example, the byte 0x41 always refers to the letter
-`A'; whereas in JIS, it could either be the letter `A', or one of the
-two position codes in a JIS X 0208 character, or one of the two
-position codes in a JIS X 0212 character. Determining exactly which
-one is meant could be difficult and time-consuming if the previous
-bytes in the string have not already been processed, or impossible if
-they are drawn from an external stream that cannot be rewound.
-
- Non-modal encodings are further divided into "fixed-width" and
-"variable-width" formats. A fixed-width encoding always uses the same
-number of words per character, whereas a variable-width encoding does
-not. EUC is a good example of a variable-width encoding: one to three
-bytes are used per character, depending on the character set. 16-bit
-and 32-bit encodings are nearly always fixed-width, and this is in fact
-one of the main reasons for using an encoding with a larger word size.
-The advantages of fixed-width encodings should be obvious. The
-advantages of variable-width encodings are that they are generally more
-space-efficient and allow for compatibility with existing 8-bit
-encodings such as ASCII. (For example, in Unicode ASCII characters are
-simply promoted to a 16-bit representation. That means that every
-ASCII character contains a `NUL' byte; evidently all of the standard
-string manipulation functions will lose badly in a fixed-width Unicode
-environment.)
-
- The bytes in an 8-bit encoding are often referred to as "octets"
-rather than simply as bytes. This terminology dates back to the days
-before 8-bit bytes were universal, when some computers had 9-bit bytes,
-others had 10-bit bytes, etc.