-File: lispref.info, Node: Coding System Properties, Next: Basic Coding System Functions, Prev: EOL Conversion, Up: Coding Systems
-
-Coding System Properties
-------------------------
-
-`mnemonic'
- String to be displayed in the modeline when this coding system is
- active.
-
-`eol-type'
- End-of-line conversion to be used. It should be one of the types
- listed in *Note EOL Conversion::.
-
-`eol-lf'
- The coding system which is the same as this one, except that it
- uses the Unix line-breaking convention.
-
-`eol-crlf'
- The coding system which is the same as this one, except that it
- uses the DOS line-breaking convention.
-
-`eol-cr'
- The coding system which is the same as this one, except that it
- uses the Macintosh line-breaking convention.
-
-`post-read-conversion'
- Function called after a file has been read in, to perform the
- decoding. Called with two arguments, BEG and END, denoting a
- region of the current buffer to be decoded.
-
-`pre-write-conversion'
- Function called before a file is written out, to perform the
- encoding. Called with two arguments, BEG and END, denoting a
- region of the current buffer to be encoded.
-
- The following additional properties are recognized if TYPE is
-`iso2022':
-
-`charset-g0'
-`charset-g1'
-`charset-g2'
-`charset-g3'
- The character set initially designated to the G0 - G3 registers.
- The value should be one of
-
- * A charset object (designate that character set)
-
- * `nil' (do not ever use this register)
-
- * `t' (no character set is initially designated to the
- register, but may be later on; this automatically sets the
- corresponding `force-g*-on-output' property)
-
-`force-g0-on-output'
-`force-g1-on-output'
-`force-g2-on-output'
-`force-g3-on-output'
- If non-`nil', send an explicit designation sequence on output
- before using the specified register.
-
-`short'
- If non-`nil', use the short forms `ESC $ @', `ESC $ A', and `ESC $
- B' on output in place of the full designation sequences `ESC $ (
- @', `ESC $ ( A', and `ESC $ ( B'.
-
-`no-ascii-eol'
- If non-`nil', don't designate ASCII to G0 at each end of line on
- output. Setting this to non-`nil' also suppresses other
- state-resetting that normally happens at the end of a line.
-
-`no-ascii-cntl'
- If non-`nil', don't designate ASCII to G0 before control chars on
- output.
-
-`seven'
- If non-`nil', use 7-bit environment on output. Otherwise, use
- 8-bit environment.
-
-`lock-shift'
- If non-`nil', use locking-shift (SO/SI) instead of single-shift or
- designation by escape sequence.
-
-`no-iso6429'
- If non-`nil', don't use ISO6429's direction specification.
-
-`escape-quoted'
- If non-nil, literal control characters that are the same as the
- beginning of a recognized ISO 2022 or ISO 6429 escape sequence (in
- particular, ESC (0x1B), SO (0x0E), SI (0x0F), SS2 (0x8E), SS3
- (0x8F), and CSI (0x9B)) are "quoted" with an escape character so
- that they can be properly distinguished from an escape sequence.
- (Note that doing this results in a non-portable encoding.) This
- encoding flag is used for byte-compiled files. Note that ESC is a
- good choice for a quoting character because there are no escape
- sequences whose second byte is a character from the Control-0 or
- Control-1 character sets; this is explicitly disallowed by the ISO
- 2022 standard.
-
-`input-charset-conversion'
- A list of conversion specifications, specifying conversion of
- characters in one charset to another when decoding is performed.
- Each specification is a list of two elements: the source charset,
- and the destination charset.
-
-`output-charset-conversion'
- A list of conversion specifications, specifying conversion of
- characters in one charset to another when encoding is performed.
- The form of each specification is the same as for
- `input-charset-conversion'.
-
- The following additional properties are recognized (and required) if
-TYPE is `ccl':
-
-`decode'
- CCL program used for decoding (converting to internal format).
-
-`encode'
- CCL program used for encoding (converting to external format).
-
- The following properties are used internally: EOL-CR, EOL-CRLF,
-EOL-LF, and BASE.
+File: lispref.info, Node: Internationalization Terminology, Next: Charsets, Up: MULE
+
+Internationalization Terminology
+================================
+
+ In internationalization terminology, a string of text is divided up
+into "characters", which are the printable units that make up the text.
+A single character is (for example) a capital `A', the number `2', a
+Katakana character, a Hangul character, a Kanji ideograph (an
+"ideograph" is a "picture" character, such as is used in Japanese
+Kanji, Chinese Hanzi, and Korean Hanja; typically there are thousands
+of such ideographs in each language), etc. The basic property of a
+character is that it is the smallest unit of text with semantic
+significance in text processing.
+
+ Human beings normally process text visually, so to a first
+approximation a character may be identified with its shape. Note that
+the same character may be drawn by two different people (or in two
+different fonts) in slightly different ways, although the "basic shape"
+will be the same. But consider the works of Scott Kim; human beings
+can recognize hugely variant shapes as the "same" character.
+Sometimes, especially where characters are extremely complicated to
+write, completely different shapes may be defined as the "same"
+character in national standards. The Taiwanese variant of Hanzi is
+generally the most complicated; over the centuries, the Japanese,
+Koreans, and the People's Republic of China have adopted
+simplifications of the shape, but the line of descent from the original
+shape is recorded, and the meanings and pronunciation of different
+forms of the same character are considered to be identical within each
+language. (Of course, it may take a specialist to recognize the
+related form; the point is that the relations are standardized, despite
+the differing shapes.)
+
+ In some cases, the differences will be significant enough that it is
+actually possible to identify two or more distinct shapes that both
+represent the same character. For example, the lowercase letters `a'
+and `g' each have two distinct possible shapes--the `a' can optionally
+have a curved tail projecting off the top, and the `g' can be formed
+either of two loops, or of one loop and a tail hanging off the bottom.
+Such distinct possible shapes of a character are called "glyphs". The
+important characteristic of two glyphs making up the same character is
+that the choice between one or the other is purely stylistic and has no
+linguistic effect on a word (this is the reason why a capital `A' and
+lowercase `a' are different characters rather than different
+glyphs--e.g. `Aspen' is a city while `aspen' is a kind of tree).
+
+ Note that "character" and "glyph" are used differently here than
+elsewhere in XEmacs.
+
+ A "character set" is essentially a set of related characters. ASCII,
+for example, is a set of 94 characters (or 128, if you count
+non-printing characters). Other character sets are ISO8859-1 (ASCII
+plus various accented characters and other international symbols), JIS
+X 0201 (ASCII, more or less, plus half-width Katakana), JIS X 0208
+(Japanese Kanji), JIS X 0212 (a second set of less-used Japanese Kanji),
+GB2312 (Mainland Chinese Hanzi), etc.
+
+ The definition of a character set will implicitly or explicitly give
+it an "ordering", a way of assigning a number to each character in the
+set. For many character sets, there is a natural ordering, for example
+the "ABC" ordering of the Roman letters. But it is not clear whether
+digits should come before or after the letters, and in fact different
+European languages treat the ordering of accented characters
+differently. It is useful to use the natural order where available, of
+course. The number assigned to any particular character is called the
+character's "code point". (Within a given character set, each
+character has a unique code point. Thus the word "set" is ill-chosen;
+different orderings of the same characters are different character sets.
+Identifying characters is simple enough for alphabetic character sets,
+but the difference in ordering can cause great headaches when the same
+thousands of characters are used by different cultures as in the Hanzi.)
+
+ A code point may be broken into a number of "position codes". The
+number of position codes required to index a particular character in a
+character set is called the "dimension" of the character set. For
+practical purposes, a position code may be thought of as a byte-sized
+index. The printing characters of ASCII, being a relatively small
+character set, is of dimension one, and each character in the set is
+indexed using a single position code, in the range 1 through 94. Use of
+this unusual range, rather than the familiar 33 through 126, is an
+intentional abstraction; to understand the programming issues you must
+break the equation between character sets and encodings.
+
+ JIS X 0208, i.e. Japanese Kanji, has thousands of characters, and is
+of dimension two - every character is indexed by two position codes,
+each in the range 1 through 94. (This number "94" is not a
+coincidence; we shall see that the JIS position codes were chosen so
+that JIS kanji could be encoded without using codes that in ASCII are
+associated with device control functions.) Note that the choice of the
+range here is somewhat arbitrary. You could just as easily index the
+printing characters in ASCII using numbers in the range 0 through 93, 2
+through 95, 3 through 96, etc. In fact, the standardized _encoding_
+for the ASCII _character set_ uses the range 33 through 126.
+
+ An "encoding" is a way of numerically representing characters from
+one or more character sets into a stream of like-sized numerical values
+called "words"; typically these are 8-bit, 16-bit, or 32-bit
+quantities. If an encoding encompasses only one character set, then the
+position codes for the characters in that character set could be used
+directly. (This is the case with the trivial cipher used by children,
+assigning 1 to `A', 2 to `B', and so on.) However, even with ASCII,
+other considerations intrude. For example, why are the upper- and
+lowercase alphabets separated by 8 characters? Why do the digits start
+with `0' being assigned the code 48? In both cases because semantically
+interesting operations (case conversion and numerical value extraction)
+become convenient masking operations. Other artificial aspects (the
+control characters being assigned to codes 0-31 and 127) are historical
+accidents. (The use of 127 for `DEL' is an artifact of the "punch
+once" nature of paper tape, for example.)
+
+ Naive use of the position code is not possible, however, if more than
+one character set is to be used in the encoding. For example, printed
+Japanese text typically requires characters from multiple character sets
+- ASCII, JIS X 0208, and JIS X 0212, to be specific. Each of these is
+indexed using one or more position codes in the range 1 through 94, so
+the position codes could not be used directly or there would be no way
+to tell which character was meant. Different Japanese encodings handle
+this differently - JIS uses special escape characters to denote
+different character sets; EUC sets the high bit of the position codes
+for JIS X 0208 and JIS X 0212, and puts a special extra byte before each
+JIS X 0212 character; etc. (JIS, EUC, and most of the other encodings
+you will encounter in files are 7-bit or 8-bit encodings. There is one
+common 16-bit encoding, which is Unicode; this strives to represent all
+the world's characters in a single large character set. 32-bit
+encodings are often used internally in programs, such as XEmacs with
+MULE support, to simplify the code that manipulates them; however, they
+are not used externally because they are not very space-efficient.)
+
+ A general method of handling text using multiple character sets
+(whether for multilingual text, or simply text in an extremely
+complicated single language like Japanese) is defined in the
+international standard ISO 2022. ISO 2022 will be discussed in more
+detail later (*note ISO 2022::), but for now suffice it to say that text
+needs control functions (at least spacing), and if escape sequences are
+to be used, an escape sequence introducer. It was decided to make all
+text streams compatible with ASCII in the sense that the codes 0-31
+(and 128-159) would always be control codes, never graphic characters,
+and where defined by the character set the `SPC' character would be
+assigned code 32, and `DEL' would be assigned 127. Thus there are 94
+code points remaining if 7 bits are used. This is the reason that most
+character sets are defined using position codes in the range 1 through
+94. Then ISO 2022 compatible encodings are produced by shifting the
+position codes 1 to 94 into character codes 33 to 126, or (if 8 bit
+codes are available) into character codes 161 to 254.
+
+ Encodings are classified as either "modal" or "non-modal". In a
+"modal encoding", there are multiple states that the encoding can be
+in, and the interpretation of the values in the stream depends on the
+current global state of the encoding. Special values in the encoding,
+called "escape sequences", are used to change the global state. JIS,
+for example, is a modal encoding. The bytes `ESC $ B' indicate that,
+from then on, bytes are to be interpreted as position codes for JIS X
+0208, rather than as ASCII. This effect is cancelled using the bytes
+`ESC ( B', which mean "switch from whatever the current state is to
+ASCII". To switch to JIS X 0212, the escape sequence `ESC $ ( D'.
+(Note that here, as is common, the escape sequences do in fact begin
+with `ESC'. This is not necessarily the case, however. Some encodings
+use control characters called "locking shifts" (effect persists until
+cancelled) to switch character sets.)
+
+ A "non-modal encoding" has no global state that extends past the
+character currently being interpreted. EUC, for example, is a
+non-modal encoding. Characters in JIS X 0208 are encoded by setting
+the high bit of the position codes, and characters in JIS X 0212 are
+encoded by doing the same but also prefixing the character with the
+byte 0x8F.
+
+ The advantage of a modal encoding is that it is generally more
+space-efficient, and is easily extendible because there are essentially
+an arbitrary number of escape sequences that can be created. The
+disadvantage, however, is that it is much more difficult to work with
+if it is not being processed in a sequential manner. In the non-modal
+EUC encoding, for example, the byte 0x41 always refers to the letter
+`A'; whereas in JIS, it could either be the letter `A', or one of the
+two position codes in a JIS X 0208 character, or one of the two
+position codes in a JIS X 0212 character. Determining exactly which
+one is meant could be difficult and time-consuming if the previous
+bytes in the string have not already been processed, or impossible if
+they are drawn from an external stream that cannot be rewound.
+
+ Non-modal encodings are further divided into "fixed-width" and
+"variable-width" formats. A fixed-width encoding always uses the same
+number of words per character, whereas a variable-width encoding does
+not. EUC is a good example of a variable-width encoding: one to three
+bytes are used per character, depending on the character set. 16-bit
+and 32-bit encodings are nearly always fixed-width, and this is in fact
+one of the main reasons for using an encoding with a larger word size.
+The advantages of fixed-width encodings should be obvious. The
+advantages of variable-width encodings are that they are generally more
+space-efficient and allow for compatibility with existing 8-bit
+encodings such as ASCII. (For example, in Unicode ASCII characters are
+simply promoted to a 16-bit representation. That means that every
+ASCII character contains a `NUL' byte; evidently all of the standard
+string manipulation functions will lose badly in a fixed-width Unicode
+environment.)
+
+ The bytes in an 8-bit encoding are often referred to as "octets"
+rather than simply as bytes. This terminology dates back to the days
+before 8-bit bytes were universal, when some computers had 9-bit bytes,
+others had 10-bit bytes, etc.