+File: lispref.info, Node: Coding System Types, Next: ISO 2022, Up: Coding Systems
+
+Coding System Types
+-------------------
+
+ The coding system type determines the basic algorithm XEmacs will
+use to decode or encode a data stream. Character encodings will be
+converted to the MULE encoding, escape sequences processed, and newline
+sequences converted to XEmacs's internal representation. There are
+three basic classes of coding system type: no-conversion, ISO-2022, and
+special.
+
+ No conversion allows you to look at the file's internal
+representation. Since XEmacs is basically a text editor, "no
+conversion" does convert newline conventions by default. (Use the
+'binary coding-system if this is not desired.)
+
+ ISO 2022 (*note ISO 2022::) is the basic international standard
+regulating use of "coded character sets for the exchange of data", ie,
+text streams. ISO 2022 contains functions that make it possible to
+encode text streams to comply with restrictions of the Internet mail
+system and de facto restrictions of most file systems (eg, use of the
+separator character in file names). Coding systems which are not ISO
+2022 conformant can be difficult to handle. Perhaps more important,
+they are not adaptable to multilingual information interchange, with
+the obvious exception of ISO 10646 (Unicode). (Unicode is partially
+supported by XEmacs with the addition of the Lisp package ucs-conv.)
+
+ The special class of coding systems includes automatic detection,
+CCL (a "little language" embedded as an interpreter, useful for
+translating between variants of a single character set),
+non-ISO-2022-conformant encodings like Unicode, Shift JIS, and Big5,
+and MULE internal coding. (NB: this list is based on XEmacs 21.2.
+Terminology may vary slightly for other versions of XEmacs and for GNU
+Emacs 20.)
+
+`no-conversion'
+ No conversion, for binary files, and a few special cases of
+ non-ISO-2022 coding systems where conversion is done by hook
+ functions (usually implemented in CCL). On output, graphic
+ characters that are not in ASCII or Latin-1 will be replaced by a
+ `?'. (For a no-conversion-encoded buffer, these characters will
+ only be present if you explicitly insert them.)
+
+`iso2022'
+ Any ISO-2022-compliant encoding. Among others, this includes JIS
+ (the Japanese encoding commonly used for e-mail), national
+ variants of EUC (the standard Unix encoding for Japanese and other
+ languages), and Compound Text (an encoding used in X11). You can
+ specify more specific information about the conversion with the
+ FLAGS argument.
+
+`ucs-4'
+ ISO 10646 UCS-4 encoding. A 31-bit fixed-width superset of
+ Unicode.
+
+`utf-8'
+ ISO 10646 UTF-8 encoding. A "file system safe" transformation
+ format that can be used with both UCS-4 and Unicode.
+
+`undecided'
+ Automatic conversion. XEmacs attempts to detect the coding system
+ used in the file.
+
+`shift-jis'
+ Shift-JIS (a Japanese encoding commonly used in PC operating
+ systems).
+
+`big5'
+ Big5 (the encoding commonly used for Taiwanese).
+
+`ccl'
+ The conversion is performed using a user-written pseudo-code
+ program. CCL (Code Conversion Language) is the name of this
+ pseudo-code. For example, CCL is used to map KOI8-R characters
+ (an encoding for Russian Cyrillic) to ISO8859-5 (the form used
+ internally by MULE).
+
+`internal'
+ Write out or read in the raw contents of the memory representing
+ the buffer's text. This is primarily useful for debugging
+ purposes, and is only enabled when XEmacs has been compiled with
+ `DEBUG_XEMACS' set (the `--debug' configure option). *Warning*:
+ Reading in a file using `internal' conversion can result in an
+ internal inconsistency in the memory representing a buffer's text,
+ which will produce unpredictable results and may cause XEmacs to
+ crash. Under normal circumstances you should never use `internal'
+ conversion.
+
+\1f
+File: lispref.info, Node: ISO 2022, Next: EOL Conversion, Prev: Coding System Types, Up: Coding Systems
+
+ISO 2022
+========
+
+ This section briefly describes the ISO 2022 encoding standard. A
+more thorough treatment is available in the original document of ISO
+2022 as well as various national standards (such as JIS X 0202).
+
+ Character sets ("charsets") are classified into the following four
+categories, according to the number of characters in the charset:
+94-charset, 96-charset, 94x94-charset, and 96x96-charset. This means
+that although an ISO 2022 coding system may have variable width
+characters, each charset used is fixed-width (in contrast to the MULE
+character set and UTF-8, for example).
+
+ ISO 2022 provides for switching between character sets via escape
+sequences. This switching is somewhat complicated, because ISO 2022
+provides for both legacy applications like Internet mail that accept
+only 7 significant bits in some contexts (RFC 822 headers, for example),
+and more modern "8-bit clean" applications. It also provides for
+compact and transparent representation of languages like Japanese which
+mix ASCII and a national script (even outside of computer programs).
+
+ First, ISO 2022 codified prevailing practice by dividing the code
+space into "control" and "graphic" regions. The code points 0x00-0x1F
+and 0x80-0x9F are reserved for "control characters", while "graphic
+characters" must be assigned to code points in the regions 0x20-0x7F and
+0xA0-0xFF. The positions 0x20 and 0x7F are special, and under some
+circumstances must be assigned the graphic character "ASCII SPACE" and
+the control character "ASCII DEL" respectively.
+
+ The various regions are given the name C0 (0x00-0x1F), GL
+(0x20-0x7F), C1 (0x80-0x9F), and GR (0xA0-0xFF). GL and GR stand for
+"graphic left" and "graphic right", respectively, because of the
+standard method of displaying graphic character sets in tables with the
+high byte indexing columns and the low byte indexing rows. I don't
+find it very intuitive, but these are called "registers".
+
+ An ISO 2022-conformant encoding for a graphic character set must use
+a fixed number of bytes per character, and the values must fit into a
+single register; that is, each byte must range over either 0x20-0x7F, or
+0xA0-0xFF. It is not allowed to extend the range of the repertoire of a
+character set by using both ranges at the same. This is why a standard
+character set such as ISO 8859-1 is actually considered by ISO 2022 to
+be an aggregation of two character sets, ASCII and LATIN-1, and why it
+is technically incorrect to refer to ISO 8859-1 as "Latin 1". Also, a
+single character's bytes must all be drawn from the same register; this
+is why Shift JIS (for Japanese) and Big 5 (for Chinese) are not ISO
+2022-compatible encodings.
+
+ The reason for this restriction becomes clear when you attempt to
+define an efficient, robust encoding for a language like Japanese.
+Like ISO 8859, Japanese encodings are aggregations of several character
+sets. In practice, the vast majority of characters are drawn from the
+"JIS Roman" character set (a derivative of ASCII; it won't hurt to
+think of it as ASCII) and the JIS X 0208 standard "basic Japanese"
+character set including not only ideographic characters ("kanji") but
+syllabic Japanese characters ("kana"), a wide variety of symbols, and
+many alphabetic characters (Roman, Greek, and Cyrillic) as well.
+Although JIS X 0208 includes the whole Roman alphabet, as a 2-byte code
+it is not suited to programming; thus the inclusion of ASCII in the
+standard Japanese encodings.
+
+ For normal Japanese text such as in newspapers, a broad repertoire of
+approximately 3000 characters is used. Evidently this won't fit into
+one byte; two must be used. But much of the text processed by Japanese
+computers is computer source code, nearly all of which is ASCII. A not
+insignificant portion of ordinary text is English (as such or as
+borrowed Japanese vocabulary) or other languages which can represented
+at least approximately in ASCII, as well. It seems reasonable then to
+represent ASCII in one byte, and JIS X 0208 in two. And this is exactly
+what the Extended Unix Code for Japanese (EUC-JP) does. ASCII is
+invoked to the GL register, and JIS X 0208 is invoked to the GR
+register. Thus, each byte can be tested for its character set by
+looking at the high bit; if set, it is Japanese, if clear, it is ASCII.
+Furthermore, since control characters like newline can never be part of
+a graphic character, even in the case of corruption in transmission the
+stream will be resynchronized at every line break, on the order of 60-80
+bytes. This coding system requires no escape sequences or special
+control codes to represent 99.9% of all Japanese text.
+
+ Note carefully the distinction between the character sets (ASCII and
+JIS X 0208), the encoding (EUC-JP), and the coding system (ISO 2022).
+The JIS X 0208 character set is used in three different encodings for
+Japanese, but in ISO-2022-JP it is invoked into GL (so the high bit is
+always clear), in EUC-JP it is invoked into GR (setting the high bit in
+the process), and in Shift JIS the high bit may be set or reset, and the
+significant bits are shifted within the 16-bit character so that the two
+main character sets can coexist with a third (the "halfwidth katakana"
+of JIS X 0201). As the name implies, the ISO-2022-JP encoding is also a
+version of the ISO-2022 coding system.
+
+ In order to systematically treat subsidiary character sets (like the
+"halfwidth katakana" already mentioned, and the "supplementary kanji" of
+JIS X 0212), four further registers are defined: G0, G1, G2, and G3.
+Unlike GL and GR, they are not logically distinguished by internal
+format. Instead, the process of "invocation" mentioned earlier is
+broken into two steps: first, a character set is "designated" to one of
+the registers G0-G3 by use of an "escape sequence" of the form:
+
+ ESC [I] I F
+
+ where I is an intermediate character or characters in the range 0x20
+- 0x3F, and F, from the range 0x30-0x7Fm is the final character
+identifying this charset. (Final characters in the range 0x30-0x3F are
+reserved for private use and will never have a publically registered
+meaning.)
+
+ Then that register is "invoked" to either GL or GR, either
+automatically (designations to G0 normally involve invocation to GL as
+well), or by use of shifting (affecting only the following character in
+the data stream) or locking (effective until the next designation or
+locking) control sequences. An encoding conformant to ISO 2022 is
+typically defined by designating the initial contents of the G0-G3
+registers, specifying an 7 or 8 bit environment, and specifying whether
+further designations will be recognized.
+
+ Some examples of character sets and the registered final characters
+F used to designate them:
+
+94-charset
+ ASCII (B), left (J) and right (I) half of JIS X 0201, ...
+
+96-charset
+ Latin-1 (A), Latin-2 (B), Latin-3 (C), ...
+
+94x94-charset
+ GB2312 (A), JIS X 0208 (B), KSC5601 (C), ...
+
+96x96-charset
+ none for the moment
+
+ The meanings of the various characters in these sequences, where not
+specified by the ISO 2022 standard (such as the ESC character), are
+assigned by "ECMA", the European Computer Manufacturers Association.
+
+ The meaning of intermediate characters are:
+
+ $ [0x24]: indicate charset of dimension 2 (94x94 or 96x96).
+ ( [0x28]: designate to G0 a 94-charset whose final byte is F.
+ ) [0x29]: designate to G1 a 94-charset whose final byte is F.
+ * [0x2A]: designate to G2 a 94-charset whose final byte is F.
+ + [0x2B]: designate to G3 a 94-charset whose final byte is F.
+ , [0x2C]: designate to G0 a 96-charset whose final byte is F.
+ - [0x2D]: designate to G1 a 96-charset whose final byte is F.
+ . [0x2E]: designate to G2 a 96-charset whose final byte is F.
+ / [0x2F]: designate to G3 a 96-charset whose final byte is F.
+
+ The comma may be used in files read and written only by MULE, as a
+MULE extension, but this is illegal in ISO 2022. (The reason is that
+in ISO 2022 G0 must be a 94-member character set, with 0x20 assigned
+the value SPACE, and 0x7F assigned the value DEL.)
+
+ Here are examples of designations:
+
+ ESC ( B : designate to G0 ASCII
+ ESC - A : designate to G1 Latin-1
+ ESC $ ( A or ESC $ A : designate to G0 GB2312
+ ESC $ ( B or ESC $ B : designate to G0 JISX0208
+ ESC $ ) C : designate to G1 KSC5601
+
+ (The short forms used to designate GB2312 and JIS X 0208 are for
+backwards compatibility; the long forms are preferred.)
+
+ To use a charset designated to G2 or G3, and to use a charset
+designated to G1 in a 7-bit environment, you must explicitly invoke G1,
+G2, or G3 into GL. There are two types of invocation, Locking Shift
+(forever) and Single Shift (one character only).
+
+ Locking Shift is done as follows:
+
+ LS0 or SI (0x0F): invoke G0 into GL
+ LS1 or SO (0x0E): invoke G1 into GL
+ LS2: invoke G2 into GL
+ LS3: invoke G3 into GL
+ LS1R: invoke G1 into GR
+ LS2R: invoke G2 into GR
+ LS3R: invoke G3 into GR
+
+ Single Shift is done as follows:
+
+ SS2 or ESC N: invoke G2 into GL
+ SS3 or ESC O: invoke G3 into GL
+
+ The shift functions (such as LS1R and SS3) are represented by control
+characters (from C1) in 8 bit environments and by escape sequences in 7
+bit environments.
+
+ (#### Ben says: I think the above is slightly incorrect. It appears
+that SS2 invokes G2 into GR and SS3 invokes G3 into GR, whereas ESC N
+and ESC O behave as indicated. The above definitions will not parse
+EUC-encoded text correctly, and it looks like the code in mule-coding.c
+has similar problems.)
+
+ Evidently there are a lot of ISO-2022-compliant ways of encoding
+multilingual text. Now, in the world, there exist many coding systems
+such as X11's Compound Text, Japanese JUNET code, and so-called EUC
+(Extended UNIX Code); all of these are variants of ISO 2022.
+
+ In MULE, we characterize a version of ISO 2022 by the following
+attributes:
+
+ 1. The character sets initially designated to G0 thru G3.
+
+ 2. Whether short form designations are allowed for Japanese and
+ Chinese.
+
+ 3. Whether ASCII should be designated to G0 before control characters.
+
+ 4. Whether ASCII should be designated to G0 at the end of line.
+
+ 5. 7-bit environment or 8-bit environment.
+
+ 6. Whether Locking Shifts are used or not.
+
+ 7. Whether to use ASCII or the variant JIS X 0201-1976-Roman.
+
+ 8. Whether to use JIS X 0208-1983 or the older version JIS X
+ 0208-1976.
+
+ (The last two are only for Japanese.)
+
+ By specifying these attributes, you can create any variant of ISO
+2022.
+
+ Here are several examples:
+
+ ISO-2022-JP -- Coding system used in Japanese email (RFC 1463 #### check).
+ 1. G0 <- ASCII, G1..3 <- never used
+ 2. Yes.
+ 3. Yes.
+ 4. Yes.
+ 5. 7-bit environment
+ 6. No.
+ 7. Use ASCII
+ 8. Use JIS X 0208-1983
+
+ ctext -- X11 Compound Text
+ 1. G0 <- ASCII, G1 <- Latin-1, G2,3 <- never used.
+ 2. No.
+ 3. No.
+ 4. Yes.
+ 5. 8-bit environment.
+ 6. No.
+ 7. Use ASCII.
+ 8. Use JIS X 0208-1983.
+
+ euc-china -- Chinese EUC. Often called the "GB encoding", but that is
+ technically incorrect.
+ 1. G0 <- ASCII, G1 <- GB 2312, G2,3 <- never used.
+ 2. No.
+ 3. Yes.
+ 4. Yes.
+ 5. 8-bit environment.
+ 6. No.
+ 7. Use ASCII.
+ 8. Use JIS X 0208-1983.
+
+ ISO-2022-KR -- Coding system used in Korean email.
+ 1. G0 <- ASCII, G1 <- KSC 5601, G2,3 <- never used.
+ 2. No.
+ 3. Yes.
+ 4. Yes.
+ 5. 7-bit environment.
+ 6. Yes.
+ 7. Use ASCII.
+ 8. Use JIS X 0208-1983.
+
+ MULE creates all of these coding systems by default.
+
+\1f
+File: lispref.info, Node: EOL Conversion, Next: Coding System Properties, Prev: ISO 2022, Up: Coding Systems
+
+EOL Conversion
+--------------
+
+`nil'
+ Automatically detect the end-of-line type (LF, CRLF, or CR). Also
+ generate subsidiary coding systems named `NAME-unix', `NAME-dos',
+ and `NAME-mac', that are identical to this coding system but have
+ an EOL-TYPE value of `lf', `crlf', and `cr', respectively.
+
+`lf'
+ The end of a line is marked externally using ASCII LF. Since this
+ is also the way that XEmacs represents an end-of-line internally,
+ specifying this option results in no end-of-line conversion. This
+ is the standard format for Unix text files.
+
+`crlf'
+ The end of a line is marked externally using ASCII CRLF. This is
+ the standard format for MS-DOS text files.
+
+`cr'
+ The end of a line is marked externally using ASCII CR. This is the
+ standard format for Macintosh text files.
+
+`t'
+ Automatically detect the end-of-line type but do not generate
+ subsidiary coding systems. (This value is converted to `nil' when
+ stored internally, and `coding-system-property' will return `nil'.)
+
+\1f