X-Git-Url: http://git.chise.org/gitweb/?a=blobdiff_plain;f=info%2Flispref.info-44;h=fc9d763e4a37632684f0b2d67af1f3ea28e007d6;hb=25c73cdbede669f1ef7e9d1f6f8a612ecb463b9c;hp=ace6584a35f1c2e2b53a33ff398498ad951640a5;hpb=f52a96980ed9280f8f906a20d4b899dc0b027644;p=chise%2Fxemacs-chise.git diff --git a/info/lispref.info-44 b/info/lispref.info-44 index ace6584..fc9d763 100644 --- a/info/lispref.info-44 +++ b/info/lispref.info-44 @@ -1,4 +1,4 @@ -This is ../info/lispref.info, produced by makeinfo version 4.0 from +This is ../info/lispref.info, produced by makeinfo version 4.0b from lispref/lispref.texi. INFO-DIR-SECTION XEmacs Editor @@ -50,1034 +50,985 @@ may be included in a translation approved by the Free Software Foundation instead of in the original English.  -File: lispref.info, Node: Coding System Properties, Next: Basic Coding System Functions, Prev: EOL Conversion, Up: Coding Systems - -Coding System Properties ------------------------- - -`mnemonic' - String to be displayed in the modeline when this coding system is - active. - -`eol-type' - End-of-line conversion to be used. It should be one of the types - listed in *Note EOL Conversion::. - -`eol-lf' - The coding system which is the same as this one, except that it - uses the Unix line-breaking convention. - -`eol-crlf' - The coding system which is the same as this one, except that it - uses the DOS line-breaking convention. - -`eol-cr' - The coding system which is the same as this one, except that it - uses the Macintosh line-breaking convention. - -`post-read-conversion' - Function called after a file has been read in, to perform the - decoding. Called with two arguments, BEG and END, denoting a - region of the current buffer to be decoded. - -`pre-write-conversion' - Function called before a file is written out, to perform the - encoding. Called with two arguments, BEG and END, denoting a - region of the current buffer to be encoded. - - The following additional properties are recognized if TYPE is -`iso2022': - -`charset-g0' -`charset-g1' -`charset-g2' -`charset-g3' - The character set initially designated to the G0 - G3 registers. - The value should be one of - - * A charset object (designate that character set) - - * `nil' (do not ever use this register) - - * `t' (no character set is initially designated to the - register, but may be later on; this automatically sets the - corresponding `force-g*-on-output' property) - -`force-g0-on-output' -`force-g1-on-output' -`force-g2-on-output' -`force-g3-on-output' - If non-`nil', send an explicit designation sequence on output - before using the specified register. - -`short' - If non-`nil', use the short forms `ESC $ @', `ESC $ A', and `ESC $ - B' on output in place of the full designation sequences `ESC $ ( - @', `ESC $ ( A', and `ESC $ ( B'. - -`no-ascii-eol' - If non-`nil', don't designate ASCII to G0 at each end of line on - output. Setting this to non-`nil' also suppresses other - state-resetting that normally happens at the end of a line. - -`no-ascii-cntl' - If non-`nil', don't designate ASCII to G0 before control chars on - output. - -`seven' - If non-`nil', use 7-bit environment on output. Otherwise, use - 8-bit environment. - -`lock-shift' - If non-`nil', use locking-shift (SO/SI) instead of single-shift or - designation by escape sequence. - -`no-iso6429' - If non-`nil', don't use ISO6429's direction specification. - -`escape-quoted' - If non-nil, literal control characters that are the same as the - beginning of a recognized ISO 2022 or ISO 6429 escape sequence (in - particular, ESC (0x1B), SO (0x0E), SI (0x0F), SS2 (0x8E), SS3 - (0x8F), and CSI (0x9B)) are "quoted" with an escape character so - that they can be properly distinguished from an escape sequence. - (Note that doing this results in a non-portable encoding.) This - encoding flag is used for byte-compiled files. Note that ESC is a - good choice for a quoting character because there are no escape - sequences whose second byte is a character from the Control-0 or - Control-1 character sets; this is explicitly disallowed by the ISO - 2022 standard. - -`input-charset-conversion' - A list of conversion specifications, specifying conversion of - characters in one charset to another when decoding is performed. - Each specification is a list of two elements: the source charset, - and the destination charset. - -`output-charset-conversion' - A list of conversion specifications, specifying conversion of - characters in one charset to another when encoding is performed. - The form of each specification is the same as for - `input-charset-conversion'. - - The following additional properties are recognized (and required) if -TYPE is `ccl': - -`decode' - CCL program used for decoding (converting to internal format). - -`encode' - CCL program used for encoding (converting to external format). - - The following properties are used internally: EOL-CR, EOL-CRLF, -EOL-LF, and BASE. +File: lispref.info, Node: Internationalization Terminology, Next: Charsets, Up: MULE + +Internationalization Terminology +================================ + + In internationalization terminology, a string of text is divided up +into "characters", which are the printable units that make up the text. +A single character is (for example) a capital `A', the number `2', a +Katakana character, a Hangul character, a Kanji ideograph (an +"ideograph" is a "picture" character, such as is used in Japanese +Kanji, Chinese Hanzi, and Korean Hanja; typically there are thousands +of such ideographs in each language), etc. The basic property of a +character is that it is the smallest unit of text with semantic +significance in text processing. + + Human beings normally process text visually, so to a first +approximation a character may be identified with its shape. Note that +the same character may be drawn by two different people (or in two +different fonts) in slightly different ways, although the "basic shape" +will be the same. But consider the works of Scott Kim; human beings +can recognize hugely variant shapes as the "same" character. +Sometimes, especially where characters are extremely complicated to +write, completely different shapes may be defined as the "same" +character in national standards. The Taiwanese variant of Hanzi is +generally the most complicated; over the centuries, the Japanese, +Koreans, and the People's Republic of China have adopted +simplifications of the shape, but the line of descent from the original +shape is recorded, and the meanings and pronunciation of different +forms of the same character are considered to be identical within each +language. (Of course, it may take a specialist to recognize the +related form; the point is that the relations are standardized, despite +the differing shapes.) + + In some cases, the differences will be significant enough that it is +actually possible to identify two or more distinct shapes that both +represent the same character. For example, the lowercase letters `a' +and `g' each have two distinct possible shapes--the `a' can optionally +have a curved tail projecting off the top, and the `g' can be formed +either of two loops, or of one loop and a tail hanging off the bottom. +Such distinct possible shapes of a character are called "glyphs". The +important characteristic of two glyphs making up the same character is +that the choice between one or the other is purely stylistic and has no +linguistic effect on a word (this is the reason why a capital `A' and +lowercase `a' are different characters rather than different +glyphs--e.g. `Aspen' is a city while `aspen' is a kind of tree). + + Note that "character" and "glyph" are used differently here than +elsewhere in XEmacs. + + A "character set" is essentially a set of related characters. ASCII, +for example, is a set of 94 characters (or 128, if you count +non-printing characters). Other character sets are ISO8859-1 (ASCII +plus various accented characters and other international symbols), JIS +X 0201 (ASCII, more or less, plus half-width Katakana), JIS X 0208 +(Japanese Kanji), JIS X 0212 (a second set of less-used Japanese Kanji), +GB2312 (Mainland Chinese Hanzi), etc. + + The definition of a character set will implicitly or explicitly give +it an "ordering", a way of assigning a number to each character in the +set. For many character sets, there is a natural ordering, for example +the "ABC" ordering of the Roman letters. But it is not clear whether +digits should come before or after the letters, and in fact different +European languages treat the ordering of accented characters +differently. It is useful to use the natural order where available, of +course. The number assigned to any particular character is called the +character's "code point". (Within a given character set, each +character has a unique code point. Thus the word "set" is ill-chosen; +different orderings of the same characters are different character sets. +Identifying characters is simple enough for alphabetic character sets, +but the difference in ordering can cause great headaches when the same +thousands of characters are used by different cultures as in the Hanzi.) + + A code point may be broken into a number of "position codes". The +number of position codes required to index a particular character in a +character set is called the "dimension" of the character set. For +practical purposes, a position code may be thought of as a byte-sized +index. The printing characters of ASCII, being a relatively small +character set, is of dimension one, and each character in the set is +indexed using a single position code, in the range 1 through 94. Use of +this unusual range, rather than the familiar 33 through 126, is an +intentional abstraction; to understand the programming issues you must +break the equation between character sets and encodings. + + JIS X 0208, i.e. Japanese Kanji, has thousands of characters, and is +of dimension two - every character is indexed by two position codes, +each in the range 1 through 94. (This number "94" is not a +coincidence; we shall see that the JIS position codes were chosen so +that JIS kanji could be encoded without using codes that in ASCII are +associated with device control functions.) Note that the choice of the +range here is somewhat arbitrary. You could just as easily index the +printing characters in ASCII using numbers in the range 0 through 93, 2 +through 95, 3 through 96, etc. In fact, the standardized _encoding_ +for the ASCII _character set_ uses the range 33 through 126. + + An "encoding" is a way of numerically representing characters from +one or more character sets into a stream of like-sized numerical values +called "words"; typically these are 8-bit, 16-bit, or 32-bit +quantities. If an encoding encompasses only one character set, then the +position codes for the characters in that character set could be used +directly. (This is the case with the trivial cipher used by children, +assigning 1 to `A', 2 to `B', and so on.) However, even with ASCII, +other considerations intrude. For example, why are the upper- and +lowercase alphabets separated by 8 characters? Why do the digits start +with `0' being assigned the code 48? In both cases because semantically +interesting operations (case conversion and numerical value extraction) +become convenient masking operations. Other artificial aspects (the +control characters being assigned to codes 0-31 and 127) are historical +accidents. (The use of 127 for `DEL' is an artifact of the "punch +once" nature of paper tape, for example.) + + Naive use of the position code is not possible, however, if more than +one character set is to be used in the encoding. For example, printed +Japanese text typically requires characters from multiple character sets +- ASCII, JIS X 0208, and JIS X 0212, to be specific. Each of these is +indexed using one or more position codes in the range 1 through 94, so +the position codes could not be used directly or there would be no way +to tell which character was meant. Different Japanese encodings handle +this differently - JIS uses special escape characters to denote +different character sets; EUC sets the high bit of the position codes +for JIS X 0208 and JIS X 0212, and puts a special extra byte before each +JIS X 0212 character; etc. (JIS, EUC, and most of the other encodings +you will encounter in files are 7-bit or 8-bit encodings. There is one +common 16-bit encoding, which is Unicode; this strives to represent all +the world's characters in a single large character set. 32-bit +encodings are often used internally in programs, such as XEmacs with +MULE support, to simplify the code that manipulates them; however, they +are not used externally because they are not very space-efficient.) + + A general method of handling text using multiple character sets +(whether for multilingual text, or simply text in an extremely +complicated single language like Japanese) is defined in the +international standard ISO 2022. ISO 2022 will be discussed in more +detail later (*note ISO 2022::), but for now suffice it to say that text +needs control functions (at least spacing), and if escape sequences are +to be used, an escape sequence introducer. It was decided to make all +text streams compatible with ASCII in the sense that the codes 0-31 +(and 128-159) would always be control codes, never graphic characters, +and where defined by the character set the `SPC' character would be +assigned code 32, and `DEL' would be assigned 127. Thus there are 94 +code points remaining if 7 bits are used. This is the reason that most +character sets are defined using position codes in the range 1 through +94. Then ISO 2022 compatible encodings are produced by shifting the +position codes 1 to 94 into character codes 33 to 126, or (if 8 bit +codes are available) into character codes 161 to 254. + + Encodings are classified as either "modal" or "non-modal". In a +"modal encoding", there are multiple states that the encoding can be +in, and the interpretation of the values in the stream depends on the +current global state of the encoding. Special values in the encoding, +called "escape sequences", are used to change the global state. JIS, +for example, is a modal encoding. The bytes `ESC $ B' indicate that, +from then on, bytes are to be interpreted as position codes for JIS X +0208, rather than as ASCII. This effect is cancelled using the bytes +`ESC ( B', which mean "switch from whatever the current state is to +ASCII". To switch to JIS X 0212, the escape sequence `ESC $ ( D'. +(Note that here, as is common, the escape sequences do in fact begin +with `ESC'. This is not necessarily the case, however. Some encodings +use control characters called "locking shifts" (effect persists until +cancelled) to switch character sets.) + + A "non-modal encoding" has no global state that extends past the +character currently being interpreted. EUC, for example, is a +non-modal encoding. Characters in JIS X 0208 are encoded by setting +the high bit of the position codes, and characters in JIS X 0212 are +encoded by doing the same but also prefixing the character with the +byte 0x8F. + + The advantage of a modal encoding is that it is generally more +space-efficient, and is easily extendible because there are essentially +an arbitrary number of escape sequences that can be created. The +disadvantage, however, is that it is much more difficult to work with +if it is not being processed in a sequential manner. In the non-modal +EUC encoding, for example, the byte 0x41 always refers to the letter +`A'; whereas in JIS, it could either be the letter `A', or one of the +two position codes in a JIS X 0208 character, or one of the two +position codes in a JIS X 0212 character. Determining exactly which +one is meant could be difficult and time-consuming if the previous +bytes in the string have not already been processed, or impossible if +they are drawn from an external stream that cannot be rewound. + + Non-modal encodings are further divided into "fixed-width" and +"variable-width" formats. A fixed-width encoding always uses the same +number of words per character, whereas a variable-width encoding does +not. EUC is a good example of a variable-width encoding: one to three +bytes are used per character, depending on the character set. 16-bit +and 32-bit encodings are nearly always fixed-width, and this is in fact +one of the main reasons for using an encoding with a larger word size. +The advantages of fixed-width encodings should be obvious. The +advantages of variable-width encodings are that they are generally more +space-efficient and allow for compatibility with existing 8-bit +encodings such as ASCII. (For example, in Unicode ASCII characters are +simply promoted to a 16-bit representation. That means that every +ASCII character contains a `NUL' byte; evidently all of the standard +string manipulation functions will lose badly in a fixed-width Unicode +environment.) + + The bytes in an 8-bit encoding are often referred to as "octets" +rather than simply as bytes. This terminology dates back to the days +before 8-bit bytes were universal, when some computers had 9-bit bytes, +others had 10-bit bytes, etc.  -File: lispref.info, Node: Basic Coding System Functions, Next: Coding System Property Functions, Prev: Coding System Properties, Up: Coding Systems +File: lispref.info, Node: Charsets, Next: MULE Characters, Prev: Internationalization Terminology, Up: MULE -Basic Coding System Functions ------------------------------ +Charsets +======== - - Function: find-coding-system coding-system-or-name - This function retrieves the coding system of the given name. + A "charset" in MULE is an object that encapsulates a particular +character set as well as an ordering of those characters. Charsets are +permanent objects and are named using symbols, like faces. - If CODING-SYSTEM-OR-NAME is a coding-system object, it is simply - returned. Otherwise, CODING-SYSTEM-OR-NAME should be a symbol. - If there is no such coding system, `nil' is returned. Otherwise - the associated coding system object is returned. + - Function: charsetp object + This function returns non-`nil' if OBJECT is a charset. - - Function: get-coding-system name - This function retrieves the coding system of the given name. Same - as `find-coding-system' except an error is signalled if there is no - such coding system instead of returning `nil'. - - - Function: coding-system-list - This function returns a list of the names of all defined coding - systems. - - - Function: coding-system-name coding-system - This function returns the name of the given coding system. - - - Function: coding-system-base coding-system - Returns the base coding system (undecided EOL convention) coding - system. - - - Function: make-coding-system name type &optional doc-string props - This function registers symbol NAME as a coding system. - - TYPE describes the conversion method used and should be one of the - types listed in *Note Coding System Types::. - - DOC-STRING is a string describing the coding system. - - PROPS is a property list, describing the specific nature of the - character set. Recognized properties are as in *Note Coding - System Properties::. - - - Function: copy-coding-system old-coding-system new-name - This function copies OLD-CODING-SYSTEM to NEW-NAME. If NEW-NAME - does not name an existing coding system, a new one will be created. +* Menu: - - Function: subsidiary-coding-system coding-system eol-type - This function returns the subsidiary coding system of - CODING-SYSTEM with eol type EOL-TYPE. +* Charset Properties:: Properties of a charset. +* Basic Charset Functions:: Functions for working with charsets. +* Charset Property Functions:: Functions for accessing charset properties. +* Predefined Charsets:: Predefined charset objects.  -File: lispref.info, Node: Coding System Property Functions, Next: Encoding and Decoding Text, Prev: Basic Coding System Functions, Up: Coding Systems - -Coding System Property Functions --------------------------------- - - - Function: coding-system-doc-string coding-system - This function returns the doc string for CODING-SYSTEM. +File: lispref.info, Node: Charset Properties, Next: Basic Charset Functions, Up: Charsets + +Charset Properties +------------------ + + Charsets have the following properties: + +`name' + A symbol naming the charset. Every charset must have a different + name; this allows a charset to be referred to using its name + rather than the actual charset object. + +`doc-string' + A documentation string describing the charset. + +`registry' + A regular expression matching the font registry field for this + character set. For example, both the `ascii' and `latin-iso8859-1' + charsets use the registry `"ISO8859-1"'. This field is used to + choose an appropriate font when the user gives a general font + specification such as `-*-courier-medium-r-*-140-*', i.e. a + 14-point upright medium-weight Courier font. + +`dimension' + Number of position codes used to index a character in the + character set. XEmacs/MULE can only handle character sets of + dimension 1 or 2. This property defaults to 1. + +`chars' + Number of characters in each dimension. In XEmacs/MULE, the only + allowed values are 94 or 96. (There are a couple of pre-defined + character sets, such as ASCII, that do not follow this, but you + cannot define new ones like this.) Defaults to 94. Note that if + the dimension is 2, the character set thus described is 94x94 or + 96x96. + +`columns' + Number of columns used to display a character in this charset. + Only used in TTY mode. (Under X, the actual width of a character + can be derived from the font used to display the characters.) If + unspecified, defaults to the dimension. (This is almost always the + correct value, because character sets with dimension 2 are usually + ideograph character sets, which need two columns to display the + intricate ideographs.) + +`direction' + A symbol, either `l2r' (left-to-right) or `r2l' (right-to-left). + Defaults to `l2r'. This specifies the direction that the text + should be displayed in, and will be left-to-right for most + charsets but right-to-left for Hebrew and Arabic. (Right-to-left + display is not currently implemented.) + +`final' + Final byte of the standard ISO 2022 escape sequence designating + this charset. Must be supplied. Each combination of (DIMENSION, + CHARS) defines a separate namespace for final bytes, and each + charset within a particular namespace must have a different final + byte. Note that ISO 2022 restricts the final byte to the range + 0x30 - 0x7E if dimension == 1, and 0x30 - 0x5F if dimension == 2. + Note also that final bytes in the range 0x30 - 0x3F are reserved + for user-defined (not official) character sets. For more + information on ISO 2022, see *Note Coding Systems::. + +`graphic' + 0 (use left half of font on output) or 1 (use right half of font on + output). Defaults to 0. This specifies how to convert the + position codes that index a character in a character set into an + index into the font used to display the character set. With + `graphic' set to 0, position codes 33 through 126 map to font + indices 33 through 126; with it set to 1, position codes 33 + through 126 map to font indices 161 through 254 (i.e. the same + number but with the high bit set). For example, for a font whose + registry is ISO8859-1, the left half of the font (octets 0x20 - + 0x7F) is the `ascii' charset, while the right half (octets 0xA0 - + 0xFF) is the `latin-iso8859-1' charset. + +`ccl-program' + A compiled CCL program used to convert a character in this charset + into an index into the font. This is in addition to the `graphic' + property. If a CCL program is defined, the position codes of a + character will first be processed according to `graphic' and then + passed through the CCL program, with the resulting values used to + index the font. + + This is used, for example, in the Big5 character set (used in + Taiwan). This character set is not ISO-2022-compliant, and its + size (94x157) does not fit within the maximum 96x96 size of + ISO-2022-compliant character sets. As a result, XEmacs/MULE + splits it (in a rather complex fashion, so as to group the most + commonly used characters together) into two charset objects + (`big5-1' and `big5-2'), each of size 94x94, and each charset + object uses a CCL program to convert the modified position codes + back into standard Big5 indices to retrieve a character from a + Big5 font. + + Most of the above properties can only be set when the charset is +initialized, and cannot be changed later. *Note Charset Property +Functions::. - - Function: coding-system-type coding-system - This function returns the type of CODING-SYSTEM. - - - Function: coding-system-property coding-system prop - This function returns the PROP property of CODING-SYSTEM. + +File: lispref.info, Node: Basic Charset Functions, Next: Charset Property Functions, Prev: Charset Properties, Up: Charsets + +Basic Charset Functions +----------------------- + + - Function: find-charset charset-or-name + This function retrieves the charset of the given name. If + CHARSET-OR-NAME is a charset object, it is simply returned. + Otherwise, CHARSET-OR-NAME should be a symbol. If there is no + such charset, `nil' is returned. Otherwise the associated charset + object is returned. + + - Function: get-charset name + This function retrieves the charset of the given name. Same as + `find-charset' except an error is signalled if there is no such + charset instead of returning `nil'. + + - Function: charset-list + This function returns a list of the names of all defined charsets. + + - Function: make-charset name doc-string props + This function defines a new character set. This function is for + use with MULE support. NAME is a symbol, the name by which the + character set is normally referred. DOC-STRING is a string + describing the character set. PROPS is a property list, + describing the specific nature of the character set. The + recognized properties are `registry', `dimension', `columns', + `chars', `final', `graphic', `direction', and `ccl-program', as + previously described. + + - Function: make-reverse-direction-charset charset new-name + This function makes a charset equivalent to CHARSET but which goes + in the opposite direction. NEW-NAME is the name of the new + charset. The new charset is returned. + + - Function: charset-from-attributes dimension chars final &optional + direction + This function returns a charset with the given DIMENSION, CHARS, + FINAL, and DIRECTION. If DIRECTION is omitted, both directions + will be checked (left-to-right will be returned if character sets + exist for both directions). + + - Function: charset-reverse-direction-charset charset + This function returns the charset (if any) with the same dimension, + number of characters, and final byte as CHARSET, but which is + displayed in the opposite direction.  -File: lispref.info, Node: Encoding and Decoding Text, Next: Detection of Textual Encoding, Prev: Coding System Property Functions, Up: Coding Systems +File: lispref.info, Node: Charset Property Functions, Next: Predefined Charsets, Prev: Basic Charset Functions, Up: Charsets -Encoding and Decoding Text +Charset Property Functions -------------------------- - - Function: decode-coding-region start end coding-system &optional - buffer - This function decodes the text between START and END which is - encoded in CODING-SYSTEM. This is useful if you've read in - encoded text from a file without decoding it (e.g. you read in a - JIS-formatted file but used the `binary' or `no-conversion' coding - system, so that it shows up as `^[$B!> | <8 | >8 | // - | < | > | == | <= | >= | != | de-sjis | en-sjis -ASSIGNMENT_OPERATOR := - += | -= | *= | /= | %= | &= | '|=' | ^= | <<= | >>= -ARRAY := '[' integer ... ']' +File: lispref.info, Node: Composite Characters, Next: Coding Systems, Prev: MULE Characters, Up: MULE + +Composite Characters +==================== + + Composite characters are not yet completely implemented. + + - Function: make-composite-char string + This function converts a string into a single composite character. + The character is the result of overstriking all the characters in + the string. + + - Function: composite-char-string character + This function returns a string of the characters comprising a + composite character. + + - Function: compose-region start end &optional buffer + This function composes the characters in the region from START to + END in BUFFER into one composite character. The composite + character replaces the composed characters. BUFFER defaults to + the current buffer if omitted. + + - Function: decompose-region start end &optional buffer + This function decomposes any composite characters in the region + from START to END in BUFFER. This converts each composite + character into one or more characters, the individual characters + out of which the composite character was formed. Non-composite + characters are left as-is. BUFFER defaults to the current buffer + if omitted.  -File: lispref.info, Node: CCL Statements, Next: CCL Expressions, Prev: CCL Syntax, Up: CCL - -CCL Statements --------------- - - The Emacs Code Conversion Language provides the following statement -types: "set", "if", "branch", "loop", "repeat", "break", "read", -"write", "call", and "end". +File: lispref.info, Node: Coding Systems, Next: CCL, Prev: Composite Characters, Up: MULE -Set statement: +Coding Systems ============== - The "set" statement has three variants with the syntaxes `(REG = -EXPRESSION)', `(REG ASSIGNMENT_OPERATOR EXPRESSION)', and `INTEGER'. -The assignment operator variation of the "set" statement works the same -way as the corresponding C expression statement does. The assignment -operators are `+=', `-=', `*=', `/=', `%=', `&=', `|=', `^=', `<<=', -and `>>=', and they have the same meanings as in C. A "naked integer" -INTEGER is equivalent to a SET statement of the form `(r0 = INTEGER)'. + A coding system is an object that defines how text containing +multiple character sets is encoded into a stream of (typically 8-bit) +bytes. The coding system is used to decode the stream into a series of +characters (which may be from multiple charsets) when the text is read +from a file or process, and is used to encode the text back into the +same format when it is written out to a file or process. -I/O statements: -=============== + For example, many ISO-2022-compliant coding systems (such as Compound +Text, which is used for inter-client data under the X Window System) use +escape sequences to switch between different charsets - Japanese Kanji, +for example, is invoked with `ESC $ ( B'; ASCII is invoked with `ESC ( +B'; and Cyrillic is invoked with `ESC - L'. See `make-coding-system' +for more information. - The "read" statement takes one or more registers as arguments. It -reads one byte (a C char) from the input into each register in turn. - - The "write" takes several forms. In the form `(write REG ...)' it -takes one or more registers as arguments and writes each in turn to the -output. The integer in a register (interpreted as an Emchar) is -encoded to multibyte form (ie, Bufbytes) and written to the current -output buffer. If it is less than 256, it is written as is. The forms -`(write EXPRESSION)' and `(write INTEGER)' are treated analogously. -The form `(write STRING)' writes the constant string to the output. A -"naked string" `STRING' is equivalent to the statement `(write -STRING)'. The form `(write REG ARRAY)' writes the REGth element of the -ARRAY to the output. - -Conditional statements: -======================= - - The "if" statement takes an EXPRESSION, a CCL BLOCK, and an optional -SECOND CCL BLOCK as arguments. If the EXPRESSION evaluates to -non-zero, the first CCL BLOCK is executed. Otherwise, if there is a -SECOND CCL BLOCK, it is executed. - - The "read-if" variant of the "if" statement takes an EXPRESSION, a -CCL BLOCK, and an optional SECOND CCL BLOCK as arguments. The -EXPRESSION must have the form `(REG OPERATOR OPERAND)' (where OPERAND is -a register or an integer). The `read-if' statement first reads from -the input into the first register operand in the EXPRESSION, then -conditionally executes a CCL block just as the `if' statement does. - - The "branch" statement takes an EXPRESSION and one or more CCL -blocks as arguments. The CCL blocks are treated as a zero-indexed -array, and the `branch' statement uses the EXPRESSION as the index of -the CCL block to execute. Null CCL blocks may be used as no-ops, -continuing execution with the statement following the `branch' -statement in the containing CCL block. Out-of-range values for the -EXPRESSION are also treated as no-ops. - - The "read-branch" variant of the "branch" statement takes an -REGISTER, a CCL BLOCK, and an optional SECOND CCL BLOCK as arguments. -The `read-branch' statement first reads from the input into the -REGISTER, then conditionally executes a CCL block just as the `branch' -statement does. - -Loop control statements: -======================== - - The "loop" statement creates a block with an implied jump from the -end of the block back to its head. The loop is exited on a `break' -statement, and continued without executing the tail by a `repeat' -statement. - - The "break" statement, written `(break)', terminates the current -loop and continues with the next statement in the current block. - - The "repeat" statement has three variants, `repeat', `write-repeat', -and `write-read-repeat'. Each continues the current loop from its -head, possibly after performing I/O. `repeat' takes no arguments and -does no I/O before jumping. `write-repeat' takes a single argument (a -register, an integer, or a string), writes it to the output, then jumps. -`write-read-repeat' takes one or two arguments. The first must be a -register. The second may be an integer or an array; if absent, it is -implicitly set to the first (register) argument. `write-read-repeat' -writes its second argument to the output, then reads from the input -into the register, and finally jumps. See the `write' and `read' -statements for the semantics of the I/O operations for each type of -argument. - -Other control statements: -========================= - - The "call" statement, written `(call CCL-PROGRAM-NAME)', executes a -CCL program as a subroutine. It does not return a value to the caller, -but can modify the register status. - - The "end" statement, written `(end)', terminates the CCL program -successfully, and returns to caller (which may be a CCL program). It -does not alter the status of the registers. + Coding systems are normally identified using a symbol, and the +symbol is accepted in place of the actual coding system object whenever +a coding system is called for. (This is similar to how faces and +charsets work.) - -File: lispref.info, Node: CCL Expressions, Next: Calling CCL, Prev: CCL Statements, Up: CCL - -CCL Expressions ---------------- - - CCL, unlike Lisp, uses infix expressions. The simplest CCL -expressions consist of a single OPERAND, either a register (one of `r0', -..., `r0') or an integer. Complex expressions are lists of the form `( -EXPRESSION OPERATOR OPERAND )'. Unlike C, assignments are not -expressions. - - In the following table, X is the target resister for a "set". In -subexpressions, this is implicitly `r7'. This means that `>8', `//', -`de-sjis', and `en-sjis' cannot be used freely in subexpressions, since -they return parts of their values in `r7'. Y may be an expression, -register, or integer, while Z must be a register or an integer. - -Name Operator Code C-like Description -CCL_PLUS `+' 0x00 X = Y + Z -CCL_MINUS `-' 0x01 X = Y - Z -CCL_MUL `*' 0x02 X = Y * Z -CCL_DIV `/' 0x03 X = Y / Z -CCL_MOD `%' 0x04 X = Y % Z -CCL_AND `&' 0x05 X = Y & Z -CCL_OR `|' 0x06 X = Y | Z -CCL_XOR `^' 0x07 X = Y ^ Z -CCL_LSH `<<' 0x08 X = Y << Z -CCL_RSH `>>' 0x09 X = Y >> Z -CCL_LSH8 `<8' 0x0A X = (Y << 8) | Z -CCL_RSH8 `>8' 0x0B X = Y >> 8, r[7] = Y & 0xFF -CCL_DIVMOD `//' 0x0C X = Y / Z, r[7] = Y % Z -CCL_LS `<' 0x10 X = (X < Y) -CCL_GT `>' 0x11 X = (X > Y) -CCL_EQ `==' 0x12 X = (X == Y) -CCL_LE `<=' 0x13 X = (X <= Y) -CCL_GE `>=' 0x14 X = (X >= Y) -CCL_NE `!=' 0x15 X = (X != Y) -CCL_ENCODE_SJIS `en-sjis' 0x16 X = HIGHER_BYTE (SJIS (Y, Z)) - r[7] = LOWER_BYTE (SJIS (Y, Z) -CCL_DECODE_SJIS `de-sjis' 0x17 X = HIGHER_BYTE (DE-SJIS (Y, Z)) - r[7] = LOWER_BYTE (DE-SJIS (Y, Z)) - - The CCL operators are as in C, with the addition of CCL_LSH8, -CCL_RSH8, CCL_DIVMOD, CCL_ENCODE_SJIS, and CCL_DECODE_SJIS. The -CCL_ENCODE_SJIS and CCL_DECODE_SJIS treat their first and second bytes -as the high and low bytes of a two-byte character code. (SJIS stands -for Shift JIS, an encoding of Japanese characters used by Microsoft. -CCL_ENCODE_SJIS is a complicated transformation of the Japanese -standard JIS encoding to Shift JIS. CCL_DECODE_SJIS is its inverse.) -It is somewhat odd to represent the SJIS operations in infix form. - - -File: lispref.info, Node: Calling CCL, Next: CCL Examples, Prev: CCL Expressions, Up: CCL - -Calling CCL ------------ - - CCL programs are called automatically during Emacs buffer I/O when -the external representation has a coding system type of `shift-jis', -`big5', or `ccl'. The program is specified by the coding system (*note -Coding Systems::). You can also call CCL programs from other CCL -programs, and from Lisp using these functions: - - - Function: ccl-execute ccl-program status - Execute CCL-PROGRAM with registers initialized by STATUS. - CCL-PROGRAM is a vector of compiled CCL code created by - `ccl-compile'. It is an error for the program to try to execute a - CCL I/O command. STATUS must be a vector of nine values, - specifying the initial value for the R0, R1 .. R7 registers and - for the instruction counter IC. A `nil' value for a register - initializer causes the register to be set to 0. A `nil' value for - the IC initializer causes execution to start at the beginning of - the program. When the program is done, STATUS is modified (by - side-effect) to contain the ending values for the corresponding - registers and IC. - - - Function: ccl-execute-on-string ccl-program status str &optional - continue - Execute CCL-PROGRAM with initial STATUS on STRING. CCL-PROGRAM is - a vector of compiled CCL code created by `ccl-compile'. STATUS - must be a vector of nine values, specifying the initial value for - the R0, R1 .. R7 registers and for the instruction counter IC. A - `nil' value for a register initializer causes the register to be - set to 0. A `nil' value for the IC initializer causes execution - to start at the beginning of the program. An optional fourth - argument CONTINUE, if non-nil, causes the IC to remain on the - unsatisfied read operation if the program terminates due to - exhaustion of the input buffer. Otherwise the IC is set to the end - of the program. When the program is done, STATUS is modified (by - side-effect) to contain the ending values for the corresponding - registers and IC. Returns the resulting string. - - To call a CCL program from another CCL program, it must first be -registered: - - - Function: register-ccl-program name ccl-program - Register NAME for CCL program PROGRAM in `ccl-program-table'. - PROGRAM should be the compiled form of a CCL program, or nil. - Return index number of the registered CCL program. - - Information about the processor time used by the CCL interpreter can -be obtained using these functions: - - - Function: ccl-elapsed-time - Returns the elapsed processor time of the CCL interpreter as cons - of user and system time, as floating point numbers measured in - seconds. If only one overall value can be determined, the return - value will be a cons of that value and 0. - - - Function: ccl-reset-elapsed-time - Resets the CCL interpreter's internal elapsed time registers. - - -File: lispref.info, Node: CCL Examples, Prev: Calling CCL, Up: CCL + - Function: coding-system-p object + This function returns non-`nil' if OBJECT is a coding system. -CCL Examples ------------- +* Menu: - This section is not yet written. +* Coding System Types:: Classifying coding systems. +* ISO 2022:: An international standard for + charsets and encodings. +* EOL Conversion:: Dealing with different ways of denoting + the end of a line. +* Coding System Properties:: Properties of a coding system. +* Basic Coding System Functions:: Working with coding systems. +* Coding System Property Functions:: Retrieving a coding system's properties. +* Encoding and Decoding Text:: Encoding and decoding text. +* Detection of Textual Encoding:: Determining how text is encoded. +* Big5 and Shift-JIS Functions:: Special functions for these non-standard + encodings. +* Predefined Coding Systems:: Coding systems implemented by MULE.  -File: lispref.info, Node: Category Tables, Prev: CCL, Up: MULE +File: lispref.info, Node: Coding System Types, Next: ISO 2022, Up: Coding Systems + +Coding System Types +------------------- + + The coding system type determines the basic algorithm XEmacs will +use to decode or encode a data stream. Character encodings will be +converted to the MULE encoding, escape sequences processed, and newline +sequences converted to XEmacs's internal representation. There are +three basic classes of coding system type: no-conversion, ISO-2022, and +special. + + No conversion allows you to look at the file's internal +representation. Since XEmacs is basically a text editor, "no +conversion" does convert newline conventions by default. (Use the +'binary coding-system if this is not desired.) + + ISO 2022 (*note ISO 2022::) is the basic international standard +regulating use of "coded character sets for the exchange of data", ie, +text streams. ISO 2022 contains functions that make it possible to +encode text streams to comply with restrictions of the Internet mail +system and de facto restrictions of most file systems (eg, use of the +separator character in file names). Coding systems which are not ISO +2022 conformant can be difficult to handle. Perhaps more important, +they are not adaptable to multilingual information interchange, with +the obvious exception of ISO 10646 (Unicode). (Unicode is partially +supported by XEmacs with the addition of the Lisp package ucs-conv.) + + The special class of coding systems includes automatic detection, +CCL (a "little language" embedded as an interpreter, useful for +translating between variants of a single character set), +non-ISO-2022-conformant encodings like Unicode, Shift JIS, and Big5, +and MULE internal coding. (NB: this list is based on XEmacs 21.2. +Terminology may vary slightly for other versions of XEmacs and for GNU +Emacs 20.) -Category Tables -=============== +`no-conversion' + No conversion, for binary files, and a few special cases of + non-ISO-2022 coding systems where conversion is done by hook + functions (usually implemented in CCL). On output, graphic + characters that are not in ASCII or Latin-1 will be replaced by a + `?'. (For a no-conversion-encoded buffer, these characters will + only be present if you explicitly insert them.) + +`iso2022' + Any ISO-2022-compliant encoding. Among others, this includes JIS + (the Japanese encoding commonly used for e-mail), national + variants of EUC (the standard Unix encoding for Japanese and other + languages), and Compound Text (an encoding used in X11). You can + specify more specific information about the conversion with the + FLAGS argument. + +`ucs-4' + ISO 10646 UCS-4 encoding. A 31-bit fixed-width superset of + Unicode. + +`utf-8' + ISO 10646 UTF-8 encoding. A "file system safe" transformation + format that can be used with both UCS-4 and Unicode. - A category table is a type of char table used for keeping track of -categories. Categories are used for classifying characters for use in -regexps--you can refer to a category rather than having to use a -complicated [] expression (and category lookups are significantly -faster). - - There are 95 different categories available, one for each printable -character (including space) in the ASCII charset. Each category is -designated by one such character, called a "category designator". They -are specified in a regexp using the syntax `\cX', where X is a category -designator. (This is not yet implemented.) - - A category table specifies, for each character, the categories that -the character is in. Note that a character can be in more than one -category. More specifically, a category table maps from a character to -either the value `nil' (meaning the character is in no categories) or a -95-element bit vector, specifying for each of the 95 categories whether -the character is in that category. - - Special Lisp functions are provided that abstract this, so you do not -have to directly manipulate bit vectors. - - - Function: category-table-p obj - This function returns `t' if ARG is a category table. - - - Function: category-table &optional buffer - This function returns the current category table. This is the one - specified by the current buffer, or by BUFFER if it is non-`nil'. - - - Function: standard-category-table - This function returns the standard category table. This is the - one used for new buffers. - - - Function: copy-category-table &optional table - This function constructs a new category table and return it. It - is a copy of the TABLE, which defaults to the standard category - table. - - - Function: set-category-table table &optional buffer - This function selects a new category table for BUFFER. One - argument, a category table. BUFFER defaults to the current buffer - if omitted. +`undecided' + Automatic conversion. XEmacs attempts to detect the coding system + used in the file. - - Function: category-designator-p obj - This function returns `t' if ARG is a category designator (a char - in the range `' '' to `'~''). +`shift-jis' + Shift-JIS (a Japanese encoding commonly used in PC operating + systems). - - Function: category-table-value-p obj - This function returns `t' if ARG is a category table value. Valid - values are `nil' or a bit vector of size 95. +`big5' + Big5 (the encoding commonly used for Taiwanese). + +`ccl' + The conversion is performed using a user-written pseudo-code + program. CCL (Code Conversion Language) is the name of this + pseudo-code. For example, CCL is used to map KOI8-R characters + (an encoding for Russian Cyrillic) to ISO8859-5 (the form used + internally by MULE). + +`internal' + Write out or read in the raw contents of the memory representing + the buffer's text. This is primarily useful for debugging + purposes, and is only enabled when XEmacs has been compiled with + `DEBUG_XEMACS' set (the `--debug' configure option). *Warning*: + Reading in a file using `internal' conversion can result in an + internal inconsistency in the memory representing a buffer's text, + which will produce unpredictable results and may cause XEmacs to + crash. Under normal circumstances you should never use `internal' + conversion.  -File: lispref.info, Node: Tips, Next: Building XEmacs and Object Allocation, Prev: MULE, Up: Top - -Tips and Standards -****************** +File: lispref.info, Node: ISO 2022, Next: EOL Conversion, Prev: Coding System Types, Up: Coding Systems + +ISO 2022 +======== + + This section briefly describes the ISO 2022 encoding standard. A +more thorough treatment is available in the original document of ISO +2022 as well as various national standards (such as JIS X 0202). + + Character sets ("charsets") are classified into the following four +categories, according to the number of characters in the charset: +94-charset, 96-charset, 94x94-charset, and 96x96-charset. This means +that although an ISO 2022 coding system may have variable width +characters, each charset used is fixed-width (in contrast to the MULE +character set and UTF-8, for example). + + ISO 2022 provides for switching between character sets via escape +sequences. This switching is somewhat complicated, because ISO 2022 +provides for both legacy applications like Internet mail that accept +only 7 significant bits in some contexts (RFC 822 headers, for example), +and more modern "8-bit clean" applications. It also provides for +compact and transparent representation of languages like Japanese which +mix ASCII and a national script (even outside of computer programs). + + First, ISO 2022 codified prevailing practice by dividing the code +space into "control" and "graphic" regions. The code points 0x00-0x1F +and 0x80-0x9F are reserved for "control characters", while "graphic +characters" must be assigned to code points in the regions 0x20-0x7F and +0xA0-0xFF. The positions 0x20 and 0x7F are special, and under some +circumstances must be assigned the graphic character "ASCII SPACE" and +the control character "ASCII DEL" respectively. + + The various regions are given the name C0 (0x00-0x1F), GL +(0x20-0x7F), C1 (0x80-0x9F), and GR (0xA0-0xFF). GL and GR stand for +"graphic left" and "graphic right", respectively, because of the +standard method of displaying graphic character sets in tables with the +high byte indexing columns and the low byte indexing rows. I don't +find it very intuitive, but these are called "registers". + + An ISO 2022-conformant encoding for a graphic character set must use +a fixed number of bytes per character, and the values must fit into a +single register; that is, each byte must range over either 0x20-0x7F, or +0xA0-0xFF. It is not allowed to extend the range of the repertoire of a +character set by using both ranges at the same. This is why a standard +character set such as ISO 8859-1 is actually considered by ISO 2022 to +be an aggregation of two character sets, ASCII and LATIN-1, and why it +is technically incorrect to refer to ISO 8859-1 as "Latin 1". Also, a +single character's bytes must all be drawn from the same register; this +is why Shift JIS (for Japanese) and Big 5 (for Chinese) are not ISO +2022-compatible encodings. + + The reason for this restriction becomes clear when you attempt to +define an efficient, robust encoding for a language like Japanese. +Like ISO 8859, Japanese encodings are aggregations of several character +sets. In practice, the vast majority of characters are drawn from the +"JIS Roman" character set (a derivative of ASCII; it won't hurt to +think of it as ASCII) and the JIS X 0208 standard "basic Japanese" +character set including not only ideographic characters ("kanji") but +syllabic Japanese characters ("kana"), a wide variety of symbols, and +many alphabetic characters (Roman, Greek, and Cyrillic) as well. +Although JIS X 0208 includes the whole Roman alphabet, as a 2-byte code +it is not suited to programming; thus the inclusion of ASCII in the +standard Japanese encodings. + + For normal Japanese text such as in newspapers, a broad repertoire of +approximately 3000 characters is used. Evidently this won't fit into +one byte; two must be used. But much of the text processed by Japanese +computers is computer source code, nearly all of which is ASCII. A not +insignificant portion of ordinary text is English (as such or as +borrowed Japanese vocabulary) or other languages which can represented +at least approximately in ASCII, as well. It seems reasonable then to +represent ASCII in one byte, and JIS X 0208 in two. And this is exactly +what the Extended Unix Code for Japanese (EUC-JP) does. ASCII is +invoked to the GL register, and JIS X 0208 is invoked to the GR +register. Thus, each byte can be tested for its character set by +looking at the high bit; if set, it is Japanese, if clear, it is ASCII. +Furthermore, since control characters like newline can never be part of +a graphic character, even in the case of corruption in transmission the +stream will be resynchronized at every line break, on the order of 60-80 +bytes. This coding system requires no escape sequences or special +control codes to represent 99.9% of all Japanese text. + + Note carefully the distinction between the character sets (ASCII and +JIS X 0208), the encoding (EUC-JP), and the coding system (ISO 2022). +The JIS X 0208 character set is used in three different encodings for +Japanese, but in ISO-2022-JP it is invoked into GL (so the high bit is +always clear), in EUC-JP it is invoked into GR (setting the high bit in +the process), and in Shift JIS the high bit may be set or reset, and the +significant bits are shifted within the 16-bit character so that the two +main character sets can coexist with a third (the "halfwidth katakana" +of JIS X 0201). As the name implies, the ISO-2022-JP encoding is also a +version of the ISO-2022 coding system. + + In order to systematically treat subsidiary character sets (like the +"halfwidth katakana" already mentioned, and the "supplementary kanji" of +JIS X 0212), four further registers are defined: G0, G1, G2, and G3. +Unlike GL and GR, they are not logically distinguished by internal +format. Instead, the process of "invocation" mentioned earlier is +broken into two steps: first, a character set is "designated" to one of +the registers G0-G3 by use of an "escape sequence" of the form: + + ESC [I] I F + + where I is an intermediate character or characters in the range 0x20 +- 0x3F, and F, from the range 0x30-0x7Fm is the final character +identifying this charset. (Final characters in the range 0x30-0x3F are +reserved for private use and will never have a publicly registered +meaning.) + + Then that register is "invoked" to either GL or GR, either +automatically (designations to G0 normally involve invocation to GL as +well), or by use of shifting (affecting only the following character in +the data stream) or locking (effective until the next designation or +locking) control sequences. An encoding conformant to ISO 2022 is +typically defined by designating the initial contents of the G0-G3 +registers, specifying an 7 or 8 bit environment, and specifying whether +further designations will be recognized. + + Some examples of character sets and the registered final characters +F used to designate them: + +94-charset + ASCII (B), left (J) and right (I) half of JIS X 0201, ... + +96-charset + Latin-1 (A), Latin-2 (B), Latin-3 (C), ... + +94x94-charset + GB2312 (A), JIS X 0208 (B), KSC5601 (C), ... + +96x96-charset + none for the moment + + The meanings of the various characters in these sequences, where not +specified by the ISO 2022 standard (such as the ESC character), are +assigned by "ECMA", the European Computer Manufacturers Association. + + The meaning of intermediate characters are: + + $ [0x24]: indicate charset of dimension 2 (94x94 or 96x96). + ( [0x28]: designate to G0 a 94-charset whose final byte is F. + ) [0x29]: designate to G1 a 94-charset whose final byte is F. + * [0x2A]: designate to G2 a 94-charset whose final byte is F. + + [0x2B]: designate to G3 a 94-charset whose final byte is F. + , [0x2C]: designate to G0 a 96-charset whose final byte is F. + - [0x2D]: designate to G1 a 96-charset whose final byte is F. + . [0x2E]: designate to G2 a 96-charset whose final byte is F. + / [0x2F]: designate to G3 a 96-charset whose final byte is F. + + The comma may be used in files read and written only by MULE, as a +MULE extension, but this is illegal in ISO 2022. (The reason is that +in ISO 2022 G0 must be a 94-member character set, with 0x20 assigned +the value SPACE, and 0x7F assigned the value DEL.) + + Here are examples of designations: + + ESC ( B : designate to G0 ASCII + ESC - A : designate to G1 Latin-1 + ESC $ ( A or ESC $ A : designate to G0 GB2312 + ESC $ ( B or ESC $ B : designate to G0 JISX0208 + ESC $ ) C : designate to G1 KSC5601 + + (The short forms used to designate GB2312 and JIS X 0208 are for +backwards compatibility; the long forms are preferred.) + + To use a charset designated to G2 or G3, and to use a charset +designated to G1 in a 7-bit environment, you must explicitly invoke G1, +G2, or G3 into GL. There are two types of invocation, Locking Shift +(forever) and Single Shift (one character only). + + Locking Shift is done as follows: + + LS0 or SI (0x0F): invoke G0 into GL + LS1 or SO (0x0E): invoke G1 into GL + LS2: invoke G2 into GL + LS3: invoke G3 into GL + LS1R: invoke G1 into GR + LS2R: invoke G2 into GR + LS3R: invoke G3 into GR + + Single Shift is done as follows: + + SS2 or ESC N: invoke G2 into GL + SS3 or ESC O: invoke G3 into GL + + The shift functions (such as LS1R and SS3) are represented by control +characters (from C1) in 8 bit environments and by escape sequences in 7 +bit environments. + + (#### Ben says: I think the above is slightly incorrect. It appears +that SS2 invokes G2 into GR and SS3 invokes G3 into GR, whereas ESC N +and ESC O behave as indicated. The above definitions will not parse +EUC-encoded text correctly, and it looks like the code in mule-coding.c +has similar problems.) + + Evidently there are a lot of ISO-2022-compliant ways of encoding +multilingual text. Now, in the world, there exist many coding systems +such as X11's Compound Text, Japanese JUNET code, and so-called EUC +(Extended UNIX Code); all of these are variants of ISO 2022. + + In MULE, we characterize a version of ISO 2022 by the following +attributes: + + 1. The character sets initially designated to G0 thru G3. + + 2. Whether short form designations are allowed for Japanese and + Chinese. + + 3. Whether ASCII should be designated to G0 before control characters. + + 4. Whether ASCII should be designated to G0 at the end of line. + + 5. 7-bit environment or 8-bit environment. + + 6. Whether Locking Shifts are used or not. + + 7. Whether to use ASCII or the variant JIS X 0201-1976-Roman. + + 8. Whether to use JIS X 0208-1983 or the older version JIS X + 0208-1976. + + (The last two are only for Japanese.) + + By specifying these attributes, you can create any variant of ISO +2022. + + Here are several examples: + + ISO-2022-JP -- Coding system used in Japanese email (RFC 1463 #### check). + 1. G0 <- ASCII, G1..3 <- never used + 2. Yes. + 3. Yes. + 4. Yes. + 5. 7-bit environment + 6. No. + 7. Use ASCII + 8. Use JIS X 0208-1983 + + ctext -- X11 Compound Text + 1. G0 <- ASCII, G1 <- Latin-1, G2,3 <- never used. + 2. No. + 3. No. + 4. Yes. + 5. 8-bit environment. + 6. No. + 7. Use ASCII. + 8. Use JIS X 0208-1983. + + euc-china -- Chinese EUC. Often called the "GB encoding", but that is + technically incorrect. + 1. G0 <- ASCII, G1 <- GB 2312, G2,3 <- never used. + 2. No. + 3. Yes. + 4. Yes. + 5. 8-bit environment. + 6. No. + 7. Use ASCII. + 8. Use JIS X 0208-1983. + + ISO-2022-KR -- Coding system used in Korean email. + 1. G0 <- ASCII, G1 <- KSC 5601, G2,3 <- never used. + 2. No. + 3. Yes. + 4. Yes. + 5. 7-bit environment. + 6. Yes. + 7. Use ASCII. + 8. Use JIS X 0208-1983. + + MULE creates all of these coding systems by default. - This chapter describes no additional features of XEmacs Lisp. -Instead it gives advice on making effective use of the features -described in the previous chapters. + +File: lispref.info, Node: EOL Conversion, Next: Coding System Properties, Prev: ISO 2022, Up: Coding Systems -* Menu: +EOL Conversion +-------------- -* Style Tips:: Writing clean and robust programs. -* Compilation Tips:: Making compiled code run fast. -* Documentation Tips:: Writing readable documentation strings. -* Comment Tips:: Conventions for writing comments. -* Library Headers:: Standard headers for library packages. +`nil' + Automatically detect the end-of-line type (LF, CRLF, or CR). Also + generate subsidiary coding systems named `NAME-unix', `NAME-dos', + and `NAME-mac', that are identical to this coding system but have + an EOL-TYPE value of `lf', `crlf', and `cr', respectively. + +`lf' + The end of a line is marked externally using ASCII LF. Since this + is also the way that XEmacs represents an end-of-line internally, + specifying this option results in no end-of-line conversion. This + is the standard format for Unix text files. + +`crlf' + The end of a line is marked externally using ASCII CRLF. This is + the standard format for MS-DOS text files. + +`cr' + The end of a line is marked externally using ASCII CR. This is the + standard format for Macintosh text files. + +`t' + Automatically detect the end-of-line type but do not generate + subsidiary coding systems. (This value is converted to `nil' when + stored internally, and `coding-system-property' will return `nil'.)