X-Git-Url: http://git.chise.org/gitweb/?a=blobdiff_plain;f=info%2Flispref.info-44;h=d5399624fc2326729a1446529d82d148b7602f12;hb=811555575ffb2579c51fea46c607d531d4d57990;hp=ace6584a35f1c2e2b53a33ff398498ad951640a5;hpb=f52a96980ed9280f8f906a20d4b899dc0b027644;p=chise%2Fxemacs-chise.git- diff --git a/info/lispref.info-44 b/info/lispref.info-44 index ace6584..d539962 100644 --- a/info/lispref.info-44 +++ b/info/lispref.info-44 @@ -50,6 +50,399 @@ may be included in a translation approved by the Free Software Foundation instead of in the original English.  +File: lispref.info, Node: Coding System Types, Next: ISO 2022, Up: Coding Systems + +Coding System Types +------------------- + + The coding system type determines the basic algorithm XEmacs will +use to decode or encode a data stream. Character encodings will be +converted to the MULE encoding, escape sequences processed, and newline +sequences converted to XEmacs's internal representation. There are +three basic classes of coding system type: no-conversion, ISO-2022, and +special. + + No conversion allows you to look at the file's internal +representation. Since XEmacs is basically a text editor, "no +conversion" does convert newline conventions by default. (Use the +'binary coding-system if this is not desired.) + + ISO 2022 (*note ISO 2022::) is the basic international standard +regulating use of "coded character sets for the exchange of data", ie, +text streams. ISO 2022 contains functions that make it possible to +encode text streams to comply with restrictions of the Internet mail +system and de facto restrictions of most file systems (eg, use of the +separator character in file names). Coding systems which are not ISO +2022 conformant can be difficult to handle. Perhaps more important, +they are not adaptable to multilingual information interchange, with +the obvious exception of ISO 10646 (Unicode). (Unicode is partially +supported by XEmacs with the addition of the Lisp package ucs-conv.) + + The special class of coding systems includes automatic detection, +CCL (a "little language" embedded as an interpreter, useful for +translating between variants of a single character set), +non-ISO-2022-conformant encodings like Unicode, Shift JIS, and Big5, +and MULE internal coding. (NB: this list is based on XEmacs 21.2. +Terminology may vary slightly for other versions of XEmacs and for GNU +Emacs 20.) + +`no-conversion' + No conversion, for binary files, and a few special cases of + non-ISO-2022 coding systems where conversion is done by hook + functions (usually implemented in CCL). On output, graphic + characters that are not in ASCII or Latin-1 will be replaced by a + `?'. (For a no-conversion-encoded buffer, these characters will + only be present if you explicitly insert them.) + +`iso2022' + Any ISO-2022-compliant encoding. Among others, this includes JIS + (the Japanese encoding commonly used for e-mail), national + variants of EUC (the standard Unix encoding for Japanese and other + languages), and Compound Text (an encoding used in X11). You can + specify more specific information about the conversion with the + FLAGS argument. + +`ucs-4' + ISO 10646 UCS-4 encoding. A 31-bit fixed-width superset of + Unicode. + +`utf-8' + ISO 10646 UTF-8 encoding. A "file system safe" transformation + format that can be used with both UCS-4 and Unicode. + +`undecided' + Automatic conversion. XEmacs attempts to detect the coding system + used in the file. + +`shift-jis' + Shift-JIS (a Japanese encoding commonly used in PC operating + systems). + +`big5' + Big5 (the encoding commonly used for Taiwanese). + +`ccl' + The conversion is performed using a user-written pseudo-code + program. CCL (Code Conversion Language) is the name of this + pseudo-code. For example, CCL is used to map KOI8-R characters + (an encoding for Russian Cyrillic) to ISO8859-5 (the form used + internally by MULE). + +`internal' + Write out or read in the raw contents of the memory representing + the buffer's text. This is primarily useful for debugging + purposes, and is only enabled when XEmacs has been compiled with + `DEBUG_XEMACS' set (the `--debug' configure option). *Warning*: + Reading in a file using `internal' conversion can result in an + internal inconsistency in the memory representing a buffer's text, + which will produce unpredictable results and may cause XEmacs to + crash. Under normal circumstances you should never use `internal' + conversion. + + +File: lispref.info, Node: ISO 2022, Next: EOL Conversion, Prev: Coding System Types, Up: Coding Systems + +ISO 2022 +======== + + This section briefly describes the ISO 2022 encoding standard. A +more thorough treatment is available in the original document of ISO +2022 as well as various national standards (such as JIS X 0202). + + Character sets ("charsets") are classified into the following four +categories, according to the number of characters in the charset: +94-charset, 96-charset, 94x94-charset, and 96x96-charset. This means +that although an ISO 2022 coding system may have variable width +characters, each charset used is fixed-width (in contrast to the MULE +character set and UTF-8, for example). + + ISO 2022 provides for switching between character sets via escape +sequences. This switching is somewhat complicated, because ISO 2022 +provides for both legacy applications like Internet mail that accept +only 7 significant bits in some contexts (RFC 822 headers, for example), +and more modern "8-bit clean" applications. It also provides for +compact and transparent representation of languages like Japanese which +mix ASCII and a national script (even outside of computer programs). + + First, ISO 2022 codified prevailing practice by dividing the code +space into "control" and "graphic" regions. The code points 0x00-0x1F +and 0x80-0x9F are reserved for "control characters", while "graphic +characters" must be assigned to code points in the regions 0x20-0x7F and +0xA0-0xFF. The positions 0x20 and 0x7F are special, and under some +circumstances must be assigned the graphic character "ASCII SPACE" and +the control character "ASCII DEL" respectively. + + The various regions are given the name C0 (0x00-0x1F), GL +(0x20-0x7F), C1 (0x80-0x9F), and GR (0xA0-0xFF). GL and GR stand for +"graphic left" and "graphic right", respectively, because of the +standard method of displaying graphic character sets in tables with the +high byte indexing columns and the low byte indexing rows. I don't +find it very intuitive, but these are called "registers". + + An ISO 2022-conformant encoding for a graphic character set must use +a fixed number of bytes per character, and the values must fit into a +single register; that is, each byte must range over either 0x20-0x7F, or +0xA0-0xFF. It is not allowed to extend the range of the repertoire of a +character set by using both ranges at the same. This is why a standard +character set such as ISO 8859-1 is actually considered by ISO 2022 to +be an aggregation of two character sets, ASCII and LATIN-1, and why it +is technically incorrect to refer to ISO 8859-1 as "Latin 1". Also, a +single character's bytes must all be drawn from the same register; this +is why Shift JIS (for Japanese) and Big 5 (for Chinese) are not ISO +2022-compatible encodings. + + The reason for this restriction becomes clear when you attempt to +define an efficient, robust encoding for a language like Japanese. +Like ISO 8859, Japanese encodings are aggregations of several character +sets. In practice, the vast majority of characters are drawn from the +"JIS Roman" character set (a derivative of ASCII; it won't hurt to +think of it as ASCII) and the JIS X 0208 standard "basic Japanese" +character set including not only ideographic characters ("kanji") but +syllabic Japanese characters ("kana"), a wide variety of symbols, and +many alphabetic characters (Roman, Greek, and Cyrillic) as well. +Although JIS X 0208 includes the whole Roman alphabet, as a 2-byte code +it is not suited to programming; thus the inclusion of ASCII in the +standard Japanese encodings. + + For normal Japanese text such as in newspapers, a broad repertoire of +approximately 3000 characters is used. Evidently this won't fit into +one byte; two must be used. But much of the text processed by Japanese +computers is computer source code, nearly all of which is ASCII. A not +insignificant portion of ordinary text is English (as such or as +borrowed Japanese vocabulary) or other languages which can represented +at least approximately in ASCII, as well. It seems reasonable then to +represent ASCII in one byte, and JIS X 0208 in two. And this is exactly +what the Extended Unix Code for Japanese (EUC-JP) does. ASCII is +invoked to the GL register, and JIS X 0208 is invoked to the GR +register. Thus, each byte can be tested for its character set by +looking at the high bit; if set, it is Japanese, if clear, it is ASCII. +Furthermore, since control characters like newline can never be part of +a graphic character, even in the case of corruption in transmission the +stream will be resynchronized at every line break, on the order of 60-80 +bytes. This coding system requires no escape sequences or special +control codes to represent 99.9% of all Japanese text. + + Note carefully the distinction between the character sets (ASCII and +JIS X 0208), the encoding (EUC-JP), and the coding system (ISO 2022). +The JIS X 0208 character set is used in three different encodings for +Japanese, but in ISO-2022-JP it is invoked into GL (so the high bit is +always clear), in EUC-JP it is invoked into GR (setting the high bit in +the process), and in Shift JIS the high bit may be set or reset, and the +significant bits are shifted within the 16-bit character so that the two +main character sets can coexist with a third (the "halfwidth katakana" +of JIS X 0201). As the name implies, the ISO-2022-JP encoding is also a +version of the ISO-2022 coding system. + + In order to systematically treat subsidiary character sets (like the +"halfwidth katakana" already mentioned, and the "supplementary kanji" of +JIS X 0212), four further registers are defined: G0, G1, G2, and G3. +Unlike GL and GR, they are not logically distinguished by internal +format. Instead, the process of "invocation" mentioned earlier is +broken into two steps: first, a character set is "designated" to one of +the registers G0-G3 by use of an "escape sequence" of the form: + + ESC [I] I F + + where I is an intermediate character or characters in the range 0x20 +- 0x3F, and F, from the range 0x30-0x7Fm is the final character +identifying this charset. (Final characters in the range 0x30-0x3F are +reserved for private use and will never have a publically registered +meaning.) + + Then that register is "invoked" to either GL or GR, either +automatically (designations to G0 normally involve invocation to GL as +well), or by use of shifting (affecting only the following character in +the data stream) or locking (effective until the next designation or +locking) control sequences. An encoding conformant to ISO 2022 is +typically defined by designating the initial contents of the G0-G3 +registers, specifying an 7 or 8 bit environment, and specifying whether +further designations will be recognized. + + Some examples of character sets and the registered final characters +F used to designate them: + +94-charset + ASCII (B), left (J) and right (I) half of JIS X 0201, ... + +96-charset + Latin-1 (A), Latin-2 (B), Latin-3 (C), ... + +94x94-charset + GB2312 (A), JIS X 0208 (B), KSC5601 (C), ... + +96x96-charset + none for the moment + + The meanings of the various characters in these sequences, where not +specified by the ISO 2022 standard (such as the ESC character), are +assigned by "ECMA", the European Computer Manufacturers Association. + + The meaning of intermediate characters are: + + $ [0x24]: indicate charset of dimension 2 (94x94 or 96x96). + ( [0x28]: designate to G0 a 94-charset whose final byte is F. + ) [0x29]: designate to G1 a 94-charset whose final byte is F. + * [0x2A]: designate to G2 a 94-charset whose final byte is F. + + [0x2B]: designate to G3 a 94-charset whose final byte is F. + , [0x2C]: designate to G0 a 96-charset whose final byte is F. + - [0x2D]: designate to G1 a 96-charset whose final byte is F. + . [0x2E]: designate to G2 a 96-charset whose final byte is F. + / [0x2F]: designate to G3 a 96-charset whose final byte is F. + + The comma may be used in files read and written only by MULE, as a +MULE extension, but this is illegal in ISO 2022. (The reason is that +in ISO 2022 G0 must be a 94-member character set, with 0x20 assigned +the value SPACE, and 0x7F assigned the value DEL.) + + Here are examples of designations: + + ESC ( B : designate to G0 ASCII + ESC - A : designate to G1 Latin-1 + ESC $ ( A or ESC $ A : designate to G0 GB2312 + ESC $ ( B or ESC $ B : designate to G0 JISX0208 + ESC $ ) C : designate to G1 KSC5601 + + (The short forms used to designate GB2312 and JIS X 0208 are for +backwards compatibility; the long forms are preferred.) + + To use a charset designated to G2 or G3, and to use a charset +designated to G1 in a 7-bit environment, you must explicitly invoke G1, +G2, or G3 into GL. There are two types of invocation, Locking Shift +(forever) and Single Shift (one character only). + + Locking Shift is done as follows: + + LS0 or SI (0x0F): invoke G0 into GL + LS1 or SO (0x0E): invoke G1 into GL + LS2: invoke G2 into GL + LS3: invoke G3 into GL + LS1R: invoke G1 into GR + LS2R: invoke G2 into GR + LS3R: invoke G3 into GR + + Single Shift is done as follows: + + SS2 or ESC N: invoke G2 into GL + SS3 or ESC O: invoke G3 into GL + + The shift functions (such as LS1R and SS3) are represented by control +characters (from C1) in 8 bit environments and by escape sequences in 7 +bit environments. + + (#### Ben says: I think the above is slightly incorrect. It appears +that SS2 invokes G2 into GR and SS3 invokes G3 into GR, whereas ESC N +and ESC O behave as indicated. The above definitions will not parse +EUC-encoded text correctly, and it looks like the code in mule-coding.c +has similar problems.) + + Evidently there are a lot of ISO-2022-compliant ways of encoding +multilingual text. Now, in the world, there exist many coding systems +such as X11's Compound Text, Japanese JUNET code, and so-called EUC +(Extended UNIX Code); all of these are variants of ISO 2022. + + In MULE, we characterize a version of ISO 2022 by the following +attributes: + + 1. The character sets initially designated to G0 thru G3. + + 2. Whether short form designations are allowed for Japanese and + Chinese. + + 3. Whether ASCII should be designated to G0 before control characters. + + 4. Whether ASCII should be designated to G0 at the end of line. + + 5. 7-bit environment or 8-bit environment. + + 6. Whether Locking Shifts are used or not. + + 7. Whether to use ASCII or the variant JIS X 0201-1976-Roman. + + 8. Whether to use JIS X 0208-1983 or the older version JIS X + 0208-1976. + + (The last two are only for Japanese.) + + By specifying these attributes, you can create any variant of ISO +2022. + + Here are several examples: + + ISO-2022-JP -- Coding system used in Japanese email (RFC 1463 #### check). + 1. G0 <- ASCII, G1..3 <- never used + 2. Yes. + 3. Yes. + 4. Yes. + 5. 7-bit environment + 6. No. + 7. Use ASCII + 8. Use JIS X 0208-1983 + + ctext -- X11 Compound Text + 1. G0 <- ASCII, G1 <- Latin-1, G2,3 <- never used. + 2. No. + 3. No. + 4. Yes. + 5. 8-bit environment. + 6. No. + 7. Use ASCII. + 8. Use JIS X 0208-1983. + + euc-china -- Chinese EUC. Often called the "GB encoding", but that is + technically incorrect. + 1. G0 <- ASCII, G1 <- GB 2312, G2,3 <- never used. + 2. No. + 3. Yes. + 4. Yes. + 5. 8-bit environment. + 6. No. + 7. Use ASCII. + 8. Use JIS X 0208-1983. + + ISO-2022-KR -- Coding system used in Korean email. + 1. G0 <- ASCII, G1 <- KSC 5601, G2,3 <- never used. + 2. No. + 3. Yes. + 4. Yes. + 5. 7-bit environment. + 6. Yes. + 7. Use ASCII. + 8. Use JIS X 0208-1983. + + MULE creates all of these coding systems by default. + + +File: lispref.info, Node: EOL Conversion, Next: Coding System Properties, Prev: ISO 2022, Up: Coding Systems + +EOL Conversion +-------------- + +`nil' + Automatically detect the end-of-line type (LF, CRLF, or CR). Also + generate subsidiary coding systems named `NAME-unix', `NAME-dos', + and `NAME-mac', that are identical to this coding system but have + an EOL-TYPE value of `lf', `crlf', and `cr', respectively. + +`lf' + The end of a line is marked externally using ASCII LF. Since this + is also the way that XEmacs represents an end-of-line internally, + specifying this option results in no end-of-line conversion. This + is the standard format for Unix text files. + +`crlf' + The end of a line is marked externally using ASCII CRLF. This is + the standard format for MS-DOS text files. + +`cr' + The end of a line is marked externally using ASCII CR. This is the + standard format for Macintosh text files. + +`t' + Automatically detect the end-of-line type but do not generate + subsidiary coding systems. (This value is converted to `nil' when + stored internally, and `coding-system-property' will return `nil'.) + + File: lispref.info, Node: Coding System Properties, Next: Basic Coding System Functions, Prev: EOL Conversion, Up: Coding Systems Coding System Properties @@ -885,199 +1278,3 @@ but can modify the register status. successfully, and returns to caller (which may be a CCL program). It does not alter the status of the registers. - -File: lispref.info, Node: CCL Expressions, Next: Calling CCL, Prev: CCL Statements, Up: CCL - -CCL Expressions ---------------- - - CCL, unlike Lisp, uses infix expressions. The simplest CCL -expressions consist of a single OPERAND, either a register (one of `r0', -..., `r0') or an integer. Complex expressions are lists of the form `( -EXPRESSION OPERATOR OPERAND )'. Unlike C, assignments are not -expressions. - - In the following table, X is the target resister for a "set". In -subexpressions, this is implicitly `r7'. This means that `>8', `//', -`de-sjis', and `en-sjis' cannot be used freely in subexpressions, since -they return parts of their values in `r7'. Y may be an expression, -register, or integer, while Z must be a register or an integer. - -Name Operator Code C-like Description -CCL_PLUS `+' 0x00 X = Y + Z -CCL_MINUS `-' 0x01 X = Y - Z -CCL_MUL `*' 0x02 X = Y * Z -CCL_DIV `/' 0x03 X = Y / Z -CCL_MOD `%' 0x04 X = Y % Z -CCL_AND `&' 0x05 X = Y & Z -CCL_OR `|' 0x06 X = Y | Z -CCL_XOR `^' 0x07 X = Y ^ Z -CCL_LSH `<<' 0x08 X = Y << Z -CCL_RSH `>>' 0x09 X = Y >> Z -CCL_LSH8 `<8' 0x0A X = (Y << 8) | Z -CCL_RSH8 `>8' 0x0B X = Y >> 8, r[7] = Y & 0xFF -CCL_DIVMOD `//' 0x0C X = Y / Z, r[7] = Y % Z -CCL_LS `<' 0x10 X = (X < Y) -CCL_GT `>' 0x11 X = (X > Y) -CCL_EQ `==' 0x12 X = (X == Y) -CCL_LE `<=' 0x13 X = (X <= Y) -CCL_GE `>=' 0x14 X = (X >= Y) -CCL_NE `!=' 0x15 X = (X != Y) -CCL_ENCODE_SJIS `en-sjis' 0x16 X = HIGHER_BYTE (SJIS (Y, Z)) - r[7] = LOWER_BYTE (SJIS (Y, Z) -CCL_DECODE_SJIS `de-sjis' 0x17 X = HIGHER_BYTE (DE-SJIS (Y, Z)) - r[7] = LOWER_BYTE (DE-SJIS (Y, Z)) - - The CCL operators are as in C, with the addition of CCL_LSH8, -CCL_RSH8, CCL_DIVMOD, CCL_ENCODE_SJIS, and CCL_DECODE_SJIS. The -CCL_ENCODE_SJIS and CCL_DECODE_SJIS treat their first and second bytes -as the high and low bytes of a two-byte character code. (SJIS stands -for Shift JIS, an encoding of Japanese characters used by Microsoft. -CCL_ENCODE_SJIS is a complicated transformation of the Japanese -standard JIS encoding to Shift JIS. CCL_DECODE_SJIS is its inverse.) -It is somewhat odd to represent the SJIS operations in infix form. - - -File: lispref.info, Node: Calling CCL, Next: CCL Examples, Prev: CCL Expressions, Up: CCL - -Calling CCL ------------ - - CCL programs are called automatically during Emacs buffer I/O when -the external representation has a coding system type of `shift-jis', -`big5', or `ccl'. The program is specified by the coding system (*note -Coding Systems::). You can also call CCL programs from other CCL -programs, and from Lisp using these functions: - - - Function: ccl-execute ccl-program status - Execute CCL-PROGRAM with registers initialized by STATUS. - CCL-PROGRAM is a vector of compiled CCL code created by - `ccl-compile'. It is an error for the program to try to execute a - CCL I/O command. STATUS must be a vector of nine values, - specifying the initial value for the R0, R1 .. R7 registers and - for the instruction counter IC. A `nil' value for a register - initializer causes the register to be set to 0. A `nil' value for - the IC initializer causes execution to start at the beginning of - the program. When the program is done, STATUS is modified (by - side-effect) to contain the ending values for the corresponding - registers and IC. - - - Function: ccl-execute-on-string ccl-program status str &optional - continue - Execute CCL-PROGRAM with initial STATUS on STRING. CCL-PROGRAM is - a vector of compiled CCL code created by `ccl-compile'. STATUS - must be a vector of nine values, specifying the initial value for - the R0, R1 .. R7 registers and for the instruction counter IC. A - `nil' value for a register initializer causes the register to be - set to 0. A `nil' value for the IC initializer causes execution - to start at the beginning of the program. An optional fourth - argument CONTINUE, if non-nil, causes the IC to remain on the - unsatisfied read operation if the program terminates due to - exhaustion of the input buffer. Otherwise the IC is set to the end - of the program. When the program is done, STATUS is modified (by - side-effect) to contain the ending values for the corresponding - registers and IC. Returns the resulting string. - - To call a CCL program from another CCL program, it must first be -registered: - - - Function: register-ccl-program name ccl-program - Register NAME for CCL program PROGRAM in `ccl-program-table'. - PROGRAM should be the compiled form of a CCL program, or nil. - Return index number of the registered CCL program. - - Information about the processor time used by the CCL interpreter can -be obtained using these functions: - - - Function: ccl-elapsed-time - Returns the elapsed processor time of the CCL interpreter as cons - of user and system time, as floating point numbers measured in - seconds. If only one overall value can be determined, the return - value will be a cons of that value and 0. - - - Function: ccl-reset-elapsed-time - Resets the CCL interpreter's internal elapsed time registers. - - -File: lispref.info, Node: CCL Examples, Prev: Calling CCL, Up: CCL - -CCL Examples ------------- - - This section is not yet written. - - -File: lispref.info, Node: Category Tables, Prev: CCL, Up: MULE - -Category Tables -=============== - - A category table is a type of char table used for keeping track of -categories. Categories are used for classifying characters for use in -regexps--you can refer to a category rather than having to use a -complicated [] expression (and category lookups are significantly -faster). - - There are 95 different categories available, one for each printable -character (including space) in the ASCII charset. Each category is -designated by one such character, called a "category designator". They -are specified in a regexp using the syntax `\cX', where X is a category -designator. (This is not yet implemented.) - - A category table specifies, for each character, the categories that -the character is in. Note that a character can be in more than one -category. More specifically, a category table maps from a character to -either the value `nil' (meaning the character is in no categories) or a -95-element bit vector, specifying for each of the 95 categories whether -the character is in that category. - - Special Lisp functions are provided that abstract this, so you do not -have to directly manipulate bit vectors. - - - Function: category-table-p obj - This function returns `t' if ARG is a category table. - - - Function: category-table &optional buffer - This function returns the current category table. This is the one - specified by the current buffer, or by BUFFER if it is non-`nil'. - - - Function: standard-category-table - This function returns the standard category table. This is the - one used for new buffers. - - - Function: copy-category-table &optional table - This function constructs a new category table and return it. It - is a copy of the TABLE, which defaults to the standard category - table. - - - Function: set-category-table table &optional buffer - This function selects a new category table for BUFFER. One - argument, a category table. BUFFER defaults to the current buffer - if omitted. - - - Function: category-designator-p obj - This function returns `t' if ARG is a category designator (a char - in the range `' '' to `'~''). - - - Function: category-table-value-p obj - This function returns `t' if ARG is a category table value. Valid - values are `nil' or a bit vector of size 95. - - -File: lispref.info, Node: Tips, Next: Building XEmacs and Object Allocation, Prev: MULE, Up: Top - -Tips and Standards -****************** - - This chapter describes no additional features of XEmacs Lisp. -Instead it gives advice on making effective use of the features -described in the previous chapters. - -* Menu: - -* Style Tips:: Writing clean and robust programs. -* Compilation Tips:: Making compiled code run fast. -* Documentation Tips:: Writing readable documentation strings. -* Comment Tips:: Conventions for writing comments. -* Library Headers:: Standard headers for library packages. -