1 This is ../info/lispref.info, produced by makeinfo version 4.0 from
4 INFO-DIR-SECTION XEmacs Editor
6 * Lispref: (lispref). XEmacs Lisp Reference Manual.
11 GNU Emacs Lisp Reference Manual Second Edition (v2.01), May 1993 GNU
12 Emacs Lisp Reference Manual Further Revised (v2.02), August 1993 Lucid
13 Emacs Lisp Reference Manual (for 19.10) First Edition, March 1994
14 XEmacs Lisp Programmer's Manual (for 19.12) Second Edition, April 1995
15 GNU Emacs Lisp Reference Manual v2.4, June 1995 XEmacs Lisp
16 Programmer's Manual (for 19.13) Third Edition, July 1995 XEmacs Lisp
17 Reference Manual (for 19.14 and 20.0) v3.1, March 1996 XEmacs Lisp
18 Reference Manual (for 19.15 and 20.1, 20.2, 20.3) v3.2, April, May,
19 November 1997 XEmacs Lisp Reference Manual (for 21.0) v3.3, April 1998
21 Copyright (C) 1990, 1991, 1992, 1993, 1994, 1995 Free Software
22 Foundation, Inc. Copyright (C) 1994, 1995 Sun Microsystems, Inc.
23 Copyright (C) 1995, 1996 Ben Wing.
25 Permission is granted to make and distribute verbatim copies of this
26 manual provided the copyright notice and this permission notice are
27 preserved on all copies.
29 Permission is granted to copy and distribute modified versions of
30 this manual under the conditions for verbatim copying, provided that the
31 entire resulting derived work is distributed under the terms of a
32 permission notice identical to this one.
34 Permission is granted to copy and distribute translations of this
35 manual into another language, under the above conditions for modified
36 versions, except that this permission notice may be stated in a
37 translation approved by the Foundation.
39 Permission is granted to copy and distribute modified versions of
40 this manual under the conditions for verbatim copying, provided also
41 that the section entitled "GNU General Public License" is included
42 exactly as in the original, and provided that the entire resulting
43 derived work is distributed under the terms of a permission notice
44 identical to this one.
46 Permission is granted to copy and distribute translations of this
47 manual into another language, under the above conditions for modified
48 versions, except that the section entitled "GNU General Public License"
49 may be included in a translation approved by the Free Software
50 Foundation instead of in the original English.
53 File: lispref.info, Node: Coding System Types, Next: ISO 2022, Up: Coding Systems
58 The coding system type determines the basic algorithm XEmacs will
59 use to decode or encode a data stream. Character encodings will be
60 converted to the MULE encoding, escape sequences processed, and newline
61 sequences converted to XEmacs's internal representation. There are
62 three basic classes of coding system type: no-conversion, ISO-2022, and
65 No conversion allows you to look at the file's internal
66 representation. Since XEmacs is basically a text editor, "no
67 conversion" does convert newline conventions by default. (Use the
68 'binary coding-system if this is not desired.)
70 ISO 2022 (*note ISO 2022::) is the basic international standard
71 regulating use of "coded character sets for the exchange of data", ie,
72 text streams. ISO 2022 contains functions that make it possible to
73 encode text streams to comply with restrictions of the Internet mail
74 system and de facto restrictions of most file systems (eg, use of the
75 separator character in file names). Coding systems which are not ISO
76 2022 conformant can be difficult to handle. Perhaps more important,
77 they are not adaptable to multilingual information interchange, with
78 the obvious exception of ISO 10646 (Unicode). (Unicode is partially
79 supported by XEmacs with the addition of the Lisp package ucs-conv.)
81 The special class of coding systems includes automatic detection,
82 CCL (a "little language" embedded as an interpreter, useful for
83 translating between variants of a single character set),
84 non-ISO-2022-conformant encodings like Unicode, Shift JIS, and Big5,
85 and MULE internal coding. (NB: this list is based on XEmacs 21.2.
86 Terminology may vary slightly for other versions of XEmacs and for GNU
90 No conversion, for binary files, and a few special cases of
91 non-ISO-2022 coding systems where conversion is done by hook
92 functions (usually implemented in CCL). On output, graphic
93 characters that are not in ASCII or Latin-1 will be replaced by a
94 `?'. (For a no-conversion-encoded buffer, these characters will
95 only be present if you explicitly insert them.)
98 Any ISO-2022-compliant encoding. Among others, this includes JIS
99 (the Japanese encoding commonly used for e-mail), national
100 variants of EUC (the standard Unix encoding for Japanese and other
101 languages), and Compound Text (an encoding used in X11). You can
102 specify more specific information about the conversion with the
106 ISO 10646 UCS-4 encoding. A 31-bit fixed-width superset of
110 ISO 10646 UTF-8 encoding. A "file system safe" transformation
111 format that can be used with both UCS-4 and Unicode.
114 Automatic conversion. XEmacs attempts to detect the coding system
118 Shift-JIS (a Japanese encoding commonly used in PC operating
122 Big5 (the encoding commonly used for Taiwanese).
125 The conversion is performed using a user-written pseudo-code
126 program. CCL (Code Conversion Language) is the name of this
127 pseudo-code. For example, CCL is used to map KOI8-R characters
128 (an encoding for Russian Cyrillic) to ISO8859-5 (the form used
132 Write out or read in the raw contents of the memory representing
133 the buffer's text. This is primarily useful for debugging
134 purposes, and is only enabled when XEmacs has been compiled with
135 `DEBUG_XEMACS' set (the `--debug' configure option). *Warning*:
136 Reading in a file using `internal' conversion can result in an
137 internal inconsistency in the memory representing a buffer's text,
138 which will produce unpredictable results and may cause XEmacs to
139 crash. Under normal circumstances you should never use `internal'
143 File: lispref.info, Node: ISO 2022, Next: EOL Conversion, Prev: Coding System Types, Up: Coding Systems
148 This section briefly describes the ISO 2022 encoding standard. A
149 more thorough treatment is available in the original document of ISO
150 2022 as well as various national standards (such as JIS X 0202).
152 Character sets ("charsets") are classified into the following four
153 categories, according to the number of characters in the charset:
154 94-charset, 96-charset, 94x94-charset, and 96x96-charset. This means
155 that although an ISO 2022 coding system may have variable width
156 characters, each charset used is fixed-width (in contrast to the MULE
157 character set and UTF-8, for example).
159 ISO 2022 provides for switching between character sets via escape
160 sequences. This switching is somewhat complicated, because ISO 2022
161 provides for both legacy applications like Internet mail that accept
162 only 7 significant bits in some contexts (RFC 822 headers, for example),
163 and more modern "8-bit clean" applications. It also provides for
164 compact and transparent representation of languages like Japanese which
165 mix ASCII and a national script (even outside of computer programs).
167 First, ISO 2022 codified prevailing practice by dividing the code
168 space into "control" and "graphic" regions. The code points 0x00-0x1F
169 and 0x80-0x9F are reserved for "control characters", while "graphic
170 characters" must be assigned to code points in the regions 0x20-0x7F and
171 0xA0-0xFF. The positions 0x20 and 0x7F are special, and under some
172 circumstances must be assigned the graphic character "ASCII SPACE" and
173 the control character "ASCII DEL" respectively.
175 The various regions are given the name C0 (0x00-0x1F), GL
176 (0x20-0x7F), C1 (0x80-0x9F), and GR (0xA0-0xFF). GL and GR stand for
177 "graphic left" and "graphic right", respectively, because of the
178 standard method of displaying graphic character sets in tables with the
179 high byte indexing columns and the low byte indexing rows. I don't
180 find it very intuitive, but these are called "registers".
182 An ISO 2022-conformant encoding for a graphic character set must use
183 a fixed number of bytes per character, and the values must fit into a
184 single register; that is, each byte must range over either 0x20-0x7F, or
185 0xA0-0xFF. It is not allowed to extend the range of the repertoire of a
186 character set by using both ranges at the same. This is why a standard
187 character set such as ISO 8859-1 is actually considered by ISO 2022 to
188 be an aggregation of two character sets, ASCII and LATIN-1, and why it
189 is technically incorrect to refer to ISO 8859-1 as "Latin 1". Also, a
190 single character's bytes must all be drawn from the same register; this
191 is why Shift JIS (for Japanese) and Big 5 (for Chinese) are not ISO
192 2022-compatible encodings.
194 The reason for this restriction becomes clear when you attempt to
195 define an efficient, robust encoding for a language like Japanese.
196 Like ISO 8859, Japanese encodings are aggregations of several character
197 sets. In practice, the vast majority of characters are drawn from the
198 "JIS Roman" character set (a derivative of ASCII; it won't hurt to
199 think of it as ASCII) and the JIS X 0208 standard "basic Japanese"
200 character set including not only ideographic characters ("kanji") but
201 syllabic Japanese characters ("kana"), a wide variety of symbols, and
202 many alphabetic characters (Roman, Greek, and Cyrillic) as well.
203 Although JIS X 0208 includes the whole Roman alphabet, as a 2-byte code
204 it is not suited to programming; thus the inclusion of ASCII in the
205 standard Japanese encodings.
207 For normal Japanese text such as in newspapers, a broad repertoire of
208 approximately 3000 characters is used. Evidently this won't fit into
209 one byte; two must be used. But much of the text processed by Japanese
210 computers is computer source code, nearly all of which is ASCII. A not
211 insignificant portion of ordinary text is English (as such or as
212 borrowed Japanese vocabulary) or other languages which can represented
213 at least approximately in ASCII, as well. It seems reasonable then to
214 represent ASCII in one byte, and JIS X 0208 in two. And this is exactly
215 what the Extended Unix Code for Japanese (EUC-JP) does. ASCII is
216 invoked to the GL register, and JIS X 0208 is invoked to the GR
217 register. Thus, each byte can be tested for its character set by
218 looking at the high bit; if set, it is Japanese, if clear, it is ASCII.
219 Furthermore, since control characters like newline can never be part of
220 a graphic character, even in the case of corruption in transmission the
221 stream will be resynchronized at every line break, on the order of 60-80
222 bytes. This coding system requires no escape sequences or special
223 control codes to represent 99.9% of all Japanese text.
225 Note carefully the distinction between the character sets (ASCII and
226 JIS X 0208), the encoding (EUC-JP), and the coding system (ISO 2022).
227 The JIS X 0208 character set is used in three different encodings for
228 Japanese, but in ISO-2022-JP it is invoked into GL (so the high bit is
229 always clear), in EUC-JP it is invoked into GR (setting the high bit in
230 the process), and in Shift JIS the high bit may be set or reset, and the
231 significant bits are shifted within the 16-bit character so that the two
232 main character sets can coexist with a third (the "halfwidth katakana"
233 of JIS X 0201). As the name implies, the ISO-2022-JP encoding is also a
234 version of the ISO-2022 coding system.
236 In order to systematically treat subsidiary character sets (like the
237 "halfwidth katakana" already mentioned, and the "supplementary kanji" of
238 JIS X 0212), four further registers are defined: G0, G1, G2, and G3.
239 Unlike GL and GR, they are not logically distinguished by internal
240 format. Instead, the process of "invocation" mentioned earlier is
241 broken into two steps: first, a character set is "designated" to one of
242 the registers G0-G3 by use of an "escape sequence" of the form:
246 where I is an intermediate character or characters in the range 0x20
247 - 0x3F, and F, from the range 0x30-0x7Fm is the final character
248 identifying this charset. (Final characters in the range 0x30-0x3F are
249 reserved for private use and will never have a publically registered
252 Then that register is "invoked" to either GL or GR, either
253 automatically (designations to G0 normally involve invocation to GL as
254 well), or by use of shifting (affecting only the following character in
255 the data stream) or locking (effective until the next designation or
256 locking) control sequences. An encoding conformant to ISO 2022 is
257 typically defined by designating the initial contents of the G0-G3
258 registers, specifying an 7 or 8 bit environment, and specifying whether
259 further designations will be recognized.
261 Some examples of character sets and the registered final characters
262 F used to designate them:
265 ASCII (B), left (J) and right (I) half of JIS X 0201, ...
268 Latin-1 (A), Latin-2 (B), Latin-3 (C), ...
271 GB2312 (A), JIS X 0208 (B), KSC5601 (C), ...
276 The meanings of the various characters in these sequences, where not
277 specified by the ISO 2022 standard (such as the ESC character), are
278 assigned by "ECMA", the European Computer Manufacturers Association.
280 The meaning of intermediate characters are:
282 $ [0x24]: indicate charset of dimension 2 (94x94 or 96x96).
283 ( [0x28]: designate to G0 a 94-charset whose final byte is F.
284 ) [0x29]: designate to G1 a 94-charset whose final byte is F.
285 * [0x2A]: designate to G2 a 94-charset whose final byte is F.
286 + [0x2B]: designate to G3 a 94-charset whose final byte is F.
287 , [0x2C]: designate to G0 a 96-charset whose final byte is F.
288 - [0x2D]: designate to G1 a 96-charset whose final byte is F.
289 . [0x2E]: designate to G2 a 96-charset whose final byte is F.
290 / [0x2F]: designate to G3 a 96-charset whose final byte is F.
292 The comma may be used in files read and written only by MULE, as a
293 MULE extension, but this is illegal in ISO 2022. (The reason is that
294 in ISO 2022 G0 must be a 94-member character set, with 0x20 assigned
295 the value SPACE, and 0x7F assigned the value DEL.)
297 Here are examples of designations:
299 ESC ( B : designate to G0 ASCII
300 ESC - A : designate to G1 Latin-1
301 ESC $ ( A or ESC $ A : designate to G0 GB2312
302 ESC $ ( B or ESC $ B : designate to G0 JISX0208
303 ESC $ ) C : designate to G1 KSC5601
305 (The short forms used to designate GB2312 and JIS X 0208 are for
306 backwards compatibility; the long forms are preferred.)
308 To use a charset designated to G2 or G3, and to use a charset
309 designated to G1 in a 7-bit environment, you must explicitly invoke G1,
310 G2, or G3 into GL. There are two types of invocation, Locking Shift
311 (forever) and Single Shift (one character only).
313 Locking Shift is done as follows:
315 LS0 or SI (0x0F): invoke G0 into GL
316 LS1 or SO (0x0E): invoke G1 into GL
317 LS2: invoke G2 into GL
318 LS3: invoke G3 into GL
319 LS1R: invoke G1 into GR
320 LS2R: invoke G2 into GR
321 LS3R: invoke G3 into GR
323 Single Shift is done as follows:
325 SS2 or ESC N: invoke G2 into GL
326 SS3 or ESC O: invoke G3 into GL
328 The shift functions (such as LS1R and SS3) are represented by control
329 characters (from C1) in 8 bit environments and by escape sequences in 7
332 (#### Ben says: I think the above is slightly incorrect. It appears
333 that SS2 invokes G2 into GR and SS3 invokes G3 into GR, whereas ESC N
334 and ESC O behave as indicated. The above definitions will not parse
335 EUC-encoded text correctly, and it looks like the code in mule-coding.c
336 has similar problems.)
338 Evidently there are a lot of ISO-2022-compliant ways of encoding
339 multilingual text. Now, in the world, there exist many coding systems
340 such as X11's Compound Text, Japanese JUNET code, and so-called EUC
341 (Extended UNIX Code); all of these are variants of ISO 2022.
343 In MULE, we characterize a version of ISO 2022 by the following
346 1. The character sets initially designated to G0 thru G3.
348 2. Whether short form designations are allowed for Japanese and
351 3. Whether ASCII should be designated to G0 before control characters.
353 4. Whether ASCII should be designated to G0 at the end of line.
355 5. 7-bit environment or 8-bit environment.
357 6. Whether Locking Shifts are used or not.
359 7. Whether to use ASCII or the variant JIS X 0201-1976-Roman.
361 8. Whether to use JIS X 0208-1983 or the older version JIS X
364 (The last two are only for Japanese.)
366 By specifying these attributes, you can create any variant of ISO
369 Here are several examples:
371 ISO-2022-JP -- Coding system used in Japanese email (RFC 1463 #### check).
372 1. G0 <- ASCII, G1..3 <- never used
379 8. Use JIS X 0208-1983
381 ctext -- X11 Compound Text
382 1. G0 <- ASCII, G1 <- Latin-1, G2,3 <- never used.
386 5. 8-bit environment.
389 8. Use JIS X 0208-1983.
391 euc-china -- Chinese EUC. Often called the "GB encoding", but that is
392 technically incorrect.
393 1. G0 <- ASCII, G1 <- GB 2312, G2,3 <- never used.
397 5. 8-bit environment.
400 8. Use JIS X 0208-1983.
402 ISO-2022-KR -- Coding system used in Korean email.
403 1. G0 <- ASCII, G1 <- KSC 5601, G2,3 <- never used.
407 5. 7-bit environment.
410 8. Use JIS X 0208-1983.
412 MULE creates all of these coding systems by default.
415 File: lispref.info, Node: EOL Conversion, Next: Coding System Properties, Prev: ISO 2022, Up: Coding Systems
421 Automatically detect the end-of-line type (LF, CRLF, or CR). Also
422 generate subsidiary coding systems named `NAME-unix', `NAME-dos',
423 and `NAME-mac', that are identical to this coding system but have
424 an EOL-TYPE value of `lf', `crlf', and `cr', respectively.
427 The end of a line is marked externally using ASCII LF. Since this
428 is also the way that XEmacs represents an end-of-line internally,
429 specifying this option results in no end-of-line conversion. This
430 is the standard format for Unix text files.
433 The end of a line is marked externally using ASCII CRLF. This is
434 the standard format for MS-DOS text files.
437 The end of a line is marked externally using ASCII CR. This is the
438 standard format for Macintosh text files.
441 Automatically detect the end-of-line type but do not generate
442 subsidiary coding systems. (This value is converted to `nil' when
443 stored internally, and `coding-system-property' will return `nil'.)
446 File: lispref.info, Node: Coding System Properties, Next: Basic Coding System Functions, Prev: EOL Conversion, Up: Coding Systems
448 Coding System Properties
449 ------------------------
452 String to be displayed in the modeline when this coding system is
456 End-of-line conversion to be used. It should be one of the types
457 listed in *Note EOL Conversion::.
460 The coding system which is the same as this one, except that it
461 uses the Unix line-breaking convention.
464 The coding system which is the same as this one, except that it
465 uses the DOS line-breaking convention.
468 The coding system which is the same as this one, except that it
469 uses the Macintosh line-breaking convention.
471 `post-read-conversion'
472 Function called after a file has been read in, to perform the
473 decoding. Called with two arguments, BEG and END, denoting a
474 region of the current buffer to be decoded.
476 `pre-write-conversion'
477 Function called before a file is written out, to perform the
478 encoding. Called with two arguments, BEG and END, denoting a
479 region of the current buffer to be encoded.
481 The following additional properties are recognized if TYPE is
488 The character set initially designated to the G0 - G3 registers.
489 The value should be one of
491 * A charset object (designate that character set)
493 * `nil' (do not ever use this register)
495 * `t' (no character set is initially designated to the
496 register, but may be later on; this automatically sets the
497 corresponding `force-g*-on-output' property)
503 If non-`nil', send an explicit designation sequence on output
504 before using the specified register.
507 If non-`nil', use the short forms `ESC $ @', `ESC $ A', and `ESC $
508 B' on output in place of the full designation sequences `ESC $ (
509 @', `ESC $ ( A', and `ESC $ ( B'.
512 If non-`nil', don't designate ASCII to G0 at each end of line on
513 output. Setting this to non-`nil' also suppresses other
514 state-resetting that normally happens at the end of a line.
517 If non-`nil', don't designate ASCII to G0 before control chars on
521 If non-`nil', use 7-bit environment on output. Otherwise, use
525 If non-`nil', use locking-shift (SO/SI) instead of single-shift or
526 designation by escape sequence.
529 If non-`nil', don't use ISO6429's direction specification.
532 If non-nil, literal control characters that are the same as the
533 beginning of a recognized ISO 2022 or ISO 6429 escape sequence (in
534 particular, ESC (0x1B), SO (0x0E), SI (0x0F), SS2 (0x8E), SS3
535 (0x8F), and CSI (0x9B)) are "quoted" with an escape character so
536 that they can be properly distinguished from an escape sequence.
537 (Note that doing this results in a non-portable encoding.) This
538 encoding flag is used for byte-compiled files. Note that ESC is a
539 good choice for a quoting character because there are no escape
540 sequences whose second byte is a character from the Control-0 or
541 Control-1 character sets; this is explicitly disallowed by the ISO
544 `input-charset-conversion'
545 A list of conversion specifications, specifying conversion of
546 characters in one charset to another when decoding is performed.
547 Each specification is a list of two elements: the source charset,
548 and the destination charset.
550 `output-charset-conversion'
551 A list of conversion specifications, specifying conversion of
552 characters in one charset to another when encoding is performed.
553 The form of each specification is the same as for
554 `input-charset-conversion'.
556 The following additional properties are recognized (and required) if
560 CCL program used for decoding (converting to internal format).
563 CCL program used for encoding (converting to external format).
565 The following properties are used internally: EOL-CR, EOL-CRLF,
569 File: lispref.info, Node: Basic Coding System Functions, Next: Coding System Property Functions, Prev: Coding System Properties, Up: Coding Systems
571 Basic Coding System Functions
572 -----------------------------
574 - Function: find-coding-system coding-system-or-name
575 This function retrieves the coding system of the given name.
577 If CODING-SYSTEM-OR-NAME is a coding-system object, it is simply
578 returned. Otherwise, CODING-SYSTEM-OR-NAME should be a symbol.
579 If there is no such coding system, `nil' is returned. Otherwise
580 the associated coding system object is returned.
582 - Function: get-coding-system name
583 This function retrieves the coding system of the given name. Same
584 as `find-coding-system' except an error is signalled if there is no
585 such coding system instead of returning `nil'.
587 - Function: coding-system-list
588 This function returns a list of the names of all defined coding
591 - Function: coding-system-name coding-system
592 This function returns the name of the given coding system.
594 - Function: coding-system-base coding-system
595 Returns the base coding system (undecided EOL convention) coding
598 - Function: make-coding-system name type &optional doc-string props
599 This function registers symbol NAME as a coding system.
601 TYPE describes the conversion method used and should be one of the
602 types listed in *Note Coding System Types::.
604 DOC-STRING is a string describing the coding system.
606 PROPS is a property list, describing the specific nature of the
607 character set. Recognized properties are as in *Note Coding
610 - Function: copy-coding-system old-coding-system new-name
611 This function copies OLD-CODING-SYSTEM to NEW-NAME. If NEW-NAME
612 does not name an existing coding system, a new one will be created.
614 - Function: subsidiary-coding-system coding-system eol-type
615 This function returns the subsidiary coding system of
616 CODING-SYSTEM with eol type EOL-TYPE.
619 File: lispref.info, Node: Coding System Property Functions, Next: Encoding and Decoding Text, Prev: Basic Coding System Functions, Up: Coding Systems
621 Coding System Property Functions
622 --------------------------------
624 - Function: coding-system-doc-string coding-system
625 This function returns the doc string for CODING-SYSTEM.
627 - Function: coding-system-type coding-system
628 This function returns the type of CODING-SYSTEM.
630 - Function: coding-system-property coding-system prop
631 This function returns the PROP property of CODING-SYSTEM.
634 File: lispref.info, Node: Encoding and Decoding Text, Next: Detection of Textual Encoding, Prev: Coding System Property Functions, Up: Coding Systems
636 Encoding and Decoding Text
637 --------------------------
639 - Function: decode-coding-region start end coding-system &optional
641 This function decodes the text between START and END which is
642 encoded in CODING-SYSTEM. This is useful if you've read in
643 encoded text from a file without decoding it (e.g. you read in a
644 JIS-formatted file but used the `binary' or `no-conversion' coding
645 system, so that it shows up as `^[$B!<!+^[(B'). The length of the
646 encoded text is returned. BUFFER defaults to the current buffer
649 - Function: encode-coding-region start end coding-system &optional
651 This function encodes the text between START and END using
652 CODING-SYSTEM. This will, for example, convert Japanese
653 characters into stuff such as `^[$B!<!+^[(B' if you use the JIS
654 encoding. The length of the encoded text is returned. BUFFER
655 defaults to the current buffer if unspecified.
658 File: lispref.info, Node: Detection of Textual Encoding, Next: Big5 and Shift-JIS Functions, Prev: Encoding and Decoding Text, Up: Coding Systems
660 Detection of Textual Encoding
661 -----------------------------
663 - Function: coding-category-list
664 This function returns a list of all recognized coding categories.
666 - Function: set-coding-priority-list list
667 This function changes the priority order of the coding categories.
668 LIST should be a list of coding categories, in descending order of
669 priority. Unspecified coding categories will be lower in priority
670 than all specified ones, in the same relative order they were in
673 - Function: coding-priority-list
674 This function returns a list of coding categories in descending
677 - Function: set-coding-category-system coding-category coding-system
678 This function changes the coding system associated with a coding
681 - Function: coding-category-system coding-category
682 This function returns the coding system associated with a coding
685 - Function: detect-coding-region start end &optional buffer
686 This function detects coding system of the text in the region
687 between START and END. Returned value is a list of possible coding
688 systems ordered by priority. If only ASCII characters are found,
689 it returns `autodetect' or one of its subsidiary coding systems
690 according to a detected end-of-line type. Optional arg BUFFER
691 defaults to the current buffer.
694 File: lispref.info, Node: Big5 and Shift-JIS Functions, Next: Predefined Coding Systems, Prev: Detection of Textual Encoding, Up: Coding Systems
696 Big5 and Shift-JIS Functions
697 ----------------------------
699 These are special functions for working with the non-standard
700 Shift-JIS and Big5 encodings.
702 - Function: decode-shift-jis-char code
703 This function decodes a JIS X 0208 character of Shift-JIS
704 coding-system. CODE is the character code in Shift-JIS as a cons
705 of type bytes. The corresponding character is returned.
707 - Function: encode-shift-jis-char ch
708 This function encodes a JIS X 0208 character CH to SHIFT-JIS
709 coding-system. The corresponding character code in SHIFT-JIS is
710 returned as a cons of two bytes.
712 - Function: decode-big5-char code
713 This function decodes a Big5 character CODE of BIG5 coding-system.
714 CODE is the character code in BIG5. The corresponding character
717 - Function: encode-big5-char ch
718 This function encodes the Big5 character CHAR to BIG5
719 coding-system. The corresponding character code in Big5 is
723 File: lispref.info, Node: Predefined Coding Systems, Prev: Big5 and Shift-JIS Functions, Up: Coding Systems
725 Coding Systems Implemented
726 --------------------------
728 MULE initializes most of the commonly used coding systems at XEmacs's
729 startup. A few others are initialized only when the relevant language
730 environment is selected and support libraries are loaded. (NB: The
731 following list is based on XEmacs 21.2.19, the development branch at the
732 time of writing. The list may be somewhat different for other
733 versions. Recent versions of GNU Emacs 20 implement a few more rare
734 coding systems; work is being done to port these to XEmacs.)
736 Unfortunately, there is not a consistent naming convention for
737 character sets, and for practical purposes coding systems often take
738 their name from their principal character sets (ASCII, KOI8-R, Shift
739 JIS). Others take their names from the coding system (ISO-2022-JP,
740 EUC-KR), and a few from their non-text usages (internal, binary). To
741 provide for this, and for the fact that many coding systems have
742 several common names, an aliasing system is provided. Finally, some
743 effort has been made to use names that are registered as MIME charsets
744 (this is why the name 'shift_jis contains that un-Lisp-y underscore).
746 There is a systematic naming convention regarding end-of-line (EOL)
747 conventions for different systems. A coding system whose name ends in
748 "-unix" forces the assumptions that lines are broken by newlines (0x0A).
749 A coding system whose name ends in "-mac" forces the assumptions that
750 lines are broken by ASCII CRs (0x0D). A coding system whose name ends
751 in "-dos" forces the assumptions that lines are broken by CRLF sequences
752 (0x0D 0x0A). These subsidiary coding systems are automatically derived
753 from a base coding system. Use of the base coding system implies
754 autodetection of the text file convention. (The fact that the -unix,
755 -mac, and -dos are derived from a base system results in them showing up
756 as "aliases" in `list-coding-systems'.) These subsidiaries have a
757 consistent modeline indicator as well. "-dos" coding systems have ":T"
758 appended to their modeline indicator, while "-mac" coding systems have
759 ":t" appended (eg, "ISO8:t" for iso-2022-8-mac).
761 In the following table, each coding system is given with its mode
762 line indicator in parentheses. Non-textual coding systems are listed
763 first, followed by textual coding systems and their aliases. (The
764 coding system subsidiary modeline indicators ":T" and ":t" will be
765 omitted from the table of coding systems.)
767 ### SJT 1999-08-23 Maybe should order these by language? Definitely
768 need language usage for the ISO-8859 family.
770 Note that although true coding system aliases have been implemented
771 for XEmacs 21.2, the coding system initialization has not yet been
772 converted as of 21.2.19. So coding systems described as aliases have
773 the same properties as the aliased coding system, but will not be equal
776 `automatic-conversion'
781 Modeline indicator: `Auto'. A type `undecided' coding system.
782 Attempts to determine an appropriate coding system from file
783 contents or the environment.
793 Modeline indicator: `Raw'. A type `no-conversion' coding system,
794 which converts only line-break-codes. An implementation quirk
795 means that this coding system is also used for ISO8859-1.
798 Modeline indicator: `Binary'. A type `no-conversion' coding
799 system which does no character coding or EOL conversions. An
800 alias for `raw-text-unix'.
806 Modeline indicator: `Cy.Alt'. A type `ccl' coding system used for
807 Alternativnyj, an encoding of the Cyrillic alphabet.
813 Modeline indicator: `Zh/Big5'. A type `big5' coding system used
814 for BIG5, the most common encoding of traditional Chinese as used
821 Modeline indicator: `Zh-GB/EUC'. A type `iso2022' coding system
822 used for simplified Chinese (as used in the People's Republic of
823 China), with the `ascii' (G0), `chinese-gb2312' (G1), and `sisheng'
824 (G2) character sets initially designated. Chinese EUC (Extended
831 Modeline indicator: `CText/Hbrw'. A type `iso2022' coding system
832 with the `ascii' (G0) and `hebrew-iso8859-8' (G1) character sets
833 initially designated for Hebrew.
839 Modeline indicator: `CText'. A type `iso2022' 8-bit coding system
840 with the `ascii' (G0) and `latin-iso8859-1' (G1) character sets
841 initially designated. X11 Compound Text Encoding. Often
842 mistakenly recognized instead of EUC encodings; usual cause is
843 inappropriate setting of `coding-priority-list'.
846 Modeline indicator: `ESC/Quot'. A type `iso2022' 8-bit coding
847 system with the `ascii' (G0) and `latin-iso8859-1' (G1) character
848 sets initially designated and escape quoting. Unix EOL conversion
849 (ie, no conversion). It is used for .ELC files.
855 Modeline indicator: `Ja/EUC'. A type `iso2022' 8-bit coding system
856 with `ascii' (G0), `japanese-jisx0208' (G1), `katakana-jisx0201'
857 (G2), and `japanese-jisx0212' (G3) initially designated. Japanese
858 EUC (Extended Unix Code).
864 Modeline indicator: `ko/EUC'. A type `iso2022' 8-bit coding system
865 with `ascii' (G0) and `korean-ksc5601' (G1) initially designated.
866 Korean EUC (Extended Unix Code).
869 Modeline indicator: `Zh-GB/Hz'. A type `no-conversion' coding
870 system with Unix EOL convention (ie, no conversion) using
871 post-read-decode and pre-write-encode functions to translate the
872 Hz/ZW coding system used for Chinese.
879 Modeline indicator: `ISO7'. A type `iso2022' 7-bit coding system
880 with `ascii' (G0) initially designated. Other character sets must
881 be explicitly designated to be used.
884 `iso-2022-7bit-ss2-dos'
885 `iso-2022-7bit-ss2-mac'
886 `iso-2022-7bit-ss2-unix'
887 Modeline indicator: `ISO7/SS'. A type `iso2022' 7-bit coding
888 system with `ascii' (G0) initially designated. Other character
889 sets must be explicitly designated to be used. SS2 is used to
890 invoke a 96-charset, one character at a time.
896 Modeline indicator: `ISO8'. A type `iso2022' 8-bit coding system
897 with `ascii' (G0) and `latin-iso8859-1' (G1) initially designated.
898 Other character sets must be explicitly designated to be used.
899 No single-shift or locking-shift.
902 `iso-2022-8bit-ss2-dos'
903 `iso-2022-8bit-ss2-mac'
904 `iso-2022-8bit-ss2-unix'
905 Modeline indicator: `ISO8/SS'. A type `iso2022' 8-bit coding
906 system with `ascii' (G0) and `latin-iso8859-1' (G1) initially
907 designated. Other character sets must be explicitly designated to
908 be used. SS2 is used to invoke a 96-charset, one character at a
914 `iso-2022-int-1-unix'
915 Modeline indicator: `INT-1'. A type `iso2022' 7-bit coding system
916 with `ascii' (G0) and `korean-ksc5601' (G1) initially designated.
919 `iso-2022-jp-1978-irv'
920 `iso-2022-jp-1978-irv-dos'
921 `iso-2022-jp-1978-irv-mac'
922 `iso-2022-jp-1978-irv-unix'
923 Modeline indicator: `Ja-78/7bit'. A type `iso2022' 7-bit coding
924 system. For compatibility with old Japanese terminals; if you
925 need to know, look at the source.
928 `iso-2022-jp-2 (ISO7/SS)'
935 Modeline indicator: `MULE/7bit'. A type `iso2022' 7-bit coding
936 system with `ascii' (G0) initially designated, and complex
937 specifications to insure backward compatibility with old Japanese
938 systems. Used for communication with mail and news in Japan. The
939 "-2" versions also use SS2 to invoke a 96-charset one character at
943 Modeline indicator: `Ko/7bit' A type `iso2022' 7-bit coding
944 system with `ascii' (G0) and `korean-ksc5601' (G1) initially
945 designated. Used for e-mail in Korea.
951 Modeline indicator: `ISO7/Lock'. A type `iso2022' 7-bit coding
952 system with `ascii' (G0) initially designated, using Locking-Shift
953 to invoke a 96-charset.
959 Due to implementation, this is not a type `iso2022' coding system,
960 but rather an alias for the `raw-text' coding system.
966 Modeline indicator: `MIME/Ltn-2'. A type `iso2022' coding system
967 with `ascii' (G0) and `latin-iso8859-2' (G1) initially invoked.
973 Modeline indicator: `MIME/Ltn-3'. A type `iso2022' coding system
974 with `ascii' (G0) and `latin-iso8859-3' (G1) initially invoked.
980 Modeline indicator: `MIME/Ltn-4'. A type `iso2022' coding system
981 with `ascii' (G0) and `latin-iso8859-4' (G1) initially invoked.
987 Modeline indicator: `ISO8/Cyr'. A type `iso2022' coding system
988 with `ascii' (G0) and `cyrillic-iso8859-5' (G1) initially invoked.
994 Modeline indicator: `Grk'. A type `iso2022' coding system with
995 `ascii' (G0) and `greek-iso8859-7' (G1) initially invoked.
1001 Modeline indicator: `MIME/Hbrw'. A type `iso2022' coding system
1002 with `ascii' (G0) and `hebrew-iso8859-8' (G1) initially invoked.
1008 Modeline indicator: `MIME/Ltn-5'. A type `iso2022' coding system
1009 with `ascii' (G0) and `latin-iso8859-9' (G1) initially invoked.
1015 Modeline indicator: `KOI8'. A type `ccl' coding-system used for
1016 KOI8-R, an encoding of the Cyrillic alphabet.
1022 Modeline indicator: `Ja/SJIS'. A type `shift-jis' coding-system
1023 implementing the Shift-JIS encoding for Japanese. The underscore
1024 is to conform to the MIME charset implementing this encoding.
1030 Modeline indicator: `TIS620'. A type `ccl' encoding for Thai. The
1031 external encoding is defined by TIS620, the internal encoding is
1032 peculiar to MULE, and called `thai-xtis'.
1035 Modeline indicator: `VIQR'. A type `no-conversion' coding system
1036 with Unix EOL convention (ie, no conversion) using
1037 post-read-decode and pre-write-encode functions to translate the
1038 VIQR coding system for Vietnamese.
1044 Modeline indicator: `VISCII'. A type `ccl' coding-system used for
1045 VISCII 1.1 for Vietnamese. Differs slightly from VSCII; VISCII is
1046 given priority by XEmacs.
1052 Modeline indicator: `VSCII'. A type `ccl' coding-system used for
1053 VSCII 1.1 for Vietnamese. Differs slightly from VISCII, which is
1054 given priority by XEmacs. Use `(prefer-coding-system
1055 'vietnamese-vscii)' to give priority to VSCII.
1058 File: lispref.info, Node: CCL, Next: Category Tables, Prev: Coding Systems, Up: MULE
1063 CCL (Code Conversion Language) is a simple structured programming
1064 language designed for character coding conversions. A CCL program is
1065 compiled to CCL code (represented by a vector of integers) and executed
1066 by the CCL interpreter embedded in Emacs. The CCL interpreter
1067 implements a virtual machine with 8 registers called `r0', ..., `r7', a
1068 number of control structures, and some I/O operators. Take care when
1069 using registers `r0' (used in implicit "set" statements) and especially
1070 `r7' (used internally by several statements and operations, especially
1071 for multiple return values and I/O operations).
1073 CCL is used for code conversion during process I/O and file I/O for
1074 non-ISO2022 coding systems. (It is the only way for a user to specify a
1075 code conversion function.) It is also used for calculating the code
1076 point of an X11 font from a character code. However, since CCL is
1077 designed as a powerful programming language, it can be used for more
1078 generic calculation where efficiency is demanded. A combination of
1079 three or more arithmetic operations can be calculated faster by CCL than
1082 *Warning:* The code in `src/mule-ccl.c' and
1083 `$packages/lisp/mule-base/mule-ccl.el' is the definitive description of
1084 CCL's semantics. The previous version of this section contained
1085 several typos and obsolete names left from earlier versions of MULE,
1086 and many may remain. (I am not an experienced CCL programmer; the few
1087 who know CCL well find writing English painful.)
1089 A CCL program transforms an input data stream into an output data
1090 stream. The input stream, held in a buffer of constant bytes, is left
1091 unchanged. The buffer may be filled by an external input operation,
1092 taken from an Emacs buffer, or taken from a Lisp string. The output
1093 buffer is a dynamic array of bytes, which can be written by an external
1094 output operation, inserted into an Emacs buffer, or returned as a Lisp
1097 A CCL program is a (Lisp) list containing two or three members. The
1098 first member is the "buffer magnification", which indicates the
1099 required minimum size of the output buffer as a multiple of the input
1100 buffer. It is followed by the "main block" which executes while there
1101 is input remaining, and an optional "EOF block" which is executed when
1102 the input is exhausted. Both the main block and the EOF block are CCL
1105 A "CCL block" is either a CCL statement or list of CCL statements.
1106 A "CCL statement" is either a "set statement" (either an integer or an
1107 "assignment", which is a list of a register to receive the assignment,
1108 an assignment operator, and an expression) or a "control statement" (a
1109 list starting with a keyword, whose allowable syntax depends on the
1114 * CCL Syntax:: CCL program syntax in BNF notation.
1115 * CCL Statements:: Semantics of CCL statements.
1116 * CCL Expressions:: Operators and expressions in CCL.
1117 * Calling CCL:: Running CCL programs.
1118 * CCL Examples:: The encoding functions for Big5 and KOI-8.
1121 File: lispref.info, Node: CCL Syntax, Next: CCL Statements, Up: CCL
1126 The full syntax of a CCL program in BNF notation:
1129 (BUFFER_MAGNIFICATION
1133 BUFFER_MAGNIFICATION := integer
1134 CCL_MAIN_BLOCK := CCL_BLOCK
1135 CCL_EOF_BLOCK := CCL_BLOCK
1138 STATEMENT | (STATEMENT [STATEMENT ...])
1140 SET | IF | BRANCH | LOOP | REPEAT | BREAK | READ | WRITE
1145 | (REG ASSIGNMENT_OPERATOR EXPRESSION)
1148 EXPRESSION := ARG | (EXPRESSION OPERATOR ARG)
1150 IF := (if EXPRESSION CCL_BLOCK [CCL_BLOCK])
1151 BRANCH := (branch EXPRESSION CCL_BLOCK [CCL_BLOCK ...])
1152 LOOP := (loop STATEMENT [STATEMENT ...])
1156 | (write-repeat [REG | integer | string])
1157 | (write-read-repeat REG [integer | ARRAY])
1160 | (read-if (REG OPERATOR ARG) CCL_BLOCK CCL_BLOCK)
1161 | (read-branch REG CCL_BLOCK [CCL_BLOCK ...])
1164 | (write EXPRESSION)
1165 | (write integer) | (write string) | (write REG ARRAY)
1167 CALL := (call ccl-program-name)
1170 REG := r0 | r1 | r2 | r3 | r4 | r5 | r6 | r7
1171 ARG := REG | integer
1173 + | - | * | / | % | & | '|' | ^ | << | >> | <8 | >8 | //
1174 | < | > | == | <= | >= | != | de-sjis | en-sjis
1175 ASSIGNMENT_OPERATOR :=
1176 += | -= | *= | /= | %= | &= | '|=' | ^= | <<= | >>=
1177 ARRAY := '[' integer ... ']'
1180 File: lispref.info, Node: CCL Statements, Next: CCL Expressions, Prev: CCL Syntax, Up: CCL
1185 The Emacs Code Conversion Language provides the following statement
1186 types: "set", "if", "branch", "loop", "repeat", "break", "read",
1187 "write", "call", and "end".
1192 The "set" statement has three variants with the syntaxes `(REG =
1193 EXPRESSION)', `(REG ASSIGNMENT_OPERATOR EXPRESSION)', and `INTEGER'.
1194 The assignment operator variation of the "set" statement works the same
1195 way as the corresponding C expression statement does. The assignment
1196 operators are `+=', `-=', `*=', `/=', `%=', `&=', `|=', `^=', `<<=',
1197 and `>>=', and they have the same meanings as in C. A "naked integer"
1198 INTEGER is equivalent to a SET statement of the form `(r0 = INTEGER)'.
1203 The "read" statement takes one or more registers as arguments. It
1204 reads one byte (a C char) from the input into each register in turn.
1206 The "write" takes several forms. In the form `(write REG ...)' it
1207 takes one or more registers as arguments and writes each in turn to the
1208 output. The integer in a register (interpreted as an Emchar) is
1209 encoded to multibyte form (ie, Bufbytes) and written to the current
1210 output buffer. If it is less than 256, it is written as is. The forms
1211 `(write EXPRESSION)' and `(write INTEGER)' are treated analogously.
1212 The form `(write STRING)' writes the constant string to the output. A
1213 "naked string" `STRING' is equivalent to the statement `(write
1214 STRING)'. The form `(write REG ARRAY)' writes the REGth element of the
1215 ARRAY to the output.
1217 Conditional statements:
1218 =======================
1220 The "if" statement takes an EXPRESSION, a CCL BLOCK, and an optional
1221 SECOND CCL BLOCK as arguments. If the EXPRESSION evaluates to
1222 non-zero, the first CCL BLOCK is executed. Otherwise, if there is a
1223 SECOND CCL BLOCK, it is executed.
1225 The "read-if" variant of the "if" statement takes an EXPRESSION, a
1226 CCL BLOCK, and an optional SECOND CCL BLOCK as arguments. The
1227 EXPRESSION must have the form `(REG OPERATOR OPERAND)' (where OPERAND is
1228 a register or an integer). The `read-if' statement first reads from
1229 the input into the first register operand in the EXPRESSION, then
1230 conditionally executes a CCL block just as the `if' statement does.
1232 The "branch" statement takes an EXPRESSION and one or more CCL
1233 blocks as arguments. The CCL blocks are treated as a zero-indexed
1234 array, and the `branch' statement uses the EXPRESSION as the index of
1235 the CCL block to execute. Null CCL blocks may be used as no-ops,
1236 continuing execution with the statement following the `branch'
1237 statement in the containing CCL block. Out-of-range values for the
1238 EXPRESSION are also treated as no-ops.
1240 The "read-branch" variant of the "branch" statement takes an
1241 REGISTER, a CCL BLOCK, and an optional SECOND CCL BLOCK as arguments.
1242 The `read-branch' statement first reads from the input into the
1243 REGISTER, then conditionally executes a CCL block just as the `branch'
1246 Loop control statements:
1247 ========================
1249 The "loop" statement creates a block with an implied jump from the
1250 end of the block back to its head. The loop is exited on a `break'
1251 statement, and continued without executing the tail by a `repeat'
1254 The "break" statement, written `(break)', terminates the current
1255 loop and continues with the next statement in the current block.
1257 The "repeat" statement has three variants, `repeat', `write-repeat',
1258 and `write-read-repeat'. Each continues the current loop from its
1259 head, possibly after performing I/O. `repeat' takes no arguments and
1260 does no I/O before jumping. `write-repeat' takes a single argument (a
1261 register, an integer, or a string), writes it to the output, then jumps.
1262 `write-read-repeat' takes one or two arguments. The first must be a
1263 register. The second may be an integer or an array; if absent, it is
1264 implicitly set to the first (register) argument. `write-read-repeat'
1265 writes its second argument to the output, then reads from the input
1266 into the register, and finally jumps. See the `write' and `read'
1267 statements for the semantics of the I/O operations for each type of
1270 Other control statements:
1271 =========================
1273 The "call" statement, written `(call CCL-PROGRAM-NAME)', executes a
1274 CCL program as a subroutine. It does not return a value to the caller,
1275 but can modify the register status.
1277 The "end" statement, written `(end)', terminates the CCL program
1278 successfully, and returns to caller (which may be a CCL program). It
1279 does not alter the status of the registers.