This is ../info/lispref.info, produced by makeinfo version 4.0 from
lispref/lispref.texi.

INFO-DIR-SECTION XEmacs Editor
START-INFO-DIR-ENTRY
* Lispref: (lispref).		XEmacs Lisp Reference Manual.
END-INFO-DIR-ENTRY

   Edition History:

   GNU Emacs Lisp Reference Manual Second Edition (v2.01), May 1993 GNU
Emacs Lisp Reference Manual Further Revised (v2.02), August 1993 Lucid
Emacs Lisp Reference Manual (for 19.10) First Edition, March 1994
XEmacs Lisp Programmer's Manual (for 19.12) Second Edition, April 1995
GNU Emacs Lisp Reference Manual v2.4, June 1995 XEmacs Lisp
Programmer's Manual (for 19.13) Third Edition, July 1995 XEmacs Lisp
Reference Manual (for 19.14 and 20.0) v3.1, March 1996 XEmacs Lisp
Reference Manual (for 19.15 and 20.1, 20.2, 20.3) v3.2, April, May,
November 1997 XEmacs Lisp Reference Manual (for 21.0) v3.3, April 1998

   Copyright (C) 1990, 1991, 1992, 1993, 1994, 1995 Free Software
Foundation, Inc.  Copyright (C) 1994, 1995 Sun Microsystems, Inc.
Copyright (C) 1995, 1996 Ben Wing.

   Permission is granted to make and distribute verbatim copies of this
manual provided the copyright notice and this permission notice are
preserved on all copies.

   Permission is granted to copy and distribute modified versions of
this manual under the conditions for verbatim copying, provided that the
entire resulting derived work is distributed under the terms of a
permission notice identical to this one.

   Permission is granted to copy and distribute translations of this
manual into another language, under the above conditions for modified
versions, except that this permission notice may be stated in a
translation approved by the Foundation.

   Permission is granted to copy and distribute modified versions of
this manual under the conditions for verbatim copying, provided also
that the section entitled "GNU General Public License" is included
exactly as in the original, and provided that the entire resulting
derived work is distributed under the terms of a permission notice
identical to this one.

   Permission is granted to copy and distribute translations of this
manual into another language, under the above conditions for modified
versions, except that the section entitled "GNU General Public License"
may be included in a translation approved by the Free Software
Foundation instead of in the original English.


File: lispref.info,  Node: Coding System Types,  Next: ISO 2022,  Up: Coding Systems

Coding System Types
-------------------

   The coding system type determines the basic algorithm XEmacs will
use to decode or encode a data stream.  Character encodings will be
converted to the MULE encoding, escape sequences processed, and newline
sequences converted to XEmacs's internal representation.  There are
three basic classes of coding system type: no-conversion, ISO-2022, and
special.

   No conversion allows you to look at the file's internal
representation.  Since XEmacs is basically a text editor, "no
conversion" does convert newline conventions by default.  (Use the
'binary coding-system if this is not desired.)

   ISO 2022 (*note ISO 2022::) is the basic international standard
regulating use of "coded character sets for the exchange of data", ie,
text streams.  ISO 2022 contains functions that make it possible to
encode text streams to comply with restrictions of the Internet mail
system and de facto restrictions of most file systems (eg, use of the
separator character in file names).  Coding systems which are not ISO
2022 conformant can be difficult to handle.  Perhaps more important,
they are not adaptable to multilingual information interchange, with
the obvious exception of ISO 10646 (Unicode).  (Unicode is partially
supported by XEmacs with the addition of the Lisp package ucs-conv.)

   The special class of coding systems includes automatic detection,
CCL (a "little language" embedded as an interpreter, useful for
translating between variants of a single character set),
non-ISO-2022-conformant encodings like Unicode, Shift JIS, and Big5,
and MULE internal coding.  (NB: this list is based on XEmacs 21.2.
Terminology may vary slightly for other versions of XEmacs and for GNU
Emacs 20.)

`no-conversion'
     No conversion, for binary files, and a few special cases of
     non-ISO-2022 coding systems where conversion is done by hook
     functions (usually implemented in CCL).  On output, graphic
     characters that are not in ASCII or Latin-1 will be replaced by a
     `?'. (For a no-conversion-encoded buffer, these characters will
     only be present if you explicitly insert them.)

`iso2022'
     Any ISO-2022-compliant encoding.  Among others, this includes JIS
     (the Japanese encoding commonly used for e-mail), national
     variants of EUC (the standard Unix encoding for Japanese and other
     languages), and Compound Text (an encoding used in X11).  You can
     specify more specific information about the conversion with the
     FLAGS argument.

`ucs-4'
     ISO 10646 UCS-4 encoding.  A 31-bit fixed-width superset of
     Unicode.

`utf-8'
     ISO 10646 UTF-8 encoding.  A "file system safe" transformation
     format that can be used with both UCS-4 and Unicode.

`undecided'
     Automatic conversion.  XEmacs attempts to detect the coding system
     used in the file.

`shift-jis'
     Shift-JIS (a Japanese encoding commonly used in PC operating
     systems).

`big5'
     Big5 (the encoding commonly used for Taiwanese).

`ccl'
     The conversion is performed using a user-written pseudo-code
     program.  CCL (Code Conversion Language) is the name of this
     pseudo-code.  For example, CCL is used to map KOI8-R characters
     (an encoding for Russian Cyrillic) to ISO8859-5 (the form used
     internally by MULE).

`internal'
     Write out or read in the raw contents of the memory representing
     the buffer's text.  This is primarily useful for debugging
     purposes, and is only enabled when XEmacs has been compiled with
     `DEBUG_XEMACS' set (the `--debug' configure option).  *Warning*:
     Reading in a file using `internal' conversion can result in an
     internal inconsistency in the memory representing a buffer's text,
     which will produce unpredictable results and may cause XEmacs to
     crash.  Under normal circumstances you should never use `internal'
     conversion.


File: lispref.info,  Node: ISO 2022,  Next: EOL Conversion,  Prev: Coding System Types,  Up: Coding Systems

ISO 2022
========

   This section briefly describes the ISO 2022 encoding standard.  A
more thorough treatment is available in the original document of ISO
2022 as well as various national standards (such as JIS X 0202).

   Character sets ("charsets") are classified into the following four
categories, according to the number of characters in the charset:
94-charset, 96-charset, 94x94-charset, and 96x96-charset.  This means
that although an ISO 2022 coding system may have variable width
characters, each charset used is fixed-width (in contrast to the MULE
character set and UTF-8, for example).

   ISO 2022 provides for switching between character sets via escape
sequences.  This switching is somewhat complicated, because ISO 2022
provides for both legacy applications like Internet mail that accept
only 7 significant bits in some contexts (RFC 822 headers, for example),
and more modern "8-bit clean" applications.  It also provides for
compact and transparent representation of languages like Japanese which
mix ASCII and a national script (even outside of computer programs).

   First, ISO 2022 codified prevailing practice by dividing the code
space into "control" and "graphic" regions.  The code points 0x00-0x1F
and 0x80-0x9F are reserved for "control characters", while "graphic
characters" must be assigned to code points in the regions 0x20-0x7F and
0xA0-0xFF.  The positions 0x20 and 0x7F are special, and under some
circumstances must be assigned the graphic character "ASCII SPACE" and
the control character "ASCII DEL" respectively.

   The various regions are given the name C0 (0x00-0x1F), GL
(0x20-0x7F), C1 (0x80-0x9F), and GR (0xA0-0xFF).  GL and GR stand for
"graphic left" and "graphic right", respectively, because of the
standard method of displaying graphic character sets in tables with the
high byte indexing columns and the low byte indexing rows.  I don't
find it very intuitive, but these are called "registers".

   An ISO 2022-conformant encoding for a graphic character set must use
a fixed number of bytes per character, and the values must fit into a
single register; that is, each byte must range over either 0x20-0x7F, or
0xA0-0xFF.  It is not allowed to extend the range of the repertoire of a
character set by using both ranges at the same.  This is why a standard
character set such as ISO 8859-1 is actually considered by ISO 2022 to
be an aggregation of two character sets, ASCII and LATIN-1, and why it
is technically incorrect to refer to ISO 8859-1 as "Latin 1".  Also, a
single character's bytes must all be drawn from the same register; this
is why Shift JIS (for Japanese) and Big 5 (for Chinese) are not ISO
2022-compatible encodings.

   The reason for this restriction becomes clear when you attempt to
define an efficient, robust encoding for a language like Japanese.
Like ISO 8859, Japanese encodings are aggregations of several character
sets.  In practice, the vast majority of characters are drawn from the
"JIS Roman" character set (a derivative of ASCII; it won't hurt to
think of it as ASCII) and the JIS X 0208 standard "basic Japanese"
character set including not only ideographic characters ("kanji") but
syllabic Japanese characters ("kana"), a wide variety of symbols, and
many alphabetic characters (Roman, Greek, and Cyrillic) as well.
Although JIS X 0208 includes the whole Roman alphabet, as a 2-byte code
it is not suited to programming; thus the inclusion of ASCII in the
standard Japanese encodings.

   For normal Japanese text such as in newspapers, a broad repertoire of
approximately 3000 characters is used.  Evidently this won't fit into
one byte; two must be used.  But much of the text processed by Japanese
computers is computer source code, nearly all of which is ASCII.  A not
insignificant portion of ordinary text is English (as such or as
borrowed Japanese vocabulary) or other languages which can represented
at least approximately in ASCII, as well.  It seems reasonable then to
represent ASCII in one byte, and JIS X 0208 in two.  And this is exactly
what the Extended Unix Code for Japanese (EUC-JP) does.  ASCII is
invoked to the GL register, and JIS X 0208 is invoked to the GR
register.  Thus, each byte can be tested for its character set by
looking at the high bit; if set, it is Japanese, if clear, it is ASCII.
Furthermore, since control characters like newline can never be part of
a graphic character, even in the case of corruption in transmission the
stream will be resynchronized at every line break, on the order of 60-80
bytes.  This coding system requires no escape sequences or special
control codes to represent 99.9% of all Japanese text.

   Note carefully the distinction between the character sets (ASCII and
JIS X 0208), the encoding (EUC-JP), and the coding system (ISO 2022).
The JIS X 0208 character set is used in three different encodings for
Japanese, but in ISO-2022-JP it is invoked into GL (so the high bit is
always clear), in EUC-JP it is invoked into GR (setting the high bit in
the process), and in Shift JIS the high bit may be set or reset, and the
significant bits are shifted within the 16-bit character so that the two
main character sets can coexist with a third (the "halfwidth katakana"
of JIS X 0201).  As the name implies, the ISO-2022-JP encoding is also a
version of the ISO-2022 coding system.

   In order to systematically treat subsidiary character sets (like the
"halfwidth katakana" already mentioned, and the "supplementary kanji" of
JIS X 0212), four further registers are defined: G0, G1, G2, and G3.
Unlike GL and GR, they are not logically distinguished by internal
format.  Instead, the process of "invocation" mentioned earlier is
broken into two steps: first, a character set is "designated" to one of
the registers G0-G3 by use of an "escape sequence" of the form:

             ESC [I] I F

   where I is an intermediate character or characters in the range 0x20
- 0x3F, and F, from the range 0x30-0x7Fm is the final character
identifying this charset.  (Final characters in the range 0x30-0x3F are
reserved for private use and will never have a publically registered
meaning.)

   Then that register is "invoked" to either GL or GR, either
automatically (designations to G0 normally involve invocation to GL as
well), or by use of shifting (affecting only the following character in
the data stream) or locking (effective until the next designation or
locking) control sequences.  An encoding conformant to ISO 2022 is
typically defined by designating the initial contents of the G0-G3
registers, specifying an 7 or 8 bit environment, and specifying whether
further designations will be recognized.

   Some examples of character sets and the registered final characters
F used to designate them:

94-charset
     ASCII (B), left (J) and right (I) half of JIS X 0201, ...

96-charset
     Latin-1 (A), Latin-2 (B), Latin-3 (C), ...

94x94-charset
     GB2312 (A), JIS X 0208 (B), KSC5601 (C), ...

96x96-charset
     none for the moment

   The meanings of the various characters in these sequences, where not
specified by the ISO 2022 standard (such as the ESC character), are
assigned by "ECMA", the European Computer Manufacturers Association.

   The meaning of intermediate characters are:

             $ [0x24]: indicate charset of dimension 2 (94x94 or 96x96).
             ( [0x28]: designate to G0 a 94-charset whose final byte is F.
             ) [0x29]: designate to G1 a 94-charset whose final byte is F.
             * [0x2A]: designate to G2 a 94-charset whose final byte is F.
             + [0x2B]: designate to G3 a 94-charset whose final byte is F.
             , [0x2C]: designate to G0 a 96-charset whose final byte is F.
             - [0x2D]: designate to G1 a 96-charset whose final byte is F.
             . [0x2E]: designate to G2 a 96-charset whose final byte is F.
             / [0x2F]: designate to G3 a 96-charset whose final byte is F.

   The comma may be used in files read and written only by MULE, as a
MULE extension, but this is illegal in ISO 2022.  (The reason is that
in ISO 2022 G0 must be a 94-member character set, with 0x20 assigned
the value SPACE, and 0x7F assigned the value DEL.)

   Here are examples of designations:

             ESC ( B :              designate to G0 ASCII
             ESC - A :              designate to G1 Latin-1
             ESC $ ( A or ESC $ A : designate to G0 GB2312
             ESC $ ( B or ESC $ B : designate to G0 JISX0208
             ESC $ ) C :            designate to G1 KSC5601

   (The short forms used to designate GB2312 and JIS X 0208 are for
backwards compatibility; the long forms are preferred.)

   To use a charset designated to G2 or G3, and to use a charset
designated to G1 in a 7-bit environment, you must explicitly invoke G1,
G2, or G3 into GL.  There are two types of invocation, Locking Shift
(forever) and Single Shift (one character only).

   Locking Shift is done as follows:

             LS0 or SI (0x0F): invoke G0 into GL
             LS1 or SO (0x0E): invoke G1 into GL
             LS2:  invoke G2 into GL
             LS3:  invoke G3 into GL
             LS1R: invoke G1 into GR
             LS2R: invoke G2 into GR
             LS3R: invoke G3 into GR

   Single Shift is done as follows:

             SS2 or ESC N: invoke G2 into GL
             SS3 or ESC O: invoke G3 into GL

   The shift functions (such as LS1R and SS3) are represented by control
characters (from C1) in 8 bit environments and by escape sequences in 7
bit environments.

   (#### Ben says: I think the above is slightly incorrect.  It appears
that SS2 invokes G2 into GR and SS3 invokes G3 into GR, whereas ESC N
and ESC O behave as indicated.  The above definitions will not parse
EUC-encoded text correctly, and it looks like the code in mule-coding.c
has similar problems.)

   Evidently there are a lot of ISO-2022-compliant ways of encoding
multilingual text.  Now, in the world, there exist many coding systems
such as X11's Compound Text, Japanese JUNET code, and so-called EUC
(Extended UNIX Code); all of these are variants of ISO 2022.

   In MULE, we characterize a version of ISO 2022 by the following
attributes:

  1. The character sets initially designated to G0 thru G3.

  2. Whether short form designations are allowed for Japanese and
     Chinese.

  3. Whether ASCII should be designated to G0 before control characters.

  4. Whether ASCII should be designated to G0 at the end of line.

  5. 7-bit environment or 8-bit environment.

  6. Whether Locking Shifts are used or not.

  7. Whether to use ASCII or the variant JIS X 0201-1976-Roman.

  8. Whether to use JIS X 0208-1983 or the older version JIS X
     0208-1976.

   (The last two are only for Japanese.)

   By specifying these attributes, you can create any variant of ISO
2022.

   Here are several examples:

     ISO-2022-JP -- Coding system used in Japanese email (RFC 1463 #### check).
             1. G0 <- ASCII, G1..3 <- never used
             2. Yes.
             3. Yes.
             4. Yes.
             5. 7-bit environment
             6. No.
             7. Use ASCII
             8. Use JIS X 0208-1983
     
     ctext -- X11 Compound Text
             1. G0 <- ASCII, G1 <- Latin-1, G2,3 <- never used.
             2. No.
             3. No.
             4. Yes.
             5. 8-bit environment.
             6. No.
             7. Use ASCII.
             8. Use JIS X 0208-1983.
     
     euc-china -- Chinese EUC.  Often called the "GB encoding", but that is
     technically incorrect.
             1. G0 <- ASCII, G1 <- GB 2312, G2,3 <- never used.
             2. No.
             3. Yes.
             4. Yes.
             5. 8-bit environment.
             6. No.
             7. Use ASCII.
             8. Use JIS X 0208-1983.
     
     ISO-2022-KR -- Coding system used in Korean email.
             1. G0 <- ASCII, G1 <- KSC 5601, G2,3 <- never used.
             2. No.
             3. Yes.
             4. Yes.
             5. 7-bit environment.
             6. Yes.
             7. Use ASCII.
             8. Use JIS X 0208-1983.

   MULE creates all of these coding systems by default.


File: lispref.info,  Node: EOL Conversion,  Next: Coding System Properties,  Prev: ISO 2022,  Up: Coding Systems

EOL Conversion
--------------

`nil'
     Automatically detect the end-of-line type (LF, CRLF, or CR).  Also
     generate subsidiary coding systems named `NAME-unix', `NAME-dos',
     and `NAME-mac', that are identical to this coding system but have
     an EOL-TYPE value of `lf', `crlf', and `cr', respectively.

`lf'
     The end of a line is marked externally using ASCII LF.  Since this
     is also the way that XEmacs represents an end-of-line internally,
     specifying this option results in no end-of-line conversion.  This
     is the standard format for Unix text files.

`crlf'
     The end of a line is marked externally using ASCII CRLF.  This is
     the standard format for MS-DOS text files.

`cr'
     The end of a line is marked externally using ASCII CR.  This is the
     standard format for Macintosh text files.

`t'
     Automatically detect the end-of-line type but do not generate
     subsidiary coding systems.  (This value is converted to `nil' when
     stored internally, and `coding-system-property' will return `nil'.)


File: lispref.info,  Node: Coding System Properties,  Next: Basic Coding System Functions,  Prev: EOL Conversion,  Up: Coding Systems

Coding System Properties
------------------------

`mnemonic'
     String to be displayed in the modeline when this coding system is
     active.

`eol-type'
     End-of-line conversion to be used.  It should be one of the types
     listed in *Note EOL Conversion::.

`eol-lf'
     The coding system which is the same as this one, except that it
     uses the Unix line-breaking convention.

`eol-crlf'
     The coding system which is the same as this one, except that it
     uses the DOS line-breaking convention.

`eol-cr'
     The coding system which is the same as this one, except that it
     uses the Macintosh line-breaking convention.

`post-read-conversion'
     Function called after a file has been read in, to perform the
     decoding.  Called with two arguments, BEG and END, denoting a
     region of the current buffer to be decoded.

`pre-write-conversion'
     Function called before a file is written out, to perform the
     encoding.  Called with two arguments, BEG and END, denoting a
     region of the current buffer to be encoded.

   The following additional properties are recognized if TYPE is
`iso2022':

`charset-g0'
`charset-g1'
`charset-g2'
`charset-g3'
     The character set initially designated to the G0 - G3 registers.
     The value should be one of

        * A charset object (designate that character set)

        * `nil' (do not ever use this register)

        * `t' (no character set is initially designated to the
          register, but may be later on; this automatically sets the
          corresponding `force-g*-on-output' property)

`force-g0-on-output'
`force-g1-on-output'
`force-g2-on-output'
`force-g3-on-output'
     If non-`nil', send an explicit designation sequence on output
     before using the specified register.

`short'
     If non-`nil', use the short forms `ESC $ @', `ESC $ A', and `ESC $
     B' on output in place of the full designation sequences `ESC $ (
     @', `ESC $ ( A', and `ESC $ ( B'.

`no-ascii-eol'
     If non-`nil', don't designate ASCII to G0 at each end of line on
     output.  Setting this to non-`nil' also suppresses other
     state-resetting that normally happens at the end of a line.

`no-ascii-cntl'
     If non-`nil', don't designate ASCII to G0 before control chars on
     output.

`seven'
     If non-`nil', use 7-bit environment on output.  Otherwise, use
     8-bit environment.

`lock-shift'
     If non-`nil', use locking-shift (SO/SI) instead of single-shift or
     designation by escape sequence.

`no-iso6429'
     If non-`nil', don't use ISO6429's direction specification.

`escape-quoted'
     If non-nil, literal control characters that are the same as the
     beginning of a recognized ISO 2022 or ISO 6429 escape sequence (in
     particular, ESC (0x1B), SO (0x0E), SI (0x0F), SS2 (0x8E), SS3
     (0x8F), and CSI (0x9B)) are "quoted" with an escape character so
     that they can be properly distinguished from an escape sequence.
     (Note that doing this results in a non-portable encoding.) This
     encoding flag is used for byte-compiled files.  Note that ESC is a
     good choice for a quoting character because there are no escape
     sequences whose second byte is a character from the Control-0 or
     Control-1 character sets; this is explicitly disallowed by the ISO
     2022 standard.

`input-charset-conversion'
     A list of conversion specifications, specifying conversion of
     characters in one charset to another when decoding is performed.
     Each specification is a list of two elements: the source charset,
     and the destination charset.

`output-charset-conversion'
     A list of conversion specifications, specifying conversion of
     characters in one charset to another when encoding is performed.
     The form of each specification is the same as for
     `input-charset-conversion'.

   The following additional properties are recognized (and required) if
TYPE is `ccl':

`decode'
     CCL program used for decoding (converting to internal format).

`encode'
     CCL program used for encoding (converting to external format).

   The following properties are used internally:  EOL-CR, EOL-CRLF,
EOL-LF, and BASE.


File: lispref.info,  Node: Basic Coding System Functions,  Next: Coding System Property Functions,  Prev: Coding System Properties,  Up: Coding Systems

Basic Coding System Functions
-----------------------------

 - Function: find-coding-system coding-system-or-name
     This function retrieves the coding system of the given name.

     If CODING-SYSTEM-OR-NAME is a coding-system object, it is simply
     returned.  Otherwise, CODING-SYSTEM-OR-NAME should be a symbol.
     If there is no such coding system, `nil' is returned.  Otherwise
     the associated coding system object is returned.

 - Function: get-coding-system name
     This function retrieves the coding system of the given name.  Same
     as `find-coding-system' except an error is signalled if there is no
     such coding system instead of returning `nil'.

 - Function: coding-system-list
     This function returns a list of the names of all defined coding
     systems.

 - Function: coding-system-name coding-system
     This function returns the name of the given coding system.

 - Function: coding-system-base coding-system
     Returns the base coding system (undecided EOL convention) coding
     system.

 - Function: make-coding-system name type &optional doc-string props
     This function registers symbol NAME as a coding system.

     TYPE describes the conversion method used and should be one of the
     types listed in *Note Coding System Types::.

     DOC-STRING is a string describing the coding system.

     PROPS is a property list, describing the specific nature of the
     character set.  Recognized properties are as in *Note Coding
     System Properties::.

 - Function: copy-coding-system old-coding-system new-name
     This function copies OLD-CODING-SYSTEM to NEW-NAME.  If NEW-NAME
     does not name an existing coding system, a new one will be created.

 - Function: subsidiary-coding-system coding-system eol-type
     This function returns the subsidiary coding system of
     CODING-SYSTEM with eol type EOL-TYPE.


File: lispref.info,  Node: Coding System Property Functions,  Next: Encoding and Decoding Text,  Prev: Basic Coding System Functions,  Up: Coding Systems

Coding System Property Functions
--------------------------------

 - Function: coding-system-doc-string coding-system
     This function returns the doc string for CODING-SYSTEM.

 - Function: coding-system-type coding-system
     This function returns the type of CODING-SYSTEM.

 - Function: coding-system-property coding-system prop
     This function returns the PROP property of CODING-SYSTEM.


File: lispref.info,  Node: Encoding and Decoding Text,  Next: Detection of Textual Encoding,  Prev: Coding System Property Functions,  Up: Coding Systems

Encoding and Decoding Text
--------------------------

 - Function: decode-coding-region start end coding-system &optional
          buffer
     This function decodes the text between START and END which is
     encoded in CODING-SYSTEM.  This is useful if you've read in
     encoded text from a file without decoding it (e.g. you read in a
     JIS-formatted file but used the `binary' or `no-conversion' coding
     system, so that it shows up as `^[$B!<!+^[(B').  The length of the
     encoded text is returned.  BUFFER defaults to the current buffer
     if unspecified.

 - Function: encode-coding-region start end coding-system &optional
          buffer
     This function encodes the text between START and END using
     CODING-SYSTEM.  This will, for example, convert Japanese
     characters into stuff such as `^[$B!<!+^[(B' if you use the JIS
     encoding.  The length of the encoded text is returned.  BUFFER
     defaults to the current buffer if unspecified.


File: lispref.info,  Node: Detection of Textual Encoding,  Next: Big5 and Shift-JIS Functions,  Prev: Encoding and Decoding Text,  Up: Coding Systems

Detection of Textual Encoding
-----------------------------

 - Function: coding-category-list
     This function returns a list of all recognized coding categories.

 - Function: set-coding-priority-list list
     This function changes the priority order of the coding categories.
     LIST should be a list of coding categories, in descending order of
     priority.  Unspecified coding categories will be lower in priority
     than all specified ones, in the same relative order they were in
     previously.

 - Function: coding-priority-list
     This function returns a list of coding categories in descending
     order of priority.

 - Function: set-coding-category-system coding-category coding-system
     This function changes the coding system associated with a coding
     category.

 - Function: coding-category-system coding-category
     This function returns the coding system associated with a coding
     category.

 - Function: detect-coding-region start end &optional buffer
     This function detects coding system of the text in the region
     between START and END.  Returned value is a list of possible coding
     systems ordered by priority.  If only ASCII characters are found,
     it returns `autodetect' or one of its subsidiary coding systems
     according to a detected end-of-line type.  Optional arg BUFFER
     defaults to the current buffer.


File: lispref.info,  Node: Big5 and Shift-JIS Functions,  Next: Predefined Coding Systems,  Prev: Detection of Textual Encoding,  Up: Coding Systems

Big5 and Shift-JIS Functions
----------------------------

   These are special functions for working with the non-standard
Shift-JIS and Big5 encodings.

 - Function: decode-shift-jis-char code
     This function decodes a JIS X 0208 character of Shift-JIS
     coding-system.  CODE is the character code in Shift-JIS as a cons
     of type bytes.  The corresponding character is returned.

 - Function: encode-shift-jis-char ch
     This function encodes a JIS X 0208 character CH to SHIFT-JIS
     coding-system.  The corresponding character code in SHIFT-JIS is
     returned as a cons of two bytes.

 - Function: decode-big5-char code
     This function decodes a Big5 character CODE of BIG5 coding-system.
     CODE is the character code in BIG5.  The corresponding character
     is returned.

 - Function: encode-big5-char ch
     This function encodes the Big5 character CHAR to BIG5
     coding-system.  The corresponding character code in Big5 is
     returned.


File: lispref.info,  Node: Predefined Coding Systems,  Prev: Big5 and Shift-JIS Functions,  Up: Coding Systems

Coding Systems Implemented
--------------------------

   MULE initializes most of the commonly used coding systems at XEmacs's
startup.  A few others are initialized only when the relevant language
environment is selected and support libraries are loaded.  (NB: The
following list is based on XEmacs 21.2.19, the development branch at the
time of writing.  The list may be somewhat different for other
versions.  Recent versions of GNU Emacs 20 implement a few more rare
coding systems; work is being done to port these to XEmacs.)

   Unfortunately, there is not a consistent naming convention for
character sets, and for practical purposes coding systems often take
their name from their principal character sets (ASCII, KOI8-R, Shift
JIS).  Others take their names from the coding system (ISO-2022-JP,
EUC-KR), and a few from their non-text usages (internal, binary).  To
provide for this, and for the fact that many coding systems have
several common names, an aliasing system is provided.  Finally, some
effort has been made to use names that are registered as MIME charsets
(this is why the name 'shift_jis contains that un-Lisp-y underscore).

   There is a systematic naming convention regarding end-of-line (EOL)
conventions for different systems.  A coding system whose name ends in
"-unix" forces the assumptions that lines are broken by newlines (0x0A).
A coding system whose name ends in "-mac" forces the assumptions that
lines are broken by ASCII CRs (0x0D).  A coding system whose name ends
in "-dos" forces the assumptions that lines are broken by CRLF sequences
(0x0D 0x0A).  These subsidiary coding systems are automatically derived
from a base coding system.  Use of the base coding system implies
autodetection of the text file convention.  (The fact that the -unix,
-mac, and -dos are derived from a base system results in them showing up
as "aliases" in `list-coding-systems'.)  These subsidiaries have a
consistent modeline indicator as well.  "-dos" coding systems have ":T"
appended to their modeline indicator, while "-mac" coding systems have
":t" appended (eg, "ISO8:t" for iso-2022-8-mac).

   In the following table, each coding system is given with its mode
line indicator in parentheses.  Non-textual coding systems are listed
first, followed by textual coding systems and their aliases. (The
coding system subsidiary modeline indicators ":T" and ":t" will be
omitted from the table of coding systems.)

   ### SJT 1999-08-23 Maybe should order these by language?  Definitely
need language usage for the ISO-8859 family.

   Note that although true coding system aliases have been implemented
for XEmacs 21.2, the coding system initialization has not yet been
converted as of 21.2.19.  So coding systems described as aliases have
the same properties as the aliased coding system, but will not be equal
as Lisp objects.

`automatic-conversion'
`undecided'
`undecided-dos'
`undecided-mac'
`undecided-unix'
     Modeline indicator: `Auto'.  A type `undecided' coding system.
     Attempts to determine an appropriate coding system from file
     contents or the environment.

`raw-text'
`no-conversion'
`raw-text-dos'
`raw-text-mac'
`raw-text-unix'
`no-conversion-dos'
`no-conversion-mac'
`no-conversion-unix'
     Modeline indicator: `Raw'.  A type `no-conversion' coding system,
     which converts only line-break-codes.  An implementation quirk
     means that this coding system is also used for ISO8859-1.

`binary'
     Modeline indicator: `Binary'.  A type `no-conversion' coding
     system which does no character coding or EOL conversions.  An
     alias for `raw-text-unix'.

`alternativnyj'
`alternativnyj-dos'
`alternativnyj-mac'
`alternativnyj-unix'
     Modeline indicator: `Cy.Alt'.  A type `ccl' coding system used for
     Alternativnyj, an encoding of the Cyrillic alphabet.

`big5'
`big5-dos'
`big5-mac'
`big5-unix'
     Modeline indicator: `Zh/Big5'.  A type `big5' coding system used
     for BIG5, the most common encoding of traditional Chinese as used
     in Taiwan.

`cn-gb-2312'
`cn-gb-2312-dos'
`cn-gb-2312-mac'
`cn-gb-2312-unix'
     Modeline indicator: `Zh-GB/EUC'.  A type `iso2022' coding system
     used for simplified Chinese (as used in the People's Republic of
     China), with the `ascii' (G0), `chinese-gb2312' (G1), and `sisheng'
     (G2) character sets initially designated.  Chinese EUC (Extended
     Unix Code).

`ctext-hebrew'
`ctext-hebrew-dos'
`ctext-hebrew-mac'
`ctext-hebrew-unix'
     Modeline indicator: `CText/Hbrw'.  A type `iso2022' coding system
     with the `ascii' (G0) and `hebrew-iso8859-8' (G1) character sets
     initially designated for Hebrew.

`ctext'
`ctext-dos'
`ctext-mac'
`ctext-unix'
     Modeline indicator: `CText'.  A type `iso2022' 8-bit coding system
     with the `ascii' (G0) and `latin-iso8859-1' (G1) character sets
     initially designated.  X11 Compound Text Encoding.  Often
     mistakenly recognized instead of EUC encodings; usual cause is
     inappropriate setting of `coding-priority-list'.

`escape-quoted'
     Modeline indicator: `ESC/Quot'.  A type `iso2022' 8-bit coding
     system with the `ascii' (G0) and `latin-iso8859-1' (G1) character
     sets initially designated and escape quoting.  Unix EOL conversion
     (ie, no conversion).  It is used for .ELC files.

`euc-jp'
`euc-jp-dos'
`euc-jp-mac'
`euc-jp-unix'
     Modeline indicator: `Ja/EUC'.  A type `iso2022' 8-bit coding system
     with `ascii' (G0), `japanese-jisx0208' (G1), `katakana-jisx0201'
     (G2), and `japanese-jisx0212' (G3) initially designated.  Japanese
     EUC (Extended Unix Code).

`euc-kr'
`euc-kr-dos'
`euc-kr-mac'
`euc-kr-unix'
     Modeline indicator: `ko/EUC'.  A type `iso2022' 8-bit coding system
     with `ascii' (G0) and `korean-ksc5601' (G1) initially designated.
     Korean EUC (Extended Unix Code).

`hz-gb-2312'
     Modeline indicator: `Zh-GB/Hz'.  A type `no-conversion' coding
     system with Unix EOL convention (ie, no conversion) using
     post-read-decode and pre-write-encode functions to translate the
     Hz/ZW coding system used for Chinese.

`iso-2022-7bit'
`iso-2022-7bit-unix'
`iso-2022-7bit-dos'
`iso-2022-7bit-mac'
`iso-2022-7'
     Modeline indicator: `ISO7'.  A type `iso2022' 7-bit coding system
     with `ascii' (G0) initially designated.  Other character sets must
     be explicitly designated to be used.

`iso-2022-7bit-ss2'
`iso-2022-7bit-ss2-dos'
`iso-2022-7bit-ss2-mac'
`iso-2022-7bit-ss2-unix'
     Modeline indicator: `ISO7/SS'.  A type `iso2022' 7-bit coding
     system with `ascii' (G0) initially designated.  Other character
     sets must be explicitly designated to be used.  SS2 is used to
     invoke a 96-charset, one character at a time.

`iso-2022-8'
`iso-2022-8-dos'
`iso-2022-8-mac'
`iso-2022-8-unix'
     Modeline indicator: `ISO8'.  A type `iso2022' 8-bit coding system
     with `ascii' (G0) and `latin-iso8859-1' (G1) initially designated.
     Other character sets must be explicitly designated to be used.
     No single-shift or locking-shift.

`iso-2022-8bit-ss2'
`iso-2022-8bit-ss2-dos'
`iso-2022-8bit-ss2-mac'
`iso-2022-8bit-ss2-unix'
     Modeline indicator: `ISO8/SS'.  A type `iso2022' 8-bit coding
     system with `ascii' (G0) and `latin-iso8859-1' (G1) initially
     designated.  Other character sets must be explicitly designated to
     be used.  SS2 is used to invoke a 96-charset, one character at a
     time.

`iso-2022-int-1'
`iso-2022-int-1-dos'
`iso-2022-int-1-mac'
`iso-2022-int-1-unix'
     Modeline indicator: `INT-1'.  A type `iso2022' 7-bit coding system
     with `ascii' (G0) and `korean-ksc5601' (G1) initially designated.
     ISO-2022-INT-1.

`iso-2022-jp-1978-irv'
`iso-2022-jp-1978-irv-dos'
`iso-2022-jp-1978-irv-mac'
`iso-2022-jp-1978-irv-unix'
     Modeline indicator: `Ja-78/7bit'.  A type `iso2022' 7-bit coding
     system.  For compatibility with old Japanese terminals; if you
     need to know, look at the source.

`iso-2022-jp'
`iso-2022-jp-2 (ISO7/SS)'
`iso-2022-jp-dos'
`iso-2022-jp-mac'
`iso-2022-jp-unix'
`iso-2022-jp-2-dos'
`iso-2022-jp-2-mac'
`iso-2022-jp-2-unix'
     Modeline indicator: `MULE/7bit'.  A type `iso2022' 7-bit coding
     system with `ascii' (G0) initially designated, and complex
     specifications to insure backward compatibility with old Japanese
     systems.  Used for communication with mail and news in Japan.  The
     "-2" versions also use SS2 to invoke a 96-charset one character at
     a time.

`iso-2022-kr'
     Modeline indicator: `Ko/7bit'  A type `iso2022' 7-bit coding
     system with `ascii' (G0) and `korean-ksc5601' (G1) initially
     designated.  Used for e-mail in Korea.

`iso-2022-lock'
`iso-2022-lock-dos'
`iso-2022-lock-mac'
`iso-2022-lock-unix'
     Modeline indicator: `ISO7/Lock'.  A type `iso2022' 7-bit coding
     system with `ascii' (G0) initially designated, using Locking-Shift
     to invoke a 96-charset.

`iso-8859-1'
`iso-8859-1-dos'
`iso-8859-1-mac'
`iso-8859-1-unix'
     Due to implementation, this is not a type `iso2022' coding system,
     but rather an alias for the `raw-text' coding system.

`iso-8859-2'
`iso-8859-2-dos'
`iso-8859-2-mac'
`iso-8859-2-unix'
     Modeline indicator: `MIME/Ltn-2'.  A type `iso2022' coding system
     with `ascii' (G0) and `latin-iso8859-2' (G1) initially invoked.

`iso-8859-3'
`iso-8859-3-dos'
`iso-8859-3-mac'
`iso-8859-3-unix'
     Modeline indicator: `MIME/Ltn-3'.  A type `iso2022' coding system
     with `ascii' (G0) and `latin-iso8859-3' (G1) initially invoked.

`iso-8859-4'
`iso-8859-4-dos'
`iso-8859-4-mac'
`iso-8859-4-unix'
     Modeline indicator: `MIME/Ltn-4'.  A type `iso2022' coding system
     with `ascii' (G0) and `latin-iso8859-4' (G1) initially invoked.

`iso-8859-5'
`iso-8859-5-dos'
`iso-8859-5-mac'
`iso-8859-5-unix'
     Modeline indicator: `ISO8/Cyr'.  A type `iso2022' coding system
     with `ascii' (G0) and `cyrillic-iso8859-5' (G1) initially invoked.

`iso-8859-7'
`iso-8859-7-dos'
`iso-8859-7-mac'
`iso-8859-7-unix'
     Modeline indicator: `Grk'.  A type `iso2022' coding system with
     `ascii' (G0) and `greek-iso8859-7' (G1) initially invoked.

`iso-8859-8'
`iso-8859-8-dos'
`iso-8859-8-mac'
`iso-8859-8-unix'
     Modeline indicator: `MIME/Hbrw'.  A type `iso2022' coding system
     with `ascii' (G0) and `hebrew-iso8859-8' (G1) initially invoked.

`iso-8859-9'
`iso-8859-9-dos'
`iso-8859-9-mac'
`iso-8859-9-unix'
     Modeline indicator: `MIME/Ltn-5'.  A type `iso2022' coding system
     with `ascii' (G0) and `latin-iso8859-9' (G1) initially invoked.

`koi8-r'
`koi8-r-dos'
`koi8-r-mac'
`koi8-r-unix'
     Modeline indicator: `KOI8'.  A type `ccl' coding-system used for
     KOI8-R, an encoding of the Cyrillic alphabet.

`shift_jis'
`shift_jis-dos'
`shift_jis-mac'
`shift_jis-unix'
     Modeline indicator: `Ja/SJIS'.  A type `shift-jis' coding-system
     implementing the Shift-JIS encoding for Japanese.  The underscore
     is to conform to the MIME charset implementing this encoding.

`tis-620'
`tis-620-dos'
`tis-620-mac'
`tis-620-unix'
     Modeline indicator: `TIS620'.  A type `ccl' encoding for Thai.  The
     external encoding is defined by TIS620, the internal encoding is
     peculiar to MULE, and called `thai-xtis'.

`viqr'
     Modeline indicator: `VIQR'.  A type `no-conversion' coding system
     with Unix EOL convention (ie, no conversion) using
     post-read-decode and pre-write-encode functions to translate the
     VIQR coding system for Vietnamese.

`viscii'
`viscii-dos'
`viscii-mac'
`viscii-unix'
     Modeline indicator: `VISCII'.  A type `ccl' coding-system used for
     VISCII 1.1 for Vietnamese.  Differs slightly from VSCII; VISCII is
     given priority by XEmacs.

`vscii'
`vscii-dos'
`vscii-mac'
`vscii-unix'
     Modeline indicator: `VSCII'.  A type `ccl' coding-system used for
     VSCII 1.1 for Vietnamese.  Differs slightly from VISCII, which is
     given priority by XEmacs.  Use `(prefer-coding-system
     'vietnamese-vscii)' to give priority to VSCII.


File: lispref.info,  Node: CCL,  Next: Category Tables,  Prev: Coding Systems,  Up: MULE

CCL
===

   CCL (Code Conversion Language) is a simple structured programming
language designed for character coding conversions.  A CCL program is
compiled to CCL code (represented by a vector of integers) and executed
by the CCL interpreter embedded in Emacs.  The CCL interpreter
implements a virtual machine with 8 registers called `r0', ..., `r7', a
number of control structures, and some I/O operators.  Take care when
using registers `r0' (used in implicit "set" statements) and especially
`r7' (used internally by several statements and operations, especially
for multiple return values and I/O operations).

   CCL is used for code conversion during process I/O and file I/O for
non-ISO2022 coding systems.  (It is the only way for a user to specify a
code conversion function.)  It is also used for calculating the code
point of an X11 font from a character code.  However, since CCL is
designed as a powerful programming language, it can be used for more
generic calculation where efficiency is demanded.  A combination of
three or more arithmetic operations can be calculated faster by CCL than
by Emacs Lisp.

   *Warning:*  The code in `src/mule-ccl.c' and
`$packages/lisp/mule-base/mule-ccl.el' is the definitive description of
CCL's semantics.  The previous version of this section contained
several typos and obsolete names left from earlier versions of MULE,
and many may remain.  (I am not an experienced CCL programmer; the few
who know CCL well find writing English painful.)

   A CCL program transforms an input data stream into an output data
stream.  The input stream, held in a buffer of constant bytes, is left
unchanged.  The buffer may be filled by an external input operation,
taken from an Emacs buffer, or taken from a Lisp string.  The output
buffer is a dynamic array of bytes, which can be written by an external
output operation, inserted into an Emacs buffer, or returned as a Lisp
string.

   A CCL program is a (Lisp) list containing two or three members.  The
first member is the "buffer magnification", which indicates the
required minimum size of the output buffer as a multiple of the input
buffer.  It is followed by the "main block" which executes while there
is input remaining, and an optional "EOF block" which is executed when
the input is exhausted.  Both the main block and the EOF block are CCL
blocks.

   A "CCL block" is either a CCL statement or list of CCL statements.
A "CCL statement" is either a "set statement" (either an integer or an
"assignment", which is a list of a register to receive the assignment,
an assignment operator, and an expression) or a "control statement" (a
list starting with a keyword, whose allowable syntax depends on the
keyword).

* Menu:

* CCL Syntax::          CCL program syntax in BNF notation.
* CCL Statements::      Semantics of CCL statements.
* CCL Expressions::     Operators and expressions in CCL.
* Calling CCL::         Running CCL programs.
* CCL Examples::        The encoding functions for Big5 and KOI-8.


File: lispref.info,  Node: CCL Syntax,  Next: CCL Statements,  Up: CCL

CCL Syntax
----------

   The full syntax of a CCL program in BNF notation:

CCL_PROGRAM :=
        (BUFFER_MAGNIFICATION
         CCL_MAIN_BLOCK
         [ CCL_EOF_BLOCK ])

BUFFER_MAGNIFICATION := integer
CCL_MAIN_BLOCK := CCL_BLOCK
CCL_EOF_BLOCK := CCL_BLOCK

CCL_BLOCK :=
        STATEMENT | (STATEMENT [STATEMENT ...])
STATEMENT :=
        SET | IF | BRANCH | LOOP | REPEAT | BREAK | READ | WRITE
        | CALL | END

SET :=
        (REG = EXPRESSION)
        | (REG ASSIGNMENT_OPERATOR EXPRESSION)
        | integer

EXPRESSION := ARG | (EXPRESSION OPERATOR ARG)

IF := (if EXPRESSION CCL_BLOCK [CCL_BLOCK])
BRANCH := (branch EXPRESSION CCL_BLOCK [CCL_BLOCK ...])
LOOP := (loop STATEMENT [STATEMENT ...])
BREAK := (break)
REPEAT :=
        (repeat)
        | (write-repeat [REG | integer | string])
        | (write-read-repeat REG [integer | ARRAY])
READ :=
        (read REG ...)
        | (read-if (REG OPERATOR ARG) CCL_BLOCK CCL_BLOCK)
        | (read-branch REG CCL_BLOCK [CCL_BLOCK ...])
WRITE :=
        (write REG ...)
        | (write EXPRESSION)
        | (write integer) | (write string) | (write REG ARRAY)
        | string
CALL := (call ccl-program-name)
END := (end)

REG := r0 | r1 | r2 | r3 | r4 | r5 | r6 | r7
ARG := REG | integer
OPERATOR :=
        + | - | * | / | % | & | '|' | ^ | << | >> | <8 | >8 | //
        | < | > | == | <= | >= | != | de-sjis | en-sjis
ASSIGNMENT_OPERATOR :=
        += | -= | *= | /= | %= | &= | '|=' | ^= | <<= | >>=
ARRAY := '[' integer ... ']'


File: lispref.info,  Node: CCL Statements,  Next: CCL Expressions,  Prev: CCL Syntax,  Up: CCL

CCL Statements
--------------

   The Emacs Code Conversion Language provides the following statement
types: "set", "if", "branch", "loop", "repeat", "break", "read",
"write", "call", and "end".

Set statement:
==============

   The "set" statement has three variants with the syntaxes `(REG =
EXPRESSION)', `(REG ASSIGNMENT_OPERATOR EXPRESSION)', and `INTEGER'.
The assignment operator variation of the "set" statement works the same
way as the corresponding C expression statement does.  The assignment
operators are `+=', `-=', `*=', `/=', `%=', `&=', `|=', `^=', `<<=',
and `>>=', and they have the same meanings as in C.  A "naked integer"
INTEGER is equivalent to a SET statement of the form `(r0 = INTEGER)'.

I/O statements:
===============

   The "read" statement takes one or more registers as arguments.  It
reads one byte (a C char) from the input into each register in turn.

   The "write" takes several forms.  In the form `(write REG ...)' it
takes one or more registers as arguments and writes each in turn to the
output.  The integer in a register (interpreted as an Emchar) is
encoded to multibyte form (ie, Bufbytes) and written to the current
output buffer.  If it is less than 256, it is written as is.  The forms
`(write EXPRESSION)' and `(write INTEGER)' are treated analogously.
The form `(write STRING)' writes the constant string to the output.  A
"naked string" `STRING' is equivalent to the statement `(write
STRING)'.  The form `(write REG ARRAY)' writes the REGth element of the
ARRAY to the output.

Conditional statements:
=======================

   The "if" statement takes an EXPRESSION, a CCL BLOCK, and an optional
SECOND CCL BLOCK as arguments.  If the EXPRESSION evaluates to
non-zero, the first CCL BLOCK is executed.  Otherwise, if there is a
SECOND CCL BLOCK, it is executed.

   The "read-if" variant of the "if" statement takes an EXPRESSION, a
CCL BLOCK, and an optional SECOND CCL BLOCK as arguments.  The
EXPRESSION must have the form `(REG OPERATOR OPERAND)' (where OPERAND is
a register or an integer).  The `read-if' statement first reads from
the input into the first register operand in the EXPRESSION, then
conditionally executes a CCL block just as the `if' statement does.

   The "branch" statement takes an EXPRESSION and one or more CCL
blocks as arguments.  The CCL blocks are treated as a zero-indexed
array, and the `branch' statement uses the EXPRESSION as the index of
the CCL block to execute.  Null CCL blocks may be used as no-ops,
continuing execution with the statement following the `branch'
statement in the containing CCL block.  Out-of-range values for the
EXPRESSION are also treated as no-ops.

   The "read-branch" variant of the "branch" statement takes an
REGISTER, a CCL BLOCK, and an optional SECOND CCL BLOCK as arguments.
The `read-branch' statement first reads from the input into the
REGISTER, then conditionally executes a CCL block just as the `branch'
statement does.

Loop control statements:
========================

   The "loop" statement creates a block with an implied jump from the
end of the block back to its head.  The loop is exited on a `break'
statement, and continued without executing the tail by a `repeat'
statement.

   The "break" statement, written `(break)', terminates the current
loop and continues with the next statement in the current block.

   The "repeat" statement has three variants, `repeat', `write-repeat',
and `write-read-repeat'.  Each continues the current loop from its
head, possibly after performing I/O.  `repeat' takes no arguments and
does no I/O before jumping.  `write-repeat' takes a single argument (a
register, an integer, or a string), writes it to the output, then jumps.
`write-read-repeat' takes one or two arguments.  The first must be a
register.  The second may be an integer or an array; if absent, it is
implicitly set to the first (register) argument.  `write-read-repeat'
writes its second argument to the output, then reads from the input
into the register, and finally jumps.  See the `write' and `read'
statements for the semantics of the I/O operations for each type of
argument.

Other control statements:
=========================

   The "call" statement, written `(call CCL-PROGRAM-NAME)', executes a
CCL program as a subroutine.  It does not return a value to the caller,
but can modify the register status.

   The "end" statement, written `(end)', terminates the CCL program
successfully, and returns to caller (which may be a CCL program).  It
does not alter the status of the registers.