This is ../info/lispref.info, produced by makeinfo version 4.0b from
lispref/lispref.texi.

INFO-DIR-SECTION XEmacs Editor
START-INFO-DIR-ENTRY
* Lispref: (lispref).		XEmacs Lisp Reference Manual.
END-INFO-DIR-ENTRY

   Edition History:

   GNU Emacs Lisp Reference Manual Second Edition (v2.01), May 1993 GNU
Emacs Lisp Reference Manual Further Revised (v2.02), August 1993 Lucid
Emacs Lisp Reference Manual (for 19.10) First Edition, March 1994
XEmacs Lisp Programmer's Manual (for 19.12) Second Edition, April 1995
GNU Emacs Lisp Reference Manual v2.4, June 1995 XEmacs Lisp
Programmer's Manual (for 19.13) Third Edition, July 1995 XEmacs Lisp
Reference Manual (for 19.14 and 20.0) v3.1, March 1996 XEmacs Lisp
Reference Manual (for 19.15 and 20.1, 20.2, 20.3) v3.2, April, May,
November 1997 XEmacs Lisp Reference Manual (for 21.0) v3.3, April 1998

   Copyright (C) 1990, 1991, 1992, 1993, 1994, 1995 Free Software
Foundation, Inc.  Copyright (C) 1994, 1995 Sun Microsystems, Inc.
Copyright (C) 1995, 1996 Ben Wing.

   Permission is granted to make and distribute verbatim copies of this
manual provided the copyright notice and this permission notice are
preserved on all copies.

   Permission is granted to copy and distribute modified versions of
this manual under the conditions for verbatim copying, provided that the
entire resulting derived work is distributed under the terms of a
permission notice identical to this one.

   Permission is granted to copy and distribute translations of this
manual into another language, under the above conditions for modified
versions, except that this permission notice may be stated in a
translation approved by the Foundation.

   Permission is granted to copy and distribute modified versions of
this manual under the conditions for verbatim copying, provided also
that the section entitled "GNU General Public License" is included
exactly as in the original, and provided that the entire resulting
derived work is distributed under the terms of a permission notice
identical to this one.

   Permission is granted to copy and distribute translations of this
manual into another language, under the above conditions for modified
versions, except that the section entitled "GNU General Public License"
may be included in a translation approved by the Free Software
Foundation instead of in the original English.


File: lispref.info,  Node: Internationalization Terminology,  Next: Charsets,  Up: MULE

Internationalization Terminology
================================

   In internationalization terminology, a string of text is divided up
into "characters", which are the printable units that make up the text.
A single character is (for example) a capital `A', the number `2', a
Katakana character, a Hangul character, a Kanji ideograph (an
"ideograph" is a "picture" character, such as is used in Japanese
Kanji, Chinese Hanzi, and Korean Hanja; typically there are thousands
of such ideographs in each language), etc.  The basic property of a
character is that it is the smallest unit of text with semantic
significance in text processing.

   Human beings normally process text visually, so to a first
approximation a character may be identified with its shape.  Note that
the same character may be drawn by two different people (or in two
different fonts) in slightly different ways, although the "basic shape"
will be the same.  But consider the works of Scott Kim; human beings
can recognize hugely variant shapes as the "same" character.
Sometimes, especially where characters are extremely complicated to
write, completely different shapes may be defined as the "same"
character in national standards.  The Taiwanese variant of Hanzi is
generally the most complicated; over the centuries, the Japanese,
Koreans, and the People's Republic of China have adopted
simplifications of the shape, but the line of descent from the original
shape is recorded, and the meanings and pronunciation of different
forms of the same character are considered to be identical within each
language.  (Of course, it may take a specialist to recognize the
related form; the point is that the relations are standardized, despite
the differing shapes.)

   In some cases, the differences will be significant enough that it is
actually possible to identify two or more distinct shapes that both
represent the same character.  For example, the lowercase letters `a'
and `g' each have two distinct possible shapes--the `a' can optionally
have a curved tail projecting off the top, and the `g' can be formed
either of two loops, or of one loop and a tail hanging off the bottom.
Such distinct possible shapes of a character are called "glyphs".  The
important characteristic of two glyphs making up the same character is
that the choice between one or the other is purely stylistic and has no
linguistic effect on a word (this is the reason why a capital `A' and
lowercase `a' are different characters rather than different
glyphs--e.g.  `Aspen' is a city while `aspen' is a kind of tree).

   Note that "character" and "glyph" are used differently here than
elsewhere in XEmacs.

   A "character set" is essentially a set of related characters.  ASCII,
for example, is a set of 94 characters (or 128, if you count
non-printing characters).  Other character sets are ISO8859-1 (ASCII
plus various accented characters and other international symbols), JIS
X 0201 (ASCII, more or less, plus half-width Katakana), JIS X 0208
(Japanese Kanji), JIS X 0212 (a second set of less-used Japanese Kanji),
GB2312 (Mainland Chinese Hanzi), etc.

   The definition of a character set will implicitly or explicitly give
it an "ordering", a way of assigning a number to each character in the
set.  For many character sets, there is a natural ordering, for example
the "ABC" ordering of the Roman letters.  But it is not clear whether
digits should come before or after the letters, and in fact different
European languages treat the ordering of accented characters
differently.  It is useful to use the natural order where available, of
course.  The number assigned to any particular character is called the
character's "code point".  (Within a given character set, each
character has a unique code point.  Thus the word "set" is ill-chosen;
different orderings of the same characters are different character sets.
Identifying characters is simple enough for alphabetic character sets,
but the difference in ordering can cause great headaches when the same
thousands of characters are used by different cultures as in the Hanzi.)

   A code point may be broken into a number of "position codes".  The
number of position codes required to index a particular character in a
character set is called the "dimension" of the character set.  For
practical purposes, a position code may be thought of as a byte-sized
index.  The printing characters of ASCII, being a relatively small
character set, is of dimension one, and each character in the set is
indexed using a single position code, in the range 1 through 94.  Use of
this unusual range, rather than the familiar 33 through 126, is an
intentional abstraction; to understand the programming issues you must
break the equation between character sets and encodings.

   JIS X 0208, i.e. Japanese Kanji, has thousands of characters, and is
of dimension two - every character is indexed by two position codes,
each in the range 1 through 94.  (This number "94" is not a
coincidence; we shall see that the JIS position codes were chosen so
that JIS kanji could be encoded without using codes that in ASCII are
associated with device control functions.)  Note that the choice of the
range here is somewhat arbitrary.  You could just as easily index the
printing characters in ASCII using numbers in the range 0 through 93, 2
through 95, 3 through 96, etc.  In fact, the standardized _encoding_
for the ASCII _character set_ uses the range 33 through 126.

   An "encoding" is a way of numerically representing characters from
one or more character sets into a stream of like-sized numerical values
called "words"; typically these are 8-bit, 16-bit, or 32-bit
quantities.  If an encoding encompasses only one character set, then the
position codes for the characters in that character set could be used
directly.  (This is the case with the trivial cipher used by children,
assigning 1 to `A', 2 to `B', and so on.)  However, even with ASCII,
other considerations intrude.  For example, why are the upper- and
lowercase alphabets separated by 8 characters?  Why do the digits start
with `0' being assigned the code 48?  In both cases because semantically
interesting operations (case conversion and numerical value extraction)
become convenient masking operations.  Other artificial aspects (the
control characters being assigned to codes 0-31 and 127) are historical
accidents.  (The use of 127 for `DEL' is an artifact of the "punch
once" nature of paper tape, for example.)

   Naive use of the position code is not possible, however, if more than
one character set is to be used in the encoding.  For example, printed
Japanese text typically requires characters from multiple character sets
- ASCII, JIS X 0208, and JIS X 0212, to be specific.  Each of these is
indexed using one or more position codes in the range 1 through 94, so
the position codes could not be used directly or there would be no way
to tell which character was meant.  Different Japanese encodings handle
this differently - JIS uses special escape characters to denote
different character sets; EUC sets the high bit of the position codes
for JIS X 0208 and JIS X 0212, and puts a special extra byte before each
JIS X 0212 character; etc.  (JIS, EUC, and most of the other encodings
you will encounter in files are 7-bit or 8-bit encodings.  There is one
common 16-bit encoding, which is Unicode; this strives to represent all
the world's characters in a single large character set.  32-bit
encodings are often used internally in programs, such as XEmacs with
MULE support, to simplify the code that manipulates them; however, they
are not used externally because they are not very space-efficient.)

   A general method of handling text using multiple character sets
(whether for multilingual text, or simply text in an extremely
complicated single language like Japanese) is defined in the
international standard ISO 2022.  ISO 2022 will be discussed in more
detail later (*note ISO 2022::), but for now suffice it to say that text
needs control functions (at least spacing), and if escape sequences are
to be used, an escape sequence introducer.  It was decided to make all
text streams compatible with ASCII in the sense that the codes 0-31
(and 128-159) would always be control codes, never graphic characters,
and where defined by the character set the `SPC' character would be
assigned code 32, and `DEL' would be assigned 127.  Thus there are 94
code points remaining if 7 bits are used.  This is the reason that most
character sets are defined using position codes in the range 1 through
94.  Then ISO 2022 compatible encodings are produced by shifting the
position codes 1 to 94 into character codes 33 to 126, or (if 8 bit
codes are available) into character codes 161 to 254.

   Encodings are classified as either "modal" or "non-modal".  In a
"modal encoding", there are multiple states that the encoding can be
in, and the interpretation of the values in the stream depends on the
current global state of the encoding.  Special values in the encoding,
called "escape sequences", are used to change the global state.  JIS,
for example, is a modal encoding.  The bytes `ESC $ B' indicate that,
from then on, bytes are to be interpreted as position codes for JIS X
0208, rather than as ASCII.  This effect is cancelled using the bytes
`ESC ( B', which mean "switch from whatever the current state is to
ASCII".  To switch to JIS X 0212, the escape sequence `ESC $ ( D'.
(Note that here, as is common, the escape sequences do in fact begin
with `ESC'.  This is not necessarily the case, however.  Some encodings
use control characters called "locking shifts" (effect persists until
cancelled) to switch character sets.)

   A "non-modal encoding" has no global state that extends past the
character currently being interpreted.  EUC, for example, is a
non-modal encoding.  Characters in JIS X 0208 are encoded by setting
the high bit of the position codes, and characters in JIS X 0212 are
encoded by doing the same but also prefixing the character with the
byte 0x8F.

   The advantage of a modal encoding is that it is generally more
space-efficient, and is easily extendible because there are essentially
an arbitrary number of escape sequences that can be created.  The
disadvantage, however, is that it is much more difficult to work with
if it is not being processed in a sequential manner.  In the non-modal
EUC encoding, for example, the byte 0x41 always refers to the letter
`A'; whereas in JIS, it could either be the letter `A', or one of the
two position codes in a JIS X 0208 character, or one of the two
position codes in a JIS X 0212 character.  Determining exactly which
one is meant could be difficult and time-consuming if the previous
bytes in the string have not already been processed, or impossible if
they are drawn from an external stream that cannot be rewound.

   Non-modal encodings are further divided into "fixed-width" and
"variable-width" formats.  A fixed-width encoding always uses the same
number of words per character, whereas a variable-width encoding does
not.  EUC is a good example of a variable-width encoding: one to three
bytes are used per character, depending on the character set.  16-bit
and 32-bit encodings are nearly always fixed-width, and this is in fact
one of the main reasons for using an encoding with a larger word size.
The advantages of fixed-width encodings should be obvious.  The
advantages of variable-width encodings are that they are generally more
space-efficient and allow for compatibility with existing 8-bit
encodings such as ASCII.  (For example, in Unicode ASCII characters are
simply promoted to a 16-bit representation.  That means that every
ASCII character contains a `NUL' byte; evidently all of the standard
string manipulation functions will lose badly in a fixed-width Unicode
environment.)

   The bytes in an 8-bit encoding are often referred to as "octets"
rather than simply as bytes.  This terminology dates back to the days
before 8-bit bytes were universal, when some computers had 9-bit bytes,
others had 10-bit bytes, etc.


File: lispref.info,  Node: Charsets,  Next: MULE Characters,  Prev: Internationalization Terminology,  Up: MULE

Charsets
========

   A "charset" in MULE is an object that encapsulates a particular
character set as well as an ordering of those characters.  Charsets are
permanent objects and are named using symbols, like faces.

 - Function: charsetp object
     This function returns non-`nil' if OBJECT is a charset.

* Menu:

* Charset Properties::          Properties of a charset.
* Basic Charset Functions::     Functions for working with charsets.
* Charset Property Functions::  Functions for accessing charset properties.
* Predefined Charsets::         Predefined charset objects.


File: lispref.info,  Node: Charset Properties,  Next: Basic Charset Functions,  Up: Charsets

Charset Properties
------------------

   Charsets have the following properties:

`name'
     A symbol naming the charset.  Every charset must have a different
     name; this allows a charset to be referred to using its name
     rather than the actual charset object.

`doc-string'
     A documentation string describing the charset.

`registry'
     A regular expression matching the font registry field for this
     character set.  For example, both the `ascii' and `latin-iso8859-1'
     charsets use the registry `"ISO8859-1"'.  This field is used to
     choose an appropriate font when the user gives a general font
     specification such as `-*-courier-medium-r-*-140-*', i.e. a
     14-point upright medium-weight Courier font.

`dimension'
     Number of position codes used to index a character in the
     character set.  XEmacs/MULE can only handle character sets of
     dimension 1 or 2.  This property defaults to 1.

`chars'
     Number of characters in each dimension.  In XEmacs/MULE, the only
     allowed values are 94 or 96. (There are a couple of pre-defined
     character sets, such as ASCII, that do not follow this, but you
     cannot define new ones like this.) Defaults to 94.  Note that if
     the dimension is 2, the character set thus described is 94x94 or
     96x96.

`columns'
     Number of columns used to display a character in this charset.
     Only used in TTY mode. (Under X, the actual width of a character
     can be derived from the font used to display the characters.)  If
     unspecified, defaults to the dimension. (This is almost always the
     correct value, because character sets with dimension 2 are usually
     ideograph character sets, which need two columns to display the
     intricate ideographs.)

`direction'
     A symbol, either `l2r' (left-to-right) or `r2l' (right-to-left).
     Defaults to `l2r'.  This specifies the direction that the text
     should be displayed in, and will be left-to-right for most
     charsets but right-to-left for Hebrew and Arabic. (Right-to-left
     display is not currently implemented.)

`final'
     Final byte of the standard ISO 2022 escape sequence designating
     this charset.  Must be supplied.  Each combination of (DIMENSION,
     CHARS) defines a separate namespace for final bytes, and each
     charset within a particular namespace must have a different final
     byte.  Note that ISO 2022 restricts the final byte to the range
     0x30 - 0x7E if dimension == 1, and 0x30 - 0x5F if dimension == 2.
     Note also that final bytes in the range 0x30 - 0x3F are reserved
     for user-defined (not official) character sets.  For more
     information on ISO 2022, see *Note Coding Systems::.

`graphic'
     0 (use left half of font on output) or 1 (use right half of font on
     output).  Defaults to 0.  This specifies how to convert the
     position codes that index a character in a character set into an
     index into the font used to display the character set.  With
     `graphic' set to 0, position codes 33 through 126 map to font
     indices 33 through 126; with it set to 1, position codes 33
     through 126 map to font indices 161 through 254 (i.e. the same
     number but with the high bit set).  For example, for a font whose
     registry is ISO8859-1, the left half of the font (octets 0x20 -
     0x7F) is the `ascii' charset, while the right half (octets 0xA0 -
     0xFF) is the `latin-iso8859-1' charset.

`ccl-program'
     A compiled CCL program used to convert a character in this charset
     into an index into the font.  This is in addition to the `graphic'
     property.  If a CCL program is defined, the position codes of a
     character will first be processed according to `graphic' and then
     passed through the CCL program, with the resulting values used to
     index the font.

     This is used, for example, in the Big5 character set (used in
     Taiwan).  This character set is not ISO-2022-compliant, and its
     size (94x157) does not fit within the maximum 96x96 size of
     ISO-2022-compliant character sets.  As a result, XEmacs/MULE
     splits it (in a rather complex fashion, so as to group the most
     commonly used characters together) into two charset objects
     (`big5-1' and `big5-2'), each of size 94x94, and each charset
     object uses a CCL program to convert the modified position codes
     back into standard Big5 indices to retrieve a character from a
     Big5 font.

   Most of the above properties can only be set when the charset is
initialized, and cannot be changed later.  *Note Charset Property
Functions::.


File: lispref.info,  Node: Basic Charset Functions,  Next: Charset Property Functions,  Prev: Charset Properties,  Up: Charsets

Basic Charset Functions
-----------------------

 - Function: find-charset charset-or-name
     This function retrieves the charset of the given name.  If
     CHARSET-OR-NAME is a charset object, it is simply returned.
     Otherwise, CHARSET-OR-NAME should be a symbol.  If there is no
     such charset, `nil' is returned.  Otherwise the associated charset
     object is returned.

 - Function: get-charset name
     This function retrieves the charset of the given name.  Same as
     `find-charset' except an error is signalled if there is no such
     charset instead of returning `nil'.

 - Function: charset-list
     This function returns a list of the names of all defined charsets.

 - Function: make-charset name doc-string props
     This function defines a new character set.  This function is for
     use with MULE support.  NAME is a symbol, the name by which the
     character set is normally referred.  DOC-STRING is a string
     describing the character set.  PROPS is a property list,
     describing the specific nature of the character set.  The
     recognized properties are `registry', `dimension', `columns',
     `chars', `final', `graphic', `direction', and `ccl-program', as
     previously described.

 - Function: make-reverse-direction-charset charset new-name
     This function makes a charset equivalent to CHARSET but which goes
     in the opposite direction.  NEW-NAME is the name of the new
     charset.  The new charset is returned.

 - Function: charset-from-attributes dimension chars final &optional
          direction
     This function returns a charset with the given DIMENSION, CHARS,
     FINAL, and DIRECTION.  If DIRECTION is omitted, both directions
     will be checked (left-to-right will be returned if character sets
     exist for both directions).

 - Function: charset-reverse-direction-charset charset
     This function returns the charset (if any) with the same dimension,
     number of characters, and final byte as CHARSET, but which is
     displayed in the opposite direction.


File: lispref.info,  Node: Charset Property Functions,  Next: Predefined Charsets,  Prev: Basic Charset Functions,  Up: Charsets

Charset Property Functions
--------------------------

   All of these functions accept either a charset name or charset
object.

 - Function: charset-property charset prop
     This function returns property PROP of CHARSET.  *Note Charset
     Properties::.

   Convenience functions are also provided for retrieving individual
properties of a charset.

 - Function: charset-name charset
     This function returns the name of CHARSET.  This will be a symbol.

 - Function: charset-description charset
     This function returns the documentation string of CHARSET.

 - Function: charset-registry charset
     This function returns the registry of CHARSET.

 - Function: charset-dimension charset
     This function returns the dimension of CHARSET.

 - Function: charset-chars charset
     This function returns the number of characters per dimension of
     CHARSET.

 - Function: charset-width charset
     This function returns the number of display columns per character
     (in TTY mode) of CHARSET.

 - Function: charset-direction charset
     This function returns the display direction of CHARSET--either
     `l2r' or `r2l'.

 - Function: charset-iso-final-char charset
     This function returns the final byte of the ISO 2022 escape
     sequence designating CHARSET.

 - Function: charset-iso-graphic-plane charset
     This function returns either 0 or 1, depending on whether the
     position codes of characters in CHARSET map to the left or right
     half of their font, respectively.

 - Function: charset-ccl-program charset
     This function returns the CCL program, if any, for converting
     position codes of characters in CHARSET into font indices.

   The only property of a charset that can currently be set after the
charset has been created is the CCL program.

 - Function: set-charset-ccl-program charset ccl-program
     This function sets the `ccl-program' property of CHARSET to
     CCL-PROGRAM.


File: lispref.info,  Node: Predefined Charsets,  Prev: Charset Property Functions,  Up: Charsets

Predefined Charsets
-------------------

   The following charsets are predefined in the C code.

     Name                    Type  Fi Gr Dir Registry
     --------------------------------------------------------------
     ascii                    94    B  0  l2r ISO8859-1
     control-1                94       0  l2r ---
     latin-iso8859-1          94    A  1  l2r ISO8859-1
     latin-iso8859-2          96    B  1  l2r ISO8859-2
     latin-iso8859-3          96    C  1  l2r ISO8859-3
     latin-iso8859-4          96    D  1  l2r ISO8859-4
     cyrillic-iso8859-5       96    L  1  l2r ISO8859-5
     arabic-iso8859-6         96    G  1  r2l ISO8859-6
     greek-iso8859-7          96    F  1  l2r ISO8859-7
     hebrew-iso8859-8         96    H  1  r2l ISO8859-8
     latin-iso8859-9          96    M  1  l2r ISO8859-9
     thai-tis620              96    T  1  l2r TIS620
     katakana-jisx0201        94    I  1  l2r JISX0201.1976
     latin-jisx0201           94    J  0  l2r JISX0201.1976
     japanese-jisx0208-1978   94x94 @  0  l2r JISX0208.1978
     japanese-jisx0208        94x94 B  0  l2r JISX0208.19(83|90)
     japanese-jisx0212        94x94 D  0  l2r JISX0212
     chinese-gb2312           94x94 A  0  l2r GB2312
     chinese-cns11643-1       94x94 G  0  l2r CNS11643.1
     chinese-cns11643-2       94x94 H  0  l2r CNS11643.2
     chinese-big5-1           94x94 0  0  l2r Big5
     chinese-big5-2           94x94 1  0  l2r Big5
     korean-ksc5601           94x94 C  0  l2r KSC5601
     composite                96x96    0  l2r ---

   The following charsets are predefined in the Lisp code.

     Name                     Type  Fi Gr Dir Registry
     --------------------------------------------------------------
     arabic-digit             94    2  0  l2r MuleArabic-0
     arabic-1-column          94    3  0  r2l MuleArabic-1
     arabic-2-column          94    4  0  r2l MuleArabic-2
     sisheng                  94    0  0  l2r sisheng_cwnn\|OMRON_UDC_ZH
     chinese-cns11643-3       94x94 I  0  l2r CNS11643.1
     chinese-cns11643-4       94x94 J  0  l2r CNS11643.1
     chinese-cns11643-5       94x94 K  0  l2r CNS11643.1
     chinese-cns11643-6       94x94 L  0  l2r CNS11643.1
     chinese-cns11643-7       94x94 M  0  l2r CNS11643.1
     ethiopic                 94x94 2  0  l2r Ethio
     ascii-r2l                94    B  0  r2l ISO8859-1
     ipa                      96    0  1  l2r MuleIPA
     vietnamese-lower         96    1  1  l2r VISCII1.1
     vietnamese-upper         96    2  1  l2r VISCII1.1

   For all of the above charsets, the dimension and number of columns
are the same.

   Note that ASCII, Control-1, and Composite are handled specially.
This is why some of the fields are blank; and some of the filled-in
fields (e.g. the type) are not really accurate.


File: lispref.info,  Node: MULE Characters,  Next: Composite Characters,  Prev: Charsets,  Up: MULE

MULE Characters
===============

 - Function: make-char charset arg1 &optional arg2
     This function makes a multi-byte character from CHARSET and octets
     ARG1 and ARG2.

 - Function: char-charset character
     This function returns the character set of char CHARACTER.

 - Function: char-octet character &optional n
     This function returns the octet (i.e. position code) numbered N
     (should be 0 or 1) of char CHARACTER.  N defaults to 0 if omitted.

 - Function: find-charset-region start end &optional buffer
     This function returns a list of the charsets in the region between
     START and END.  BUFFER defaults to the current buffer if omitted.

 - Function: find-charset-string string
     This function returns a list of the charsets in STRING.


File: lispref.info,  Node: Composite Characters,  Next: Coding Systems,  Prev: MULE Characters,  Up: MULE

Composite Characters
====================

   Composite characters are not yet completely implemented.

 - Function: make-composite-char string
     This function converts a string into a single composite character.
     The character is the result of overstriking all the characters in
     the string.

 - Function: composite-char-string character
     This function returns a string of the characters comprising a
     composite character.

 - Function: compose-region start end &optional buffer
     This function composes the characters in the region from START to
     END in BUFFER into one composite character.  The composite
     character replaces the composed characters.  BUFFER defaults to
     the current buffer if omitted.

 - Function: decompose-region start end &optional buffer
     This function decomposes any composite characters in the region
     from START to END in BUFFER.  This converts each composite
     character into one or more characters, the individual characters
     out of which the composite character was formed.  Non-composite
     characters are left as-is.  BUFFER defaults to the current buffer
     if omitted.


File: lispref.info,  Node: Coding Systems,  Next: CCL,  Prev: Composite Characters,  Up: MULE

Coding Systems
==============

   A coding system is an object that defines how text containing
multiple character sets is encoded into a stream of (typically 8-bit)
bytes.  The coding system is used to decode the stream into a series of
characters (which may be from multiple charsets) when the text is read
from a file or process, and is used to encode the text back into the
same format when it is written out to a file or process.

   For example, many ISO-2022-compliant coding systems (such as Compound
Text, which is used for inter-client data under the X Window System) use
escape sequences to switch between different charsets - Japanese Kanji,
for example, is invoked with `ESC $ ( B'; ASCII is invoked with `ESC (
B'; and Cyrillic is invoked with `ESC - L'.  See `make-coding-system'
for more information.

   Coding systems are normally identified using a symbol, and the
symbol is accepted in place of the actual coding system object whenever
a coding system is called for. (This is similar to how faces and
charsets work.)

 - Function: coding-system-p object
     This function returns non-`nil' if OBJECT is a coding system.

* Menu:

* Coding System Types::               Classifying coding systems.
* ISO 2022::                          An international standard for
                                        charsets and encodings.
* EOL Conversion::                    Dealing with different ways of denoting
                                        the end of a line.
* Coding System Properties::          Properties of a coding system.
* Basic Coding System Functions::     Working with coding systems.
* Coding System Property Functions::  Retrieving a coding system's properties.
* Encoding and Decoding Text::        Encoding and decoding text.
* Detection of Textual Encoding::     Determining how text is encoded.
* Big5 and Shift-JIS Functions::      Special functions for these non-standard
                                        encodings.
* Predefined Coding Systems::         Coding systems implemented by MULE.


File: lispref.info,  Node: Coding System Types,  Next: ISO 2022,  Up: Coding Systems

Coding System Types
-------------------

   The coding system type determines the basic algorithm XEmacs will
use to decode or encode a data stream.  Character encodings will be
converted to the MULE encoding, escape sequences processed, and newline
sequences converted to XEmacs's internal representation.  There are
three basic classes of coding system type: no-conversion, ISO-2022, and
special.

   No conversion allows you to look at the file's internal
representation.  Since XEmacs is basically a text editor, "no
conversion" does convert newline conventions by default.  (Use the
'binary coding-system if this is not desired.)

   ISO 2022 (*note ISO 2022::) is the basic international standard
regulating use of "coded character sets for the exchange of data", ie,
text streams.  ISO 2022 contains functions that make it possible to
encode text streams to comply with restrictions of the Internet mail
system and de facto restrictions of most file systems (eg, use of the
separator character in file names).  Coding systems which are not ISO
2022 conformant can be difficult to handle.  Perhaps more important,
they are not adaptable to multilingual information interchange, with
the obvious exception of ISO 10646 (Unicode).  (Unicode is partially
supported by XEmacs with the addition of the Lisp package ucs-conv.)

   The special class of coding systems includes automatic detection,
CCL (a "little language" embedded as an interpreter, useful for
translating between variants of a single character set),
non-ISO-2022-conformant encodings like Unicode, Shift JIS, and Big5,
and MULE internal coding.  (NB: this list is based on XEmacs 21.2.
Terminology may vary slightly for other versions of XEmacs and for GNU
Emacs 20.)

`no-conversion'
     No conversion, for binary files, and a few special cases of
     non-ISO-2022 coding systems where conversion is done by hook
     functions (usually implemented in CCL).  On output, graphic
     characters that are not in ASCII or Latin-1 will be replaced by a
     `?'. (For a no-conversion-encoded buffer, these characters will
     only be present if you explicitly insert them.)

`iso2022'
     Any ISO-2022-compliant encoding.  Among others, this includes JIS
     (the Japanese encoding commonly used for e-mail), national
     variants of EUC (the standard Unix encoding for Japanese and other
     languages), and Compound Text (an encoding used in X11).  You can
     specify more specific information about the conversion with the
     FLAGS argument.

`ucs-4'
     ISO 10646 UCS-4 encoding.  A 31-bit fixed-width superset of
     Unicode.

`utf-8'
     ISO 10646 UTF-8 encoding.  A "file system safe" transformation
     format that can be used with both UCS-4 and Unicode.

`undecided'
     Automatic conversion.  XEmacs attempts to detect the coding system
     used in the file.

`shift-jis'
     Shift-JIS (a Japanese encoding commonly used in PC operating
     systems).

`big5'
     Big5 (the encoding commonly used for Taiwanese).

`ccl'
     The conversion is performed using a user-written pseudo-code
     program.  CCL (Code Conversion Language) is the name of this
     pseudo-code.  For example, CCL is used to map KOI8-R characters
     (an encoding for Russian Cyrillic) to ISO8859-5 (the form used
     internally by MULE).

`internal'
     Write out or read in the raw contents of the memory representing
     the buffer's text.  This is primarily useful for debugging
     purposes, and is only enabled when XEmacs has been compiled with
     `DEBUG_XEMACS' set (the `--debug' configure option).  *Warning*:
     Reading in a file using `internal' conversion can result in an
     internal inconsistency in the memory representing a buffer's text,
     which will produce unpredictable results and may cause XEmacs to
     crash.  Under normal circumstances you should never use `internal'
     conversion.


File: lispref.info,  Node: ISO 2022,  Next: EOL Conversion,  Prev: Coding System Types,  Up: Coding Systems

ISO 2022
========

   This section briefly describes the ISO 2022 encoding standard.  A
more thorough treatment is available in the original document of ISO
2022 as well as various national standards (such as JIS X 0202).

   Character sets ("charsets") are classified into the following four
categories, according to the number of characters in the charset:
94-charset, 96-charset, 94x94-charset, and 96x96-charset.  This means
that although an ISO 2022 coding system may have variable width
characters, each charset used is fixed-width (in contrast to the MULE
character set and UTF-8, for example).

   ISO 2022 provides for switching between character sets via escape
sequences.  This switching is somewhat complicated, because ISO 2022
provides for both legacy applications like Internet mail that accept
only 7 significant bits in some contexts (RFC 822 headers, for example),
and more modern "8-bit clean" applications.  It also provides for
compact and transparent representation of languages like Japanese which
mix ASCII and a national script (even outside of computer programs).

   First, ISO 2022 codified prevailing practice by dividing the code
space into "control" and "graphic" regions.  The code points 0x00-0x1F
and 0x80-0x9F are reserved for "control characters", while "graphic
characters" must be assigned to code points in the regions 0x20-0x7F and
0xA0-0xFF.  The positions 0x20 and 0x7F are special, and under some
circumstances must be assigned the graphic character "ASCII SPACE" and
the control character "ASCII DEL" respectively.

   The various regions are given the name C0 (0x00-0x1F), GL
(0x20-0x7F), C1 (0x80-0x9F), and GR (0xA0-0xFF).  GL and GR stand for
"graphic left" and "graphic right", respectively, because of the
standard method of displaying graphic character sets in tables with the
high byte indexing columns and the low byte indexing rows.  I don't
find it very intuitive, but these are called "registers".

   An ISO 2022-conformant encoding for a graphic character set must use
a fixed number of bytes per character, and the values must fit into a
single register; that is, each byte must range over either 0x20-0x7F, or
0xA0-0xFF.  It is not allowed to extend the range of the repertoire of a
character set by using both ranges at the same.  This is why a standard
character set such as ISO 8859-1 is actually considered by ISO 2022 to
be an aggregation of two character sets, ASCII and LATIN-1, and why it
is technically incorrect to refer to ISO 8859-1 as "Latin 1".  Also, a
single character's bytes must all be drawn from the same register; this
is why Shift JIS (for Japanese) and Big 5 (for Chinese) are not ISO
2022-compatible encodings.

   The reason for this restriction becomes clear when you attempt to
define an efficient, robust encoding for a language like Japanese.
Like ISO 8859, Japanese encodings are aggregations of several character
sets.  In practice, the vast majority of characters are drawn from the
"JIS Roman" character set (a derivative of ASCII; it won't hurt to
think of it as ASCII) and the JIS X 0208 standard "basic Japanese"
character set including not only ideographic characters ("kanji") but
syllabic Japanese characters ("kana"), a wide variety of symbols, and
many alphabetic characters (Roman, Greek, and Cyrillic) as well.
Although JIS X 0208 includes the whole Roman alphabet, as a 2-byte code
it is not suited to programming; thus the inclusion of ASCII in the
standard Japanese encodings.

   For normal Japanese text such as in newspapers, a broad repertoire of
approximately 3000 characters is used.  Evidently this won't fit into
one byte; two must be used.  But much of the text processed by Japanese
computers is computer source code, nearly all of which is ASCII.  A not
insignificant portion of ordinary text is English (as such or as
borrowed Japanese vocabulary) or other languages which can represented
at least approximately in ASCII, as well.  It seems reasonable then to
represent ASCII in one byte, and JIS X 0208 in two.  And this is exactly
what the Extended Unix Code for Japanese (EUC-JP) does.  ASCII is
invoked to the GL register, and JIS X 0208 is invoked to the GR
register.  Thus, each byte can be tested for its character set by
looking at the high bit; if set, it is Japanese, if clear, it is ASCII.
Furthermore, since control characters like newline can never be part of
a graphic character, even in the case of corruption in transmission the
stream will be resynchronized at every line break, on the order of 60-80
bytes.  This coding system requires no escape sequences or special
control codes to represent 99.9% of all Japanese text.

   Note carefully the distinction between the character sets (ASCII and
JIS X 0208), the encoding (EUC-JP), and the coding system (ISO 2022).
The JIS X 0208 character set is used in three different encodings for
Japanese, but in ISO-2022-JP it is invoked into GL (so the high bit is
always clear), in EUC-JP it is invoked into GR (setting the high bit in
the process), and in Shift JIS the high bit may be set or reset, and the
significant bits are shifted within the 16-bit character so that the two
main character sets can coexist with a third (the "halfwidth katakana"
of JIS X 0201).  As the name implies, the ISO-2022-JP encoding is also a
version of the ISO-2022 coding system.

   In order to systematically treat subsidiary character sets (like the
"halfwidth katakana" already mentioned, and the "supplementary kanji" of
JIS X 0212), four further registers are defined: G0, G1, G2, and G3.
Unlike GL and GR, they are not logically distinguished by internal
format.  Instead, the process of "invocation" mentioned earlier is
broken into two steps: first, a character set is "designated" to one of
the registers G0-G3 by use of an "escape sequence" of the form:

             ESC [I] I F

   where I is an intermediate character or characters in the range 0x20
- 0x3F, and F, from the range 0x30-0x7Fm is the final character
identifying this charset.  (Final characters in the range 0x30-0x3F are
reserved for private use and will never have a publicly registered
meaning.)

   Then that register is "invoked" to either GL or GR, either
automatically (designations to G0 normally involve invocation to GL as
well), or by use of shifting (affecting only the following character in
the data stream) or locking (effective until the next designation or
locking) control sequences.  An encoding conformant to ISO 2022 is
typically defined by designating the initial contents of the G0-G3
registers, specifying an 7 or 8 bit environment, and specifying whether
further designations will be recognized.

   Some examples of character sets and the registered final characters
F used to designate them:

94-charset
     ASCII (B), left (J) and right (I) half of JIS X 0201, ...

96-charset
     Latin-1 (A), Latin-2 (B), Latin-3 (C), ...

94x94-charset
     GB2312 (A), JIS X 0208 (B), KSC5601 (C), ...

96x96-charset
     none for the moment

   The meanings of the various characters in these sequences, where not
specified by the ISO 2022 standard (such as the ESC character), are
assigned by "ECMA", the European Computer Manufacturers Association.

   The meaning of intermediate characters are:

             $ [0x24]: indicate charset of dimension 2 (94x94 or 96x96).
             ( [0x28]: designate to G0 a 94-charset whose final byte is F.
             ) [0x29]: designate to G1 a 94-charset whose final byte is F.
             * [0x2A]: designate to G2 a 94-charset whose final byte is F.
             + [0x2B]: designate to G3 a 94-charset whose final byte is F.
             , [0x2C]: designate to G0 a 96-charset whose final byte is F.
             - [0x2D]: designate to G1 a 96-charset whose final byte is F.
             . [0x2E]: designate to G2 a 96-charset whose final byte is F.
             / [0x2F]: designate to G3 a 96-charset whose final byte is F.

   The comma may be used in files read and written only by MULE, as a
MULE extension, but this is illegal in ISO 2022.  (The reason is that
in ISO 2022 G0 must be a 94-member character set, with 0x20 assigned
the value SPACE, and 0x7F assigned the value DEL.)

   Here are examples of designations:

             ESC ( B :              designate to G0 ASCII
             ESC - A :              designate to G1 Latin-1
             ESC $ ( A or ESC $ A : designate to G0 GB2312
             ESC $ ( B or ESC $ B : designate to G0 JISX0208
             ESC $ ) C :            designate to G1 KSC5601

   (The short forms used to designate GB2312 and JIS X 0208 are for
backwards compatibility; the long forms are preferred.)

   To use a charset designated to G2 or G3, and to use a charset
designated to G1 in a 7-bit environment, you must explicitly invoke G1,
G2, or G3 into GL.  There are two types of invocation, Locking Shift
(forever) and Single Shift (one character only).

   Locking Shift is done as follows:

             LS0 or SI (0x0F): invoke G0 into GL
             LS1 or SO (0x0E): invoke G1 into GL
             LS2:  invoke G2 into GL
             LS3:  invoke G3 into GL
             LS1R: invoke G1 into GR
             LS2R: invoke G2 into GR
             LS3R: invoke G3 into GR

   Single Shift is done as follows:

             SS2 or ESC N: invoke G2 into GL
             SS3 or ESC O: invoke G3 into GL

   The shift functions (such as LS1R and SS3) are represented by control
characters (from C1) in 8 bit environments and by escape sequences in 7
bit environments.

   (#### Ben says: I think the above is slightly incorrect.  It appears
that SS2 invokes G2 into GR and SS3 invokes G3 into GR, whereas ESC N
and ESC O behave as indicated.  The above definitions will not parse
EUC-encoded text correctly, and it looks like the code in mule-coding.c
has similar problems.)

   Evidently there are a lot of ISO-2022-compliant ways of encoding
multilingual text.  Now, in the world, there exist many coding systems
such as X11's Compound Text, Japanese JUNET code, and so-called EUC
(Extended UNIX Code); all of these are variants of ISO 2022.

   In MULE, we characterize a version of ISO 2022 by the following
attributes:

  1. The character sets initially designated to G0 thru G3.

  2. Whether short form designations are allowed for Japanese and
     Chinese.

  3. Whether ASCII should be designated to G0 before control characters.

  4. Whether ASCII should be designated to G0 at the end of line.

  5. 7-bit environment or 8-bit environment.

  6. Whether Locking Shifts are used or not.

  7. Whether to use ASCII or the variant JIS X 0201-1976-Roman.

  8. Whether to use JIS X 0208-1983 or the older version JIS X
     0208-1976.

   (The last two are only for Japanese.)

   By specifying these attributes, you can create any variant of ISO
2022.

   Here are several examples:

     ISO-2022-JP -- Coding system used in Japanese email (RFC 1463 #### check).
             1. G0 <- ASCII, G1..3 <- never used
             2. Yes.
             3. Yes.
             4. Yes.
             5. 7-bit environment
             6. No.
             7. Use ASCII
             8. Use JIS X 0208-1983
     
     ctext -- X11 Compound Text
             1. G0 <- ASCII, G1 <- Latin-1, G2,3 <- never used.
             2. No.
             3. No.
             4. Yes.
             5. 8-bit environment.
             6. No.
             7. Use ASCII.
             8. Use JIS X 0208-1983.
     
     euc-china -- Chinese EUC.  Often called the "GB encoding", but that is
     technically incorrect.
             1. G0 <- ASCII, G1 <- GB 2312, G2,3 <- never used.
             2. No.
             3. Yes.
             4. Yes.
             5. 8-bit environment.
             6. No.
             7. Use ASCII.
             8. Use JIS X 0208-1983.
     
     ISO-2022-KR -- Coding system used in Korean email.
             1. G0 <- ASCII, G1 <- KSC 5601, G2,3 <- never used.
             2. No.
             3. Yes.
             4. Yes.
             5. 7-bit environment.
             6. Yes.
             7. Use ASCII.
             8. Use JIS X 0208-1983.

   MULE creates all of these coding systems by default.


File: lispref.info,  Node: EOL Conversion,  Next: Coding System Properties,  Prev: ISO 2022,  Up: Coding Systems

EOL Conversion
--------------

`nil'
     Automatically detect the end-of-line type (LF, CRLF, or CR).  Also
     generate subsidiary coding systems named `NAME-unix', `NAME-dos',
     and `NAME-mac', that are identical to this coding system but have
     an EOL-TYPE value of `lf', `crlf', and `cr', respectively.

`lf'
     The end of a line is marked externally using ASCII LF.  Since this
     is also the way that XEmacs represents an end-of-line internally,
     specifying this option results in no end-of-line conversion.  This
     is the standard format for Unix text files.

`crlf'
     The end of a line is marked externally using ASCII CRLF.  This is
     the standard format for MS-DOS text files.

`cr'
     The end of a line is marked externally using ASCII CR.  This is the
     standard format for Macintosh text files.

`t'
     Automatically detect the end-of-line type but do not generate
     subsidiary coding systems.  (This value is converted to `nil' when
     stored internally, and `coding-system-property' will return `nil'.)