git.chise.org Git - chise/xemacs-chise.git.1/blob - man/lispref/mule.texi

   1 @c -*-texinfo-*-
   2 @c This is part of the XEmacs Lisp Reference Manual.
   3 @c Copyright (C) 1996 Ben Wing.
   4 @c See the file lispref.texi for copying conditions.
   5 @setfilename ../../info/internationalization.info
   6 @node MULE, Tips, Internationalization, top
   7 @chapter MULE
   8
   9   @dfn{MULE} is the name originally given to the version of GNU Emacs
  10 extended for multi-lingual (and in particular Asian-language) support.
  11 ``MULE'' is short for ``MUlti-Lingual Emacs''.  It is an extension and
  12 complete rewrite of Nemacs (``Nihon Emacs'' where ``Nihon'' is the
  13 Japanese word for ``Japan''), which only provided support for Japanese.
  14 XEmacs refers to its multi-lingual support as @dfn{MULE support} since
  15 it is based on @dfn{MULE}.
  16
  17 @menu
  18 * Internationalization Terminology::
  19                         Definition of various internationalization terms.
  20 * Charsets::            Sets of related characters.
  21 * MULE Characters::     Working with characters in XEmacs/MULE.
  22 * Composite Characters:: Making new characters by overstriking other ones.
  23 * Coding Systems::      Ways of representing a string of chars using integers.
  24 * CCL::                 A special language for writing fast converters.
  25 * Category Tables::     Subdividing charsets into groups.
  26 @end menu
  27
  28 @node Internationalization Terminology, Charsets, , MULE
  29 @section Internationalization Terminology
  30
  31   In internationalization terminology, a string of text is divided up
  32 into @dfn{characters}, which are the printable units that make up the
  33 text.  A single character is (for example) a capital @samp{A}, the
  34 number @samp{2}, a Katakana character, a Hangul character, a Kanji
  35 ideograph (an @dfn{ideograph} is a ``picture'' character, such as is
  36 used in Japanese Kanji, Chinese Hanzi, and Korean Hanja; typically there
  37 are thousands of such ideographs in each language), etc.  The basic
  38 property of a character is that it is the smallest unit of text with
  39 semantic significance in text processing.
  40
  41   Human beings normally process text visually, so to a first approximation
  42 a character may be identified with its shape.  Note that the same
  43 character may be drawn by two different people (or in two different
  44 fonts) in slightly different ways, although the "basic shape" will be the
  45 same.  But consider the works of Scott Kim; human beings can recognize
  46 hugely variant shapes as the "same" character.  Sometimes, especially
  47 where characters are extremely complicated to write, completely
  48 different shapes may be defined as the "same" character in national
  49 standards.  The Taiwanese variant of Hanzi is generally the most
  50 complicated; over the centuries, the Japanese, Koreans, and the People's
  51 Republic of China have adopted simplifications of the shape, but the
  52 line of descent from the original shape is recorded, and the meanings
  53 and pronunciation of different forms of the same character are
  54 considered to be identical within each language.  (Of course, it may
  55 take a specialist to recognize the related form; the point is that the
  56 relations are standardized, despite the differing shapes.)
  57
  58   In some cases, the differences will be significant enough that it is
  59 actually possible to identify two or more distinct shapes that both
  60 represent the same character.  For example, the lowercase letters
  61 @samp{a} and @samp{g} each have two distinct possible shapes---the
  62 @samp{a} can optionally have a curved tail projecting off the top, and
  63 the @samp{g} can be formed either of two loops, or of one loop and a
  64 tail hanging off the bottom.  Such distinct possible shapes of a
  65 character are called @dfn{glyphs}.  The important characteristic of two
  66 glyphs making up the same character is that the choice between one or
  67 the other is purely stylistic and has no linguistic effect on a word
  68 (this is the reason why a capital @samp{A} and lowercase @samp{a}
  69 are different characters rather than different glyphs---e.g.
  70 @samp{Aspen} is a city while @samp{aspen} is a kind of tree).
  71
  72   Note that @dfn{character} and @dfn{glyph} are used differently
  73 here than elsewhere in XEmacs.
  74
  75   A @dfn{character set} is essentially a set of related characters.  ASCII,
  76 for example, is a set of 94 characters (or 128, if you count
  77 non-printing characters).  Other character sets are ISO8859-1 (ASCII
  78 plus various accented characters and other international symbols),
  79 JIS X 0201 (ASCII, more or less, plus half-width Katakana), JIS X 0208
  80 (Japanese Kanji), JIS X 0212 (a second set of less-used Japanese Kanji),
  81 GB2312 (Mainland Chinese Hanzi), etc.
  82
  83   The definition of a character set will implicitly or explicitly give
  84 it an @dfn{ordering}, a way of assigning a number to each character in
  85 the set.  For many character sets, there is a natural ordering, for
  86 example the ``ABC'' ordering of the Roman letters.  But it is not clear
  87 whether digits should come before or after the letters, and in fact
  88 different European languages treat the ordering of accented characters
  89 differently.  It is useful to use the natural order where available, of
  90 course.  The number assigned to any particular character is called the
  91 character's @dfn{code point}.  (Within a given character set, each
  92 character has a unique code point.  Thus the word "set" is ill-chosen;
  93 different orderings of the same characters are different character sets.
  94 Identifying characters is simple enough for alphabetic character sets,
  95 but the difference in ordering can cause great headaches when the same
  96 thousands of characters are used by different cultures as in the Hanzi.)
  97
  98   A code point may be broken into a number of @dfn{position codes}.  The
  99 number of position codes required to index a particular character in a
 100 character set is called the @dfn{dimension} of the character set.  For
 101 practical purposes, a position code may be thought of as a byte-sized
 102 index.  The printing characters of ASCII, being a relatively small
 103 character set, is of dimension one, and each character in the set is
 104 indexed using a single position code, in the range 1 through 94.  Use of
 105 this unusual range, rather than the familiar 33 through 126, is an
 106 intentional abstraction; to understand the programming issues you must
 107 break the equation between character sets and encodings.
 108
 109   JIS X 0208, i.e. Japanese Kanji, has thousands of characters, and is
 110 of dimension two -- every character is indexed by two position codes,
 111 each in the range 1 through 94.  (This number ``94'' is not a
 112 coincidence; we shall see that the JIS position codes were chosen so
 113 that JIS kanji could be encoded without using codes that in ASCII are
 114 associated with device control functions.)  Note that the choice of the
 115 range here is somewhat arbitrary.  You could just as easily index the
 116 printing characters in ASCII using numbers in the range 0 through 93, 2
 117 through 95, 3 through 96, etc.  In fact, the standardized
 118 @emph{encoding} for the ASCII @emph{character set} uses the range 33
 119 through 126.
 120
 121   An @dfn{encoding} is a way of numerically representing characters from
 122 one or more character sets into a stream of like-sized numerical values
 123 called @dfn{words}; typically these are 8-bit, 16-bit, or 32-bit
 124 quantities.  If an encoding encompasses only one character set, then the
 125 position codes for the characters in that character set could be used
 126 directly.  (This is the case with the trivial cipher used by children,
 127 assigning 1 to `A', 2 to `B', and so on.)  However, even with ASCII,
 128 other considerations intrude.  For example, why are the upper- and
 129 lowercase alphabets separated by 8 characters?  Why do the digits start
 130 with `0' being assigned the code 48?  In both cases because semantically
 131 interesting operations (case conversion and numerical value extraction)
 132 become convenient masking operations.  Other artificial aspects (the
 133 control characters being assigned to codes 0--31 and 127) are historical
 134 accidents.  (The use of 127 for @samp{DEL} is an artifact of the "punch
 135 once" nature of paper tape, for example.)
 136
 137   Naive use of the position code is not possible, however, if more than
 138 one character set is to be used in the encoding.  For example, printed
 139 Japanese text typically requires characters from multiple character sets
 140 -- ASCII, JIS X 0208, and JIS X 0212, to be specific.  Each of these is
 141 indexed using one or more position codes in the range 1 through 94, so
 142 the position codes could not be used directly or there would be no way
 143 to tell which character was meant.  Different Japanese encodings handle
 144 this differently -- JIS uses special escape characters to denote
 145 different character sets; EUC sets the high bit of the position codes
 146 for JIS X 0208 and JIS X 0212, and puts a special extra byte before each
 147 JIS X 0212 character; etc.  (JIS, EUC, and most of the other encodings
 148 you will encounter in files are 7-bit or 8-bit encodings.  There is one
 149 common 16-bit encoding, which is Unicode; this strives to represent all
 150 the world's characters in a single large character set.  32-bit
 151 encodings are often used internally in programs, such as XEmacs with
 152 MULE support, to simplify the code that manipulates them; however, they
 153 are not used externally because they are not very space-efficient.)
 154
 155   A general method of handling text using multiple character sets
 156 (whether for multilingual text, or simply text in an extremely
 157 complicated single language like Japanese) is defined in the
 158 international standard ISO 2022.  ISO 2022 will be discussed in more
 159 detail later (@pxref{ISO 2022}), but for now suffice it to say that text
 160 needs control functions (at least spacing), and if escape sequences are
 161 to be used, an escape sequence introducer.  It was decided to make all
 162 text streams compatible with ASCII in the sense that the codes 0--31
 163 (and 128-159) would always be control codes, never graphic characters,
 164 and where defined by the character set the @samp{SPC} character would be
 165 assigned code 32, and @samp{DEL} would be assigned 127.  Thus there are
 166 94 code points remaining if 7 bits are used.  This is the reason that
 167 most character sets are defined using position codes in the range 1
 168 through 94.  Then ISO 2022 compatible encodings are produced by shifting
 169 the position codes 1 to 94 into character codes 33 to 126, or (if 8 bit
 170 codes are available) into character codes 161 to 254.
 171
 172   Encodings are classified as either @dfn{modal} or @dfn{non-modal}.  In
 173 a @dfn{modal encoding}, there are multiple states that the encoding can
 174 be in, and the interpretation of the values in the stream depends on the
 175 current global state of the encoding.  Special values in the encoding,
 176 called @dfn{escape sequences}, are used to change the global state.
 177 JIS, for example, is a modal encoding.  The bytes @samp{ESC $ B}
 178 indicate that, from then on, bytes are to be interpreted as position
 179 codes for JIS X 0208, rather than as ASCII.  This effect is cancelled
 180 using the bytes @samp{ESC ( B}, which mean ``switch from whatever the
 181 current state is to ASCII''.  To switch to JIS X 0212, the escape
 182 sequence @samp{ESC $ ( D}. (Note that here, as is common, the escape
 183 sequences do in fact begin with @samp{ESC}.  This is not necessarily the
 184 case, however.  Some encodings use control characters called "locking
 185 shifts" (effect persists until cancelled) to switch character sets.)
 186
 187   A @dfn{non-modal encoding} has no global state that extends past the
 188 character currently being interpreted.  EUC, for example, is a
 189 non-modal encoding.  Characters in JIS X 0208 are encoded by setting
 190 the high bit of the position codes, and characters in JIS X 0212 are
 191 encoded by doing the same but also prefixing the character with the
 192 byte 0x8F.
 193
 194   The advantage of a modal encoding is that it is generally more
 195 space-efficient, and is easily extendible because there are essentially
 196 an arbitrary number of escape sequences that can be created.  The
 197 disadvantage, however, is that it is much more difficult to work with
 198 if it is not being processed in a sequential manner.  In the non-modal
 199 EUC encoding, for example, the byte 0x41 always refers to the letter
 200 @samp{A}; whereas in JIS, it could either be the letter @samp{A}, or
 201 one of the two position codes in a JIS X 0208 character, or one of the
 202 two position codes in a JIS X 0212 character.  Determining exactly which
 203 one is meant could be difficult and time-consuming if the previous
 204 bytes in the string have not already been processed, or impossible if
 205 they are drawn from an external stream that cannot be rewound.
 206
 207   Non-modal encodings are further divided into @dfn{fixed-width} and
 208 @dfn{variable-width} formats.  A fixed-width encoding always uses
 209 the same number of words per character, whereas a variable-width
 210 encoding does not.  EUC is a good example of a variable-width
 211 encoding: one to three bytes are used per character, depending on
 212 the character set.  16-bit and 32-bit encodings are nearly always
 213 fixed-width, and this is in fact one of the main reasons for using
 214 an encoding with a larger word size.  The advantages of fixed-width
 215 encodings should be obvious.  The advantages of variable-width
 216 encodings are that they are generally more space-efficient and allow
 217 for compatibility with existing 8-bit encodings such as ASCII.  (For
 218 example, in Unicode ASCII characters are simply promoted to a 16-bit
 219 representation.  That means that every ASCII character contains a
 220 @samp{NUL} byte; evidently all of the standard string manipulation
 221 functions will lose badly in a fixed-width Unicode environment.)
 222
 223   The bytes in an 8-bit encoding are often referred to as @dfn{octets}
 224 rather than simply as bytes.  This terminology dates back to the days
 225 before 8-bit bytes were universal, when some computers had 9-bit bytes,
 226 others had 10-bit bytes, etc.
 227
 228 @node Charsets, MULE Characters, Internationalization Terminology, MULE
 229 @section Charsets
 230
 231   A @dfn{charset} in MULE is an object that encapsulates a
 232 particular character set as well as an ordering of those characters.
 233 Charsets are permanent objects and are named using symbols, like
 234 faces.
 235
 236 @defun charsetp object
 237 This function returns non-@code{nil} if @var{object} is a charset.
 238 @end defun
 239
 240 @menu
 241 * Charset Properties::          Properties of a charset.
 242 * Basic Charset Functions::     Functions for working with charsets.
 243 * Charset Property Functions::  Functions for accessing charset properties.
 244 * Predefined Charsets::         Predefined charset objects.
 245 @end menu
 246
 247 @node Charset Properties, Basic Charset Functions, , Charsets
 248 @subsection Charset Properties
 249
 250   Charsets have the following properties:
 251
 252 @table @code
 253 @item name
 254 A symbol naming the charset.  Every charset must have a different name;
 255 this allows a charset to be referred to using its name rather than
 256 the actual charset object.
 257 @item doc-string
 258 A documentation string describing the charset.
 259 @item registry
 260 A regular expression matching the font registry field for this character
 261 set.  For example, both the @code{ascii} and @code{latin-iso8859-1}
 262 charsets use the registry @code{"ISO8859-1"}.  This field is used to
 263 choose an appropriate font when the user gives a general font
 264 specification such as @samp{-*-courier-medium-r-*-140-*}, i.e. a
 265 14-point upright medium-weight Courier font.
 266 @item dimension
 267 Number of position codes used to index a character in the character set.
 268 XEmacs/MULE can only handle character sets of dimension 1 or 2.
 269 This property defaults to 1.
 270 @item chars
 271 Number of characters in each dimension.  In XEmacs/MULE, the only
 272 allowed values are 94 or 96. (There are a couple of pre-defined
 273 character sets, such as ASCII, that do not follow this, but you cannot
 274 define new ones like this.) Defaults to 94.  Note that if the dimension
 275 is 2, the character set thus described is 94x94 or 96x96.
 276 @item columns
 277 Number of columns used to display a character in this charset.
 278 Only used in TTY mode. (Under X, the actual width of a character
 279 can be derived from the font used to display the characters.)
 280 If unspecified, defaults to the dimension. (This is almost
 281 always the correct value, because character sets with dimension 2
 282 are usually ideograph character sets, which need two columns to
 283 display the intricate ideographs.)
 284 @item direction
 285 A symbol, either @code{l2r} (left-to-right) or @code{r2l}
 286 (right-to-left).  Defaults to @code{l2r}.  This specifies the
 287 direction that the text should be displayed in, and will be
 288 left-to-right for most charsets but right-to-left for Hebrew
 289 and Arabic. (Right-to-left display is not currently implemented.)
 290 @item final
 291 Final byte of the standard ISO 2022 escape sequence designating this
 292 charset.  Must be supplied.  Each combination of (@var{dimension},
 293 @var{chars}) defines a separate namespace for final bytes, and each
 294 charset within a particular namespace must have a different final byte.
 295 Note that ISO 2022 restricts the final byte to the range 0x30 - 0x7E if
 296 dimension == 1, and 0x30 - 0x5F if dimension == 2.  Note also that final
 297 bytes in the range 0x30 - 0x3F are reserved for user-defined (not
 298 official) character sets.  For more information on ISO 2022, see @ref{Coding
 299 Systems}.
 300 @item graphic
 301 0 (use left half of font on output) or 1 (use right half of font on
 302 output).  Defaults to 0.  This specifies how to convert the position
 303 codes that index a character in a character set into an index into the
 304 font used to display the character set.  With @code{graphic} set to 0,
 305 position codes 33 through 126 map to font indices 33 through 126; with
 306 it set to 1, position codes 33 through 126 map to font indices 161
 307 through 254 (i.e. the same number but with the high bit set).  For
 308 example, for a font whose registry is ISO8859-1, the left half of the
 309 font (octets 0x20 - 0x7F) is the @code{ascii} charset, while the right
 310 half (octets 0xA0 - 0xFF) is the @code{latin-iso8859-1} charset.
 311 @item ccl-program
 312 A compiled CCL program used to convert a character in this charset into
 313 an index into the font.  This is in addition to the @code{graphic}
 314 property.  If a CCL program is defined, the position codes of a
 315 character will first be processed according to @code{graphic} and
 316 then passed through the CCL program, with the resulting values used
 317 to index the font.
 318
 319   This is used, for example, in the Big5 character set (used in Taiwan).
 320 This character set is not ISO-2022-compliant, and its size (94x157) does
 321 not fit within the maximum 96x96 size of ISO-2022-compliant character
 322 sets.  As a result, XEmacs/MULE splits it (in a rather complex fashion,
 323 so as to group the most commonly used characters together) into two
 324 charset objects (@code{big5-1} and @code{big5-2}), each of size 94x94,
 325 and each charset object uses a CCL program to convert the modified
 326 position codes back into standard Big5 indices to retrieve a character
 327 from a Big5 font.
 328 @end table
 329
 330   Most of the above properties can only be set when the charset is
 331 initialized, and cannot be changed later.
 332 @xref{Charset Property Functions}.
 333
 334 @node Basic Charset Functions, Charset Property Functions, Charset Properties, Charsets
 335 @subsection Basic Charset Functions
 336
 337 @defun find-charset charset-or-name
 338 This function retrieves the charset of the given name.  If
 339 @var{charset-or-name} is a charset object, it is simply returned.
 340 Otherwise, @var{charset-or-name} should be a symbol.  If there is no
 341 such charset, @code{nil} is returned.  Otherwise the associated charset
 342 object is returned.
 343 @end defun
 344
 345 @defun get-charset name
 346 This function retrieves the charset of the given name.  Same as
 347 @code{find-charset} except an error is signalled if there is no such
 348 charset instead of returning @code{nil}.
 349 @end defun
 350
 351 @defun charset-list
 352 This function returns a list of the names of all defined charsets.
 353 @end defun
 354
 355 @defun make-charset name doc-string props
 356 This function defines a new character set.  This function is for use
 357 with MULE support.  @var{name} is a symbol, the name by which the
 358 character set is normally referred.  @var{doc-string} is a string
 359 describing the character set.  @var{props} is a property list,
 360 describing the specific nature of the character set.  The recognized
 361 properties are @code{registry}, @code{dimension}, @code{columns},
 362 @code{chars}, @code{final}, @code{graphic}, @code{direction}, and
 363 @code{ccl-program}, as previously described.
 364 @end defun
 365
 366 @defun make-reverse-direction-charset charset new-name
 367 This function makes a charset equivalent to @var{charset} but which goes
 368 in the opposite direction.  @var{new-name} is the name of the new
 369 charset.  The new charset is returned.
 370 @end defun
 371
 372 @defun charset-from-attributes dimension chars final &optional direction
 373 This function returns a charset with the given @var{dimension},
 374 @var{chars}, @var{final}, and @var{direction}.  If @var{direction} is
 375 omitted, both directions will be checked (left-to-right will be returned
 376 if character sets exist for both directions).
 377 @end defun
 378
 379 @defun charset-reverse-direction-charset charset
 380 This function returns the charset (if any) with the same dimension,
 381 number of characters, and final byte as @var{charset}, but which is
 382 displayed in the opposite direction.
 383 @end defun
 384
 385 @node Charset Property Functions, Predefined Charsets, Basic Charset Functions, Charsets
 386 @subsection Charset Property Functions
 387
 388   All of these functions accept either a charset name or charset object.
 389
 390 @defun charset-property charset prop
 391 This function returns property @var{prop} of @var{charset}.
 392 @xref{Charset Properties}.
 393 @end defun
 394
 395   Convenience functions are also provided for retrieving individual
 396 properties of a charset.
 397
 398 @defun charset-name charset
 399 This function returns the name of @var{charset}.  This will be a symbol.
 400 @end defun
 401
 402 @defun charset-description charset
 403 This function returns the documentation string of @var{charset}.
 404 @end defun
 405
 406 @defun charset-registry charset
 407 This function returns the registry of @var{charset}.
 408 @end defun
 409
 410 @defun charset-dimension charset
 411 This function returns the dimension of @var{charset}.
 412 @end defun
 413
 414 @defun charset-chars charset
 415 This function returns the number of characters per dimension of
 416 @var{charset}.
 417 @end defun
 418
 419 @defun charset-width charset
 420 This function returns the number of display columns per character (in
 421 TTY mode) of @var{charset}.
 422 @end defun
 423
 424 @defun charset-direction charset
 425 This function returns the display direction of @var{charset}---either
 426 @code{l2r} or @code{r2l}.
 427 @end defun
 428
 429 @defun charset-iso-final-char charset
 430 This function returns the final byte of the ISO 2022 escape sequence
 431 designating @var{charset}.
 432 @end defun
 433
 434 @defun charset-iso-graphic-plane charset
 435 This function returns either 0 or 1, depending on whether the position
 436 codes of characters in @var{charset} map to the left or right half
 437 of their font, respectively.
 438 @end defun
 439
 440 @defun charset-ccl-program charset
 441 This function returns the CCL program, if any, for converting
 442 position codes of characters in @var{charset} into font indices.
 443 @end defun
 444
 445   The two properties of a charset that can currently be set after the
 446 charset has been created are the CCL program and the font registry.
 447
 448 @defun set-charset-ccl-program charset ccl-program
 449 This function sets the @code{ccl-program} property of @var{charset} to
 450 @var{ccl-program}.
 451 @end defun
 452
 453 @defun set-charset-registry charset registry
 454 This function sets the @code{registry} property of @var{charset} to
 455 @var{registry}.
 456 @end defun
 457
 458 @node Predefined Charsets, , Charset Property Functions, Charsets
 459 @subsection Predefined Charsets
 460
 461   The following charsets are predefined in the C code.
 462
 463 @example
 464 Name                    Type  Fi Gr Dir Registry
 465 --------------------------------------------------------------
 466 ascii                    94    B  0  l2r ISO8859-1
 467 control-1                94       0  l2r ---
 468 latin-iso8859-1          94    A  1  l2r ISO8859-1
 469 latin-iso8859-2          96    B  1  l2r ISO8859-2
 470 latin-iso8859-3          96    C  1  l2r ISO8859-3
 471 latin-iso8859-4          96    D  1  l2r ISO8859-4
 472 cyrillic-iso8859-5       96    L  1  l2r ISO8859-5
 473 arabic-iso8859-6         96    G  1  r2l ISO8859-6
 474 greek-iso8859-7          96    F  1  l2r ISO8859-7
 475 hebrew-iso8859-8         96    H  1  r2l ISO8859-8
 476 latin-iso8859-9          96    M  1  l2r ISO8859-9
 477 thai-tis620              96    T  1  l2r TIS620
 478 katakana-jisx0201        94    I  1  l2r JISX0201.1976
 479 latin-jisx0201           94    J  0  l2r JISX0201.1976
 480 japanese-jisx0208-1978   94x94 @@  0  l2r JISX0208.1978
 481 japanese-jisx0208        94x94 B  0  l2r JISX0208.19(83|90)
 482 japanese-jisx0212        94x94 D  0  l2r JISX0212
 483 chinese-gb2312           94x94 A  0  l2r GB2312
 484 chinese-cns11643-1       94x94 G  0  l2r CNS11643.1
 485 chinese-cns11643-2       94x94 H  0  l2r CNS11643.2
 486 chinese-big5-1           94x94 0  0  l2r Big5
 487 chinese-big5-2           94x94 1  0  l2r Big5
 488 korean-ksc5601           94x94 C  0  l2r KSC5601
 489 composite                96x96    0  l2r ---
 490 @end example
 491
 492   The following charsets are predefined in the Lisp code.
 493
 494 @example
 495 Name                     Type  Fi Gr Dir Registry
 496 --------------------------------------------------------------
 497 arabic-digit             94    2  0  l2r MuleArabic-0
 498 arabic-1-column          94    3  0  r2l MuleArabic-1
 499 arabic-2-column          94    4  0  r2l MuleArabic-2
 500 sisheng                  94    0  0  l2r sisheng_cwnn\|OMRON_UDC_ZH
 501 chinese-cns11643-3       94x94 I  0  l2r CNS11643.1
 502 chinese-cns11643-4       94x94 J  0  l2r CNS11643.1
 503 chinese-cns11643-5       94x94 K  0  l2r CNS11643.1
 504 chinese-cns11643-6       94x94 L  0  l2r CNS11643.1
 505 chinese-cns11643-7       94x94 M  0  l2r CNS11643.1
 506 ethiopic                 94x94 2  0  l2r Ethio
 507 ascii-r2l                94    B  0  r2l ISO8859-1
 508 ipa                      96    0  1  l2r MuleIPA
 509 vietnamese-viscii-lower  96    1  1  l2r VISCII1.1
 510 vietnamese-viscii-upper  96    2  1  l2r VISCII1.1
 511 @end example
 512
 513 For all of the above charsets, the dimension and number of columns are
 514 the same.
 515
 516   Note that ASCII, Control-1, and Composite are handled specially.
 517 This is why some of the fields are blank; and some of the filled-in
 518 fields (e.g. the type) are not really accurate.
 519
 520 @node MULE Characters, Composite Characters, Charsets, MULE
 521 @section MULE Characters
 522
 523 @defun make-char charset arg1 &optional arg2
 524 This function makes a multi-byte character from @var{charset} and octets
 525 @var{arg1} and @var{arg2}.
 526 @end defun
 527
 528 @defun char-charset character
 529 This function returns the character set of char @var{character}.
 530 @end defun
 531
 532 @defun char-octet character &optional n
 533 This function returns the octet (i.e. position code) numbered @var{n}
 534 (should be 0 or 1) of char @var{character}.  @var{n} defaults to 0 if omitted.
 535 @end defun
 536
 537 @defun find-charset-region start end &optional buffer
 538 This function returns a list of the charsets in the region between
 539 @var{start} and @var{end}.  @var{buffer} defaults to the current buffer
 540 if omitted.
 541 @end defun
 542
 543 @defun find-charset-string string
 544 This function returns a list of the charsets in @var{string}.
 545 @end defun
 546
 547 @node Composite Characters, Coding Systems, MULE Characters, MULE
 548 @section Composite Characters
 549
 550   Composite characters are not yet completely implemented.
 551
 552 @defun make-composite-char string
 553 This function converts a string into a single composite character.  The
 554 character is the result of overstriking all the characters in the
 555 string.
 556 @end defun
 557
 558 @defun composite-char-string character
 559 This function returns a string of the characters comprising a composite
 560 character.
 561 @end defun
 562
 563 @defun compose-region start end &optional buffer
 564 This function composes the characters in the region from @var{start} to
 565 @var{end} in @var{buffer} into one composite character.  The composite
 566 character replaces the composed characters.  @var{buffer} defaults to
 567 the current buffer if omitted.
 568 @end defun
 569
 570 @defun decompose-region start end &optional buffer
 571 This function decomposes any composite characters in the region from
 572 @var{start} to @var{end} in @var{buffer}.  This converts each composite
 573 character into one or more characters, the individual characters out of
 574 which the composite character was formed.  Non-composite characters are
 575 left as-is.  @var{buffer} defaults to the current buffer if omitted.
 576 @end defun
 577
 578 @node Coding Systems, CCL, Composite Characters, MULE
 579 @section Coding Systems
 580
 581   A coding system is an object that defines how text containing multiple
 582 character sets is encoded into a stream of (typically 8-bit) bytes.  The
 583 coding system is used to decode the stream into a series of characters
 584 (which may be from multiple charsets) when the text is read from a file
 585 or process, and is used to encode the text back into the same format
 586 when it is written out to a file or process.
 587
 588   For example, many ISO-2022-compliant coding systems (such as Compound
 589 Text, which is used for inter-client data under the X Window System) use
 590 escape sequences to switch between different charsets -- Japanese Kanji,
 591 for example, is invoked with @samp{ESC $ ( B}; ASCII is invoked with
 592 @samp{ESC ( B}; and Cyrillic is invoked with @samp{ESC - L}.  See
 593 @code{make-coding-system} for more information.
 594
 595   Coding systems are normally identified using a symbol, and the symbol is
 596 accepted in place of the actual coding system object whenever a coding
 597 system is called for. (This is similar to how faces and charsets work.)
 598
 599 @defun coding-system-p object
 600 This function returns non-@code{nil} if @var{object} is a coding system.
 601 @end defun
 602
 603 @menu
 604 * Coding System Types::               Classifying coding systems.
 605 * ISO 2022::                          An international standard for
 606                                         charsets and encodings.
 607 * EOL Conversion::                    Dealing with different ways of denoting
 608                                         the end of a line.
 609 * Coding System Properties::          Properties of a coding system.
 610 * Basic Coding System Functions::     Working with coding systems.
 611 * Coding System Property Functions::  Retrieving a coding system's properties.
 612 * Encoding and Decoding Text::        Encoding and decoding text.
 613 * Detection of Textual Encoding::     Determining how text is encoded.
 614 * Big5 and Shift-JIS Functions::      Special functions for these non-standard
 615                                         encodings.
 616 * Predefined Coding Systems::         Coding systems implemented by MULE.
 617 @end menu
 618
 619 @node Coding System Types, ISO 2022, , Coding Systems
 620 @subsection Coding System Types
 621
 622   The coding system type determines the basic algorithm XEmacs will use to
 623 decode or encode a data stream.  Character encodings will be converted
 624 to the MULE encoding, escape sequences processed, and newline sequences
 625 converted to XEmacs's internal representation.  There are three basic
 626 classes of coding system type: no-conversion, ISO-2022, and special.
 627
 628   No conversion allows you to look at the file's internal representation.
 629 Since XEmacs is basically a text editor, "no conversion" does convert
 630 newline conventions by default.  (Use the 'binary coding-system if this
 631 is not desired.)
 632
 633   ISO 2022 (@pxref{ISO 2022}) is the basic international standard regulating
 634 use of "coded character sets for the exchange of data", ie, text
 635 streams.  ISO 2022 contains functions that make it possible to encode
 636 text streams to comply with restrictions of the Internet mail system and
 637 de facto restrictions of most file systems (eg, use of the separator
 638 character in file names).  Coding systems which are not ISO 2022
 639 conformant can be difficult to handle.  Perhaps more important, they are
 640 not adaptable to multilingual information interchange, with the obvious
 641 exception of ISO 10646 (Unicode).  (Unicode is partially supported by
 642 XEmacs with the addition of the Lisp package ucs-conv.)
 643
 644   The special class of coding systems includes automatic detection, CCL (a
 645 "little language" embedded as an interpreter, useful for translating
 646 between variants of a single character set), non-ISO-2022-conformant
 647 encodings like Unicode, Shift JIS, and Big5, and MULE internal coding.
 648 (NB: this list is based on XEmacs 21.2.  Terminology may vary slightly
 649 for other versions of XEmacs and for GNU Emacs 20.)
 650
 651 @table @code
 652 @item no-conversion
 653 No conversion, for binary files, and a few special cases of non-ISO-2022
 654 coding systems where conversion is done by hook functions (usually
 655 implemented in CCL).  On output, graphic characters that are not in
 656 ASCII or Latin-1 will be replaced by a @samp{?}. (For a
 657 no-conversion-encoded buffer, these characters will only be present if
 658 you explicitly insert them.)
 659 @item iso2022
 660 Any ISO-2022-compliant encoding.  Among others, this includes JIS (the
 661 Japanese encoding commonly used for e-mail), national variants of EUC
 662 (the standard Unix encoding for Japanese and other languages), and
 663 Compound Text (an encoding used in X11).  You can specify more specific
 664 information about the conversion with the @var{flags} argument.
 665 @item ucs-4
 666 ISO 10646 UCS-4 encoding.  A 31-bit fixed-width superset of Unicode.
 667 @item utf-8
 668 ISO 10646 UTF-8 encoding.  A ``file system safe'' transformation format
 669 that can be used with both UCS-4 and Unicode.
 670 @item undecided
 671 Automatic conversion.  XEmacs attempts to detect the coding system used
 672 in the file.
 673 @item shift-jis
 674 Shift-JIS (a Japanese encoding commonly used in PC operating systems).
 675 @item big5
 676 Big5 (the encoding commonly used for Taiwanese).
 677 @item ccl
 678 The conversion is performed using a user-written pseudo-code program.
 679 CCL (Code Conversion Language) is the name of this pseudo-code.  For
 680 example, CCL is used to map KOI8-R characters (an encoding for Russian
 681 Cyrillic) to ISO8859-5 (the form used internally by MULE).
 682 @item internal
 683 Write out or read in the raw contents of the memory representing the
 684 buffer's text.  This is primarily useful for debugging purposes, and is
 685 only enabled when XEmacs has been compiled with @code{DEBUG_XEMACS} set
 686 (the @samp{--debug} configure option).  @strong{Warning}: Reading in a
 687 file using @code{internal} conversion can result in an internal
 688 inconsistency in the memory representing a buffer's text, which will
 689 produce unpredictable results and may cause XEmacs to crash.  Under
 690 normal circumstances you should never use @code{internal} conversion.
 691 @end table
 692
 693 @node ISO 2022, EOL Conversion, Coding System Types, Coding Systems
 694 @section ISO 2022
 695
 696   This section briefly describes the ISO 2022 encoding standard.  A more
 697 thorough treatment is available in the original document of ISO
 698 2022 as well as various national standards (such as JIS X 0202).
 699
 700   Character sets (@dfn{charsets}) are classified into the following four
 701 categories, according to the number of characters in the charset:
 702 94-charset, 96-charset, 94x94-charset, and 96x96-charset.  This means
 703 that although an ISO 2022 coding system may have variable width
 704 characters, each charset used is fixed-width (in contrast to the MULE
 705 character set and UTF-8, for example).
 706
 707   ISO 2022 provides for switching between character sets via escape
 708 sequences.  This switching is somewhat complicated, because ISO 2022
 709 provides for both legacy applications like Internet mail that accept
 710 only 7 significant bits in some contexts (RFC 822 headers, for example),
 711 and more modern "8-bit clean" applications.  It also provides for
 712 compact and transparent representation of languages like Japanese which
 713 mix ASCII and a national script (even outside of computer programs).
 714
 715   First, ISO 2022 codified prevailing practice by dividing the code space
 716 into "control" and "graphic" regions.  The code points 0x00-0x1F and
 717 0x80-0x9F are reserved for "control characters", while "graphic
 718 characters" must be assigned to code points in the regions 0x20-0x7F and
 719 0xA0-0xFF.  The positions 0x20 and 0x7F are special, and under some
 720 circumstances must be assigned the graphic character "ASCII SPACE" and
 721 the control character "ASCII DEL" respectively.
 722
 723   The various regions are given the name C0 (0x00-0x1F), GL (0x20-0x7F),
 724 C1 (0x80-0x9F), and GR (0xA0-0xFF).  GL and GR stand for "graphic left"
 725 and "graphic right", respectively, because of the standard method of
 726 displaying graphic character sets in tables with the high byte indexing
 727 columns and the low byte indexing rows.  I don't find it very intuitive,
 728 but these are called "registers".
 729
 730   An ISO 2022-conformant encoding for a graphic character set must use a
 731 fixed number of bytes per character, and the values must fit into a
 732 single register; that is, each byte must range over either 0x20-0x7F, or
 733 0xA0-0xFF.  It is not allowed to extend the range of the repertoire of a
 734 character set by using both ranges at the same.  This is why a standard
 735 character set such as ISO 8859-1 is actually considered by ISO 2022 to
 736 be an aggregation of two character sets, ASCII and LATIN-1, and why it
 737 is technically incorrect to refer to ISO 8859-1 as "Latin 1".  Also, a
 738 single character's bytes must all be drawn from the same register; this
 739 is why Shift JIS (for Japanese) and Big 5 (for Chinese) are not ISO
 740 2022-compatible encodings.
 741
 742   The reason for this restriction becomes clear when you attempt to define
 743 an efficient, robust encoding for a language like Japanese.  Like ISO
 744 8859, Japanese encodings are aggregations of several character sets.  In
 745 practice, the vast majority of characters are drawn from the "JIS Roman"
 746 character set (a derivative of ASCII; it won't hurt to think of it as
 747 ASCII) and the JIS X 0208 standard "basic Japanese" character set
 748 including not only ideographic characters ("kanji") but syllabic
 749 Japanese characters ("kana"), a wide variety of symbols, and many
 750 alphabetic characters (Roman, Greek, and Cyrillic) as well.  Although
 751 JIS X 0208 includes the whole Roman alphabet, as a 2-byte code it is not
 752 suited to programming; thus the inclusion of ASCII in the standard
 753 Japanese encodings.
 754
 755   For normal Japanese text such as in newspapers, a broad repertoire of
 756 approximately 3000 characters is used.  Evidently this won't fit into
 757 one byte; two must be used.  But much of the text processed by Japanese
 758 computers is computer source code, nearly all of which is ASCII.  A not
 759 insignificant portion of ordinary text is English (as such or as
 760 borrowed Japanese vocabulary) or other languages which can represented
 761 at least approximately in ASCII, as well.  It seems reasonable then to
 762 represent ASCII in one byte, and JIS X 0208 in two.  And this is exactly
 763 what the Extended Unix Code for Japanese (EUC-JP) does.  ASCII is
 764 invoked to the GL register, and JIS X 0208 is invoked to the GR
 765 register.  Thus, each byte can be tested for its character set by
 766 looking at the high bit; if set, it is Japanese, if clear, it is ASCII.
 767 Furthermore, since control characters like newline can never be part of
 768 a graphic character, even in the case of corruption in transmission the
 769 stream will be resynchronized at every line break, on the order of 60-80
 770 bytes.  This coding system requires no escape sequences or special
 771 control codes to represent 99.9% of all Japanese text.
 772
 773   Note carefully the distinction between the character sets (ASCII and JIS
 774 X 0208), the encoding (EUC-JP), and the coding system (ISO 2022).  The
 775 JIS X 0208 character set is used in three different encodings for
 776 Japanese, but in ISO-2022-JP it is invoked into GL (so the high bit is
 777 always clear), in EUC-JP it is invoked into GR (setting the high bit in
 778 the process), and in Shift JIS the high bit may be set or reset, and the
 779 significant bits are shifted within the 16-bit character so that the two
 780 main character sets can coexist with a third (the "halfwidth katakana"
 781 of JIS X 0201).  As the name implies, the ISO-2022-JP encoding is also a
 782 version of the ISO-2022 coding system.
 783
 784   In order to systematically treat subsidiary character sets (like the
 785 "halfwidth katakana" already mentioned, and the "supplementary kanji" of
 786 JIS X 0212), four further registers are defined: G0, G1, G2, and G3.
 787 Unlike GL and GR, they are not logically distinguished by internal
 788 format.  Instead, the process of "invocation" mentioned earlier is
 789 broken into two steps: first, a character set is @dfn{designated} to one
 790 of the registers G0-G3 by use of an @dfn{escape sequence} of the form:
 791
 792 @example
 793         ESC [@var{I}] @var{I} @var{F}
 794 @end example
 795
 796 where @var{I} is an intermediate character or characters in the range
 797 0x20 - 0x3F, and @var{F}, from the range 0x30-0x7Fm is the final
 798 character identifying this charset.  (Final characters in the range
 799 0x30-0x3F are reserved for private use and will never have a publicly
 800 registered meaning.)
 801
 802   Then that register is @dfn{invoked} to either GL or GR, either
 803 automatically (designations to G0 normally involve invocation to GL as
 804 well), or by use of shifting (affecting only the following character in
 805 the data stream) or locking (effective until the next designation or
 806 locking) control sequences.  An encoding conformant to ISO 2022 is
 807 typically defined by designating the initial contents of the G0-G3
 808 registers, specifying a 7 or 8 bit environment, and specifying whether
 809 further designations will be recognized.
 810
 811   Some examples of character sets and the registered final characters
 812 @var{F} used to designate them:
 813
 814 @need 1000
 815 @table @asis
 816 @item 94-charset
 817  ASCII (B), left (J) and right (I) half of JIS X 0201, ...
 818 @item 96-charset
 819  Latin-1 (A), Latin-2 (B), Latin-3 (C), ...
 820 @item 94x94-charset
 821  GB2312 (A), JIS X 0208 (B), KSC5601 (C), ...
 822 @item 96x96-charset
 823  none for the moment
 824 @end table
 825
 826   The meanings of the various characters in these sequences, where not
 827 specified by the ISO 2022 standard (such as the ESC character), are
 828 assigned by @dfn{ECMA}, the European Computer Manufacturers Association.
 829
 830   The meaning of intermediate characters are:
 831
 832 @example
 833 @group
 834         $ [0x24]: indicate charset of dimension 2 (94x94 or 96x96).
 835         ( [0x28]: designate to G0 a 94-charset whose final byte is @var{F}.
 836         ) [0x29]: designate to G1 a 94-charset whose final byte is @var{F}.
 837         * [0x2A]: designate to G2 a 94-charset whose final byte is @var{F}.
 838         + [0x2B]: designate to G3 a 94-charset whose final byte is @var{F}.
 839         , [0x2C]: designate to G0 a 96-charset whose final byte is @var{F}.
 840         - [0x2D]: designate to G1 a 96-charset whose final byte is @var{F}.
 841         . [0x2E]: designate to G2 a 96-charset whose final byte is @var{F}.
 842         / [0x2F]: designate to G3 a 96-charset whose final byte is @var{F}.
 843 @end group
 844 @end example
 845
 846   The comma may be used in files read and written only by MULE, as a MULE
 847 extension, but this is illegal in ISO 2022.  (The reason is that in ISO
 848 2022 G0 must be a 94-member character set, with 0x20 assigned the value
 849 SPACE, and 0x7F assigned the value DEL.)
 850
 851   Here are examples of designations:
 852
 853 @example
 854 @group
 855         ESC ( B :              designate to G0 ASCII
 856         ESC - A :              designate to G1 Latin-1
 857         ESC $ ( A or ESC $ A : designate to G0 GB2312
 858         ESC $ ( B or ESC $ B : designate to G0 JISX0208
 859         ESC $ ) C :            designate to G1 KSC5601
 860 @end group
 861 @end example
 862
 863 (The short forms used to designate GB2312 and JIS X 0208 are for
 864 backwards compatibility; the long forms are preferred.)
 865
 866   To use a charset designated to G2 or G3, and to use a charset designated
 867 to G1 in a 7-bit environment, you must explicitly invoke G1, G2, or G3
 868 into GL.  There are two types of invocation, Locking Shift (forever) and
 869 Single Shift (one character only).
 870
 871   Locking Shift is done as follows:
 872
 873 @example
 874         LS0 or SI (0x0F): invoke G0 into GL
 875         LS1 or SO (0x0E): invoke G1 into GL
 876         LS2:  invoke G2 into GL
 877         LS3:  invoke G3 into GL
 878         LS1R: invoke G1 into GR
 879         LS2R: invoke G2 into GR
 880         LS3R: invoke G3 into GR
 881 @end example
 882
 883   Single Shift is done as follows:
 884
 885 @example
 886 @group
 887         SS2 or ESC N: invoke G2 into GL
 888         SS3 or ESC O: invoke G3 into GL
 889 @end group
 890 @end example
 891
 892   The shift functions (such as LS1R and SS3) are represented by control
 893 characters (from C1) in 8 bit environments and by escape sequences in 7
 894 bit environments.
 895
 896 (#### Ben says: I think the above is slightly incorrect.  It appears that
 897 SS2 invokes G2 into GR and SS3 invokes G3 into GR, whereas ESC N and
 898 ESC O behave as indicated.  The above definitions will not parse
 899 EUC-encoded text correctly, and it looks like the code in mule-coding.c
 900 has similar problems.)
 901
 902   Evidently there are a lot of ISO-2022-compliant ways of encoding
 903 multilingual text.  Now, in the world, there exist many coding systems
 904 such as X11's Compound Text, Japanese JUNET code, and so-called EUC
 905 (Extended UNIX Code); all of these are variants of ISO 2022.
 906
 907   In MULE, we characterize a version of ISO 2022 by the following
 908 attributes:
 909
 910 @enumerate
 911 @item
 912 The character sets initially designated to G0 thru G3.
 913 @item
 914 Whether short form designations are allowed for Japanese and Chinese.
 915 @item
 916 Whether ASCII should be designated to G0 before control characters.
 917 @item
 918 Whether ASCII should be designated to G0 at the end of line.
 919 @item
 920 7-bit environment or 8-bit environment.
 921 @item
 922 Whether Locking Shifts are used or not.
 923 @item
 924 Whether to use ASCII or the variant JIS X 0201-1976-Roman.
 925 @item
 926 Whether to use JIS X 0208-1983 or the older version JIS X 0208-1976.
 927 @end enumerate
 928
 929 (The last two are only for Japanese.)
 930
 931   By specifying these attributes, you can create any variant
 932 of ISO 2022.
 933
 934   Here are several examples:
 935
 936 @example
 937 @group
 938 ISO-2022-JP -- Coding system used in Japanese email (RFC 1463 #### check).
 939         1. G0 <- ASCII, G1..3 <- never used
 940         2. Yes.
 941         3. Yes.
 942         4. Yes.
 943         5. 7-bit environment
 944         6. No.
 945         7. Use ASCII
 946         8. Use JIS X 0208-1983
 947 @end group
 948
 949 @group
 950 ctext -- X11 Compound Text
 951         1. G0 <- ASCII, G1 <- Latin-1, G2,3 <- never used.
 952         2. No.
 953         3. No.
 954         4. Yes.
 955         5. 8-bit environment.
 956         6. No.
 957         7. Use ASCII.
 958         8. Use JIS X 0208-1983.
 959 @end group
 960
 961 @group
 962 euc-china -- Chinese EUC.  Often called the "GB encoding", but that is
 963 technically incorrect.
 964         1. G0 <- ASCII, G1 <- GB 2312, G2,3 <- never used.
 965         2. No.
 966         3. Yes.
 967         4. Yes.
 968         5. 8-bit environment.
 969         6. No.
 970         7. Use ASCII.
 971         8. Use JIS X 0208-1983.
 972 @end group
 973
 974 @group
 975 ISO-2022-KR -- Coding system used in Korean email.
 976         1. G0 <- ASCII, G1 <- KSC 5601, G2,3 <- never used.
 977         2. No.
 978         3. Yes.
 979         4. Yes.
 980         5. 7-bit environment.
 981         6. Yes.
 982         7. Use ASCII.
 983         8. Use JIS X 0208-1983.
 984 @end group
 985 @end example
 986
 987 MULE creates all of these coding systems by default.
 988
 989 @node EOL Conversion, Coding System Properties, ISO 2022, Coding Systems
 990 @subsection EOL Conversion
 991
 992 @table @code
 993 @item nil
 994 Automatically detect the end-of-line type (LF, CRLF, or CR).  Also
 995 generate subsidiary coding systems named @code{@var{name}-unix},
 996 @code{@var{name}-dos}, and @code{@var{name}-mac}, that are identical to
 997 this coding system but have an EOL-TYPE value of @code{lf}, @code{crlf},
 998 and @code{cr}, respectively.
 999 @item lf
1000 The end of a line is marked externally using ASCII LF.  Since this is
1001 also the way that XEmacs represents an end-of-line internally,
1002 specifying this option results in no end-of-line conversion.  This is
1003 the standard format for Unix text files.
1004 @item crlf
1005 The end of a line is marked externally using ASCII CRLF.  This is the
1006 standard format for MS-DOS text files.
1007 @item cr
1008 The end of a line is marked externally using ASCII CR.  This is the
1009 standard format for Macintosh text files.
1010 @item t
1011 Automatically detect the end-of-line type but do not generate subsidiary
1012 coding systems.  (This value is converted to @code{nil} when stored
1013 internally, and @code{coding-system-property} will return @code{nil}.)
1014 @end table
1015
1016 @node Coding System Properties, Basic Coding System Functions, EOL Conversion, Coding Systems
1017 @subsection Coding System Properties
1018
1019 @table @code
1020 @item mnemonic
1021 String to be displayed in the modeline when this coding system is
1022 active.
1023
1024 @item eol-type
1025 End-of-line conversion to be used.  It should be one of the types
1026 listed in @ref{EOL Conversion}.
1027
1028 @item eol-lf
1029 The coding system which is the same as this one, except that it uses the
1030 Unix line-breaking convention.
1031
1032 @item eol-crlf
1033 The coding system which is the same as this one, except that it uses the
1034 DOS line-breaking convention.
1035
1036 @item eol-cr
1037 The coding system which is the same as this one, except that it uses the
1038 Macintosh line-breaking convention.
1039
1040 @item post-read-conversion
1041 Function called after a file has been read in, to perform the decoding.
1042 Called with two arguments, @var{start} and @var{end}, denoting a region of
1043 the current buffer to be decoded.
1044
1045 @item pre-write-conversion
1046 Function called before a file is written out, to perform the encoding.
1047 Called with two arguments, @var{start} and @var{end}, denoting a region of
1048 the current buffer to be encoded.
1049 @end table
1050
1051   The following additional properties are recognized if @var{type} is
1052 @code{iso2022}:
1053
1054 @table @code
1055 @item charset-g0
1056 @itemx charset-g1
1057 @itemx charset-g2
1058 @itemx charset-g3
1059 The character set initially designated to the G0 - G3 registers.
1060 The value should be one of
1061
1062 @itemize @bullet
1063 @item
1064 A charset object (designate that character set)
1065 @item
1066 @code{nil} (do not ever use this register)
1067 @item
1068 @code{t} (no character set is initially designated to the register, but
1069 may be later on; this automatically sets the corresponding
1070 @code{force-g*-on-output} property)
1071 @end itemize
1072
1073 @item force-g0-on-output
1074 @itemx force-g1-on-output
1075 @itemx force-g2-on-output
1076 @itemx force-g3-on-output
1077 If non-@code{nil}, send an explicit designation sequence on output
1078 before using the specified register.
1079
1080 @item short
1081 If non-@code{nil}, use the short forms @samp{ESC $ @@}, @samp{ESC $ A},
1082 and @samp{ESC $ B} on output in place of the full designation sequences
1083 @samp{ESC $ ( @@}, @samp{ESC $ ( A}, and @samp{ESC $ ( B}.
1084
1085 @item no-ascii-eol
1086 If non-@code{nil}, don't designate ASCII to G0 at each end of line on
1087 output.  Setting this to non-@code{nil} also suppresses other
1088 state-resetting that normally happens at the end of a line.
1089
1090 @item no-ascii-cntl
1091 If non-@code{nil}, don't designate ASCII to G0 before control chars on
1092 output.
1093
1094 @item seven
1095 If non-@code{nil}, use 7-bit environment on output.  Otherwise, use 8-bit
1096 environment.
1097
1098 @item lock-shift
1099 If non-@code{nil}, use locking-shift (SO/SI) instead of single-shift or
1100 designation by escape sequence.
1101
1102 @item no-iso6429
1103 If non-@code{nil}, don't use ISO6429's direction specification.
1104
1105 @item escape-quoted
1106 If non-@code{nil}, literal control characters that are the same as the
1107 beginning of a recognized ISO 2022 or ISO 6429 escape sequence (in
1108 particular, ESC (0x1B), SO (0x0E), SI (0x0F), SS2 (0x8E), SS3 (0x8F),
1109 and CSI (0x9B)) are ``quoted'' with an escape character so that they can
1110 be properly distinguished from an escape sequence.  (Note that doing
1111 this results in a non-portable encoding.) This encoding flag is used for
1112 byte-compiled files.  Note that ESC is a good choice for a quoting
1113 character because there are no escape sequences whose second byte is a
1114 character from the Control-0 or Control-1 character sets; this is
1115 explicitly disallowed by the ISO 2022 standard.
1116
1117 @item input-charset-conversion
1118 A list of conversion specifications, specifying conversion of characters
1119 in one charset to another when decoding is performed.  Each
1120 specification is a list of two elements: the source charset, and the
1121 destination charset.
1122
1123 @item output-charset-conversion
1124 A list of conversion specifications, specifying conversion of characters
1125 in one charset to another when encoding is performed.  The form of each
1126 specification is the same as for @code{input-charset-conversion}.
1127 @end table
1128
1129   The following additional properties are recognized (and required) if
1130 @var{type} is @code{ccl}:
1131
1132 @table @code
1133 @item decode
1134 CCL program used for decoding (converting to internal format).
1135
1136 @item encode
1137 CCL program used for encoding (converting to external format).
1138 @end table
1139
1140   The following properties are used internally:  @var{eol-cr},
1141 @var{eol-crlf}, @var{eol-lf}, and @var{base}.
1142
1143 @node Basic Coding System Functions, Coding System Property Functions, Coding System Properties, Coding Systems
1144 @subsection Basic Coding System Functions
1145
1146 @defun find-coding-system coding-system-or-name
1147 This function retrieves the coding system of the given name.
1148
1149   If @var{coding-system-or-name} is a coding-system object, it is simply
1150 returned.  Otherwise, @var{coding-system-or-name} should be a symbol.
1151 If there is no such coding system, @code{nil} is returned.  Otherwise
1152 the associated coding system object is returned.
1153 @end defun
1154
1155 @defun get-coding-system name
1156 This function retrieves the coding system of the given name.  Same as
1157 @code{find-coding-system} except an error is signalled if there is no
1158 such coding system instead of returning @code{nil}.
1159 @end defun
1160
1161 @defun coding-system-list
1162 This function returns a list of the names of all defined coding systems.
1163 @end defun
1164
1165 @defun coding-system-name coding-system
1166 This function returns the name of the given coding system.
1167 @end defun
1168
1169 @defun coding-system-base coding-system
1170 Returns the base coding system (undecided EOL convention)
1171 coding system.
1172 @end defun
1173
1174 @defun make-coding-system name type &optional doc-string props
1175 This function registers symbol @var{name} as a coding system.
1176
1177 @var{type} describes the conversion method used and should be one of
1178 the types listed in @ref{Coding System Types}.
1179
1180 @var{doc-string} is a string describing the coding system.
1181
1182 @var{props} is a property list, describing the specific nature of the
1183 character set.  Recognized properties are as in @ref{Coding System
1184 Properties}.
1185 @end defun
1186
1187 @defun copy-coding-system old-coding-system new-name
1188 This function copies @var{old-coding-system} to @var{new-name}.  If
1189 @var{new-name} does not name an existing coding system, a new one will
1190 be created.
1191 @end defun
1192
1193 @defun subsidiary-coding-system coding-system eol-type
1194 This function returns the subsidiary coding system of
1195 @var{coding-system} with eol type @var{eol-type}.
1196 @end defun
1197
1198 @node Coding System Property Functions, Encoding and Decoding Text, Basic Coding System Functions, Coding Systems
1199 @subsection Coding System Property Functions
1200
1201 @defun coding-system-doc-string coding-system
1202 This function returns the doc string for @var{coding-system}.
1203 @end defun
1204
1205 @defun coding-system-type coding-system
1206 This function returns the type of @var{coding-system}.
1207 @end defun
1208
1209 @defun coding-system-property coding-system prop
1210 This function returns the @var{prop} property of @var{coding-system}.
1211 @end defun
1212
1213 @node Encoding and Decoding Text, Detection of Textual Encoding, Coding System Property Functions, Coding Systems
1214 @subsection Encoding and Decoding Text
1215
1216 @defun decode-coding-region start end coding-system &optional buffer
1217 This function decodes the text between @var{start} and @var{end} which
1218 is encoded in @var{coding-system}.  This is useful if you've read in
1219 encoded text from a file without decoding it (e.g. you read in a
1220 JIS-formatted file but used the @code{binary} or @code{no-conversion} coding
1221 system, so that it shows up as @samp{^[$B!<!+^[(B}).  The length of the
1222 encoded text is returned.  @var{buffer} defaults to the current buffer
1223 if unspecified.
1224 @end defun
1225
1226 @defun encode-coding-region start end coding-system &optional buffer
1227 This function encodes the text between @var{start} and @var{end} using
1228 @var{coding-system}.  This will, for example, convert Japanese
1229 characters into stuff such as @samp{^[$B!<!+^[(B} if you use the JIS
1230 encoding.  The length of the encoded text is returned.  @var{buffer}
1231 defaults to the current buffer if unspecified.
1232 @end defun
1233
1234 @node Detection of Textual Encoding, Big5 and Shift-JIS Functions, Encoding and Decoding Text, Coding Systems
1235 @subsection Detection of Textual Encoding
1236
1237 @defun coding-category-list
1238 This function returns a list of all recognized coding categories.
1239 @end defun
1240
1241 @defun set-coding-priority-list list
1242 This function changes the priority order of the coding categories.
1243 @var{list} should be a list of coding categories, in descending order of
1244 priority.  Unspecified coding categories will be lower in priority than
1245 all specified ones, in the same relative order they were in previously.
1246 @end defun
1247
1248 @defun coding-priority-list
1249 This function returns a list of coding categories in descending order of
1250 priority.
1251 @end defun
1252
1253 @defun set-coding-category-system coding-category coding-system
1254 This function changes the coding system associated with a coding category.
1255 @end defun
1256
1257 @defun coding-category-system coding-category
1258 This function returns the coding system associated with a coding category.
1259 @end defun
1260
1261 @defun detect-coding-region start end &optional buffer
1262 This function detects coding system of the text in the region between
1263 @var{start} and @var{end}.  Returned value is a list of possible coding
1264 systems ordered by priority.  If only ASCII characters are found, it
1265 returns @code{autodetect} or one of its subsidiary coding systems
1266 according to a detected end-of-line type.  Optional arg @var{buffer}
1267 defaults to the current buffer.
1268 @end defun
1269
1270 @node Big5 and Shift-JIS Functions, Predefined Coding Systems, Detection of Textual Encoding, Coding Systems
1271 @subsection Big5 and Shift-JIS Functions
1272
1273   These are special functions for working with the non-standard
1274 Shift-JIS and Big5 encodings.
1275
1276 @defun decode-shift-jis-char code
1277 This function decodes a JIS X 0208 character of Shift-JIS coding-system.
1278 @var{code} is the character code in Shift-JIS as a cons of type bytes.
1279 The corresponding character is returned.
1280 @end defun
1281
1282 @defun encode-shift-jis-char character
1283 This function encodes a JIS X 0208 character @var{character} to
1284 SHIFT-JIS coding-system.  The corresponding character code in SHIFT-JIS
1285 is returned as a cons of two bytes.
1286 @end defun
1287
1288 @defun decode-big5-char code
1289 This function decodes a Big5 character @var{code} of BIG5 coding-system.
1290 @var{code} is the character code in BIG5.  The corresponding character
1291 is returned.
1292 @end defun
1293
1294 @defun encode-big5-char character
1295 This function encodes the Big5 character @var{character} to BIG5
1296 coding-system.  The corresponding character code in Big5 is returned.
1297 @end defun
1298
1299 @node Predefined Coding Systems, , Big5 and Shift-JIS Functions, Coding Systems
1300 @subsection Coding Systems Implemented
1301
1302   MULE initializes most of the commonly used coding systems at XEmacs's
1303 startup.  A few others are initialized only when the relevant language
1304 environment is selected and support libraries are loaded.  (NB: The
1305 following list is based on XEmacs 21.2.19, the development branch at the
1306 time of writing.  The list may be somewhat different for other
1307 versions.  Recent versions of GNU Emacs 20 implement a few more rare
1308 coding systems; work is being done to port these to XEmacs.)
1309
1310   Unfortunately, there is not a consistent naming convention for character
1311 sets, and for practical purposes coding systems often take their name
1312 from their principal character sets (ASCII, KOI8-R, Shift JIS).  Others
1313 take their names from the coding system (ISO-2022-JP, EUC-KR), and a few
1314 from their non-text usages (internal, binary).  To provide for this, and
1315 for the fact that many coding systems have several common names, an
1316 aliasing system is provided.  Finally, some effort has been made to use
1317 names that are registered as MIME charsets (this is why the name
1318 'shift_jis contains that un-Lisp-y underscore).
1319
1320   There is a systematic naming convention regarding end-of-line (EOL)
1321 conventions for different systems.  A coding system whose name ends in
1322 "-unix" forces the assumptions that lines are broken by newlines (0x0A).
1323 A coding system whose name ends in "-mac" forces the assumptions that
1324 lines are broken by ASCII CRs (0x0D).  A coding system whose name ends
1325 in "-dos" forces the assumptions that lines are broken by CRLF sequences
1326 (0x0D 0x0A).  These subsidiary coding systems are automatically derived
1327 from a base coding system.  Use of the base coding system implies
1328 autodetection of the text file convention.  (The fact that the -unix,
1329 -mac, and -dos are derived from a base system results in them showing up
1330 as "aliases" in `list-coding-systems'.)  These subsidiaries have a
1331 consistent modeline indicator as well.  "-dos" coding systems have ":T"
1332 appended to their modeline indicator, while "-mac" coding systems have
1333 ":t" appended (eg, "ISO8:t" for iso-2022-8-mac).
1334
1335   In the following table, each coding system is given with its mode line
1336 indicator in parentheses.  Non-textual coding systems are listed first,
1337 followed by textual coding systems and their aliases. (The coding system
1338 subsidiary modeline indicators ":T" and ":t" will be omitted from the
1339 table of coding systems.)
1340
1341   ### SJT 1999-08-23 Maybe should order these by language?  Definitely
1342 need language usage for the ISO-8859 family.
1343
1344   Note that although true coding system aliases have been implemented for
1345 XEmacs 21.2, the coding system initialization has not yet been converted
1346 as of 21.2.19.  So coding systems described as aliases have the same
1347 properties as the aliased coding system, but will not be equal as Lisp
1348 objects.
1349
1350 @table @code
1351
1352 @item automatic-conversion
1353 @itemx undecided
1354 @itemx undecided-dos
1355 @itemx undecided-mac
1356 @itemx undecided-unix
1357
1358 Modeline indicator: @code{Auto}.  A type @code{undecided} coding system.
1359 Attempts to determine an appropriate coding system from file contents or
1360 the environment.
1361
1362 @item raw-text
1363 @itemx no-conversion
1364 @itemx raw-text-dos
1365 @itemx raw-text-mac
1366 @itemx raw-text-unix
1367 @itemx no-conversion-dos
1368 @itemx no-conversion-mac
1369 @itemx no-conversion-unix
1370
1371 Modeline indicator: @code{Raw}.  A type @code{no-conversion} coding system,
1372 which converts only line-break-codes.  An implementation quirk means
1373 that this coding system is also used for ISO8859-1.
1374
1375 @item binary
1376 Modeline indicator: @code{Binary}.  A type @code{no-conversion} coding
1377 system which does no character coding or EOL conversions.  An alias for
1378 @code{raw-text-unix}.
1379
1380 @item alternativnyj
1381 @itemx alternativnyj-dos
1382 @itemx alternativnyj-mac
1383 @itemx alternativnyj-unix
1384
1385 Modeline indicator: @code{Cy.Alt}.  A type @code{ccl} coding system used for
1386 Alternativnyj, an encoding of the Cyrillic alphabet.
1387
1388 @item big5
1389 @itemx big5-dos
1390 @itemx big5-mac
1391 @itemx big5-unix
1392
1393 Modeline indicator: @code{Zh/Big5}.  A type @code{big5} coding system used for
1394 BIG5, the most common encoding of traditional Chinese as used in Taiwan.
1395
1396 @item cn-gb-2312
1397 @itemx cn-gb-2312-dos
1398 @itemx cn-gb-2312-mac
1399 @itemx cn-gb-2312-unix
1400
1401 Modeline indicator: @code{Zh-GB/EUC}.  A type @code{iso2022} coding system used
1402 for simplified Chinese (as used in the People's Republic of China), with
1403 the @code{ascii} (G0), @code{chinese-gb2312} (G1), and @code{sisheng}
1404 (G2) character sets initially designated.  Chinese EUC (Extended Unix
1405 Code).
1406
1407 @item ctext-hebrew
1408 @itemx ctext-hebrew-dos
1409 @itemx ctext-hebrew-mac
1410 @itemx ctext-hebrew-unix
1411
1412 Modeline indicator: @code{CText/Hbrw}.  A type @code{iso2022} coding system
1413 with the @code{ascii} (G0) and @code{hebrew-iso8859-8} (G1) character
1414 sets initially designated for Hebrew.
1415
1416 @item ctext
1417 @itemx ctext-dos
1418 @itemx ctext-mac
1419 @itemx ctext-unix
1420
1421 Modeline indicator: @code{CText}.  A type @code{iso2022} 8-bit coding system
1422 with the @code{ascii} (G0) and @code{latin-iso8859-1} (G1) character
1423 sets initially designated.  X11 Compound Text Encoding.  Often
1424 mistakenly recognized instead of EUC encodings; usual cause is
1425 inappropriate setting of @code{coding-priority-list}.
1426
1427 @item escape-quoted
1428
1429 Modeline indicator: @code{ESC/Quot}.  A type @code{iso2022} 8-bit coding
1430 system with the @code{ascii} (G0) and @code{latin-iso8859-1} (G1)
1431 character sets initially designated and escape quoting.  Unix EOL
1432 conversion (ie, no conversion).  It is used for .ELC files.
1433
1434 @item euc-jp
1435 @itemx euc-jp-dos
1436 @itemx euc-jp-mac
1437 @itemx euc-jp-unix
1438
1439 Modeline indicator: @code{Ja/EUC}.  A type @code{iso2022} 8-bit coding system
1440 with @code{ascii} (G0), @code{japanese-jisx0208} (G1),
1441 @code{katakana-jisx0201} (G2), and @code{japanese-jisx0212} (G3)
1442 initially designated.  Japanese EUC (Extended Unix Code).
1443
1444 @item euc-kr
1445 @itemx euc-kr-dos
1446 @itemx euc-kr-mac
1447 @itemx euc-kr-unix
1448
1449 Modeline indicator: @code{ko/EUC}.  A type @code{iso2022} 8-bit coding system
1450 with @code{ascii} (G0) and @code{korean-ksc5601} (G1) initially
1451 designated.  Korean EUC (Extended Unix Code).
1452
1453 @item hz-gb-2312
1454 Modeline indicator: @code{Zh-GB/Hz}.  A type @code{no-conversion} coding
1455 system with Unix EOL convention (ie, no conversion) using
1456 post-read-decode and pre-write-encode functions to translate the Hz/ZW
1457 coding system used for Chinese.
1458
1459 @item iso-2022-7bit
1460 @itemx iso-2022-7bit-unix
1461 @itemx iso-2022-7bit-dos
1462 @itemx iso-2022-7bit-mac
1463 @itemx iso-2022-7
1464
1465 Modeline indicator: @code{ISO7}.  A type @code{iso2022} 7-bit coding system
1466 with @code{ascii} (G0) initially designated.  Other character sets must
1467 be explicitly designated to be used.
1468
1469 @item iso-2022-7bit-ss2
1470 @itemx iso-2022-7bit-ss2-dos
1471 @itemx iso-2022-7bit-ss2-mac
1472 @itemx iso-2022-7bit-ss2-unix
1473
1474 Modeline indicator: @code{ISO7/SS}.  A type @code{iso2022} 7-bit coding system
1475 with @code{ascii} (G0) initially designated.  Other character sets must
1476 be explicitly designated to be used.  SS2 is used to invoke a
1477 96-charset, one character at a time.
1478
1479 @item iso-2022-8
1480 @itemx iso-2022-8-dos
1481 @itemx iso-2022-8-mac
1482 @itemx iso-2022-8-unix
1483
1484 Modeline indicator: @code{ISO8}.  A type @code{iso2022} 8-bit coding system
1485 with @code{ascii} (G0) and @code{latin-iso8859-1} (G1) initially
1486 designated.  Other character sets must be explicitly designated to be
1487 used.  No single-shift or locking-shift.
1488
1489 @item iso-2022-8bit-ss2
1490 @itemx iso-2022-8bit-ss2-dos
1491 @itemx iso-2022-8bit-ss2-mac
1492 @itemx iso-2022-8bit-ss2-unix
1493
1494 Modeline indicator: @code{ISO8/SS}.  A type @code{iso2022} 8-bit coding system
1495 with @code{ascii} (G0) and @code{latin-iso8859-1} (G1) initially
1496 designated.  Other character sets must be explicitly designated to be
1497 used.  SS2 is used to invoke a 96-charset, one character at a time.
1498
1499 @item iso-2022-int-1
1500 @itemx iso-2022-int-1-dos
1501 @itemx iso-2022-int-1-mac
1502 @itemx iso-2022-int-1-unix
1503
1504 Modeline indicator: @code{INT-1}.  A type @code{iso2022} 7-bit coding system
1505 with @code{ascii} (G0) and @code{korean-ksc5601} (G1) initially
1506 designated.  ISO-2022-INT-1.
1507
1508 @item iso-2022-jp-1978-irv
1509 @itemx iso-2022-jp-1978-irv-dos
1510 @itemx iso-2022-jp-1978-irv-mac
1511 @itemx iso-2022-jp-1978-irv-unix
1512
1513 Modeline indicator: @code{Ja-78/7bit}.  A type @code{iso2022} 7-bit coding
1514 system.  For compatibility with old Japanese terminals; if you need to
1515 know, look at the source.
1516
1517 @item iso-2022-jp
1518 @itemx iso-2022-jp-2 (ISO7/SS)
1519 @itemx iso-2022-jp-dos
1520 @itemx iso-2022-jp-mac
1521 @itemx iso-2022-jp-unix
1522 @itemx iso-2022-jp-2-dos
1523 @itemx iso-2022-jp-2-mac
1524 @itemx iso-2022-jp-2-unix
1525
1526 Modeline indicator: @code{MULE/7bit}.  A type @code{iso2022} 7-bit coding
1527 system with @code{ascii} (G0) initially designated, and complex
1528 specifications to insure backward compatibility with old Japanese
1529 systems.  Used for communication with mail and news in Japan.  The "-2"
1530 versions also use SS2 to invoke a 96-charset one character at a time.
1531
1532 @item iso-2022-kr
1533 Modeline indicator: @code{Ko/7bit}  A type @code{iso2022} 7-bit coding
1534 system with @code{ascii} (G0) and @code{korean-ksc5601} (G1) initially
1535 designated.  Used for e-mail in Korea.
1536
1537 @item iso-2022-lock
1538 @itemx iso-2022-lock-dos
1539 @itemx iso-2022-lock-mac
1540 @itemx iso-2022-lock-unix
1541
1542 Modeline indicator: @code{ISO7/Lock}.  A type @code{iso2022} 7-bit coding
1543 system with @code{ascii} (G0) initially designated, using Locking-Shift
1544 to invoke a 96-charset.
1545
1546 @item iso-8859-1
1547 @itemx iso-8859-1-dos
1548 @itemx iso-8859-1-mac
1549 @itemx iso-8859-1-unix
1550
1551 Due to implementation, this is not a type @code{iso2022} coding system,
1552 but rather an alias for the @code{raw-text} coding system.
1553
1554 @item iso-8859-2
1555 @itemx iso-8859-2-dos
1556 @itemx iso-8859-2-mac
1557 @itemx iso-8859-2-unix
1558
1559 Modeline indicator: @code{MIME/Ltn-2}.  A type @code{iso2022} coding
1560 system with @code{ascii} (G0) and @code{latin-iso8859-2} (G1) initially
1561 invoked.
1562
1563 @item iso-8859-3
1564 @itemx iso-8859-3-dos
1565 @itemx iso-8859-3-mac
1566 @itemx iso-8859-3-unix
1567
1568 Modeline indicator: @code{MIME/Ltn-3}.  A type @code{iso2022} coding system
1569 with @code{ascii} (G0) and @code{latin-iso8859-3} (G1) initially
1570 invoked.
1571
1572 @item iso-8859-4
1573 @itemx iso-8859-4-dos
1574 @itemx iso-8859-4-mac
1575 @itemx iso-8859-4-unix
1576
1577 Modeline indicator: @code{MIME/Ltn-4}.  A type @code{iso2022} coding system
1578 with @code{ascii} (G0) and @code{latin-iso8859-4} (G1) initially
1579 invoked.
1580
1581 @item iso-8859-5
1582 @itemx iso-8859-5-dos
1583 @itemx iso-8859-5-mac
1584 @itemx iso-8859-5-unix
1585
1586 Modeline indicator: @code{ISO8/Cyr}.  A type @code{iso2022} coding system with
1587 @code{ascii} (G0) and @code{cyrillic-iso8859-5} (G1) initially invoked.
1588
1589 @item iso-8859-7
1590 @itemx iso-8859-7-dos
1591 @itemx iso-8859-7-mac
1592 @itemx iso-8859-7-unix
1593
1594 Modeline indicator: @code{Grk}.  A type @code{iso2022} coding system with
1595 @code{ascii} (G0) and @code{greek-iso8859-7} (G1) initially invoked.
1596
1597 @item iso-8859-8
1598 @itemx iso-8859-8-dos
1599 @itemx iso-8859-8-mac
1600 @itemx iso-8859-8-unix
1601
1602 Modeline indicator: @code{MIME/Hbrw}.  A type @code{iso2022} coding system with
1603 @code{ascii} (G0) and @code{hebrew-iso8859-8} (G1) initially invoked.
1604
1605 @item iso-8859-9
1606 @itemx iso-8859-9-dos
1607 @itemx iso-8859-9-mac
1608 @itemx iso-8859-9-unix
1609
1610 Modeline indicator: @code{MIME/Ltn-5}.  A type @code{iso2022} coding system
1611 with @code{ascii} (G0) and @code{latin-iso8859-9} (G1) initially
1612 invoked.
1613
1614 @item koi8-r
1615 @itemx koi8-r-dos
1616 @itemx koi8-r-mac
1617 @itemx koi8-r-unix
1618
1619 Modeline indicator: @code{KOI8}.  A type @code{ccl} coding-system used for
1620 KOI8-R, an encoding of the Cyrillic alphabet.
1621
1622 @item shift_jis
1623 @itemx shift_jis-dos
1624 @itemx shift_jis-mac
1625 @itemx shift_jis-unix
1626
1627 Modeline indicator: @code{Ja/SJIS}.  A type @code{shift-jis} coding-system
1628 implementing the Shift-JIS encoding for Japanese.  The underscore is to
1629 conform to the MIME charset implementing this encoding.
1630
1631 @item tis-620
1632 @itemx tis-620-dos
1633 @itemx tis-620-mac
1634 @itemx tis-620-unix
1635
1636 Modeline indicator: @code{TIS620}.  A type @code{ccl} encoding for Thai.  The
1637 external encoding is defined by TIS620, the internal encoding is
1638 peculiar to MULE, and called @code{thai-xtis}.
1639
1640 @item viqr
1641
1642 Modeline indicator: @code{VIQR}.  A type @code{no-conversion} coding
1643 system with Unix EOL convention (ie, no conversion) using
1644 post-read-decode and pre-write-encode functions to translate the VIQR
1645 coding system for Vietnamese.
1646
1647 @item viscii
1648 @itemx viscii-dos
1649 @itemx viscii-mac
1650 @itemx viscii-unix
1651
1652 Modeline indicator: @code{VISCII}.  A type @code{ccl} coding-system used
1653 for VISCII 1.1 for Vietnamese.  Differs slightly from VSCII; VISCII is
1654 given priority by XEmacs.
1655
1656 @item vscii
1657 @itemx vscii-dos
1658 @itemx vscii-mac
1659 @itemx vscii-unix
1660
1661 Modeline indicator: @code{VSCII}.  A type @code{ccl} coding-system used
1662 for VSCII 1.1 for Vietnamese.  Differs slightly from VISCII, which is
1663 given priority by XEmacs.  Use
1664 @code{(prefer-coding-system 'vietnamese-vscii)} to give priority to VSCII.
1665
1666 @end table
1667
1668 @node CCL, Category Tables, Coding Systems, MULE
1669 @section CCL
1670
1671   CCL (Code Conversion Language) is a simple structured programming
1672 language designed for character coding conversions.  A CCL program is
1673 compiled to CCL code (represented by a vector of integers) and executed
1674 by the CCL interpreter embedded in Emacs.  The CCL interpreter
1675 implements a virtual machine with 8 registers called @code{r0}, ...,
1676 @code{r7}, a number of control structures, and some I/O operators.  Take
1677 care when using registers @code{r0} (used in implicit @dfn{set}
1678 statements) and especially @code{r7} (used internally by several
1679 statements and operations, especially for multiple return values and I/O
1680 operations).
1681
1682   CCL is used for code conversion during process I/O and file I/O for
1683 non-ISO2022 coding systems.  (It is the only way for a user to specify a
1684 code conversion function.)  It is also used for calculating the code
1685 point of an X11 font from a character code.  However, since CCL is
1686 designed as a powerful programming language, it can be used for more
1687 generic calculation where efficiency is demanded.  A combination of
1688 three or more arithmetic operations can be calculated faster by CCL than
1689 by Emacs Lisp.
1690
1691   @strong{Warning:}  The code in @file{src/mule-ccl.c} and
1692 @file{$packages/lisp/mule-base/mule-ccl.el} is the definitive
1693 description of CCL's semantics.  The previous version of this section
1694 contained several typos and obsolete names left from earlier versions of
1695 MULE, and many may remain.  (I am not an experienced CCL programmer; the
1696 few who know CCL well find writing English painful.)
1697
1698   A CCL program transforms an input data stream into an output data
1699 stream.  The input stream, held in a buffer of constant bytes, is left
1700 unchanged.  The buffer may be filled by an external input operation,
1701 taken from an Emacs buffer, or taken from a Lisp string.  The output
1702 buffer is a dynamic array of bytes, which can be written by an external
1703 output operation, inserted into an Emacs buffer, or returned as a Lisp
1704 string.
1705
1706   A CCL program is a (Lisp) list containing two or three members.  The
1707 first member is the @dfn{buffer magnification}, which indicates the
1708 required minimum size of the output buffer as a multiple of the input
1709 buffer.  It is followed by the @dfn{main block} which executes while
1710 there is input remaining, and an optional @dfn{EOF block} which is
1711 executed when the input is exhausted.  Both the main block and the EOF
1712 block are CCL blocks.
1713
1714   A @dfn{CCL block} is either a CCL statement or list of CCL statements.
1715 A @dfn{CCL statement} is either a @dfn{set statement} (either an integer
1716 or an @dfn{assignment}, which is a list of a register to receive the
1717 assignment, an assignment operator, and an expression) or a @dfn{control
1718 statement} (a list starting with a keyword, whose allowable syntax
1719 depends on the keyword).
1720
1721 @menu
1722 * CCL Syntax::          CCL program syntax in BNF notation.
1723 * CCL Statements::      Semantics of CCL statements.
1724 * CCL Expressions::     Operators and expressions in CCL.
1725 * Calling CCL::         Running CCL programs.
1726 * CCL Example::         A trivial program to transform the Web's URL encoding.
1727 @end menu
1728
1729 @node    CCL Syntax, CCL Statements, , CCL
1730 @comment Node,       Next,           Previous,  Up
1731 @subsection CCL Syntax
1732
1733   The full syntax of a CCL program in BNF notation:
1734
1735 @format
1736 CCL_PROGRAM :=
1737         (BUFFER_MAGNIFICATION
1738          CCL_MAIN_BLOCK
1739          [ CCL_EOF_BLOCK ])
1740
1741 BUFFER_MAGNIFICATION := integer
1742 CCL_MAIN_BLOCK := CCL_BLOCK
1743 CCL_EOF_BLOCK := CCL_BLOCK
1744
1745 CCL_BLOCK :=
1746         STATEMENT | (STATEMENT [STATEMENT ...])
1747 STATEMENT :=
1748         SET | IF | BRANCH | LOOP | REPEAT | BREAK | READ | WRITE
1749         | CALL | END
1750
1751 SET :=
1752         (REG = EXPRESSION)
1753         | (REG ASSIGNMENT_OPERATOR EXPRESSION)
1754         | integer
1755
1756 EXPRESSION := ARG | (EXPRESSION OPERATOR ARG)
1757
1758 IF := (if EXPRESSION CCL_BLOCK [CCL_BLOCK])
1759 BRANCH := (branch EXPRESSION CCL_BLOCK [CCL_BLOCK ...])
1760 LOOP := (loop STATEMENT [STATEMENT ...])
1761 BREAK := (break)
1762 REPEAT :=
1763         (repeat)
1764         | (write-repeat [REG | integer | string])
1765         | (write-read-repeat REG [integer | ARRAY])
1766 READ :=
1767         (read REG ...)
1768         | (read-if (REG OPERATOR ARG) CCL_BLOCK CCL_BLOCK)
1769         | (read-branch REG CCL_BLOCK [CCL_BLOCK ...])
1770 WRITE :=
1771         (write REG ...)
1772         | (write EXPRESSION)
1773         | (write integer) | (write string) | (write REG ARRAY)
1774         | string
1775 CALL := (call ccl-program-name)
1776 END := (end)
1777
1778 REG := r0 | r1 | r2 | r3 | r4 | r5 | r6 | r7
1779 ARG := REG | integer
1780 OPERATOR :=
1781         + | - | * | / | % | & | '|' | ^ | << | >> | <8 | >8 | //
1782         | < | > | == | <= | >= | != | de-sjis | en-sjis
1783 ASSIGNMENT_OPERATOR :=
1784         += | -= | *= | /= | %= | &= | '|=' | ^= | <<= | >>=
1785 ARRAY := '[' integer ... ']'
1786 @end format
1787
1788 @node    CCL Statements, CCL Expressions, CCL Syntax, CCL
1789 @comment Node,           Next,            Previous,   Up
1790 @subsection CCL Statements
1791
1792   The Emacs Code Conversion Language provides the following statement
1793 types: @dfn{set}, @dfn{if}, @dfn{branch}, @dfn{loop}, @dfn{repeat},
1794 @dfn{break}, @dfn{read}, @dfn{write}, @dfn{call}, and @dfn{end}.
1795
1796 @heading Set statement:
1797
1798   The @dfn{set} statement has three variants with the syntaxes
1799 @samp{(@var{reg} = @var{expression})},
1800 @samp{(@var{reg} @var{assignment_operator} @var{expression})}, and
1801 @samp{@var{integer}}.  The assignment operator variation of the
1802 @dfn{set} statement works the same way as the corresponding C expression
1803 statement does.  The assignment operators are @code{+=}, @code{-=},
1804 @code{*=}, @code{/=}, @code{%=}, @code{&=}, @code{|=}, @code{^=},
1805 @code{<<=}, and @code{>>=}, and they have the same meanings as in C.  A
1806 "naked integer" @var{integer} is equivalent to a @var{set} statement of
1807 the form @code{(r0 = @var{integer})}.
1808
1809 @heading I/O statements:
1810
1811   The @dfn{read} statement takes one or more registers as arguments.  It
1812 reads one byte (a C char) from the input into each register in turn.
1813
1814   The @dfn{write} takes several forms.  In the form @samp{(write @var{reg}
1815 ...)} it takes one or more registers as arguments and writes each in
1816 turn to the output.  The integer in a register (interpreted as an
1817 Emchar) is encoded to multibyte form (ie, Bufbytes) and written to the
1818 current output buffer.  If it is less than 256, it is written as is.
1819 The forms @samp{(write @var{expression})} and @samp{(write
1820 @var{integer})} are treated analogously.  The form @samp{(write
1821 @var{string})} writes the constant string to the output.  A
1822 "naked string" @samp{@var{string}} is equivalent to the statement @samp{(write
1823 @var{string})}.  The form @samp{(write @var{reg} @var{array})} writes
1824 the @var{reg}th element of the @var{array} to the output.
1825
1826 @heading Conditional statements:
1827
1828   The @dfn{if} statement takes an @var{expression}, a @var{CCL block}, and
1829 an optional @var{second CCL block} as arguments.  If the
1830 @var{expression} evaluates to non-zero, the first @var{CCL block} is
1831 executed.  Otherwise, if there is a @var{second CCL block}, it is
1832 executed.
1833
1834   The @dfn{read-if} variant of the @dfn{if} statement takes an
1835 @var{expression}, a @var{CCL block}, and an optional @var{second CCL
1836 block} as arguments.  The @var{expression} must have the form
1837 @code{(@var{reg} @var{operator} @var{operand})} (where @var{operand} is
1838 a register or an integer).  The @code{read-if} statement first reads
1839 from the input into the first register operand in the @var{expression},
1840 then conditionally executes a CCL block just as the @code{if} statement
1841 does.
1842
1843   The @dfn{branch} statement takes an @var{expression} and one or more CCL
1844 blocks as arguments.  The CCL blocks are treated as a zero-indexed
1845 array, and the @code{branch} statement uses the @var{expression} as the
1846 index of the CCL block to execute.  Null CCL blocks may be used as
1847 no-ops, continuing execution with the statement following the
1848 @code{branch} statement in the containing CCL block.  Out-of-range
1849 values for the @var{expression} are also treated as no-ops.
1850
1851   The @dfn{read-branch} variant of the @dfn{branch} statement takes an
1852 @var{register}, a @var{CCL block}, and an optional @var{second CCL
1853 block} as arguments.  The @code{read-branch} statement first reads from
1854 the input into the @var{register}, then conditionally executes a CCL
1855 block just as the @code{branch} statement does.
1856
1857 @heading Loop control statements:
1858
1859   The @dfn{loop} statement creates a block with an implied jump from the
1860 end of the block back to its head.  The loop is exited on a @code{break}
1861 statement, and continued without executing the tail by a @code{repeat}
1862 statement.
1863
1864   The @dfn{break} statement, written @samp{(break)}, terminates the
1865 current loop and continues with the next statement in the current
1866 block.
1867
1868   The @dfn{repeat} statement has three variants, @code{repeat},
1869 @code{write-repeat}, and @code{write-read-repeat}.  Each continues the
1870 current loop from its head, possibly after performing I/O.
1871 @code{repeat} takes no arguments and does no I/O before jumping.
1872 @code{write-repeat} takes a single argument (a register, an
1873 integer, or a string), writes it to the output, then jumps.
1874 @code{write-read-repeat} takes one or two arguments.  The first must
1875 be a register.  The second may be an integer or an array; if absent, it
1876 is implicitly set to the first (register) argument.
1877 @code{write-read-repeat} writes its second argument to the output, then
1878 reads from the input into the register, and finally jumps.  See the
1879 @code{write} and @code{read} statements for the semantics of the I/O
1880 operations for each type of argument.
1881
1882 @heading Other control statements:
1883
1884   The @dfn{call} statement, written @samp{(call @var{ccl-program-name})},
1885 executes a CCL program as a subroutine.  It does not return a value to
1886 the caller, but can modify the register status.
1887
1888   The @dfn{end} statement, written @samp{(end)}, terminates the CCL
1889 program successfully, and returns to caller (which may be a CCL
1890 program).  It does not alter the status of the registers.
1891
1892 @node    CCL Expressions, Calling CCL, CCL Statements, CCL
1893 @comment Node,            Next,        Previous,       Up
1894 @subsection CCL Expressions
1895
1896   CCL, unlike Lisp, uses infix expressions.  The simplest CCL expressions
1897 consist of a single @var{operand}, either a register (one of @code{r0},
1898 ..., @code{r0}) or an integer.  Complex expressions are lists of the
1899 form @code{( @var{expression} @var{operator} @var{operand} )}.  Unlike
1900 C, assignments are not expressions.
1901
1902   In the following table, @var{X} is the target resister for a @dfn{set}.
1903 In subexpressions, this is implicitly @code{r7}.  This means that
1904 @code{>8}, @code{//}, @code{de-sjis}, and @code{en-sjis} cannot be used
1905 freely in subexpressions, since they return parts of their values in
1906 @code{r7}.  @var{Y} may be an expression, register, or integer, while
1907 @var{Z} must be a register or an integer.
1908
1909 @multitable @columnfractions .22 .14 .09 .55
1910 @item Name @tab Operator @tab Code @tab C-like Description
1911 @item CCL_PLUS @tab @code{+} @tab 0x00 @tab X = Y + Z
1912 @item CCL_MINUS @tab @code{-} @tab 0x01 @tab X = Y - Z
1913 @item CCL_MUL @tab @code{*} @tab 0x02 @tab X = Y * Z
1914 @item CCL_DIV @tab @code{/} @tab 0x03 @tab X = Y / Z
1915 @item CCL_MOD @tab @code{%} @tab 0x04 @tab X = Y % Z
1916 @item CCL_AND @tab @code{&} @tab 0x05 @tab X = Y & Z
1917 @item CCL_OR @tab @code{|} @tab 0x06 @tab X = Y | Z
1918 @item CCL_XOR @tab @code{^} @tab 0x07 @tab X = Y ^ Z
1919 @item CCL_LSH @tab @code{<<} @tab 0x08 @tab X = Y << Z
1920 @item CCL_RSH @tab @code{>>} @tab 0x09 @tab X = Y >> Z
1921 @item CCL_LSH8 @tab @code{<8} @tab 0x0A @tab X = (Y << 8) | Z
1922 @item CCL_RSH8 @tab @code{>8} @tab 0x0B @tab X = Y >> 8, r[7] = Y & 0xFF
1923 @item CCL_DIVMOD @tab @code{//} @tab 0x0C @tab X = Y / Z, r[7] = Y % Z
1924 @item CCL_LS @tab @code{<} @tab 0x10 @tab X = (X < Y)
1925 @item CCL_GT @tab @code{>} @tab 0x11 @tab X = (X > Y)
1926 @item CCL_EQ @tab @code{==} @tab 0x12 @tab X = (X == Y)
1927 @item CCL_LE @tab @code{<=} @tab 0x13 @tab X = (X <= Y)
1928 @item CCL_GE @tab @code{>=} @tab 0x14 @tab X = (X >= Y)
1929 @item CCL_NE @tab @code{!=} @tab 0x15 @tab X = (X != Y)
1930 @item CCL_ENCODE_SJIS @tab @code{en-sjis} @tab 0x16 @tab X = HIGHER_BYTE (SJIS (Y, Z))
1931 @item @tab @tab @tab r[7] = LOWER_BYTE (SJIS (Y, Z)
1932 @item CCL_DECODE_SJIS @tab @code{de-sjis} @tab 0x17 @tab X = HIGHER_BYTE (DE-SJIS (Y, Z))
1933 @item @tab @tab @tab r[7] = LOWER_BYTE (DE-SJIS (Y, Z))
1934 @end multitable
1935
1936   The CCL operators are as in C, with the addition of CCL_LSH8, CCL_RSH8,
1937 CCL_DIVMOD, CCL_ENCODE_SJIS, and CCL_DECODE_SJIS.  The CCL_ENCODE_SJIS
1938 and CCL_DECODE_SJIS treat their first and second bytes as the high and
1939 low bytes of a two-byte character code.  (SJIS stands for Shift JIS, an
1940 encoding of Japanese characters used by Microsoft.  CCL_ENCODE_SJIS is a
1941 complicated transformation of the Japanese standard JIS encoding to
1942 Shift JIS.  CCL_DECODE_SJIS is its inverse.)  It is somewhat odd to
1943 represent the SJIS operations in infix form.
1944
1945 @node    Calling CCL, CCL Example, CCL Expressions, CCL
1946 @comment Node,        Next,          Previous,        Up
1947 @subsection Calling CCL
1948
1949   CCL programs are called automatically during Emacs buffer I/O when the
1950 external representation has a coding system type of @code{shift-jis},
1951 @code{big5}, or @code{ccl}.  The program is specified by the coding
1952 system (@pxref{Coding Systems}).  You can also call CCL programs from
1953 other CCL programs, and from Lisp using these functions:
1954
1955 @defun ccl-execute ccl-program status
1956 Execute @var{ccl-program} with registers initialized by
1957 @var{status}.  @var{ccl-program} is a vector of compiled CCL code
1958 created by @code{ccl-compile}.  It is an error for the program to try to
1959 execute a CCL I/O command.  @var{status} must be a vector of nine
1960 values, specifying the initial value for the R0, R1 .. R7 registers and
1961 for the instruction counter IC.  A @code{nil} value for a register
1962 initializer causes the register to be set to 0.  A @code{nil} value for
1963 the IC initializer causes execution to start at the beginning of the
1964 program.  When the program is done, @var{status} is modified (by
1965 side-effect) to contain the ending values for the corresponding
1966 registers and IC.
1967 @end defun
1968
1969 @defun ccl-execute-on-string ccl-program status string &optional continue
1970 Execute @var{ccl-program} with initial @var{status} on
1971 @var{string}.  @var{ccl-program} is a vector of compiled CCL code
1972 created by @code{ccl-compile}.  @var{status} must be a vector of nine
1973 values, specifying the initial value for the R0, R1 .. R7 registers and
1974 for the instruction counter IC.  A @code{nil} value for a register
1975 initializer causes the register to be set to 0.  A @code{nil} value for
1976 the IC initializer causes execution to start at the beginning of the
1977 program.  An optional fourth argument @var{continue}, if non-@code{nil}, causes
1978 the IC to
1979 remain on the unsatisfied read operation if the program terminates due
1980 to exhaustion of the input buffer.  Otherwise the IC is set to the end
1981 of the program.  When the program is done, @var{status} is modified (by
1982 side-effect) to contain the ending values for the corresponding
1983 registers and IC.  Returns the resulting string.
1984 @end defun
1985
1986   To call a CCL program from another CCL program, it must first be
1987 registered:
1988
1989 @defun register-ccl-program name ccl-program
1990 Register @var{name} for CCL program @var{ccl-program} in
1991 @code{ccl-program-table}.  @var{ccl-program} should be the compiled form of
1992 a CCL program, or @code{nil}.  Return index number of the registered CCL
1993 program.
1994 @end defun
1995
1996   Information about the processor time used by the CCL interpreter can be
1997 obtained using these functions:
1998
1999 @defun ccl-elapsed-time
2000 Returns the elapsed processor time of the CCL interpreter as cons of
2001 user and system time, as
2002 floating point numbers measured in seconds.  If only one
2003 overall value can be determined, the return value will be a cons of that
2004 value and 0.
2005 @end defun
2006
2007 @defun ccl-reset-elapsed-time
2008 Resets the CCL interpreter's internal elapsed time registers.
2009 @end defun
2010
2011 @node    CCL Example, ,  Calling CCL, CCL
2012 @comment Node,         Next, Previous,    Up
2013 @subsection CCL Example
2014
2015   In this section, we describe the implementation of a trivial coding
2016 system to transform from the Web's URL encoding to XEmacs' internal
2017 coding.  Many people will have been first exposed to URL encoding when
2018 they saw ``%20'' where they expected a space in a file's name on their
2019 local hard disk; this can happen when a browser saves a file from the
2020 web and doesn't encode the name, as passed from the server, properly.
2021
2022   URL encoding itself is underspecified with regard to encodings beyond
2023 ASCII.  The relevant document, RFC 1738, explicitly doesn't give any
2024 information on how to encode non-ASCII characters, and the ``obvious''
2025 way---use the %xx values for the octets of the eight bit MIME character
2026 set in which the page was served---breaks when a user types a character
2027 outside that character set.  Best practice for web development is to
2028 serve all pages as UTF-8 and treat incoming form data as using that
2029 coding system.  (Oh, and gamble that your clients won't ever want to
2030 type anything outside Unicode.  But that's not so much of a gamble with
2031 today's client operating systems.)  We don't treat non-ASCII in this
2032 example, as dealing with @samp{(read-multibyte-character ...)} and
2033 errors therewith would make it much harder to understand.
2034
2035   Since CCL isn't a very rich language, we move much of the logic that
2036 would ordinarily be computed from operations like @code{(member ..)},
2037 @code{(and ...)} and @code{(or ...)} into tables, from which register
2038 values are read and written, and on which @code{if} statements are
2039 predicated.  Much more of the implementation of this coding system is
2040 occupied with constructing these tables---in normal Emacs Lisp---than it
2041 is with actual CCL code.
2042
2043   All the @code{defvar} statements we deal with in the next few sections
2044 are surrounded by a @code{(eval-and-compile ...)}, which means that the
2045 logic which initializes these variables executes at compile time, and if
2046 XEmacs loads the compiled version of the file, these variables are
2047 initialized as constants.
2048
2049 @menu
2050 * Four bits to ASCII::  Two tables used for getting hex digits from ASCII.
2051 * URI Encoding constants::  Useful predefined characters.
2052 * Numeric to ASCII-hexadecimal conversion:: Trivial in Lisp, not so in CCL.
2053 * Characters to be preserved:: No transformation needed for these characters.
2054 * The program to decode to internal format:: .
2055 * The program to encode from internal format:: .
2056 * The actual coding system:: .
2057 @end menu
2058
2059 @node Four bits to ASCII, URI Encoding constants, , CCL Example
2060 @subsubsection Four bits to ASCII
2061
2062   The first @code{defvar} is for
2063 @code{url-coding-high-order-nybble-as-ascii}, a 256-entry table that
2064 maps from an octet's value to the ASCII encoding for the hex value of
2065 its most significant four bits.  That might sound complex, but it isn't;
2066 for decimal 65, hex value @samp{#x41}, the entry in the table is the
2067 ASCII encoding of `4'.  For decimal 122, ASCII `z', hex value
2068 @code{#x7a}, @code{(elt url-coding-high-order-nybble-as-ascii #x7a)}
2069 after this file is loaded gives the ASCII encoding of 7.
2070
2071 @example
2072 (defvar url-coding-high-order-nybble-as-ascii
2073   (let ((val (make-vector 256 0))
2074         (i 0))
2075     (while (< i (length val))
2076       (aset val i (char-to-int (aref (format "%02X" i) 0)))
2077       (setq i (1+ i)))
2078     val)
2079   "Table to find an ASCII version of an octet's most significant 4 bits.")
2080 @end example
2081
2082   The next table, @code{url-coding-low-order-nybble-as-ascii} is almost
2083 the same thing, but this time it has a map for the hex encoding of the
2084 low-order four bits.  So the sixty-fifth entry (offset @samp{#x41}) is
2085 the ASCII encoding of `1', the hundred-and-twenty-second (offset
2086 @samp{#x7a}) is the ASCII encoding of `A'.
2087
2088 @example
2089 (defvar url-coding-low-order-nybble-as-ascii
2090   (let ((val (make-vector 256 0))
2091         (i 0))
2092     (while (< i (length val))
2093       (aset val i (char-to-int (aref (format "%02X" i) 1)))
2094       (setq i (1+ i)))
2095     val)
2096   "Table to find an ASCII version of an octet's least significant 4 bits.")
2097 @end example
2098
2099 @node URI Encoding constants, Numeric to ASCII-hexadecimal conversion, Four bits to ASCII, CCL Example
2100 @subsubsection URI Encoding constants
2101
2102   Next, we have a couple of variables that make the CCL code more
2103 readable.  The first is the ASCII encoding of the percentage sign; this
2104 character is used as an escape code, to start the encoding of a
2105 non-printable character.  For historical reasons, URL encoding allows
2106 the space character to be encoded as a plus sign--it does make typing
2107 URLs like @samp{http://google.com/search?q=XEmacs+home+page} easier--and
2108 as such, we have to check when decoding for this value, and map it to
2109 the space character.  When doing this in CCL, we use the
2110 @code{url-coding-escaped-space-code} variable.
2111
2112 @example
2113 (defvar url-coding-escape-character-code (char-to-int ?%)
2114   "The code point for the percentage sign, in ASCII.")
2115
2116 (defvar url-coding-escaped-space-code (char-to-int ?+)
2117   "The URL-encoded value of the space character, that is, +.")
2118 @end example
2119
2120 @node Numeric to ASCII-hexadecimal conversion, Characters to be preserved, URI Encoding constants, CCL Example
2121 @subsubsection Numeric to ASCII-hexadecimal conversion
2122
2123   Now, we have a couple of utility tables that wouldn't be necessary in
2124 a more expressive programming language than is CCL. The first is sixteen
2125 in length, and maps a hexadecimal number to the ASCII encoding of that
2126 number; so zero maps to ASCII `0', ten maps to ASCII `A.' The second
2127 does the reverse; that is, it maps an ASCII character to its value when
2128 interpreted as a hexadecimal digit. ('A' => 10, 'c' => 12, '2' => 2, as
2129 a few examples.)
2130
2131 @example
2132 (defvar url-coding-hex-digit-table
2133   (let ((i 0)
2134         (val (make-vector 16 0)))
2135     (while (< i 16)
2136       (aset val i (char-to-int (aref (format "%X" i) 0)))
2137       (setq i (1+ i)))
2138     val)
2139   "A map from a hexadecimal digit's numeric value to its encoding in ASCII.")
2140
2141 (defvar url-coding-latin-1-as-hex-table
2142   (let ((val (make-vector 256 0))
2143         (i 0))
2144     (while (< i (length val))
2145       ;; Get a hex val for this ASCII character.
2146       (aset val i (string-to-int (format "%c" i) 16))
2147       (setq i (1+ i)))
2148     val)
2149   "A map from Latin 1 code points to their values as hexadecimal digits.")
2150 @end example
2151
2152 @node Characters to be preserved, The program to decode to internal format, Numeric to ASCII-hexadecimal conversion, CCL Example
2153 @subsubsection Characters to be preserved
2154
2155   And finally, the last of these tables.  URL encoding says that
2156 alphanumeric characters, the underscore, hyphen and the full stop
2157 @footnote{That's what the standards call it, though my North American
2158 readers will be more familiar with it as the period character.} retain
2159 their ASCII encoding, and don't undergo transformation.
2160 @code{url-coding-should-preserve-table} is an array in which the entries
2161 are one if the corresponding ASCII character should be left as-is, and
2162 zero if they should be transformed.  So the entries for all the control
2163 and most of the punctuation charcters are zero.  Lisp programmers will
2164 observe that this initialization is particularly inefficient, but
2165 they'll also be aware that this is a long way from an inner loop where
2166 every nanosecond counts.
2167
2168 @example
2169 (defvar url-coding-should-preserve-table
2170   (let ((preserve
2171          (list ?- ?_ ?. ?a ?b ?c ?d ?e ?f ?g ?h ?i ?j ?k ?l ?m ?n ?o
2172                ?p ?q ?r ?s ?t ?u ?v ?w ?x ?y ?z ?A ?B ?C ?D ?E ?F ?G
2173                ?H ?I ?J ?K ?L ?M ?N ?O ?P ?Q ?R ?S ?T ?U ?V ?W ?X ?Y
2174                ?Z ?0 ?1 ?2 ?3 ?4 ?5 ?6 ?7 ?8 ?9))
2175         (i 0)
2176         (res (make-vector 256 0)))
2177     (while (< i 256)
2178       (when (member (int-char i) preserve)
2179         (aset res i 1))
2180       (setq i (1+ i)))
2181     res)
2182   "A 256-entry array of flags, indicating whether or not to preserve an
2183 octet as its ASCII encoding.")
2184 @end example
2185
2186 @node The program to decode to internal format, The program to encode from internal format, Characters to be preserved, CCL Example
2187 @subsubsection The program to decode to internal format
2188
2189   After the almost interminable tables, we get to the CCL.  The first
2190 CCL program, @code{ccl-decode-urlcoding} decodes from the URL coding to
2191 our internal format; since this version of CCL doesn't have support for
2192 error checking on the input, we don't do any verification on it.
2193
2194 The buffer magnification--approximate ratio of the size of the output
2195 buffer to the size of the input buffer--is declared as one, because
2196 fractional values aren't allowed. (Since all those %20's will map to
2197 ` ', the length of the output text will be less than that of the input
2198 text.)
2199
2200 So, first we read an octet from the input buffer into register
2201 @samp{r0}, to set up the loop.  Next, we start the loop, with a
2202 @code{(loop ...)} statement, and we check if the value in @samp{r0} is a
2203 percentage sign.  (Note the comma before
2204 @code{url-coding-escape-character-code}; since CCL is a Lisp macro
2205 language, we can break out of the macro evaluation with a comman, and as
2206 such, ``@code{,url-coding-escape-character-code}'' will be evaluated as a
2207 literal `37.')
2208
2209 If it is a percentage sign, we read the next two octets into @samp{r2}
2210 and @samp{r3}, and convert them into their hexadecimal numeric values,
2211 using the @code{url-coding-latin-1-as-hex-table} array declared above.
2212 (But again, it'll be interpreted as a literal array.)  We then left
2213 shift the first by four bits, mask the two together, and write the
2214 result to the output buffer.
2215
2216 If it isn't a percentage sign, and it is a `+' sign, we write a
2217 space--hexadecimal 20--to the output buffer.
2218
2219 If none of those things are true, we pass the octet to the output buffer
2220 untransformed.  (This could be a place to put error checking, in a more
2221 expressive language.)  We then read one more octet from the input
2222 buffer, and move to the next iteration of the loop.
2223
2224 @example
2225 (define-ccl-program ccl-decode-urlcoding
2226   `(1
2227     ((read r0)
2228      (loop
2229        (if (r0 == ,url-coding-escape-character-code)
2230            ((read r2 r3)
2231             ;; Assign the value at offset r2 in the url-coding-hex-digit-table
2232             ;; to r3.
2233             (r2 = r2 ,url-coding-latin-1-as-hex-table)
2234             (r3 = r3 ,url-coding-latin-1-as-hex-table)
2235             (r2 <<= 4)
2236             (r3 |= r2)
2237             (write r3))
2238          (if (r0 == ,url-coding-escaped-space-code)
2239              (write #x20)
2240            (write r0)))
2241        (read r0)
2242        (repeat))))
2243   "CCL program to take URI-encoded ASCII text and transform it to our
2244 internal encoding. ")
2245 @end example
2246
2247 @node The program to encode from internal format, The actual coding system, The program to decode to internal format, CCL Example
2248 @subsubsection The program to encode from internal format
2249
2250   Next, we see the CCL program to encode ASCII text as URL coded text.
2251 Here, the buffer magnification is specified as three, to account for ` '
2252 mapping to %20, etc.  As before, we read an octet from the input into
2253 @samp{r0}, and move into the body of the loop.  Next, we check if we
2254 should preserve the value of this octet, by reading from offset
2255 @samp{r0} in the @code{url-coding-should-preserve-table} into @samp{r1}.
2256 Then we have an @samp{if} statement predicated on the value in
2257 @samp{r1}; for the true branch, we write the input octet directly.  For
2258 the false branch, we write a percentage sign, the ASCII encoding of the
2259 high four bits in hex, and then the ASCII encoding of the low four bits
2260 in hex.
2261
2262 We then read an octet from the input into @samp{r0}, and repeat the loop.
2263
2264 @example
2265 (define-ccl-program ccl-encode-urlcoding
2266   `(3
2267     ((read r0)
2268      (loop
2269        (r1 = r0 ,url-coding-should-preserve-table)
2270        ;; If we should preserve the value, just write the octet directly.
2271        (if r1
2272            (write r0)
2273          ;; else, write a percentage sign, and the hex value of the octet, in
2274          ;; an ASCII-friendly format.
2275          ((write ,url-coding-escape-character-code)
2276           (write r0 ,url-coding-high-order-nybble-as-ascii)
2277           (write r0 ,url-coding-low-order-nybble-as-ascii)))
2278        (read r0)
2279        (repeat))))
2280   "CCL program to encode octets (almost) according to RFC 1738")
2281 @end example
2282
2283 @node The actual coding system, , The program to encode from internal format, CCL Example
2284 @subsubsection The actual coding system
2285
2286 To actually create the coding system, we call
2287 @samp{make-coding-system}.  The first argument is the symbol that is to
2288 be the name of the coding system, in our case @samp{url-coding}. The
2289 second specifies that the coding system is to be of type
2290 @samp{ccl}---there are several other coding system types available,
2291 including, see the documentation for @samp{make-coding-system} for the
2292 full list. Then there's a documentation string describing the wherefore
2293 and caveats of the coding system, and the final argument is a property
2294 list giving information about the CCL programs and the coding system's
2295 mnemonic.
2296
2297 @example
2298 (make-coding-system
2299  'url-coding 'ccl
2300  "The coding used by application/x-www-form-urlencoded HTTP applications.
2301 This coding form doesn't specify anything about non-ASCII characters, so
2302 make sure you've transformed to a seven-bit coding system first."
2303  '(decode ccl-decode-urlcoding
2304    encode ccl-encode-urlcoding
2305    mnemonic "URLenc"))
2306 @end example
2307
2308 If you're lucky, the @samp{url-coding} coding system describe here
2309 should be available in the XEmacs package system. Otherwise, downloading
2310 it from @samp{http://www.parhasard.net/url-coding.el} should work for
2311 the foreseeable future.
2312
2313 @node Category Tables, , CCL, MULE
2314 @section Category Tables
2315
2316   A category table is a type of char table used for keeping track of
2317 categories.  Categories are used for classifying characters for use in
2318 regexps---you can refer to a category rather than having to use a
2319 complicated [] expression (and category lookups are significantly
2320 faster).
2321
2322   There are 95 different categories available, one for each printable
2323 character (including space) in the ASCII charset.  Each category is
2324 designated by one such character, called a @dfn{category designator}.
2325 They are specified in a regexp using the syntax @samp{\cX}, where X is a
2326 category designator. (This is not yet implemented.)
2327
2328   A category table specifies, for each character, the categories that
2329 the character is in.  Note that a character can be in more than one
2330 category.  More specifically, a category table maps from a character to
2331 either the value @code{nil} (meaning the character is in no categories)
2332 or a 95-element bit vector, specifying for each of the 95 categories
2333 whether the character is in that category.
2334
2335   Special Lisp functions are provided that abstract this, so you do not
2336 have to directly manipulate bit vectors.
2337
2338 @defun category-table-p object
2339 This function returns @code{t} if @var{object} is a category table.
2340 @end defun
2341
2342 @defun category-table &optional buffer
2343 This function returns the current category table.  This is the one
2344 specified by the current buffer, or by @var{buffer} if it is
2345 non-@code{nil}.
2346 @end defun
2347
2348 @defun standard-category-table
2349 This function returns the standard category table.  This is the one used
2350 for new buffers.
2351 @end defun
2352
2353 @defun copy-category-table &optional category-table
2354 This function returns a new category table which is a copy of
2355 @var{category-table}, which defaults to the standard category table.
2356 @end defun
2357
2358 @defun set-category-table category-table &optional buffer
2359 This function selects @var{category-table} as the new category table for
2360 @var{buffer}.  @var{buffer} defaults to the current buffer if omitted.
2361 @end defun
2362
2363 @defun category-designator-p object
2364 This function returns @code{t} if @var{object} is a category designator (a
2365 char in the range @samp{' '} to @samp{'~'}).
2366 @end defun
2367
2368 @defun category-table-value-p object
2369 This function returns @code{t} if @var{object} is a category table value.
2370 Valid values are @code{nil} or a bit vector of size 95.
2371 @end defun
2372