man/lispref/mule.texi

   1 @c -*-texinfo-*-
   2 @c This is part of the XEmacs Lisp Reference Manual.
   3 @c Copyright (C) 1996 Ben Wing.
   4 @c See the file lispref.texi for copying conditions.
   5 @setfilename ../../info/internationalization.info
   6 @node MULE, Tips, Internationalization, top
   7 @chapter MULE
   8
   9 @dfn{MULE} is the name originally given to the version of GNU Emacs
  10 extended for multi-lingual (and in particular Asian-language) support.
  11 ``MULE'' is short for ``MUlti-Lingual Emacs''.  It was originally called
  12 Nemacs (``Nihon Emacs'' where ``Nihon'' is the Japanese word for
  13 ``Japan''), when it only provided support for Japanese.  XEmacs
  14 refers to its multi-lingual support as @dfn{MULE support} since it
  15 is based on @dfn{MULE}.
  16
  17 @menu
  18 * Internationalization Terminology::
  19                         Definition of various internationalization terms.
  20 * Charsets::            Sets of related characters.
  21 * MULE Characters::     Working with characters in XEmacs/MULE.
  22 * Composite Characters:: Making new characters by overstriking other ones.
  23 * ISO 2022::            An international standard for charsets and encodings.
  24 * Coding Systems::      Ways of representing a string of chars using integers.
  25 * CCL::                 A special language for writing fast converters.
  26 * Category Tables::     Subdividing charsets into groups.
  27 @end menu
  28
  29 @node Internationalization Terminology
  30 @section Internationalization Terminology
  31
  32    In internationalization terminology, a string of text is divided up
  33 into @dfn{characters}, which are the printable units that make up the
  34 text.  A single character is (for example) a capital @samp{A}, the
  35 number @samp{2}, a Katakana character, a Kanji ideograph (an
  36 @dfn{ideograph} is a ``picture'' character, such as is used in Japanese
  37 Kanji, Chinese Hanzi, and Korean Hangul; typically there are thousands
  38 of such ideographs in each language), etc.  The basic property of a
  39 character is its shape.  Note that the same character may be drawn by
  40 two different people (or in two different fonts) in slightly different
  41 ways, although the basic shape will be the same.
  42
  43   In some cases, the differences will be significant enough that it is
  44 actually possible to identify two or more distinct shapes that both
  45 represent the same character.  For example, the lowercase letters
  46 @samp{a} and @samp{g} each have two distinct possible shapes -- the
  47 @samp{a} can optionally have a curved tail projecting off the top, and
  48 the @samp{g} can be formed either of two loops, or of one loop and a
  49 tail hanging off the bottom.  Such distinct possible shapes of a
  50 character are called @dfn{glyphs}.  The important characteristic of two
  51 glyphs making up the same character is that the choice between one or
  52 the other is purely stylistic and has no linguistic effect on a word
  53 (this is the reason why a capital @samp{A} and lowercase @samp{a}
  54 are different characters rather than different glyphs -- e.g.
  55 @samp{Aspen} is a city while @samp{aspen} is a kind of tree).
  56
  57   Note that @dfn{character} and @dfn{glyph} are used differently
  58 here than elsewhere in XEmacs.
  59
  60   A @dfn{character set} is simply a set of related characters.  ASCII,
  61 for example, is a set of 94 characters (or 128, if you count
  62 non-printing characters).  Other character sets are ISO8859-1 (ASCII
  63 plus various accented characters and other international symbols),
  64 JISX0201 (ASCII, more or less, plus half-width Katakana), JISX0208
  65 (Japanese Kanji), JISX0212 (a second set of less-used Japanese Kanji),
  66 GB2312 (Mainland Chinese Hanzi), etc.
  67
  68   Every character set has one or more @dfn{orderings}, which can be
  69 viewed as a way of assigning a number (or set of numbers) to each
  70 character in the set.  For most character sets, there is a standard
  71 ordering, and in fact all of the character sets mentioned above define a
  72 particular ordering.  ASCII, for example, places letters in their
  73 ``natural'' order, puts uppercase letters before lowercase letters,
  74 numbers before letters, etc.  Note that for many of the Asian character
  75 sets, there is no natural ordering of the characters.  The actual
  76 orderings are based on one or more salient characteristic, of which
  77 there are many to choose from -- e.g. number of strokes, common
  78 radicals, phonetic ordering, etc.
  79
  80   The set of numbers assigned to any particular character are called
  81 the character's @dfn{position codes}.  The number of position codes
  82 required to index a particular character in a character set is called
  83 the @dfn{dimension} of the character set.  ASCII, being a relatively
  84 small character set, is of dimension one, and each character in the
  85 set is indexed using a single position code, in the range 0 through
  86 127 (if non-printing characters are included) or 33 through 126
  87 (if only the printing characters are considered).  JISX0208, i.e.
  88 Japanese Kanji, has thousands of characters, and is of dimension two --
  89 every character is indexed by two position codes, each in the range
  90 33 through 126. (Note that the choice of the range here is somewhat
  91 arbitrary.  Although a character set such as JISX0208 defines an
  92 @emph{ordering} of all its characters, it does not define the actual
  93 mapping between numbers and characters.  You could just as easily
  94 index the characters in JISX0208 using numbers in the range 0 through
  95 93, 1 through 94, 2 through 95, etc.  The reason for the actual range
  96 chosen is so that the position codes match up with the actual values
  97 used in the common encodings.)
  98
  99   An @dfn{encoding} is a way of numerically representing characters from
 100 one or more character sets into a stream of like-sized numerical values
 101 called @dfn{words}; typically these are 8-bit, 16-bit, or 32-bit
 102 quantities.  If an encoding encompasses only one character set, then the
 103 position codes for the characters in that character set could be used
 104 directly. (This is the case with ASCII, and as a result, most people do
 105 not understand the difference between a character set and an encoding.)
 106 This is not possible, however, if more than one character set is to be
 107 used in the encoding.  For example, printed Japanese text typically
 108 requires characters from multiple character sets -- ASCII, JISX0208, and
 109 JISX0212, to be specific.  Each of these is indexed using one or more
 110 position codes in the range 33 through 126, so the position codes could
 111 not be used directly or there would be no way to tell which character
 112 was meant.  Different Japanese encodings handle this differently -- JIS
 113 uses special escape characters to denote different character sets; EUC
 114 sets the high bit of the position codes for JISX0208 and JISX0212, and
 115 puts a special extra byte before each JISX0212 character; etc. (JIS,
 116 EUC, and most of the other encodings you will encounter are 7-bit or
 117 8-bit encodings.  There is one common 16-bit encoding, which is Unicode;
 118 this strives to represent all the world's characters in a single large
 119 character set.  32-bit encodings are generally used internally in
 120 programs to simplify the code that manipulates them; however, they are
 121 not much used externally because they are not very space-efficient.)
 122
 123   Encodings are classified as either @dfn{modal} or @dfn{non-modal}.  In
 124 a @dfn{modal encoding}, there are multiple states that the encoding can be in,
 125 and the interpretation of the values in the stream depends on the
 126 current global state of the encoding.  Special values in the encoding,
 127 called @dfn{escape sequences}, are used to change the global state.
 128 JIS, for example, is a modal encoding.  The bytes @samp{ESC $ B}
 129 indicate that, from then on, bytes are to be interpreted as position
 130 codes for JISX0208, rather than as ASCII.  This effect is cancelled
 131 using the bytes @samp{ESC ( B}, which mean ``switch from whatever the
 132 current state is to ASCII''.  To switch to JISX0212, the escape sequence
 133 @samp{ESC $ ( D}. (Note that here, as is common, the escape sequences do
 134 in fact begin with @samp{ESC}.  This is not necessarily the case,
 135 however.)
 136
 137 A @dfn{non-modal encoding} has no global state that extends past the
 138 character currently being interpreted.  EUC, for example, is a
 139 non-modal encoding.  Characters in JISX0208 are encoded by setting
 140 the high bit of the position codes, and characters in JISX0212 are
 141 encoded by doing the same but also prefixing the character with the
 142 byte 0x8F.
 143
 144   The advantage of a modal encoding is that it is generally more
 145 space-efficient, and is easily extendable because there are essentially
 146 an arbitrary number of escape sequences that can be created.  The
 147 disadvantage, however, is that it is much more difficult to work with
 148 if it is not being processed in a sequential manner.  In the non-modal
 149 EUC encoding, for example, the byte 0x41 always refers to the letter
 150 @samp{A}; whereas in JIS, it could either be the letter @samp{A}, or
 151 one of the two position codes in a JISX0208 character, or one of the
 152 two position codes in a JISX0212 character.  Determining exactly which
 153 one is meant could be difficult and time-consuming if the previous
 154 bytes in the string have not already been processed.
 155
 156   Non-modal encodings are further divided into @dfn{fixed-width} and
 157 @dfn{variable-width} formats.  A fixed-width encoding always uses
 158 the same number of words per character, whereas a variable-width
 159 encoding does not.  EUC is a good example of a variable-width
 160 encoding: one to three bytes are used per character, depending on
 161 the character set.  16-bit and 32-bit encodings are nearly always
 162 fixed-width, and this is in fact one of the main reasons for using
 163 an encoding with a larger word size.  The advantages of fixed-width
 164 encodings should be obvious.  The advantages of variable-width
 165 encodings are that they are generally more space-efficient and allow
 166 for compatibility with existing 8-bit encodings such as ASCII.
 167
 168   Note that the bytes in an 8-bit encoding are often referred to
 169 as @dfn{octets} rather than simply as bytes.  This terminology
 170 dates back to the days before 8-bit bytes were universal, when
 171 some computers had 9-bit bytes, others had 10-bit bytes, etc.
 172
 173 @node Charsets
 174 @section Charsets
 175
 176   A @dfn{charset} in MULE is an object that encapsulates a
 177 particular character set as well as an ordering of those characters.
 178 Charsets are permanent objects and are named using symbols, like
 179 faces.
 180
 181 @defun charsetp object
 182 This function returns non-@code{nil} if @var{object} is a charset.
 183 @end defun
 184
 185 @menu
 186 * Charset Properties::          Properties of a charset.
 187 * Basic Charset Functions::     Functions for working with charsets.
 188 * Charset Property Functions::  Functions for accessing charset properties.
 189 * Predefined Charsets::         Predefined charset objects.
 190 @end menu
 191
 192 @node Charset Properties
 193 @subsection Charset Properties
 194
 195   Charsets have the following properties:
 196
 197 @table @code
 198 @item name
 199 A symbol naming the charset.  Every charset must have a different name;
 200 this allows a charset to be referred to using its name rather than
 201 the actual charset object.
 202 @item doc-string
 203 A documentation string describing the charset.
 204 @item registry
 205 A regular expression matching the font registry field for this character
 206 set.  For example, both the @code{ascii} and @code{latin-iso8859-1}
 207 charsets use the registry @code{"ISO8859-1"}.  This field is used to
 208 choose an appropriate font when the user gives a general font
 209 specification such as @samp{-*-courier-medium-r-*-140-*}, i.e. a
 210 14-point upright medium-weight Courier font.
 211 @item dimension
 212 Number of position codes used to index a character in the character set.
 213 XEmacs/MULE can only handle character sets of dimension 1 or 2.
 214 This property defaults to 1.
 215 @item chars
 216 Number of characters in each dimension.  In XEmacs/MULE, the only
 217 allowed values are 94 or 96. (There are a couple of pre-defined
 218 character sets, such as ASCII, that do not follow this, but you cannot
 219 define new ones like this.) Defaults to 94.  Note that if the dimension
 220 is 2, the character set thus described is 94x94 or 96x96.
 221 @item columns
 222 Number of columns used to display a character in this charset.
 223 Only used in TTY mode. (Under X, the actual width of a character
 224 can be derived from the font used to display the characters.)
 225 If unspecified, defaults to the dimension. (This is almost
 226 always the correct value, because character sets with dimension 2
 227 are usually ideograph character sets, which need two columns to
 228 display the intricate ideographs.)
 229 @item direction
 230 A symbol, either @code{l2r} (left-to-right) or @code{r2l}
 231 (right-to-left).  Defaults to @code{l2r}.  This specifies the
 232 direction that the text should be displayed in, and will be
 233 left-to-right for most charsets but right-to-left for Hebrew
 234 and Arabic. (Right-to-left display is not currently implemented.)
 235 @item final
 236 Final byte of the standard ISO 2022 escape sequence designating this
 237 charset.  Must be supplied.  Each combination of (@var{dimension},
 238 @var{chars}) defines a separate namespace for final bytes, and each
 239 charset within a particular namespace must have a different final byte.
 240 Note that ISO 2022 restricts the final byte to the range 0x30 - 0x7E if
 241 dimension == 1, and 0x30 - 0x5F if dimension == 2.  Note also that final
 242 bytes in the range 0x30 - 0x3F are reserved for user-defined (not
 243 official) character sets.  For more information on ISO 2022, see @ref{Coding
 244 Systems}.
 245 @item graphic
 246 0 (use left half of font on output) or 1 (use right half of font on
 247 output).  Defaults to 0.  This specifies how to convert the position
 248 codes that index a character in a character set into an index into the
 249 font used to display the character set.  With @code{graphic} set to 0,
 250 position codes 33 through 126 map to font indices 33 through 126; with
 251 it set to 1, position codes 33 through 126 map to font indices 161
 252 through 254 (i.e. the same number but with the high bit set).  For
 253 example, for a font whose registry is ISO8859-1, the left half of the
 254 font (octets 0x20 - 0x7F) is the @code{ascii} charset, while the right
 255 half (octets 0xA0 - 0xFF) is the @code{latin-iso8859-1} charset.
 256 @item ccl-program
 257 A compiled CCL program used to convert a character in this charset into
 258 an index into the font.  This is in addition to the @code{graphic}
 259 property.  If a CCL program is defined, the position codes of a
 260 character will first be processed according to @code{graphic} and
 261 then passed through the CCL program, with the resulting values used
 262 to index the font.
 263
 264 This is used, for example, in the Big5 character set (used in Taiwan).
 265 This character set is not ISO-2022-compliant, and its size (94x157) does
 266 not fit within the maximum 96x96 size of ISO-2022-compliant character
 267 sets.  As a result, XEmacs/MULE splits it (in a rather complex fashion,
 268 so as to group the most commonly used characters together) into two
 269 charset objects (@code{big5-1} and @code{big5-2}), each of size 94x94,
 270 and each charset object uses a CCL program to convert the modified
 271 position codes back into standard Big5 indices to retrieve a character
 272 from a Big5 font.
 273 @end table
 274
 275 Most of the above properties can only be changed when the charset
 276 is created.  @xref{Charset Property Functions}.
 277
 278 @node Basic Charset Functions
 279 @subsection Basic Charset Functions
 280
 281 @defun find-charset charset-or-name
 282 This function retrieves the charset of the given name.  If
 283 @var{charset-or-name} is a charset object, it is simply returned.
 284 Otherwise, @var{charset-or-name} should be a symbol.  If there is no
 285 such charset, @code{nil} is returned.  Otherwise the associated charset
 286 object is returned.
 287 @end defun
 288
 289 @defun get-charset name
 290 This function retrieves the charset of the given name.  Same as
 291 @code{find-charset} except an error is signalled if there is no such
 292 charset instead of returning @code{nil}.
 293 @end defun
 294
 295 @defun charset-list
 296 This function returns a list of the names of all defined charsets.
 297 @end defun
 298
 299 @defun make-charset name doc-string props
 300 This function defines a new character set.  This function is for use
 301 with Mule support.  @var{name} is a symbol, the name by which the
 302 character set is normally referred.  @var{doc-string} is a string
 303 describing the character set.  @var{props} is a property list,
 304 describing the specific nature of the character set.  The recognized
 305 properties are @code{registry}, @code{dimension}, @code{columns},
 306 @code{chars}, @code{final}, @code{graphic}, @code{direction}, and
 307 @code{ccl-program}, as previously described.
 308 @end defun
 309
 310 @defun make-reverse-direction-charset charset new-name
 311 This function makes a charset equivalent to @var{charset} but which goes
 312 in the opposite direction.  @var{new-name} is the name of the new
 313 charset.  The new charset is returned.
 314 @end defun
 315
 316 @defun charset-from-attributes dimension chars final &optional direction
 317 This function returns a charset with the given @var{dimension},
 318 @var{chars}, @var{final}, and @var{direction}.  If @var{direction} is
 319 omitted, both directions will be checked (left-to-right will be returned
 320 if character sets exist for both directions).
 321 @end defun
 322
 323 @defun charset-reverse-direction-charset charset
 324 This function returns the charset (if any) with the same dimension,
 325 number of characters, and final byte as @var{charset}, but which is
 326 displayed in the opposite direction.
 327 @end defun
 328
 329 @node Charset Property Functions
 330 @subsection Charset Property Functions
 331
 332 All of these functions accept either a charset name or charset object.
 333
 334 @defun charset-property charset prop
 335 This function returns property @var{prop} of @var{charset}.
 336 @xref{Charset Properties}.
 337 @end defun
 338
 339 Convenience functions are also provided for retrieving individual
 340 properties of a charset.
 341
 342 @defun charset-name charset
 343 This function returns the name of @var{charset}.  This will be a symbol.
 344 @end defun
 345
 346 @defun charset-doc-string charset
 347 This function returns the doc string of @var{charset}.
 348 @end defun
 349
 350 @defun charset-registry charset
 351 This function returns the registry of @var{charset}.
 352 @end defun
 353
 354 @defun charset-dimension charset
 355 This function returns the dimension of @var{charset}.
 356 @end defun
 357
 358 @defun charset-chars charset
 359 This function returns the number of characters per dimension of
 360 @var{charset}.
 361 @end defun
 362
 363 @defun charset-columns charset
 364 This function returns the number of display columns per character (in
 365 TTY mode) of @var{charset}.
 366 @end defun
 367
 368 @defun charset-direction charset
 369 This function returns the display direction of @var{charset} -- either
 370 @code{l2r} or @code{r2l}.
 371 @end defun
 372
 373 @defun charset-final charset
 374 This function returns the final byte of the ISO 2022 escape sequence
 375 designating @var{charset}.
 376 @end defun
 377
 378 @defun charset-graphic charset
 379 This function returns either 0 or 1, depending on whether the position
 380 codes of characters in @var{charset} map to the left or right half
 381 of their font, respectively.
 382 @end defun
 383
 384 @defun charset-ccl-program charset
 385 This function returns the CCL program, if any, for converting
 386 position codes of characters in @var{charset} into font indices.
 387 @end defun
 388
 389 The only property of a charset that can currently be set after
 390 the charset has been created is the CCL program.
 391
 392 @defun set-charset-ccl-program charset ccl-program
 393 This function sets the @code{ccl-program} property of @var{charset} to
 394 @var{ccl-program}.
 395 @end defun
 396
 397 @node Predefined Charsets
 398 @subsection Predefined Charsets
 399
 400 The following charsets are predefined in the C code.
 401
 402 @example
 403 Name                    Type  Fi Gr Dir Registry
 404 --------------------------------------------------------------
 405 ascii                    94    B  0  l2r ISO8859-1
 406 control-1                94       0  l2r ---
 407 latin-iso8859-1          94    A  1  l2r ISO8859-1
 408 latin-iso8859-2          96    B  1  l2r ISO8859-2
 409 latin-iso8859-3          96    C  1  l2r ISO8859-3
 410 latin-iso8859-4          96    D  1  l2r ISO8859-4
 411 cyrillic-iso8859-5       96    L  1  l2r ISO8859-5
 412 arabic-iso8859-6         96    G  1  r2l ISO8859-6
 413 greek-iso8859-7          96    F  1  l2r ISO8859-7
 414 hebrew-iso8859-8         96    H  1  r2l ISO8859-8
 415 latin-iso8859-9          96    M  1  l2r ISO8859-9
 416 thai-tis620              96    T  1  l2r TIS620
 417 katakana-jisx0201        94    I  1  l2r JISX0201.1976
 418 latin-jisx0201           94    J  0  l2r JISX0201.1976
 419 japanese-jisx0208-1978   94x94 @@  0  l2r JISX0208.1978
 420 japanese-jisx0208        94x94 B  0  l2r JISX0208.19(83|90)
 421 japanese-jisx0212        94x94 D  0  l2r JISX0212
 422 chinese-gb2312           94x94 A  0  l2r GB2312
 423 chinese-cns11643-1       94x94 G  0  l2r CNS11643.1
 424 chinese-cns11643-2       94x94 H  0  l2r CNS11643.2
 425 chinese-big5-1           94x94 0  0  l2r Big5
 426 chinese-big5-2           94x94 1  0  l2r Big5
 427 korean-ksc5601           94x94 C  0  l2r KSC5601
 428 composite                96x96    0  l2r ---
 429 @end example
 430
 431 The following charsets are predefined in the Lisp code.
 432
 433 @example
 434 Name                     Type  Fi Gr Dir Registry
 435 --------------------------------------------------------------
 436 arabic-digit             94    2  0  l2r MuleArabic-0
 437 arabic-1-column          94    3  0  r2l MuleArabic-1
 438 arabic-2-column          94    4  0  r2l MuleArabic-2
 439 sisheng                  94    0  0  l2r sisheng_cwnn\|OMRON_UDC_ZH
 440 chinese-cns11643-3       94x94 I  0  l2r CNS11643.1
 441 chinese-cns11643-4       94x94 J  0  l2r CNS11643.1
 442 chinese-cns11643-5       94x94 K  0  l2r CNS11643.1
 443 chinese-cns11643-6       94x94 L  0  l2r CNS11643.1
 444 chinese-cns11643-7       94x94 M  0  l2r CNS11643.1
 445 ethiopic                 94x94 2  0  l2r Ethio
 446 ascii-r2l                94    B  0  r2l ISO8859-1
 447 ipa                      96    0  1  l2r MuleIPA
 448 vietnamese-lower         96    1  1  l2r VISCII1.1
 449 vietnamese-upper         96    2  1  l2r VISCII1.1
 450 @end example
 451
 452 For all of the above charsets, the dimension and number of columns are
 453 the same.
 454
 455 Note that ASCII, Control-1, and Composite are handled specially.
 456 This is why some of the fields are blank; and some of the filled-in
 457 fields (e.g. the type) are not really accurate.
 458
 459 @node MULE Characters
 460 @section MULE Characters
 461
 462 @defun make-char charset arg1 &optional arg2
 463 This function makes a multi-byte character from @var{charset} and octets
 464 @var{arg1} and @var{arg2}.
 465 @end defun
 466
 467 @defun char-charset ch
 468 This function returns the character set of char @var{ch}.
 469 @end defun
 470
 471 @defun char-octet ch &optional n
 472 This function returns the octet (i.e. position code) numbered @var{n}
 473 (should be 0 or 1) of char @var{ch}.  @var{n} defaults to 0 if omitted.
 474 @end defun
 475
 476 @defun find-charset-region start end &optional buffer
 477 This function returns a list of the charsets in the region between
 478 @var{start} and @var{end}.  @var{buffer} defaults to the current buffer
 479 if omitted.
 480 @end defun
 481
 482 @defun find-charset-string string
 483 This function returns a list of the charsets in @var{string}.
 484 @end defun
 485
 486 @node Composite Characters
 487 @section Composite Characters
 488
 489 Composite characters are not yet completely implemented.
 490
 491 @defun make-composite-char string
 492 This function converts a string into a single composite character.  The
 493 character is the result of overstriking all the characters in the
 494 string.
 495 @end defun
 496
 497 @defun composite-char-string ch
 498 This function returns a string of the characters comprising a composite
 499 character.
 500 @end defun
 501
 502 @defun compose-region start end &optional buffer
 503 This function composes the characters in the region from @var{start} to
 504 @var{end} in @var{buffer} into one composite character.  The composite
 505 character replaces the composed characters.  @var{buffer} defaults to
 506 the current buffer if omitted.
 507 @end defun
 508
 509 @defun decompose-region start end &optional buffer
 510 This function decomposes any composite characters in the region from
 511 @var{start} to @var{end} in @var{buffer}.  This converts each composite
 512 character into one or more characters, the individual characters out of
 513 which the composite character was formed.  Non-composite characters are
 514 left as-is.  @var{buffer} defaults to the current buffer if omitted.
 515 @end defun
 516
 517 @node ISO 2022
 518 @section ISO 2022
 519
 520 This section briefly describes the ISO 2022 encoding standard.  For more
 521 thorough understanding, please refer to the original document of ISO
 522 2022.
 523
 524 Character sets (@dfn{charsets}) are classified into the following four
 525 categories, according to the number of characters of charset:
 526 94-charset, 96-charset, 94x94-charset, and 96x96-charset.
 527
 528 @need 1000
 529 @table @asis
 530 @item 94-charset
 531  ASCII(B), left(J) and right(I) half of JISX0201, ...
 532 @item 96-charset
 533  Latin-1(A), Latin-2(B), Latin-3(C), ...
 534 @item 94x94-charset
 535  GB2312(A), JISX0208(B), KSC5601(C), ...
 536 @item 96x96-charset
 537  none for the moment
 538 @end table
 539
 540 The character in parentheses after the name of each charset
 541 is the @dfn{final character} @var{F}, which can be regarded as
 542 the identifier of the charset.  ECMA allocates @var{F} to each
 543 charset.  @var{F} is in the range of 0x30..0x7F, but 0x30..0x3F
 544 are only for private use.
 545
 546 Note: @dfn{ECMA} = European Computer Manufacturers Association
 547
 548 There are four @dfn{registers of charsets}, called G0 thru G3.
 549 You can designate (or assign) any charset to one of these
 550 registers.
 551
 552 The code space contained within one octet (of size 256) is divided into
 553 4 areas: C0, GL, C1, and GR.  GL and GR are the areas into which a
 554 register of charset can be invoked into.
 555
 556 @example
 557 @group
 558         C0: 0x00 - 0x1F
 559         GL: 0x20 - 0x7F
 560         C1: 0x80 - 0x9F
 561         GR: 0xA0 - 0xFF
 562 @end group
 563 @end example
 564
 565 Usually, in the initial state, G0 is invoked into GL, and G1
 566 is invoked into GR.
 567
 568 ISO 2022 distinguishes 7-bit environments and 8-bit environments.  In
 569 7-bit environments, only C0 and GL are used.
 570
 571 Charset designation is done by escape sequences of the form:
 572
 573 @example
 574         ESC [@var{I}] @var{I} @var{F}
 575 @end example
 576
 577 where @var{I} is an intermediate character in the range 0x20 - 0x2F, and
 578 @var{F} is the final character identifying this charset.
 579
 580 The meaning of intermediate characters are:
 581
 582 @example
 583 @group
 584         $ [0x24]: indicate charset of dimension 2 (94x94 or 96x96).
 585         ( [0x28]: designate to G0 a 94-charset whose final byte is @var{F}.
 586         ) [0x29]: designate to G1 a 94-charset whose final byte is @var{F}.
 587         * [0x2A]: designate to G2 a 94-charset whose final byte is @var{F}.
 588         + [0x2B]: designate to G3 a 94-charset whose final byte is @var{F}.
 589         - [0x2D]: designate to G1 a 96-charset whose final byte is @var{F}.
 590         . [0x2E]: designate to G2 a 96-charset whose final byte is @var{F}.
 591         / [0x2F]: designate to G3 a 96-charset whose final byte is @var{F}.
 592 @end group
 593 @end example
 594
 595 The following rule is not allowed in ISO 2022 but can be used in Mule.
 596
 597 @example
 598         , [0x2C]: designate to G0 a 96-charset whose final byte is @var{F}.
 599 @end example
 600
 601 Here are examples of designations:
 602
 603 @example
 604 @group
 605         ESC ( B :              designate to G0 ASCII
 606         ESC - A :              designate to G1 Latin-1
 607         ESC $ ( A or ESC $ A : designate to G0 GB2312
 608         ESC $ ( B or ESC $ B : designate to G0 JISX0208
 609         ESC $ ) C :            designate to G1 KSC5601
 610 @end group
 611 @end example
 612
 613 To use a charset designated to G2 or G3, and to use a charset designated
 614 to G1 in a 7-bit environment, you must explicitly invoke G1, G2, or G3
 615 into GL.  There are two types of invocation, Locking Shift (forever) and
 616 Single Shift (one character only).
 617
 618 Locking Shift is done as follows:
 619
 620 @example
 621         LS0 or SI (0x0F): invoke G0 into GL
 622         LS1 or SO (0x0E): invoke G1 into GL
 623         LS2:  invoke G2 into GL
 624         LS3:  invoke G3 into GL
 625         LS1R: invoke G1 into GR
 626         LS2R: invoke G2 into GR
 627         LS3R: invoke G3 into GR
 628 @end example
 629
 630 Single Shift is done as follows:
 631
 632 @example
 633 @group
 634         SS2 or ESC N: invoke G2 into GL
 635         SS3 or ESC O: invoke G3 into GL
 636 @end group
 637 @end example
 638
 639 (#### Ben says: I think the above is slightly incorrect.  It appears that
 640 SS2 invokes G2 into GR and SS3 invokes G3 into GR, whereas ESC N and
 641 ESC O behave as indicated.  The above definitions will not parse
 642 EUC-encoded text correctly, and it looks like the code in mule-coding.c
 643 has similar problems.)
 644
 645 You may realize that there are a lot of ISO-2022-compliant ways of
 646 encoding multilingual text.  Now, in the world, there exist many coding
 647 systems such as X11's Compound Text, Japanese JUNET code, and so-called
 648 EUC (Extended UNIX Code); all of these are variants of ISO 2022.
 649
 650 In Mule, we characterize ISO 2022 by the following attributes:
 651
 652 @enumerate
 653 @item
 654 Initial designation to G0 thru G3.
 655 @item
 656 Allow designation of short form for Japanese and Chinese.
 657 @item
 658 Should we designate ASCII to G0 before control characters?
 659 @item
 660 Should we designate ASCII to G0 at the end of line?
 661 @item
 662 7-bit environment or 8-bit environment.
 663 @item
 664 Use Locking Shift or not.
 665 @item
 666 Use ASCII or JIS0201-1976-Roman.
 667 @item
 668 Use JISX0208-1983 or JISX0208-1976.
 669 @end enumerate
 670
 671 (The last two are only for Japanese.)
 672
 673 By specifying these attributes, you can create any variant
 674 of ISO 2022.
 675
 676 Here are several examples:
 677
 678 @example
 679 @group
 680 junet -- Coding system used in JUNET.
 681         1. G0 <- ASCII, G1..3 <- never used
 682         2. Yes.
 683         3. Yes.
 684         4. Yes.
 685         5. 7-bit environment
 686         6. No.
 687         7. Use ASCII
 688         8. Use JISX0208-1983
 689 @end group
 690
 691 @group
 692 ctext -- Compound Text
 693         1. G0 <- ASCII, G1 <- Latin-1, G2,3 <- never used
 694         2. No.
 695         3. No.
 696         4. Yes.
 697         5. 8-bit environment
 698         6. No.
 699         7. Use ASCII
 700         8. Use JISX0208-1983
 701 @end group
 702
 703 @group
 704 euc-china -- Chinese EUC.  Although many people call this
 705 as "GB encoding", the name may cause misunderstanding.
 706         1. G0 <- ASCII, G1 <- GB2312, G2,3 <- never used
 707         2. No.
 708         3. Yes.
 709         4. Yes.
 710         5. 8-bit environment
 711         6. No.
 712         7. Use ASCII
 713         8. Use JISX0208-1983
 714 @end group
 715
 716 @group
 717 korean-mail -- Coding system used in Korean network.
 718         1. G0 <- ASCII, G1 <- KSC5601, G2,3 <- never used
 719         2. No.
 720         3. Yes.
 721         4. Yes.
 722         5. 7-bit environment
 723         6. Yes.
 724         7. No.
 725         8. No.
 726 @end group
 727 @end example
 728
 729 Mule creates all these coding systems by default.
 730
 731 @node Coding Systems
 732 @section Coding Systems
 733
 734 A coding system is an object that defines how text containing multiple
 735 character sets is encoded into a stream of (typically 8-bit) bytes.  The
 736 coding system is used to decode the stream into a series of characters
 737 (which may be from multiple charsets) when the text is read from a file
 738 or process, and is used to encode the text back into the same format
 739 when it is written out to a file or process.
 740
 741 For example, many ISO-2022-compliant coding systems (such as Compound
 742 Text, which is used for inter-client data under the X Window System) use
 743 escape sequences to switch between different charsets -- Japanese Kanji,
 744 for example, is invoked with @samp{ESC $ ( B}; ASCII is invoked with
 745 @samp{ESC ( B}; and Cyrillic is invoked with @samp{ESC - L}.  See
 746 @code{make-coding-system} for more information.
 747
 748 Coding systems are normally identified using a symbol, and the symbol is
 749 accepted in place of the actual coding system object whenever a coding
 750 system is called for. (This is similar to how faces and charsets work.)
 751
 752 @defun coding-system-p object
 753 This function returns non-@code{nil} if @var{object} is a coding system.
 754 @end defun
 755
 756 @menu
 757 * Coding System Types::               Classifying coding systems.
 758 * EOL Conversion::                    Dealing with different ways of denoting
 759                                         the end of a line.
 760 * Coding System Properties::          Properties of a coding system.
 761 * Basic Coding System Functions::     Working with coding systems.
 762 * Coding System Property Functions::  Retrieving a coding system's properties.
 763 * Encoding and Decoding Text::        Encoding and decoding text.
 764 * Detection of Textual Encoding::     Determining how text is encoded.
 765 * Big5 and Shift-JIS Functions::      Special functions for these non-standard
 766                                         encodings.
 767 @end menu
 768
 769 @node Coding System Types
 770 @subsection Coding System Types
 771
 772 @table @code
 773 @item nil
 774 @itemx autodetect
 775 Automatic conversion.  XEmacs attempts to detect the coding system used
 776 in the file.
 777 @item no-conversion
 778 No conversion.  Use this for binary files and such.  On output, graphic
 779 characters that are not in ASCII or Latin-1 will be replaced by a
 780 @samp{?}. (For a no-conversion-encoded buffer, these characters will
 781 only be present if you explicitly insert them.)
 782 @item shift-jis
 783 Shift-JIS (a Japanese encoding commonly used in PC operating systems).
 784 @item iso2022
 785 Any ISO-2022-compliant encoding.  Among other things, this includes JIS
 786 (the Japanese encoding commonly used for e-mail), national variants of
 787 EUC (the standard Unix encoding for Japanese and other languages), and
 788 Compound Text (an encoding used in X11).  You can specify more specific
 789 information about the conversion with the @var{flags} argument.
 790 @item big5
 791 Big5 (the encoding commonly used for Taiwanese).
 792 @item ccl
 793 The conversion is performed using a user-written pseudo-code program.
 794 CCL (Code Conversion Language) is the name of this pseudo-code.
 795 @item internal
 796 Write out or read in the raw contents of the memory representing the
 797 buffer's text.  This is primarily useful for debugging purposes, and is
 798 only enabled when XEmacs has been compiled with @code{DEBUG_XEMACS} set
 799 (the @samp{--debug} configure option).  @strong{Warning}: Reading in a
 800 file using @code{internal} conversion can result in an internal
 801 inconsistency in the memory representing a buffer's text, which will
 802 produce unpredictable results and may cause XEmacs to crash.  Under
 803 normal circumstances you should never use @code{internal} conversion.
 804 @end table
 805
 806 @node EOL Conversion
 807 @subsection EOL Conversion
 808
 809 @table @code
 810 @item nil
 811 Automatically detect the end-of-line type (LF, CRLF, or CR).  Also
 812 generate subsidiary coding systems named @code{@var{name}-unix},
 813 @code{@var{name}-dos}, and @code{@var{name}-mac}, that are identical to
 814 this coding system but have an EOL-TYPE value of @code{lf}, @code{crlf},
 815 and @code{cr}, respectively.
 816 @item lf
 817 The end of a line is marked externally using ASCII LF.  Since this is
 818 also the way that XEmacs represents an end-of-line internally,
 819 specifying this option results in no end-of-line conversion.  This is
 820 the standard format for Unix text files.
 821 @item crlf
 822 The end of a line is marked externally using ASCII CRLF.  This is the
 823 standard format for MS-DOS text files.
 824 @item cr
 825 The end of a line is marked externally using ASCII CR.  This is the
 826 standard format for Macintosh text files.
 827 @item t
 828 Automatically detect the end-of-line type but do not generate subsidiary
 829 coding systems.  (This value is converted to @code{nil} when stored
 830 internally, and @code{coding-system-property} will return @code{nil}.)
 831 @end table
 832
 833 @node Coding System Properties
 834 @subsection Coding System Properties
 835
 836 @table @code
 837 @item mnemonic
 838 String to be displayed in the modeline when this coding system is
 839 active.
 840
 841 @item eol-type
 842 End-of-line conversion to be used.  It should be one of the types
 843 listed in @ref{EOL Conversion}.
 844
 845 @item post-read-conversion
 846 Function called after a file has been read in, to perform the decoding.
 847 Called with two arguments, @var{beg} and @var{end}, denoting a region of
 848 the current buffer to be decoded.
 849
 850 @item pre-write-conversion
 851 Function called before a file is written out, to perform the encoding.
 852 Called with two arguments, @var{beg} and @var{end}, denoting a region of
 853 the current buffer to be encoded.
 854 @end table
 855
 856 The following additional properties are recognized if @var{type} is
 857 @code{iso2022}:
 858
 859 @table @code
 860 @item charset-g0
 861 @itemx charset-g1
 862 @itemx charset-g2
 863 @itemx charset-g3
 864 The character set initially designated to the G0 - G3 registers.
 865 The value should be one of
 866
 867 @itemize @bullet
 868 @item
 869 A charset object (designate that character set)
 870 @item
 871 @code{nil} (do not ever use this register)
 872 @item
 873 @code{t} (no character set is initially designated to the register, but
 874 may be later on; this automatically sets the corresponding
 875 @code{force-g*-on-output} property)
 876 @end itemize
 877
 878 @item force-g0-on-output
 879 @itemx force-g1-on-output
 880 @itemx force-g2-on-output
 881 @itemx force-g3-on-output
 882 If non-@code{nil}, send an explicit designation sequence on output
 883 before using the specified register.
 884
 885 @item short
 886 If non-@code{nil}, use the short forms @samp{ESC $ @@}, @samp{ESC $ A},
 887 and @samp{ESC $ B} on output in place of the full designation sequences
 888 @samp{ESC $ ( @@}, @samp{ESC $ ( A}, and @samp{ESC $ ( B}.
 889
 890 @item no-ascii-eol
 891 If non-@code{nil}, don't designate ASCII to G0 at each end of line on
 892 output.  Setting this to non-@code{nil} also suppresses other
 893 state-resetting that normally happens at the end of a line.
 894
 895 @item no-ascii-cntl
 896 If non-@code{nil}, don't designate ASCII to G0 before control chars on
 897 output.
 898
 899 @item seven
 900 If non-@code{nil}, use 7-bit environment on output.  Otherwise, use 8-bit
 901 environment.
 902
 903 @item lock-shift
 904 If non-@code{nil}, use locking-shift (SO/SI) instead of single-shift or
 905 designation by escape sequence.
 906
 907 @item no-iso6429
 908 If non-@code{nil}, don't use ISO6429's direction specification.
 909
 910 @item escape-quoted
 911 If non-nil, literal control characters that are the same as the
 912 beginning of a recognized ISO 2022 or ISO 6429 escape sequence (in
 913 particular, ESC (0x1B), SO (0x0E), SI (0x0F), SS2 (0x8E), SS3 (0x8F),
 914 and CSI (0x9B)) are ``quoted'' with an escape character so that they can
 915 be properly distinguished from an escape sequence.  (Note that doing
 916 this results in a non-portable encoding.) This encoding flag is used for
 917 byte-compiled files.  Note that ESC is a good choice for a quoting
 918 character because there are no escape sequences whose second byte is a
 919 character from the Control-0 or Control-1 character sets; this is
 920 explicitly disallowed by the ISO 2022 standard.
 921
 922 @item input-charset-conversion
 923 A list of conversion specifications, specifying conversion of characters
 924 in one charset to another when decoding is performed.  Each
 925 specification is a list of two elements: the source charset, and the
 926 destination charset.
 927
 928 @item output-charset-conversion
 929 A list of conversion specifications, specifying conversion of characters
 930 in one charset to another when encoding is performed.  The form of each
 931 specification is the same as for @code{input-charset-conversion}.
 932 @end table
 933
 934 The following additional properties are recognized (and required) if
 935 @var{type} is @code{ccl}:
 936
 937 @table @code
 938 @item decode
 939 CCL program used for decoding (converting to internal format).
 940
 941 @item encode
 942 CCL program used for encoding (converting to external format).
 943 @end table
 944
 945 @node Basic Coding System Functions
 946 @subsection Basic Coding System Functions
 947
 948 @defun find-coding-system coding-system-or-name
 949 This function retrieves the coding system of the given name.
 950
 951 If @var{coding-system-or-name} is a coding-system object, it is simply
 952 returned.  Otherwise, @var{coding-system-or-name} should be a symbol.
 953 If there is no such coding system, @code{nil} is returned.  Otherwise
 954 the associated coding system object is returned.
 955 @end defun
 956
 957 @defun get-coding-system name
 958 This function retrieves the coding system of the given name.  Same as
 959 @code{find-coding-system} except an error is signalled if there is no
 960 such coding system instead of returning @code{nil}.
 961 @end defun
 962
 963 @defun coding-system-list
 964 This function returns a list of the names of all defined coding systems.
 965 @end defun
 966
 967 @defun coding-system-name coding-system
 968 This function returns the name of the given coding system.
 969 @end defun
 970
 971 @defun make-coding-system name type &optional doc-string props
 972 This function registers symbol @var{name} as a coding system.
 973
 974 @var{type} describes the conversion method used and should be one of
 975 the types listed in @ref{Coding System Types}.
 976
 977 @var{doc-string} is a string describing the coding system.
 978
 979 @var{props} is a property list, describing the specific nature of the
 980 character set.  Recognized properties are as in @ref{Coding System
 981 Properties}.
 982 @end defun
 983
 984 @defun copy-coding-system old-coding-system new-name
 985 This function copies @var{old-coding-system} to @var{new-name}.  If
 986 @var{new-name} does not name an existing coding system, a new one will
 987 be created.
 988 @end defun
 989
 990 @defun subsidiary-coding-system coding-system eol-type
 991 This function returns the subsidiary coding system of
 992 @var{coding-system} with eol type @var{eol-type}.
 993 @end defun
 994
 995 @node Coding System Property Functions
 996 @subsection Coding System Property Functions
 997
 998 @defun coding-system-doc-string coding-system
 999 This function returns the doc string for @var{coding-system}.
1000 @end defun
1001
1002 @defun coding-system-type coding-system
1003 This function returns the type of @var{coding-system}.
1004 @end defun
1005
1006 @defun coding-system-property coding-system prop
1007 This function returns the @var{prop} property of @var{coding-system}.
1008 @end defun
1009
1010 @node Encoding and Decoding Text
1011 @subsection Encoding and Decoding Text
1012
1013 @defun decode-coding-region start end coding-system &optional buffer
1014 This function decodes the text between @var{start} and @var{end} which
1015 is encoded in @var{coding-system}.  This is useful if you've read in
1016 encoded text from a file without decoding it (e.g. you read in a
1017 JIS-formatted file but used the @code{binary} or @code{no-conversion} coding
1018 system, so that it shows up as @samp{^[$B!<!+^[(B}).  The length of the
1019 encoded text is returned.  @var{buffer} defaults to the current buffer
1020 if unspecified.
1021 @end defun
1022
1023 @defun encode-coding-region start end coding-system &optional buffer
1024 This function encodes the text between @var{start} and @var{end} using
1025 @var{coding-system}.  This will, for example, convert Japanese
1026 characters into stuff such as @samp{^[$B!<!+^[(B} if you use the JIS
1027 encoding.  The length of the encoded text is returned.  @var{buffer}
1028 defaults to the current buffer if unspecified.
1029 @end defun
1030
1031 @node Detection of Textual Encoding
1032 @subsection Detection of Textual Encoding
1033
1034 @defun coding-category-list
1035 This function returns a list of all recognized coding categories.
1036 @end defun
1037
1038 @defun set-coding-priority-list list
1039 This function changes the priority order of the coding categories.
1040 @var{list} should be a list of coding categories, in descending order of
1041 priority.  Unspecified coding categories will be lower in priority than
1042 all specified ones, in the same relative order they were in previously.
1043 @end defun
1044
1045 @defun coding-priority-list
1046 This function returns a list of coding categories in descending order of
1047 priority.
1048 @end defun
1049
1050 @defun set-coding-category-system coding-category coding-system
1051 This function changes the coding system associated with a coding category.
1052 @end defun
1053
1054 @defun coding-category-system coding-category
1055 This function returns the coding system associated with a coding category.
1056 @end defun
1057
1058 @defun detect-coding-region start end &optional buffer
1059 This function detects coding system of the text in the region between
1060 @var{start} and @var{end}.  Returned value is a list of possible coding
1061 systems ordered by priority.  If only ASCII characters are found, it
1062 returns @code{autodetect} or one of its subsidiary coding systems
1063 according to a detected end-of-line type.  Optional arg @var{buffer}
1064 defaults to the current buffer.
1065 @end defun
1066
1067 @node Big5 and Shift-JIS Functions
1068 @subsection Big5 and Shift-JIS Functions
1069
1070 These are special functions for working with the non-standard
1071 Shift-JIS and Big5 encodings.
1072
1073 @defun decode-shift-jis-char code
1074 This function decodes a JISX0208 character of Shift-JIS coding-system.
1075 @var{code} is the character code in Shift-JIS as a cons of type bytes.
1076 The corresponding character is returned.
1077 @end defun
1078
1079 @defun encode-shift-jis-char ch
1080 This function encodes a JISX0208 character @var{ch} to SHIFT-JIS
1081 coding-system.  The corresponding character code in SHIFT-JIS is
1082 returned as a cons of two bytes.
1083 @end defun
1084
1085 @defun decode-big5-char code
1086 This function decodes a Big5 character @var{code} of BIG5 coding-system.
1087 @var{code} is the character code in BIG5.  The corresponding character
1088 is returned.
1089 @end defun
1090
1091 @defun encode-big5-char ch
1092 This function encodes the Big5 character @var{char} to BIG5
1093 coding-system.  The corresponding character code in Big5 is returned.
1094 @end defun
1095
1096 @node CCL, Category Tables, Coding Systems, MULE
1097 @section CCL
1098
1099 CCL (Code Conversion Language) is a simple structured programming
1100 language designed for character coding conversions.  A CCL program is
1101 compiled to CCL code (represented by a vector of integers) and executed
1102 by the CCL interpreter embedded in Emacs.  The CCL interpreter
1103 implements a virtual machine with 8 registers called @code{r0}, ...,
1104 @code{r7}, a number of control structures, and some I/O operators.  Take
1105 care when using registers @code{r0} (used in implicit @dfn{set}
1106 statements) and especially @code{r7} (used internally by several
1107 statements and operations, especially for multiple return values and I/O
1108 operations).
1109
1110 CCL is used for code conversion during process I/O and file I/O for
1111 non-ISO2022 coding systems.  (It is the only way for a user to specify a
1112 code conversion function.)  It is also used for calculating the code
1113 point of an X11 font from a character code.  However, since CCL is
1114 designed as a powerful programming language, it can be used for more
1115 generic calculation where efficiency is demanded.  A combination of
1116 three or more arithmetic operations can be calculated faster by CCL than
1117 by Emacs Lisp.
1118
1119 @strong{Warning:}  The code in @file{src/mule-ccl.c} and
1120 @file{$packages/lisp/mule-base/mule-ccl.el} is the definitive
1121 description of CCL's semantics.  The previous version of this section
1122 contained several typos and obsolete names left from earlier versions of
1123 MULE, and many may remain.  (I am not an experienced CCL programmer; the
1124 few who know CCL well find writing English painful.)
1125
1126 A CCL program transforms an input data stream into an output data
1127 stream.  The input stream, held in a buffer of constant bytes, is left
1128 unchanged.  The buffer may be filled by an external input operation,
1129 taken from an Emacs buffer, or taken from a Lisp string.  The output
1130 buffer is a dynamic array of bytes, which can be written by an external
1131 output operation, inserted into an Emacs buffer, or returned as a Lisp
1132 string.
1133
1134 A CCL program is a (Lisp) list containing two or three members.  The
1135 first member is the @dfn{buffer magnification}, which indicates the
1136 required minimum size of the output buffer as a multiple of the input
1137 buffer.  It is followed by the @dfn{main block} which executes while
1138 there is input remaining, and an optional @dfn{EOF block} which is
1139 executed when the input is exhausted.  Both the main block and the EOF
1140 block are CCL blocks.
1141
1142 A @dfn{CCL block} is either a CCL statement or list of CCL statements.
1143 A @dfn{CCL statement} is either a @dfn{set statement} (either an integer
1144 or an @dfn{assignment}, which is a list of a register to receive the
1145 assignment, an assignment operator, and an expression) or a @dfn{control
1146 statement} (a list starting with a keyword, whose allowable syntax
1147 depends on the keyword).
1148
1149 @menu
1150 * CCL Syntax::          CCL program syntax in BNF notation.
1151 * CCL Statements::      Semantics of CCL statements.
1152 * CCL Expressions::     Operators and expressions in CCL.
1153 * Calling CCL::         Running CCL programs.
1154 * CCL Examples::        The encoding functions for Big5 and KOI-8.
1155 @end menu
1156
1157 @node    CCL Syntax, CCL Statements, CCL,       CCL
1158 @comment Node,       Next,           Previous,  Up
1159 @subsection CCL Syntax
1160
1161 The full syntax of a CCL program in BNF notation:
1162
1163 @format
1164 CCL_PROGRAM :=
1165         (BUFFER_MAGNIFICATION
1166          CCL_MAIN_BLOCK
1167          [ CCL_EOF_BLOCK ])
1168
1169 BUFFER_MAGNIFICATION := integer
1170 CCL_MAIN_BLOCK := CCL_BLOCK
1171 CCL_EOF_BLOCK := CCL_BLOCK
1172
1173 CCL_BLOCK :=
1174         STATEMENT | (STATEMENT [STATEMENT ...])
1175 STATEMENT :=
1176         SET | IF | BRANCH | LOOP | REPEAT | BREAK | READ | WRITE
1177         | CALL | END
1178
1179 SET :=
1180         (REG = EXPRESSION)
1181         | (REG ASSIGNMENT_OPERATOR EXPRESSION)
1182         | integer
1183
1184 EXPRESSION := ARG | (EXPRESSION OPERATOR ARG)
1185
1186 IF := (if EXPRESSION CCL_BLOCK [CCL_BLOCK])
1187 BRANCH := (branch EXPRESSION CCL_BLOCK [CCL_BLOCK ...])
1188 LOOP := (loop STATEMENT [STATEMENT ...])
1189 BREAK := (break)
1190 REPEAT :=
1191         (repeat)
1192         | (write-repeat [REG | integer | string])
1193         | (write-read-repeat REG [integer | ARRAY])
1194 READ :=
1195         (read REG ...)
1196         | (read-if (REG OPERATOR ARG) CCL_BLOCK CCL_BLOCK)
1197         | (read-branch REG CCL_BLOCK [CCL_BLOCK ...])
1198 WRITE :=
1199         (write REG ...)
1200         | (write EXPRESSION)
1201         | (write integer) | (write string) | (write REG ARRAY)
1202         | string
1203 CALL := (call ccl-program-name)
1204 END := (end)
1205
1206 REG := r0 | r1 | r2 | r3 | r4 | r5 | r6 | r7
1207 ARG := REG | integer
1208 OPERATOR :=
1209         + | - | * | / | % | & | '|' | ^ | << | >> | <8 | >8 | //
1210         | < | > | == | <= | >= | != | de-sjis | en-sjis
1211 ASSIGNMENT_OPERATOR :=
1212         += | -= | *= | /= | %= | &= | '|=' | ^= | <<= | >>=
1213 ARRAY := '[' integer ... ']'
1214 @end format
1215
1216 @node    CCL Statements, CCL Expressions, CCL Syntax, CCL
1217 @comment Node,           Next,            Previous,   Up
1218 @subsection CCL Statements
1219
1220 The Emacs Code Conversion Language provides the following statement
1221 types: @dfn{set}, @dfn{if}, @dfn{branch}, @dfn{loop}, @dfn{repeat},
1222 @dfn{break}, @dfn{read}, @dfn{write}, @dfn{call}, and @dfn{end}.
1223
1224 @heading Set statement:
1225
1226 The @dfn{set} statement has three variants with the syntaxes
1227 @samp{(@var{reg} = @var{expression})},
1228 @samp{(@var{reg} @var{assignment_operator} @var{expression})}, and
1229 @samp{@var{integer}}.  The assignment operator variation of the
1230 @dfn{set} statement works the same way as the corresponding C expression
1231 statement does.  The assignment operators are @code{+=}, @code{-=},
1232 @code{*=}, @code{/=}, @code{%=}, @code{&=}, @code{|=}, @code{^=},
1233 @code{<<=}, and @code{>>=}, and they have the same meanings as in C.  A
1234 "naked integer" @var{integer} is equivalent to a @var{set} statement of
1235 the form @code{(r0 = @var{integer})}.
1236
1237 @heading I/O statements:
1238
1239 The @dfn{read} statement takes one or more registers as arguments.  It
1240 reads one byte (a C char) from the input into each register in turn.
1241
1242 The @dfn{write} takes several forms.  In the form @samp{(write @var{reg}
1243 ...)} it takes one or more registers as arguments and writes each in
1244 turn to the output.  The integer in a register (interpreted as an
1245 Emchar) is encoded to multibyte form (ie, Bufbytes) and written to the
1246 current output buffer.  If it is less than 256, it is written as is.
1247 The forms @samp{(write @var{expression})} and @samp{(write
1248 @var{integer})} are treated analogously.  The form @samp{(write
1249 @var{string})} writes the constant string to the output.  A
1250 "naked string" @samp{@var{string}} is equivalent to the statement @samp{(write
1251 @var{string})}.  The form @samp{(write @var{reg} @var{array})} writes
1252 the @var{reg}th element of the @var{array} to the output.
1253
1254 @heading Conditional statements:
1255
1256 The @dfn{if} statement takes an @var{expression}, a @var{CCL block}, and
1257 an optional @var{second CCL block} as arguments.  If the
1258 @var{expression} evaluates to non-zero, the first @var{CCL block} is
1259 executed.  Otherwise, if there is a @var{second CCL block}, it is
1260 executed.
1261
1262 The @dfn{read-if} variant of the @dfn{if} statement takes an
1263 @var{expression}, a @var{CCL block}, and an optional @var{second CCL
1264 block} as arguments.  The @var{expression} must have the form
1265 @code{(@var{reg} @var{operator} @var{operand})} (where @var{operand} is
1266 a register or an integer).  The @code{read-if} statement first reads
1267 from the input into the first register operand in the @var{expression},
1268 then conditionally executes a CCL block just as the @code{if} statement
1269 does.
1270
1271 The @dfn{branch} statement takes an @var{expression} and one or more CCL
1272 blocks as arguments.  The CCL blocks are treated as a zero-indexed
1273 array, and the @code{branch} statement uses the @var{expression} as the
1274 index of the CCL block to execute.  Null CCL blocks may be used as
1275 no-ops, continuing execution with the statement following the
1276 @code{branch} statement in the containing CCL block.  Out-of-range
1277 values for the @var{EXPRESSION} are also treated as no-ops.
1278
1279 The @dfn{read-branch} variant of the @dfn{branch} statement takes an
1280 @var{register}, a @var{CCL block}, and an optional @var{second CCL
1281 block} as arguments.  The @code{read-branch} statement first reads from
1282 the input into the @var{register}, then conditionally executes a CCL
1283 block just as the @code{branch} statement does.
1284
1285 @heading Loop control statements:
1286
1287 The @dfn{loop} statement creates a block with an implied jump from the
1288 end of the block back to its head.  The loop is exited on a @code{break}
1289 statement, and continued without executing the tail by a @code{repeat}
1290 statement.
1291
1292 The @dfn{break} statement, written @samp{(break)}, terminates the
1293 current loop and continues with the next statement in the current
1294 block.
1295
1296 The @dfn{repeat} statement has three variants, @code{repeat},
1297 @code{write-repeat}, and @code{write-read-repeat}.  Each continues the
1298 current loop from its head, possibly after performing I/O.
1299 @code{repeat} takes no arguments and does no I/O before jumping.
1300 @code{write-repeat} takes a single argument (a register, an
1301 integer, or a string), writes it to the output, then jumps.
1302 @code{write-read-repeat} takes one or two arguments.  The first must
1303 be a register.  The second may be an integer or an array; if absent, it
1304 is implicitly set to the first (register) argument.
1305 @code{write-read-repeat} writes its second argument to the output, then
1306 reads from the input into the register, and finally jumps.  See the
1307 @code{write} and @code{read} statements for the semantics of the I/O
1308 operations for each type of argument.
1309
1310 @heading Other control statements:
1311
1312 The @dfn{call} statement, written @samp{(call @var{ccl-program-name})},
1313 executes a CCL program as a subroutine.  It does not return a value to
1314 the caller, but can modify the register status.
1315
1316 The @dfn{end} statement, written @samp{(end)}, terminates the CCL
1317 program successfully, and returns to caller (which may be a CCL
1318 program).  It does not alter the status of the registers.
1319
1320 @node    CCL Expressions, Calling CCL, CCL Statements, CCL
1321 @comment Node,            Next,        Previous,       Up
1322 @subsection CCL Expressions
1323
1324 CCL, unlike Lisp, uses infix expressions.  The simplest CCL expressions
1325 consist of a single @var{operand}, either a register (one of @code{r0},
1326 ..., @code{r0}) or an integer.  Complex expressions are lists of the
1327 form @code{( @var{expression} @var{operator} @var{operand} )}.  Unlike
1328 C, assignments are not expressions.
1329
1330 In the following table, @var{X} is the target resister for a @dfn{set}.
1331 In subexpressions, this is implicitly @code{r7}.  This means that
1332 @code{>8}, @code{//}, @code{de-sjis}, and @code{en-sjis} cannot be used
1333 freely in subexpressions, since they return parts of their values in
1334 @code{r7}.  @var{Y} may be an expression, register, or integer, while
1335 @var{Z} must be a register or an integer.
1336
1337 @multitable @columnfractions .22 .14 .09 .55
1338 @item Name @tab Operator @tab Code @tab C-like Description
1339 @item CCL_PLUS @tab @code{+} @tab 0x00 @tab X = Y + Z
1340 @item CCL_MINUS @tab @code{-} @tab 0x01 @tab X = Y - Z
1341 @item CCL_MUL @tab @code{*} @tab 0x02 @tab X = Y * Z
1342 @item CCL_DIV @tab @code{/} @tab 0x03 @tab X = Y / Z
1343 @item CCL_MOD @tab @code{%} @tab 0x04 @tab X = Y % Z
1344 @item CCL_AND @tab @code{&} @tab 0x05 @tab X = Y & Z
1345 @item CCL_OR @tab @code{|} @tab 0x06 @tab X = Y | Z
1346 @item CCL_XOR @tab @code{^} @tab 0x07 @tab X = Y ^ Z
1347 @item CCL_LSH @tab @code{<<} @tab 0x08 @tab X = Y << Z
1348 @item CCL_RSH @tab @code{>>} @tab 0x09 @tab X = Y >> Z
1349 @item CCL_LSH8 @tab @code{<8} @tab 0x0A @tab X = (Y << 8) | Z
1350 @item CCL_RSH8 @tab @code{>8} @tab 0x0B @tab X = Y >> 8, r[7] = Y & 0xFF
1351 @item CCL_DIVMOD @tab @code{//} @tab 0x0C @tab X = Y / Z, r[7] = Y % Z
1352 @item CCL_LS @tab @code{<} @tab 0x10 @tab X = (X < Y)
1353 @item CCL_GT @tab @code{>} @tab 0x11 @tab X = (X > Y)
1354 @item CCL_EQ @tab @code{==} @tab 0x12 @tab X = (X == Y)
1355 @item CCL_LE @tab @code{<=} @tab 0x13 @tab X = (X <= Y)
1356 @item CCL_GE @tab @code{>=} @tab 0x14 @tab X = (X >= Y)
1357 @item CCL_NE @tab @code{!=} @tab 0x15 @tab X = (X != Y)
1358 @item CCL_ENCODE_SJIS @tab @code{en-sjis} @tab 0x16 @tab X = HIGHER_BYTE (SJIS (Y, Z))
1359 @item @tab @tab @tab r[7] = LOWER_BYTE (SJIS (Y, Z)
1360 @item CCL_DECODE_SJIS @tab @code{de-sjis} @tab 0x17 @tab X = HIGHER_BYTE (DE-SJIS (Y, Z))
1361 @item @tab @tab @tab r[7] = LOWER_BYTE (DE-SJIS (Y, Z))
1362 @end multitable
1363
1364 The CCL operators are as in C, with the addition of CCL_LSH8, CCL_RSH8,
1365 CCL_DIVMOD, CCL_ENCODE_SJIS, and CCL_DECODE_SJIS.  The CCL_ENCODE_SJIS
1366 and CCL_DECODE_SJIS treat their first and second bytes as the high and
1367 low bytes of a two-byte character code.  (SJIS stands for Shift JIS, an
1368 encoding of Japanese characters used by Microsoft.  CCL_ENCODE_SJIS is a
1369 complicated transformation of the Japanese standard JIS encoding to
1370 Shift JIS.  CCL_DECODE_SJIS is its inverse.)  It is somewhat odd to
1371 represent the SJIS operations in infix form.
1372
1373 @node    Calling CCL, CCL Examples,  CCL Expressions, CCL
1374 @comment Node,        Next,          Previous,        Up
1375 @subsection Calling CCL
1376
1377 CCL programs are called automatically during Emacs buffer I/O when the
1378 external representation has a coding system type of @code{shift-jis},
1379 @code{big5}, or @code{ccl}.  The program is specified by the coding
1380 system (@pxref{Coding Systems}).  You can also call CCL programs from
1381 other CCL programs, and from Lisp using these functions:
1382
1383 @defun ccl-execute ccl-program status
1384 Execute @var{ccl-program} with registers initialized by
1385 @var{status}.  @var{ccl-program} is a vector of compiled CCL code
1386 created by @code{ccl-compile}.  It is an error for the program to try to
1387 execute a CCL I/O command.  @var{status} must be a vector of nine
1388 values, specifying the initial value for the R0, R1 .. R7 registers and
1389 for the instruction counter IC.  A @code{nil} value for a register
1390 initializer causes the register to be set to 0.  A @code{nil} value for
1391 the IC initializer causes execution to start at the beginning of the
1392 program.  When the program is done, @var{status} is modified (by
1393 side-effect) to contain the ending values for the corresponding
1394 registers and IC.
1395 @end defun
1396
1397 @defun ccl-execute-on-string ccl-program status str &optional continue
1398 Execute @var{ccl-program} with initial @var{status} on
1399 @var{string}.  @var{ccl-program} is a vector of compiled CCL code
1400 created by @code{ccl-compile}.  @var{status} must be a vector of nine
1401 values, specifying the initial value for the R0, R1 .. R7 registers and
1402 for the instruction counter IC.  A @code{nil} value for a register
1403 initializer causes the register to be set to 0.  A @code{nil} value for
1404 the IC initializer causes execution to start at the beginning of the
1405 program.  An optional fourth argument @var{continue}, if non-nil, causes
1406 the IC to
1407 remain on the unsatisfied read operation if the program terminates due
1408 to exhaustion of the input buffer.  Otherwise the IC is set to the end
1409 of the program.  When the program is done, @var{status} is modified (by
1410 side-effect) to contain the ending values for the corresponding
1411 registers and IC.  Returns the resulting string.
1412 @end defun
1413
1414 To call a CCL program from another CCL program, it must first be
1415 registered:
1416
1417 @defun register-ccl-program name ccl-program
1418 Register @var{name} for CCL program @var{program} in
1419 @code{ccl-program-table}.  @var{program} should be the compiled form of
1420 a CCL program, or nil.  Return index number of the registered CCL
1421 program.
1422 @end defun
1423
1424 Information about the processor time used by the CCL interpreter can be
1425 obtained using these functions:
1426
1427 @defun ccl-elapsed-time
1428 Returns the elapsed processor time of the CCL interpreter as cons of
1429 user and system time, as
1430 floating point numbers measured in seconds.  If only one
1431 overall value can be determined, the return value will be a cons of that
1432 value and 0.
1433 @end defun
1434
1435 @defun ccl-reset-elapsed-time
1436 Resets the CCL interpreter's internal elapsed time registers.
1437 @end defun
1438
1439 @node    CCL Examples, ,     Calling CCL, CCL
1440 @comment Node,         Next, Previous,    Up
1441 @subsection CCL Examples
1442
1443 This section is not yet written.
1444
1445 @node Category Tables, , CCL, MULE
1446 @section Category Tables
1447
1448   A category table is a type of char table used for keeping track of
1449 categories.  Categories are used for classifying characters for use in
1450 regexps -- you can refer to a category rather than having to use a
1451 complicated [] expression (and category lookups are significantly
1452 faster).
1453
1454   There are 95 different categories available, one for each printable
1455 character (including space) in the ASCII charset.  Each category is
1456 designated by one such character, called a @dfn{category designator}.
1457 They are specified in a regexp using the syntax @samp{\cX}, where X is a
1458 category designator. (This is not yet implemented.)
1459
1460   A category table specifies, for each character, the categories that
1461 the character is in.  Note that a character can be in more than one
1462 category.  More specifically, a category table maps from a character to
1463 either the value @code{nil} (meaning the character is in no categories)
1464 or a 95-element bit vector, specifying for each of the 95 categories
1465 whether the character is in that category.
1466
1467   Special Lisp functions are provided that abstract this, so you do not
1468 have to directly manipulate bit vectors.
1469
1470 @defun category-table-p obj
1471 This function returns @code{t} if @var{arg} is a category table.
1472 @end defun
1473
1474 @defun category-table &optional buffer
1475 This function returns the current category table.  This is the one
1476 specified by the current buffer, or by @var{buffer} if it is
1477 non-@code{nil}.
1478 @end defun
1479
1480 @defun standard-category-table
1481 This function returns the standard category table.  This is the one used
1482 for new buffers.
1483 @end defun
1484
1485 @defun copy-category-table &optional table
1486 This function constructs a new category table and return it.  It is a
1487 copy of the @var{table}, which defaults to the standard category table.
1488 @end defun
1489
1490 @defun set-category-table table &optional buffer
1491 This function selects a new category table for @var{buffer}.  One
1492 argument, a category table.  @var{buffer} defaults to the current buffer
1493 if omitted.
1494 @end defun
1495
1496 @defun category-designator-p obj
1497 This function returns @code{t} if @var{arg} is a category designator (a
1498 char in the range @samp{' '} to @samp{'~'}).
1499 @end defun
1500
1501 @defun category-table-value-p obj
1502 This function returns @code{t} if @var{arg} is a category table value.
1503 Valid values are @code{nil} or a bit vector of size 95.
1504 @end defun
1505