git.chise.org Git - chise/xemacs-chise.git.1/blob - man/lispref/mule.texi

   1 @c -*-texinfo-*-
   2 @c This is part of the XEmacs Lisp Reference Manual.
   3 @c Copyright (C) 1996 Ben Wing.
   4 @c See the file lispref.texi for copying conditions.
   5 @setfilename ../../info/internationalization.info
   6 @node MULE, Tips, Internationalization, top
   7 @chapter MULE
   8
   9 @dfn{MULE} is the name originally given to the version of GNU Emacs
  10 extended for multi-lingual (and in particular Asian-language) support.
  11 ``MULE'' is short for ``MUlti-Lingual Emacs''.  It was originally called
  12 Nemacs (``Nihon Emacs'' where ``Nihon'' is the Japanese word for
  13 ``Japan''), when it only provided support for Japanese.  XEmacs
  14 refers to its multi-lingual support as @dfn{MULE support} since it
  15 is based on @dfn{MULE}.
  16
  17 @menu
  18 * Internationalization Terminology::
  19                         Definition of various internationalization terms.
  20 * Charsets::            Sets of related characters.
  21 * MULE Characters::     Working with characters in XEmacs/MULE.
  22 * Composite Characters:: Making new characters by overstriking other ones.
  23 * ISO 2022::            An international standard for charsets and encodings.
  24 * Coding Systems::      Ways of representing a string of chars using integers.
  25 * CCL::                 A special language for writing fast converters.
  26 * Category Tables::     Subdividing charsets into groups.
  27 @end menu
  28
  29 @node Internationalization Terminology
  30 @section Internationalization Terminology
  31
  32    In internationalization terminology, a string of text is divided up
  33 into @dfn{characters}, which are the printable units that make up the
  34 text.  A single character is (for example) a capital @samp{A}, the
  35 number @samp{2}, a Katakana character, a Kanji ideograph (an
  36 @dfn{ideograph} is a ``picture'' character, such as is used in Japanese
  37 Kanji, Chinese Hanzi, and Korean Hangul; typically there are thousands
  38 of such ideographs in each language), etc.  The basic property of a
  39 character is its shape.  Note that the same character may be drawn by
  40 two different people (or in two different fonts) in slightly different
  41 ways, although the basic shape will be the same.
  42
  43   In some cases, the differences will be significant enough that it is
  44 actually possible to identify two or more distinct shapes that both
  45 represent the same character.  For example, the lowercase letters
  46 @samp{a} and @samp{g} each have two distinct possible shapes -- the
  47 @samp{a} can optionally have a curved tail projecting off the top, and
  48 the @samp{g} can be formed either of two loops, or of one loop and a
  49 tail hanging off the bottom.  Such distinct possible shapes of a
  50 character are called @dfn{glyphs}.  The important characteristic of two
  51 glyphs making up the same character is that the choice between one or
  52 the other is purely stylistic and has no linguistic effect on a word
  53 (this is the reason why a capital @samp{A} and lowercase @samp{a}
  54 are different characters rather than different glyphs -- e.g.
  55 @samp{Aspen} is a city while @samp{aspen} is a kind of tree).
  56
  57   Note that @dfn{character} and @dfn{glyph} are used differently
  58 here than elsewhere in XEmacs.
  59
  60   A @dfn{character set} is simply a set of related characters.  ASCII,
  61 for example, is a set of 94 characters (or 128, if you count
  62 non-printing characters).  Other character sets are ISO8859-1 (ASCII
  63 plus various accented characters and other international symbols),
  64 JISX0201 (ASCII, more or less, plus half-width Katakana), JISX0208
  65 (Japanese Kanji), JISX0212 (a second set of less-used Japanese Kanji),
  66 GB2312 (Mainland Chinese Hanzi), etc.
  67
  68   Every character set has one or more @dfn{orderings}, which can be
  69 viewed as a way of assigning a number (or set of numbers) to each
  70 character in the set.  For most character sets, there is a standard
  71 ordering, and in fact all of the character sets mentioned above define a
  72 particular ordering.  ASCII, for example, places letters in their
  73 ``natural'' order, puts uppercase letters before lowercase letters,
  74 numbers before letters, etc.  Note that for many of the Asian character
  75 sets, there is no natural ordering of the characters.  The actual
  76 orderings are based on one or more salient characteristic, of which
  77 there are many to choose from -- e.g. number of strokes, common
  78 radicals, phonetic ordering, etc.
  79
  80   The set of numbers assigned to any particular character are called
  81 the character's @dfn{position codes}.  The number of position codes
  82 required to index a particular character in a character set is called
  83 the @dfn{dimension} of the character set.  ASCII, being a relatively
  84 small character set, is of dimension one, and each character in the
  85 set is indexed using a single position code, in the range 0 through
  86 127 (if non-printing characters are included) or 33 through 126
  87 (if only the printing characters are considered).  JISX0208, i.e.
  88 Japanese Kanji, has thousands of characters, and is of dimension two --
  89 every character is indexed by two position codes, each in the range
  90 33 through 126. (Note that the choice of the range here is somewhat
  91 arbitrary.  Although a character set such as JISX0208 defines an
  92 @emph{ordering} of all its characters, it does not define the actual
  93 mapping between numbers and characters.  You could just as easily
  94 index the characters in JISX0208 using numbers in the range 0 through
  95 93, 1 through 94, 2 through 95, etc.  The reason for the actual range
  96 chosen is so that the position codes match up with the actual values
  97 used in the common encodings.)
  98
  99   An @dfn{encoding} is a way of numerically representing characters from
 100 one or more character sets into a stream of like-sized numerical values
 101 called @dfn{words}; typically these are 8-bit, 16-bit, or 32-bit
 102 quantities.  If an encoding encompasses only one character set, then the
 103 position codes for the characters in that character set could be used
 104 directly. (This is the case with ASCII, and as a result, most people do
 105 not understand the difference between a character set and an encoding.)
 106 This is not possible, however, if more than one character set is to be
 107 used in the encoding.  For example, printed Japanese text typically
 108 requires characters from multiple character sets -- ASCII, JISX0208, and
 109 JISX0212, to be specific.  Each of these is indexed using one or more
 110 position codes in the range 33 through 126, so the position codes could
 111 not be used directly or there would be no way to tell which character
 112 was meant.  Different Japanese encodings handle this differently -- JIS
 113 uses special escape characters to denote different character sets; EUC
 114 sets the high bit of the position codes for JISX0208 and JISX0212, and
 115 puts a special extra byte before each JISX0212 character; etc. (JIS,
 116 EUC, and most of the other encodings you will encounter are 7-bit or
 117 8-bit encodings.  There is one common 16-bit encoding, which is Unicode;
 118 this strives to represent all the world's characters in a single large
 119 character set.  32-bit encodings are generally used internally in
 120 programs to simplify the code that manipulates them; however, they are
 121 not much used externally because they are not very space-efficient.)
 122
 123   Encodings are classified as either @dfn{modal} or @dfn{non-modal}.  In
 124 a @dfn{modal encoding}, there are multiple states that the encoding can be in,
 125 and the interpretation of the values in the stream depends on the
 126 current global state of the encoding.  Special values in the encoding,
 127 called @dfn{escape sequences}, are used to change the global state.
 128 JIS, for example, is a modal encoding.  The bytes @samp{ESC $ B}
 129 indicate that, from then on, bytes are to be interpreted as position
 130 codes for JISX0208, rather than as ASCII.  This effect is cancelled
 131 using the bytes @samp{ESC ( B}, which mean ``switch from whatever the
 132 current state is to ASCII''.  To switch to JISX0212, the escape sequence
 133 @samp{ESC $ ( D}. (Note that here, as is common, the escape sequences do
 134 in fact begin with @samp{ESC}.  This is not necessarily the case,
 135 however.)
 136
 137 A @dfn{non-modal encoding} has no global state that extends past the
 138 character currently being interpreted.  EUC, for example, is a
 139 non-modal encoding.  Characters in JISX0208 are encoded by setting
 140 the high bit of the position codes, and characters in JISX0212 are
 141 encoded by doing the same but also prefixing the character with the
 142 byte 0x8F.
 143
 144   The advantage of a modal encoding is that it is generally more
 145 space-efficient, and is easily extendable because there are essentially
 146 an arbitrary number of escape sequences that can be created.  The
 147 disadvantage, however, is that it is much more difficult to work with
 148 if it is not being processed in a sequential manner.  In the non-modal
 149 EUC encoding, for example, the byte 0x41 always refers to the letter
 150 @samp{A}; whereas in JIS, it could either be the letter @samp{A}, or
 151 one of the two position codes in a JISX0208 character, or one of the
 152 two position codes in a JISX0212 character.  Determining exactly which
 153 one is meant could be difficult and time-consuming if the previous
 154 bytes in the string have not already been processed.
 155
 156   Non-modal encodings are further divided into @dfn{fixed-width} and
 157 @dfn{variable-width} formats.  A fixed-width encoding always uses
 158 the same number of words per character, whereas a variable-width
 159 encoding does not.  EUC is a good example of a variable-width
 160 encoding: one to three bytes are used per character, depending on
 161 the character set.  16-bit and 32-bit encodings are nearly always
 162 fixed-width, and this is in fact one of the main reasons for using
 163 an encoding with a larger word size.  The advantages of fixed-width
 164 encodings should be obvious.  The advantages of variable-width
 165 encodings are that they are generally more space-efficient and allow
 166 for compatibility with existing 8-bit encodings such as ASCII.
 167
 168   Note that the bytes in an 8-bit encoding are often referred to
 169 as @dfn{octets} rather than simply as bytes.  This terminology
 170 dates back to the days before 8-bit bytes were universal, when
 171 some computers had 9-bit bytes, others had 10-bit bytes, etc.
 172
 173 @node Charsets
 174 @section Charsets
 175
 176   A @dfn{charset} in MULE is an object that encapsulates a
 177 particular character set as well as an ordering of those characters.
 178 Charsets are permanent objects and are named using symbols, like
 179 faces.
 180
 181 @defun charsetp object
 182 This function returns non-@code{nil} if @var{object} is a charset.
 183 @end defun
 184
 185 @menu
 186 * Charset Properties::          Properties of a charset.
 187 * Basic Charset Functions::     Functions for working with charsets.
 188 * Charset Property Functions::  Functions for accessing charset properties.
 189 * Predefined Charsets::         Predefined charset objects.
 190 @end menu
 191
 192 @node Charset Properties
 193 @subsection Charset Properties
 194
 195   Charsets have the following properties:
 196
 197 @table @code
 198 @item name
 199 A symbol naming the charset.  Every charset must have a different name;
 200 this allows a charset to be referred to using its name rather than
 201 the actual charset object.
 202 @item doc-string
 203 A documentation string describing the charset.
 204 @item registry
 205 A regular expression matching the font registry field for this character
 206 set.  For example, both the @code{ascii} and @code{latin-iso8859-1}
 207 charsets use the registry @code{"ISO8859-1"}.  This field is used to
 208 choose an appropriate font when the user gives a general font
 209 specification such as @samp{-*-courier-medium-r-*-140-*}, i.e. a
 210 14-point upright medium-weight Courier font.
 211 @item dimension
 212 Number of position codes used to index a character in the character set.
 213 XEmacs/MULE can only handle character sets of dimension 1 or 2.
 214 This property defaults to 1.
 215 @item chars
 216 Number of characters in each dimension.  In XEmacs/MULE, the only
 217 allowed values are 94 or 96. (There are a couple of pre-defined
 218 character sets, such as ASCII, that do not follow this, but you cannot
 219 define new ones like this.) Defaults to 94.  Note that if the dimension
 220 is 2, the character set thus described is 94x94 or 96x96.
 221 @item columns
 222 Number of columns used to display a character in this charset.
 223 Only used in TTY mode. (Under X, the actual width of a character
 224 can be derived from the font used to display the characters.)
 225 If unspecified, defaults to the dimension. (This is almost
 226 always the correct value, because character sets with dimension 2
 227 are usually ideograph character sets, which need two columns to
 228 display the intricate ideographs.)
 229 @item direction
 230 A symbol, either @code{l2r} (left-to-right) or @code{r2l}
 231 (right-to-left).  Defaults to @code{l2r}.  This specifies the
 232 direction that the text should be displayed in, and will be
 233 left-to-right for most charsets but right-to-left for Hebrew
 234 and Arabic. (Right-to-left display is not currently implemented.)
 235 @item final
 236 Final byte of the standard ISO 2022 escape sequence designating this
 237 charset.  Must be supplied.  Each combination of (@var{dimension},
 238 @var{chars}) defines a separate namespace for final bytes, and each
 239 charset within a particular namespace must have a different final byte.
 240 Note that ISO 2022 restricts the final byte to the range 0x30 - 0x7E if
 241 dimension == 1, and 0x30 - 0x5F if dimension == 2.  Note also that final
 242 bytes in the range 0x30 - 0x3F are reserved for user-defined (not
 243 official) character sets.  For more information on ISO 2022, see @ref{Coding
 244 Systems}.
 245 @item graphic
 246 0 (use left half of font on output) or 1 (use right half of font on
 247 output).  Defaults to 0.  This specifies how to convert the position
 248 codes that index a character in a character set into an index into the
 249 font used to display the character set.  With @code{graphic} set to 0,
 250 position codes 33 through 126 map to font indices 33 through 126; with
 251 it set to 1, position codes 33 through 126 map to font indices 161
 252 through 254 (i.e. the same number but with the high bit set).  For
 253 example, for a font whose registry is ISO8859-1, the left half of the
 254 font (octets 0x20 - 0x7F) is the @code{ascii} charset, while the right
 255 half (octets 0xA0 - 0xFF) is the @code{latin-iso8859-1} charset.
 256 @item ccl-program
 257 A compiled CCL program used to convert a character in this charset into
 258 an index into the font.  This is in addition to the @code{graphic}
 259 property.  If a CCL program is defined, the position codes of a
 260 character will first be processed according to @code{graphic} and
 261 then passed through the CCL program, with the resulting values used
 262 to index the font.
 263
 264 This is used, for example, in the Big5 character set (used in Taiwan).
 265 This character set is not ISO-2022-compliant, and its size (94x157) does
 266 not fit within the maximum 96x96 size of ISO-2022-compliant character
 267 sets.  As a result, XEmacs/MULE splits it (in a rather complex fashion,
 268 so as to group the most commonly used characters together) into two
 269 charset objects (@code{big5-1} and @code{big5-2}), each of size 94x94,
 270 and each charset object uses a CCL program to convert the modified
 271 position codes back into standard Big5 indices to retrieve a character
 272 from a Big5 font.
 273 @end table
 274
 275 Most of the above properties can only be changed when the charset
 276 is created.  @xref{Charset Property Functions}.
 277
 278 @node Basic Charset Functions
 279 @subsection Basic Charset Functions
 280
 281 @defun find-charset charset-or-name
 282 This function retrieves the charset of the given name.  If
 283 @var{charset-or-name} is a charset object, it is simply returned.
 284 Otherwise, @var{charset-or-name} should be a symbol.  If there is no
 285 such charset, @code{nil} is returned.  Otherwise the associated charset
 286 object is returned.
 287 @end defun
 288
 289 @defun get-charset name
 290 This function retrieves the charset of the given name.  Same as
 291 @code{find-charset} except an error is signalled if there is no such
 292 charset instead of returning @code{nil}.
 293 @end defun
 294
 295 @defun charset-list
 296 This function returns a list of the names of all defined charsets.
 297 @end defun
 298
 299 @defun make-charset name doc-string props
 300 This function defines a new character set.  This function is for use
 301 with Mule support.  @var{name} is a symbol, the name by which the
 302 character set is normally referred.  @var{doc-string} is a string
 303 describing the character set.  @var{props} is a property list,
 304 describing the specific nature of the character set.  The recognized
 305 properties are @code{registry}, @code{dimension}, @code{columns},
 306 @code{chars}, @code{final}, @code{graphic}, @code{direction}, and
 307 @code{ccl-program}, as previously described.
 308 @end defun
 309
 310 @defun make-reverse-direction-charset charset new-name
 311 This function makes a charset equivalent to @var{charset} but which goes
 312 in the opposite direction.  @var{new-name} is the name of the new
 313 charset.  The new charset is returned.
 314 @end defun
 315
 316 @defun charset-from-attributes dimension chars final &optional direction
 317 This function returns a charset with the given @var{dimension},
 318 @var{chars}, @var{final}, and @var{direction}.  If @var{direction} is
 319 omitted, both directions will be checked (left-to-right will be returned
 320 if character sets exist for both directions).
 321 @end defun
 322
 323 @defun charset-reverse-direction-charset charset
 324 This function returns the charset (if any) with the same dimension,
 325 number of characters, and final byte as @var{charset}, but which is
 326 displayed in the opposite direction.
 327 @end defun
 328
 329 @node Charset Property Functions
 330 @subsection Charset Property Functions
 331
 332 All of these functions accept either a charset name or charset object.
 333
 334 @defun charset-property charset prop
 335 This function returns property @var{prop} of @var{charset}.
 336 @xref{Charset Properties}.
 337 @end defun
 338
 339 Convenience functions are also provided for retrieving individual
 340 properties of a charset.
 341
 342 @defun charset-name charset
 343 This function returns the name of @var{charset}.  This will be a symbol.
 344 @end defun
 345
 346 @defun charset-doc-string charset
 347 This function returns the doc string of @var{charset}.
 348 @end defun
 349
 350 @defun charset-registry charset
 351 This function returns the registry of @var{charset}.
 352 @end defun
 353
 354 @defun charset-dimension charset
 355 This function returns the dimension of @var{charset}.
 356 @end defun
 357
 358 @defun charset-chars charset
 359 This function returns the number of characters per dimension of
 360 @var{charset}.
 361 @end defun
 362
 363 @defun charset-columns charset
 364 This function returns the number of display columns per character (in
 365 TTY mode) of @var{charset}.
 366 @end defun
 367
 368 @defun charset-direction charset
 369 This function returns the display direction of @var{charset} -- either
 370 @code{l2r} or @code{r2l}.
 371 @end defun
 372
 373 @defun charset-final charset
 374 This function returns the final byte of the ISO 2022 escape sequence
 375 designating @var{charset}.
 376 @end defun
 377
 378 @defun charset-graphic charset
 379 This function returns either 0 or 1, depending on whether the position
 380 codes of characters in @var{charset} map to the left or right half
 381 of their font, respectively.
 382 @end defun
 383
 384 @defun charset-ccl-program charset
 385 This function returns the CCL program, if any, for converting
 386 position codes of characters in @var{charset} into font indices.
 387 @end defun
 388
 389 The only property of a charset that can currently be set after
 390 the charset has been created is the CCL program.
 391
 392 @defun set-charset-ccl-program charset ccl-program
 393 This function sets the @code{ccl-program} property of @var{charset} to
 394 @var{ccl-program}.
 395 @end defun
 396
 397 @node Predefined Charsets
 398 @subsection Predefined Charsets
 399
 400 The following charsets are predefined in the C code.
 401
 402 @example
 403 Name                    Type  Fi Gr Dir Registry
 404 --------------------------------------------------------------
 405 ascii                    94    B  0  l2r ISO8859-1
 406 control-1                94       0  l2r ---
 407 latin-iso8859-1          94    A  1  l2r ISO8859-1
 408 latin-iso8859-2          96    B  1  l2r ISO8859-2
 409 latin-iso8859-3          96    C  1  l2r ISO8859-3
 410 latin-iso8859-4          96    D  1  l2r ISO8859-4
 411 cyrillic-iso8859-5       96    L  1  l2r ISO8859-5
 412 arabic-iso8859-6         96    G  1  r2l ISO8859-6
 413 greek-iso8859-7          96    F  1  l2r ISO8859-7
 414 hebrew-iso8859-8         96    H  1  r2l ISO8859-8
 415 latin-iso8859-9          96    M  1  l2r ISO8859-9
 416 thai-tis620              96    T  1  l2r TIS620
 417 katakana-jisx0201        94    I  1  l2r JISX0201.1976
 418 latin-jisx0201           94    J  0  l2r JISX0201.1976
 419 japanese-jisx0208-1978   94x94 @@  0  l2r JISX0208.1978
 420 japanese-jisx0208        94x94 B  0  l2r JISX0208.19(83|90)
 421 japanese-jisx0212        94x94 D  0  l2r JISX0212
 422 chinese-gb2312           94x94 A  0  l2r GB2312
 423 chinese-cns11643-1       94x94 G  0  l2r CNS11643.1
 424 chinese-cns11643-2       94x94 H  0  l2r CNS11643.2
 425 chinese-big5-1           94x94 0  0  l2r Big5
 426 chinese-big5-2           94x94 1  0  l2r Big5
 427 korean-ksc5601           94x94 C  0  l2r KSC5601
 428 composite                96x96    0  l2r ---
 429 @end example
 430
 431 The following charsets are predefined in the Lisp code.
 432
 433 @example
 434 Name                     Type  Fi Gr Dir Registry
 435 --------------------------------------------------------------
 436 arabic-digit             94    2  0  l2r MuleArabic-0
 437 arabic-1-column          94    3  0  r2l MuleArabic-1
 438 arabic-2-column          94    4  0  r2l MuleArabic-2
 439 sisheng                  94    0  0  l2r sisheng_cwnn\|OMRON_UDC_ZH
 440 chinese-cns11643-3       94x94 I  0  l2r CNS11643.1
 441 chinese-cns11643-4       94x94 J  0  l2r CNS11643.1
 442 chinese-cns11643-5       94x94 K  0  l2r CNS11643.1
 443 chinese-cns11643-6       94x94 L  0  l2r CNS11643.1
 444 chinese-cns11643-7       94x94 M  0  l2r CNS11643.1
 445 ethiopic                 94x94 2  0  l2r Ethio
 446 ascii-r2l                94    B  0  r2l ISO8859-1
 447 ipa                      96    0  1  l2r MuleIPA
 448 vietnamese-lower         96    1  1  l2r VISCII1.1
 449 vietnamese-upper         96    2  1  l2r VISCII1.1
 450 @end example
 451
 452 For all of the above charsets, the dimension and number of columns are
 453 the same.
 454
 455 Note that ASCII, Control-1, and Composite are handled specially.
 456 This is why some of the fields are blank; and some of the filled-in
 457 fields (e.g. the type) are not really accurate.
 458
 459 @node MULE Characters
 460 @section MULE Characters
 461
 462 @defun make-char charset arg1 &optional arg2
 463 This function makes a multi-byte character from @var{charset} and octets
 464 @var{arg1} and @var{arg2}.
 465 @end defun
 466
 467 @defun char-charset ch
 468 This function returns the character set of char @var{ch}.
 469 @end defun
 470
 471 @defun char-octet ch &optional n
 472 This function returns the octet (i.e. position code) numbered @var{n}
 473 (should be 0 or 1) of char @var{ch}.  @var{n} defaults to 0 if omitted.
 474 @end defun
 475
 476 @defun find-charset-region start end &optional buffer
 477 This function returns a list of the charsets in the region between
 478 @var{start} and @var{end}.  @var{buffer} defaults to the current buffer
 479 if omitted.
 480 @end defun
 481
 482 @defun find-charset-string string
 483 This function returns a list of the charsets in @var{string}.
 484 @end defun
 485
 486 @node Composite Characters
 487 @section Composite Characters
 488
 489 Composite characters are not yet completely implemented.
 490
 491 @defun make-composite-char string
 492 This function converts a string into a single composite character.  The
 493 character is the result of overstriking all the characters in the
 494 string.
 495 @end defun
 496
 497 @defun composite-char-string ch
 498 This function returns a string of the characters comprising a composite
 499 character.
 500 @end defun
 501
 502 @defun compose-region start end &optional buffer
 503 This function composes the characters in the region from @var{start} to
 504 @var{end} in @var{buffer} into one composite character.  The composite
 505 character replaces the composed characters.  @var{buffer} defaults to
 506 the current buffer if omitted.
 507 @end defun
 508
 509 @defun decompose-region start end &optional buffer
 510 This function decomposes any composite characters in the region from
 511 @var{start} to @var{end} in @var{buffer}.  This converts each composite
 512 character into one or more characters, the individual characters out of
 513 which the composite character was formed.  Non-composite characters are
 514 left as-is.  @var{buffer} defaults to the current buffer if omitted.
 515 @end defun
 516
 517 @node ISO 2022
 518 @section ISO 2022
 519
 520 This section briefly describes the ISO 2022 encoding standard.  For more
 521 thorough understanding, please refer to the original document of ISO
 522 2022.
 523
 524 Character sets (@dfn{charsets}) are classified into the following four
 525 categories, according to the number of characters of charset:
 526 94-charset, 96-charset, 94x94-charset, and 96x96-charset.
 527
 528 @need 1000
 529 @table @asis
 530 @item 94-charset
 531  ASCII(B), left(J) and right(I) half of JISX0201, ...
 532 @item 96-charset
 533  Latin-1(A), Latin-2(B), Latin-3(C), ...
 534 @item 94x94-charset
 535  GB2312(A), JISX0208(B), KSC5601(C), ...
 536 @item 96x96-charset
 537  none for the moment
 538 @end table
 539
 540 The character in parentheses after the name of each charset
 541 is the @dfn{final character} @var{F}, which can be regarded as
 542 the identifier of the charset.  ECMA allocates @var{F} to each
 543 charset.  @var{F} is in the range of 0x30..0x7F, but 0x30..0x3F
 544 are only for private use.
 545
 546 Note: @dfn{ECMA} = European Computer Manufacturers Association
 547
 548 There are four @dfn{registers of charsets}, called G0 thru G3.
 549 You can designate (or assign) any charset to one of these
 550 registers.
 551
 552 The code space contained within one octet (of size 256) is divided into
 553 4 areas: C0, GL, C1, and GR.  GL and GR are the areas into which a
 554 register of charset can be invoked into.
 555
 556 @example
 557 @group
 558         C0: 0x00 - 0x1F
 559         GL: 0x20 - 0x7F
 560         C1: 0x80 - 0x9F
 561         GR: 0xA0 - 0xFF
 562 @end group
 563 @end example
 564
 565 Usually, in the initial state, G0 is invoked into GL, and G1
 566 is invoked into GR.
 567
 568 ISO 2022 distinguishes 7-bit environments and 8-bit environments.  In
 569 7-bit environments, only C0 and GL are used.
 570
 571 Charset designation is done by escape sequences of the form:
 572
 573 @example
 574         ESC [@var{I}] @var{I} @var{F}
 575 @end example
 576
 577 where @var{I} is an intermediate character in the range 0x20 - 0x2F, and
 578 @var{F} is the final character identifying this charset.
 579
 580 The meaning of intermediate characters are:
 581
 582 @example
 583 @group
 584         $ [0x24]: indicate charset of dimension 2 (94x94 or 96x96).
 585         ( [0x28]: designate to G0 a 94-charset whose final byte is @var{F}.
 586         ) [0x29]: designate to G1 a 94-charset whose final byte is @var{F}.
 587         * [0x2A]: designate to G2 a 94-charset whose final byte is @var{F}.
 588         + [0x2B]: designate to G3 a 94-charset whose final byte is @var{F}.
 589         - [0x2D]: designate to G1 a 96-charset whose final byte is @var{F}.
 590         . [0x2E]: designate to G2 a 96-charset whose final byte is @var{F}.
 591         / [0x2F]: designate to G3 a 96-charset whose final byte is @var{F}.
 592 @end group
 593 @end example
 594
 595 The following rule is not allowed in ISO 2022 but can be used in Mule.
 596
 597 @example
 598         , [0x2C]: designate to G0 a 96-charset whose final byte is @var{F}.
 599 @end example
 600
 601 Here are examples of designations:
 602
 603 @example
 604 @group
 605         ESC ( B :              designate to G0 ASCII
 606         ESC - A :              designate to G1 Latin-1
 607         ESC $ ( A or ESC $ A : designate to G0 GB2312
 608         ESC $ ( B or ESC $ B : designate to G0 JISX0208
 609         ESC $ ) C :            designate to G1 KSC5601
 610 @end group
 611 @end example
 612
 613 To use a charset designated to G2 or G3, and to use a charset designated
 614 to G1 in a 7-bit environment, you must explicitly invoke G1, G2, or G3
 615 into GL.  There are two types of invocation, Locking Shift (forever) and
 616 Single Shift (one character only).
 617
 618 Locking Shift is done as follows:
 619
 620 @example
 621         LS0 or SI (0x0F): invoke G0 into GL
 622         LS1 or SO (0x0E): invoke G1 into GL
 623         LS2:  invoke G2 into GL
 624         LS3:  invoke G3 into GL
 625         LS1R: invoke G1 into GR
 626         LS2R: invoke G2 into GR
 627         LS3R: invoke G3 into GR
 628 @end example
 629
 630 Single Shift is done as follows:
 631
 632 @example
 633 @group
 634         SS2 or ESC N: invoke G2 into GL
 635         SS3 or ESC O: invoke G3 into GL
 636 @end group
 637 @end example
 638
 639 (#### Ben says: I think the above is slightly incorrect.  It appears that
 640 SS2 invokes G2 into GR and SS3 invokes G3 into GR, whereas ESC N and
 641 ESC O behave as indicated.  The above definitions will not parse
 642 EUC-encoded text correctly, and it looks like the code in mule-coding.c
 643 has similar problems.)
 644
 645 You may realize that there are a lot of ISO-2022-compliant ways of
 646 encoding multilingual text.  Now, in the world, there exist many coding
 647 systems such as X11's Compound Text, Japanese JUNET code, and so-called
 648 EUC (Extended UNIX Code); all of these are variants of ISO 2022.
 649
 650 In Mule, we characterize ISO 2022 by the following attributes:
 651
 652 @enumerate
 653 @item
 654 Initial designation to G0 thru G3.
 655 @item
 656 Allow designation of short form for Japanese and Chinese.
 657 @item
 658 Should we designate ASCII to G0 before control characters?
 659 @item
 660 Should we designate ASCII to G0 at the end of line?
 661 @item
 662 7-bit environment or 8-bit environment.
 663 @item
 664 Use Locking Shift or not.
 665 @item
 666 Use ASCII or JIS0201-1976-Roman.
 667 @item
 668 Use JISX0208-1983 or JISX0208-1976.
 669 @end enumerate
 670
 671 (The last two are only for Japanese.)
 672
 673 By specifying these attributes, you can create any variant
 674 of ISO 2022.
 675
 676 Here are several examples:
 677
 678 @example
 679 @group
 680 junet -- Coding system used in JUNET.
 681         1. G0 <- ASCII, G1..3 <- never used
 682         2. Yes.
 683         3. Yes.
 684         4. Yes.
 685         5. 7-bit environment
 686         6. No.
 687         7. Use ASCII
 688         8. Use JISX0208-1983
 689 @end group
 690
 691 @group
 692 ctext -- Compound Text
 693         1. G0 <- ASCII, G1 <- Latin-1, G2,3 <- never used
 694         2. No.
 695         3. No.
 696         4. Yes.
 697         5. 8-bit environment
 698         6. No.
 699         7. Use ASCII
 700         8. Use JISX0208-1983
 701 @end group
 702
 703 @group
 704 euc-china -- Chinese EUC.  Although many people call this
 705 as "GB encoding", the name may cause misunderstanding.
 706         1. G0 <- ASCII, G1 <- GB2312, G2,3 <- never used
 707         2. No.
 708         3. Yes.
 709         4. Yes.
 710         5. 8-bit environment
 711         6. No.
 712         7. Use ASCII
 713         8. Use JISX0208-1983
 714 @end group
 715
 716 @group
 717 korean-mail -- Coding system used in Korean network.
 718         1. G0 <- ASCII, G1 <- KSC5601, G2,3 <- never used
 719         2. No.
 720         3. Yes.
 721         4. Yes.
 722         5. 7-bit environment
 723         6. Yes.
 724         7. No.
 725         8. No.
 726 @end group
 727 @end example
 728
 729 Mule creates all these coding systems by default.
 730
 731 @node Coding Systems
 732 @section Coding Systems
 733
 734 A coding system is an object that defines how text containing multiple
 735 character sets is encoded into a stream of (typically 8-bit) bytes.  The
 736 coding system is used to decode the stream into a series of characters
 737 (which may be from multiple charsets) when the text is read from a file
 738 or process, and is used to encode the text back into the same format
 739 when it is written out to a file or process.
 740
 741 For example, many ISO-2022-compliant coding systems (such as Compound
 742 Text, which is used for inter-client data under the X Window System) use
 743 escape sequences to switch between different charsets -- Japanese Kanji,
 744 for example, is invoked with @samp{ESC $ ( B}; ASCII is invoked with
 745 @samp{ESC ( B}; and Cyrillic is invoked with @samp{ESC - L}.  See
 746 @code{make-coding-system} for more information.
 747
 748 Coding systems are normally identified using a symbol, and the symbol is
 749 accepted in place of the actual coding system object whenever a coding
 750 system is called for. (This is similar to how faces and charsets work.)
 751
 752 @defun coding-system-p object
 753 This function returns non-@code{nil} if @var{object} is a coding system.
 754 @end defun
 755
 756 @menu
 757 * Coding System Types::               Classifying coding systems.
 758 * EOL Conversion::                    Dealing with different ways of denoting
 759                                         the end of a line.
 760 * Coding System Properties::          Properties of a coding system.
 761 * Basic Coding System Functions::     Working with coding systems.
 762 * Coding System Property Functions::  Retrieving a coding system's properties.
 763 * Encoding and Decoding Text::        Encoding and decoding text.
 764 * Detection of Textual Encoding::     Determining how text is encoded.
 765 * Big5 and Shift-JIS Functions::      Special functions for these non-standard
 766                                         encodings.
 767 @end menu
 768
 769 @node Coding System Types
 770 @subsection Coding System Types
 771
 772 @table @code
 773 @item nil
 774 @itemx autodetect
 775 Automatic conversion.  XEmacs attempts to detect the coding system used
 776 in the file.
 777 @item no-conversion
 778 No conversion.  Use this for binary files and such.  On output, graphic
 779 characters that are not in ASCII or Latin-1 will be replaced by a
 780 @samp{?}. (For a no-conversion-encoded buffer, these characters will
 781 only be present if you explicitly insert them.)
 782 @item shift-jis
 783 Shift-JIS (a Japanese encoding commonly used in PC operating systems).
 784 @item iso2022
 785 Any ISO-2022-compliant encoding.  Among other things, this includes JIS
 786 (the Japanese encoding commonly used for e-mail), national variants of
 787 EUC (the standard Unix encoding for Japanese and other languages), and
 788 Compound Text (an encoding used in X11).  You can specify more specific
 789 information about the conversion with the @var{flags} argument.
 790 @item big5
 791 Big5 (the encoding commonly used for Taiwanese).
 792 @item ccl
 793 The conversion is performed using a user-written pseudo-code program.
 794 CCL (Code Conversion Language) is the name of this pseudo-code.
 795 @item internal
 796 Write out or read in the raw contents of the memory representing the
 797 buffer's text.  This is primarily useful for debugging purposes, and is
 798 only enabled when XEmacs has been compiled with @code{DEBUG_XEMACS} set
 799 (the @samp{--debug} configure option).  @strong{Warning}: Reading in a
 800 file using @code{internal} conversion can result in an internal
 801 inconsistency in the memory representing a buffer's text, which will
 802 produce unpredictable results and may cause XEmacs to crash.  Under
 803 normal circumstances you should never use @code{internal} conversion.
 804 @end table
 805
 806 @node EOL Conversion
 807 @subsection EOL Conversion
 808
 809 @table @code
 810 @item nil
 811 Automatically detect the end-of-line type (LF, CRLF, or CR).  Also
 812 generate subsidiary coding systems named @code{@var{name}-unix},
 813 @code{@var{name}-dos}, and @code{@var{name}-mac}, that are identical to
 814 this coding system but have an EOL-TYPE value of @code{lf}, @code{crlf},
 815 and @code{cr}, respectively.
 816 @item lf
 817 The end of a line is marked externally using ASCII LF.  Since this is
 818 also the way that XEmacs represents an end-of-line internally,
 819 specifying this option results in no end-of-line conversion.  This is
 820 the standard format for Unix text files.
 821 @item crlf
 822 The end of a line is marked externally using ASCII CRLF.  This is the
 823 standard format for MS-DOS text files.
 824 @item cr
 825 The end of a line is marked externally using ASCII CR.  This is the
 826 standard format for Macintosh text files.
 827 @item t
 828 Automatically detect the end-of-line type but do not generate subsidiary
 829 coding systems.  (This value is converted to @code{nil} when stored
 830 internally, and @code{coding-system-property} will return @code{nil}.)
 831 @end table
 832
 833 @node Coding System Properties
 834 @subsection Coding System Properties
 835
 836 @table @code
 837 @item mnemonic
 838 String to be displayed in the modeline when this coding system is
 839 active.
 840
 841 @item eol-type
 842 End-of-line conversion to be used.  It should be one of the types
 843 listed in @ref{EOL Conversion}.
 844
 845 @item post-read-conversion
 846 Function called after a file has been read in, to perform the decoding.
 847 Called with two arguments, @var{beg} and @var{end}, denoting a region of
 848 the current buffer to be decoded.
 849
 850 @item pre-write-conversion
 851 Function called before a file is written out, to perform the encoding.
 852 Called with two arguments, @var{beg} and @var{end}, denoting a region of
 853 the current buffer to be encoded.
 854 @end table
 855
 856 The following additional properties are recognized if @var{type} is
 857 @code{iso2022}:
 858
 859 @table @code
 860 @item charset-g0
 861 @itemx charset-g1
 862 @itemx charset-g2
 863 @itemx charset-g3
 864 The character set initially designated to the G0 - G3 registers.
 865 The value should be one of
 866
 867 @itemize @bullet
 868 @item
 869 A charset object (designate that character set)
 870 @item
 871 @code{nil} (do not ever use this register)
 872 @item
 873 @code{t} (no character set is initially designated to the register, but
 874 may be later on; this automatically sets the corresponding
 875 @code{force-g*-on-output} property)
 876 @end itemize
 877
 878 @item force-g0-on-output
 879 @itemx force-g1-on-output
 880 @itemx force-g2-on-output
 881 @itemx force-g3-on-output
 882 If non-@code{nil}, send an explicit designation sequence on output
 883 before using the specified register.
 884
 885 @item short
 886 If non-@code{nil}, use the short forms @samp{ESC $ @@}, @samp{ESC $ A},
 887 and @samp{ESC $ B} on output in place of the full designation sequences
 888 @samp{ESC $ ( @@}, @samp{ESC $ ( A}, and @samp{ESC $ ( B}.
 889
 890 @item no-ascii-eol
 891 If non-@code{nil}, don't designate ASCII to G0 at each end of line on
 892 output.  Setting this to non-@code{nil} also suppresses other
 893 state-resetting that normally happens at the end of a line.
 894
 895 @item no-ascii-cntl
 896 If non-@code{nil}, don't designate ASCII to G0 before control chars on
 897 output.
 898
 899 @item seven
 900 If non-@code{nil}, use 7-bit environment on output.  Otherwise, use 8-bit
 901 environment.
 902
 903 @item lock-shift
 904 If non-@code{nil}, use locking-shift (SO/SI) instead of single-shift or
 905 designation by escape sequence.
 906
 907 @item no-iso6429
 908 If non-@code{nil}, don't use ISO6429's direction specification.
 909
 910 @item escape-quoted
 911 If non-nil, literal control characters that are the same as the
 912 beginning of a recognized ISO 2022 or ISO 6429 escape sequence (in
 913 particular, ESC (0x1B), SO (0x0E), SI (0x0F), SS2 (0x8E), SS3 (0x8F),
 914 and CSI (0x9B)) are ``quoted'' with an escape character so that they can
 915 be properly distinguished from an escape sequence.  (Note that doing
 916 this results in a non-portable encoding.) This encoding flag is used for
 917 byte-compiled files.  Note that ESC is a good choice for a quoting
 918 character because there are no escape sequences whose second byte is a
 919 character from the Control-0 or Control-1 character sets; this is
 920 explicitly disallowed by the ISO 2022 standard.
 921
 922 @item input-charset-conversion
 923 A list of conversion specifications, specifying conversion of characters
 924 in one charset to another when decoding is performed.  Each
 925 specification is a list of two elements: the source charset, and the
 926 destination charset.
 927
 928 @item output-charset-conversion
 929 A list of conversion specifications, specifying conversion of characters
 930 in one charset to another when encoding is performed.  The form of each
 931 specification is the same as for @code{input-charset-conversion}.
 932 @end table
 933
 934 The following additional properties are recognized (and required) if
 935 @var{type} is @code{ccl}:
 936
 937 @table @code
 938 @item decode
 939 CCL program used for decoding (converting to internal format).
 940
 941 @item encode
 942 CCL program used for encoding (converting to external format).
 943 @end table
 944
 945 @node Basic Coding System Functions
 946 @subsection Basic Coding System Functions
 947
 948 @defun find-coding-system coding-system-or-name
 949 This function retrieves the coding system of the given name.
 950
 951 If @var{coding-system-or-name} is a coding-system object, it is simply
 952 returned.  Otherwise, @var{coding-system-or-name} should be a symbol.
 953 If there is no such coding system, @code{nil} is returned.  Otherwise
 954 the associated coding system object is returned.
 955 @end defun
 956
 957 @defun get-coding-system name
 958 This function retrieves the coding system of the given name.  Same as
 959 @code{find-coding-system} except an error is signalled if there is no
 960 such coding system instead of returning @code{nil}.
 961 @end defun
 962
 963 @defun coding-system-list
 964 This function returns a list of the names of all defined coding systems.
 965 @end defun
 966
 967 @defun coding-system-name coding-system
 968 This function returns the name of the given coding system.
 969 @end defun
 970
 971 @defun make-coding-system name type &optional doc-string props
 972 This function registers symbol @var{name} as a coding system.
 973
 974 @var{type} describes the conversion method used and should be one of
 975 the types listed in @ref{Coding System Types}.
 976
 977 @var{doc-string} is a string describing the coding system.
 978
 979 @var{props} is a property list, describing the specific nature of the
 980 character set.  Recognized properties are as in @ref{Coding System
 981 Properties}.
 982 @end defun
 983
 984 @defun copy-coding-system old-coding-system new-name
 985 This function copies @var{old-coding-system} to @var{new-name}.  If
 986 @var{new-name} does not name an existing coding system, a new one will
 987 be created.
 988 @end defun
 989
 990 @defun subsidiary-coding-system coding-system eol-type
 991 This function returns the subsidiary coding system of
 992 @var{coding-system} with eol type @var{eol-type}.
 993 @end defun
 994
 995 @node Coding System Property Functions
 996 @subsection Coding System Property Functions
 997
 998 @defun coding-system-doc-string coding-system
 999 This function returns the doc string for @var{coding-system}.
1000 @end defun
1001
1002 @defun coding-system-type coding-system
1003 This function returns the type of @var{coding-system}.
1004 @end defun
1005
1006 @defun coding-system-property coding-system prop
1007 This function returns the @var{prop} property of @var{coding-system}.
1008 @end defun
1009
1010 @node Encoding and Decoding Text
1011 @subsection Encoding and Decoding Text
1012
1013 @defun decode-coding-region start end coding-system &optional buffer
1014 This function decodes the text between @var{start} and @var{end} which
1015 is encoded in @var{coding-system}.  This is useful if you've read in
1016 encoded text from a file without decoding it (e.g. you read in a
1017 JIS-formatted file but used the @code{binary} or @code{no-conversion} coding
1018 system, so that it shows up as @samp{^[$B!<!+^[(B}).  The length of the
1019 encoded text is returned.  @var{buffer} defaults to the current buffer
1020 if unspecified.
1021 @end defun
1022
1023 @defun encode-coding-region start end coding-system &optional buffer
1024 This function encodes the text between @var{start} and @var{end} using
1025 @var{coding-system}.  This will, for example, convert Japanese
1026 characters into stuff such as @samp{^[$B!<!+^[(B} if you use the JIS
1027 encoding.  The length of the encoded text is returned.  @var{buffer}
1028 defaults to the current buffer if unspecified.
1029 @end defun
1030
1031 @node Detection of Textual Encoding
1032 @subsection Detection of Textual Encoding
1033
1034 @defun coding-category-list
1035 This function returns a list of all recognized coding categories.
1036 @end defun
1037
1038 @defun set-coding-priority-list list
1039 This function changes the priority order of the coding categories.
1040 @var{list} should be a list of coding categories, in descending order of
1041 priority.  Unspecified coding categories will be lower in priority than
1042 all specified ones, in the same relative order they were in previously.
1043 @end defun
1044
1045 @defun coding-priority-list
1046 This function returns a list of coding categories in descending order of
1047 priority.
1048 @end defun
1049
1050 @defun set-coding-category-system coding-category coding-system
1051 This function changes the coding system associated with a coding category.
1052 @end defun
1053
1054 @defun coding-category-system coding-category
1055 This function returns the coding system associated with a coding category.
1056 @end defun
1057
1058 @defun detect-coding-region start end &optional buffer
1059 This function detects coding system of the text in the region between
1060 @var{start} and @var{end}.  Returned value is a list of possible coding
1061 systems ordered by priority.  If only ASCII characters are found, it
1062 returns @code{autodetect} or one of its subsidiary coding systems
1063 according to a detected end-of-line type.  Optional arg @var{buffer}
1064 defaults to the current buffer.
1065 @end defun
1066
1067 @node Big5 and Shift-JIS Functions
1068 @subsection Big5 and Shift-JIS Functions
1069
1070 These are special functions for working with the non-standard
1071 Shift-JIS and Big5 encodings.
1072
1073 @defun decode-shift-jis-char code
1074 This function decodes a JISX0208 character of Shift-JIS coding-system.
1075 @var{code} is the character code in Shift-JIS as a cons of type bytes.
1076 The corresponding character is returned.
1077 @end defun
1078
1079 @defun encode-shift-jis-char ch
1080 This function encodes a JISX0208 character @var{ch} to SHIFT-JIS
1081 coding-system.  The corresponding character code in SHIFT-JIS is
1082 returned as a cons of two bytes.
1083 @end defun
1084
1085 @defun decode-big5-char code
1086 This function decodes a Big5 character @var{code} of BIG5 coding-system.
1087 @var{code} is the character code in BIG5.  The corresponding character
1088 is returned.
1089 @end defun
1090
1091 @defun encode-big5-char ch
1092 This function encodes the Big5 character @var{char} to BIG5
1093 coding-system.  The corresponding character code in Big5 is returned.
1094 @end defun
1095
1096 @node CCL
1097 @section CCL
1098
1099 @defun execute-ccl-program ccl-program status
1100 This function executes @var{ccl-program} with registers initialized by
1101 @var{status}.  @var{ccl-program} is a vector of compiled CCL code
1102 created by @code{ccl-compile}.  @var{status} must be a vector of nine
1103 values, specifying the initial value for the R0, R1 .. R7 registers and
1104 for the instruction counter IC.  A @code{nil} value for a register
1105 initializer causes the register to be set to 0.  A @code{nil} value for
1106 the IC initializer causes execution to start at the beginning of the
1107 program.  When the program is done, @var{status} is modified (by
1108 side-effect) to contain the ending values for the corresponding
1109 registers and IC.
1110 @end defun
1111
1112 @defun execute-ccl-program-string ccl-program status str
1113 This function executes @var{ccl-program} with initial @var{status} on
1114 @var{string}.  @var{ccl-program} is a vector of compiled CCL code
1115 created by @code{ccl-compile}.  @var{status} must be a vector of nine
1116 values, specifying the initial value for the R0, R1 .. R7 registers and
1117 for the instruction counter IC.  A @code{nil} value for a register
1118 initializer causes the register to be set to 0.  A @code{nil} value for
1119 the IC initializer causes execution to start at the beginning of the
1120 program.  When the program is done, @var{status} is modified (by
1121 side-effect) to contain the ending values for the corresponding
1122 registers and IC.  Returns the resulting string.
1123 @end defun
1124
1125 @defun ccl-reset-elapsed-time
1126 This function resets the internal value which holds the time elapsed by
1127 CCL interpreter.
1128 @end defun
1129
1130 @defun ccl-elapsed-time
1131 This function returns the time elapsed by CCL interpreter as cons of
1132 user and system time.  This measures processor time, not real time.
1133 Both values are floating point numbers measured in seconds.  If only one
1134 overall value can be determined, the return value will be a cons of that
1135 value and 0.
1136 @end defun
1137
1138 @node Category Tables
1139 @section Category Tables
1140
1141   A category table is a type of char table used for keeping track of
1142 categories.  Categories are used for classifying characters for use in
1143 regexps -- you can refer to a category rather than having to use a
1144 complicated [] expression (and category lookups are significantly
1145 faster).
1146
1147   There are 95 different categories available, one for each printable
1148 character (including space) in the ASCII charset.  Each category is
1149 designated by one such character, called a @dfn{category designator}.
1150 They are specified in a regexp using the syntax @samp{\cX}, where X is a
1151 category designator. (This is not yet implemented.)
1152
1153   A category table specifies, for each character, the categories that
1154 the character is in.  Note that a character can be in more than one
1155 category.  More specifically, a category table maps from a character to
1156 either the value @code{nil} (meaning the character is in no categories)
1157 or a 95-element bit vector, specifying for each of the 95 categories
1158 whether the character is in that category.
1159
1160   Special Lisp functions are provided that abstract this, so you do not
1161 have to directly manipulate bit vectors.
1162
1163 @defun category-table-p obj
1164 This function returns @code{t} if @var{arg} is a category table.
1165 @end defun
1166
1167 @defun category-table &optional buffer
1168 This function returns the current category table.  This is the one
1169 specified by the current buffer, or by @var{buffer} if it is
1170 non-@code{nil}.
1171 @end defun
1172
1173 @defun standard-category-table
1174 This function returns the standard category table.  This is the one used
1175 for new buffers.
1176 @end defun
1177
1178 @defun copy-category-table &optional table
1179 This function constructs a new category table and return it.  It is a
1180 copy of the @var{table}, which defaults to the standard category table.
1181 @end defun
1182
1183 @defun set-category-table table &optional buffer
1184 This function selects a new category table for @var{buffer}.  One
1185 argument, a category table.  @var{buffer} defaults to the current buffer
1186 if omitted.
1187 @end defun
1188
1189 @defun category-designator-p obj
1190 This function returns @code{t} if @var{arg} is a category designator (a
1191 char in the range @samp{' '} to @samp{'~'}).
1192 @end defun
1193
1194 @defun category-table-value-p obj
1195 This function returns @code{t} if @var{arg} is a category table value.
1196 Valid values are @code{nil} or a bit vector of size 95.
1197 @end defun
1198