git.chise.org Git - chise/xemacs-chise.git-/blob - info/lispref.info-43

   1 This is ../info/lispref.info, produced by makeinfo version 4.0 from
   2 lispref/lispref.texi.
   3
   4 INFO-DIR-SECTION XEmacs Editor
   5 START-INFO-DIR-ENTRY
   6 * Lispref: (lispref).           XEmacs Lisp Reference Manual.
   7 END-INFO-DIR-ENTRY
   8
   9    Edition History:
  10
  11    GNU Emacs Lisp Reference Manual Second Edition (v2.01), May 1993 GNU
  12 Emacs Lisp Reference Manual Further Revised (v2.02), August 1993 Lucid
  13 Emacs Lisp Reference Manual (for 19.10) First Edition, March 1994
  14 XEmacs Lisp Programmer's Manual (for 19.12) Second Edition, April 1995
  15 GNU Emacs Lisp Reference Manual v2.4, June 1995 XEmacs Lisp
  16 Programmer's Manual (for 19.13) Third Edition, July 1995 XEmacs Lisp
  17 Reference Manual (for 19.14 and 20.0) v3.1, March 1996 XEmacs Lisp
  18 Reference Manual (for 19.15 and 20.1, 20.2, 20.3) v3.2, April, May,
  19 November 1997 XEmacs Lisp Reference Manual (for 21.0) v3.3, April 1998
  20
  21    Copyright (C) 1990, 1991, 1992, 1993, 1994, 1995 Free Software
  22 Foundation, Inc.  Copyright (C) 1994, 1995 Sun Microsystems, Inc.
  23 Copyright (C) 1995, 1996 Ben Wing.
  24
  25    Permission is granted to make and distribute verbatim copies of this
  26 manual provided the copyright notice and this permission notice are
  27 preserved on all copies.
  28
  29    Permission is granted to copy and distribute modified versions of
  30 this manual under the conditions for verbatim copying, provided that the
  31 entire resulting derived work is distributed under the terms of a
  32 permission notice identical to this one.
  33
  34    Permission is granted to copy and distribute translations of this
  35 manual into another language, under the above conditions for modified
  36 versions, except that this permission notice may be stated in a
  37 translation approved by the Foundation.
  38
  39    Permission is granted to copy and distribute modified versions of
  40 this manual under the conditions for verbatim copying, provided also
  41 that the section entitled "GNU General Public License" is included
  42 exactly as in the original, and provided that the entire resulting
  43 derived work is distributed under the terms of a permission notice
  44 identical to this one.
  45
  46    Permission is granted to copy and distribute translations of this
  47 manual into another language, under the above conditions for modified
  48 versions, except that the section entitled "GNU General Public License"
  49 may be included in a translation approved by the Free Software
  50 Foundation instead of in the original English.
  51
  52 \1f
  53 File: lispref.info,  Node: Internationalization Terminology,  Next: Charsets,  Up: MULE
  54
  55 Internationalization Terminology
  56 ================================
  57
  58    In internationalization terminology, a string of text is divided up
  59 into "characters", which are the printable units that make up the text.
  60 A single character is (for example) a capital `A', the number `2', a
  61 Katakana character, a Hangul character, a Kanji ideograph (an
  62 "ideograph" is a "picture" character, such as is used in Japanese
  63 Kanji, Chinese Hanzi, and Korean Hanja; typically there are thousands
  64 of such ideographs in each language), etc.  The basic property of a
  65 character is that it is the smallest unit of text with semantic
  66 significance in text processing.
  67
  68    Human beings normally process text visually, so to a first
  69 approximation a character may be identified with its shape.  Note that
  70 the same character may be drawn by two different people (or in two
  71 different fonts) in slightly different ways, although the "basic shape"
  72 will be the same.  But consider the works of Scott Kim; human beings
  73 can recognize hugely variant shapes as the "same" character.
  74 Sometimes, especially where characters are extremely complicated to
  75 write, completely different shapes may be defined as the "same"
  76 character in national standards.  The Taiwanese variant of Hanzi is
  77 generally the most complicated; over the centuries, the Japanese,
  78 Koreans, and the People's Republic of China have adopted
  79 simplifications of the shape, but the line of descent from the original
  80 shape is recorded, and the meanings and pronunciation of different
  81 forms of the same character are considered to be identical within each
  82 language.  (Of course, it may take a specialist to recognize the
  83 related form; the point is that the relations are standardized, despite
  84 the differing shapes.)
  85
  86    In some cases, the differences will be significant enough that it is
  87 actually possible to identify two or more distinct shapes that both
  88 represent the same character.  For example, the lowercase letters `a'
  89 and `g' each have two distinct possible shapes--the `a' can optionally
  90 have a curved tail projecting off the top, and the `g' can be formed
  91 either of two loops, or of one loop and a tail hanging off the bottom.
  92 Such distinct possible shapes of a character are called "glyphs".  The
  93 important characteristic of two glyphs making up the same character is
  94 that the choice between one or the other is purely stylistic and has no
  95 linguistic effect on a word (this is the reason why a capital `A' and
  96 lowercase `a' are different characters rather than different
  97 glyphs--e.g.  `Aspen' is a city while `aspen' is a kind of tree).
  98
  99    Note that "character" and "glyph" are used differently here than
 100 elsewhere in XEmacs.
 101
 102    A "character set" is essentially a set of related characters.  ASCII,
 103 for example, is a set of 94 characters (or 128, if you count
 104 non-printing characters).  Other character sets are ISO8859-1 (ASCII
 105 plus various accented characters and other international symbols), JIS
 106 X 0201 (ASCII, more or less, plus half-width Katakana), JIS X 0208
 107 (Japanese Kanji), JIS X 0212 (a second set of less-used Japanese Kanji),
 108 GB2312 (Mainland Chinese Hanzi), etc.
 109
 110    The definition of a character set will implicitly or explicitly give
 111 it an "ordering", a way of assigning a number to each character in the
 112 set.  For many character sets, there is a natural ordering, for example
 113 the "ABC" ordering of the Roman letters.  But it is not clear whether
 114 digits should come before or after the letters, and in fact different
 115 European languages treat the ordering of accented characters
 116 differently.  It is useful to use the natural order where available, of
 117 course.  The number assigned to any particular character is called the
 118 character's "code point".  (Within a given character set, each
 119 character has a unique code point.  Thus the word "set" is ill-chosen;
 120 different orderings of the same characters are different character sets.
 121 Identifying characters is simple enough for alphabetic character sets,
 122 but the difference in ordering can cause great headaches when the same
 123 thousands of characters are used by different cultures as in the Hanzi.)
 124
 125    A code point may be broken into a number of "position codes".  The
 126 number of position codes required to index a particular character in a
 127 character set is called the "dimension" of the character set.  For
 128 practical purposes, a position code may be thought of as a byte-sized
 129 index.  The printing characters of ASCII, being a relatively small
 130 character set, is of dimension one, and each character in the set is
 131 indexed using a single position code, in the range 1 through 94.  Use of
 132 this unusual range, rather than the familiar 33 through 126, is an
 133 intentional abstraction; to understand the programming issues you must
 134 break the equation between character sets and encodings.
 135
 136    JIS X 0208, i.e. Japanese Kanji, has thousands of characters, and is
 137 of dimension two - every character is indexed by two position codes,
 138 each in the range 1 through 94.  (This number "94" is not a
 139 coincidence; we shall see that the JIS position codes were chosen so
 140 that JIS kanji could be encoded without using codes that in ASCII are
 141 associated with device control functions.)  Note that the choice of the
 142 range here is somewhat arbitrary.  You could just as easily index the
 143 printing characters in ASCII using numbers in the range 0 through 93, 2
 144 through 95, 3 through 96, etc.  In fact, the standardized _encoding_
 145 for the ASCII _character set_ uses the range 33 through 126.
 146
 147    An "encoding" is a way of numerically representing characters from
 148 one or more character sets into a stream of like-sized numerical values
 149 called "words"; typically these are 8-bit, 16-bit, or 32-bit
 150 quantities.  If an encoding encompasses only one character set, then the
 151 position codes for the characters in that character set could be used
 152 directly.  (This is the case with the trivial cipher used by children,
 153 assigning 1 to `A', 2 to `B', and so on.)  However, even with ASCII,
 154 other considerations intrude.  For example, why are the upper- and
 155 lowercase alphabets separated by 8 characters?  Why do the digits start
 156 with `0' being assigned the code 48?  In both cases because semantically
 157 interesting operations (case conversion and numerical value extraction)
 158 become convenient masking operations.  Other artificial aspects (the
 159 control characters being assigned to codes 0-31 and 127) are historical
 160 accidents.  (The use of 127 for `DEL' is an artifact of the "punch
 161 once" nature of paper tape, for example.)
 162
 163    Naive use of the position code is not possible, however, if more than
 164 one character set is to be used in the encoding.  For example, printed
 165 Japanese text typically requires characters from multiple character sets
 166 - ASCII, JIS X 0208, and JIS X 0212, to be specific.  Each of these is
 167 indexed using one or more position codes in the range 1 through 94, so
 168 the position codes could not be used directly or there would be no way
 169 to tell which character was meant.  Different Japanese encodings handle
 170 this differently - JIS uses special escape characters to denote
 171 different character sets; EUC sets the high bit of the position codes
 172 for JIS X 0208 and JIS X 0212, and puts a special extra byte before each
 173 JIS X 0212 character; etc.  (JIS, EUC, and most of the other encodings
 174 you will encounter in files are 7-bit or 8-bit encodings.  There is one
 175 common 16-bit encoding, which is Unicode; this strives to represent all
 176 the world's characters in a single large character set.  32-bit
 177 encodings are often used internally in programs, such as XEmacs with
 178 MULE support, to simplify the code that manipulates them; however, they
 179 are not used externally because they are not very space-efficient.)
 180
 181    A general method of handling text using multiple character sets
 182 (whether for multilingual text, or simply text in an extremely
 183 complicated single language like Japanese) is defined in the
 184 international standard ISO 2022.  ISO 2022 will be discussed in more
 185 detail later (*note ISO 2022::), but for now suffice it to say that text
 186 needs control functions (at least spacing), and if escape sequences are
 187 to be used, an escape sequence introducer.  It was decided to make all
 188 text streams compatible with ASCII in the sense that the codes 0-31
 189 (and 128-159) would always be control codes, never graphic characters,
 190 and where defined by the character set the `SPC' character would be
 191 assigned code 32, and `DEL' would be assigned 127.  Thus there are 94
 192 code points remaining if 7 bits are used.  This is the reason that most
 193 character sets are defined using position codes in the range 1 through
 194 94.  Then ISO 2022 compatible encodings are produced by shifting the
 195 position codes 1 to 94 into character codes 33 to 126, or (if 8 bit
 196 codes are available) into character codes 161 to 254.
 197
 198    Encodings are classified as either "modal" or "non-modal".  In a
 199 "modal encoding", there are multiple states that the encoding can be
 200 in, and the interpretation of the values in the stream depends on the
 201 current global state of the encoding.  Special values in the encoding,
 202 called "escape sequences", are used to change the global state.  JIS,
 203 for example, is a modal encoding.  The bytes `ESC $ B' indicate that,
 204 from then on, bytes are to be interpreted as position codes for JIS X
 205 0208, rather than as ASCII.  This effect is cancelled using the bytes
 206 `ESC ( B', which mean "switch from whatever the current state is to
 207 ASCII".  To switch to JIS X 0212, the escape sequence `ESC $ ( D'.
 208 (Note that here, as is common, the escape sequences do in fact begin
 209 with `ESC'.  This is not necessarily the case, however.  Some encodings
 210 use control characters called "locking shifts" (effect persists until
 211 cancelled) to switch character sets.)
 212
 213    A "non-modal encoding" has no global state that extends past the
 214 character currently being interpreted.  EUC, for example, is a
 215 non-modal encoding.  Characters in JIS X 0208 are encoded by setting
 216 the high bit of the position codes, and characters in JIS X 0212 are
 217 encoded by doing the same but also prefixing the character with the
 218 byte 0x8F.
 219
 220    The advantage of a modal encoding is that it is generally more
 221 space-efficient, and is easily extendable because there are essentially
 222 an arbitrary number of escape sequences that can be created.  The
 223 disadvantage, however, is that it is much more difficult to work with
 224 if it is not being processed in a sequential manner.  In the non-modal
 225 EUC encoding, for example, the byte 0x41 always refers to the letter
 226 `A'; whereas in JIS, it could either be the letter `A', or one of the
 227 two position codes in a JIS X 0208 character, or one of the two
 228 position codes in a JIS X 0212 character.  Determining exactly which
 229 one is meant could be difficult and time-consuming if the previous
 230 bytes in the string have not already been processed, or impossible if
 231 they are drawn from an external stream that cannot be rewound.
 232
 233    Non-modal encodings are further divided into "fixed-width" and
 234 "variable-width" formats.  A fixed-width encoding always uses the same
 235 number of words per character, whereas a variable-width encoding does
 236 not.  EUC is a good example of a variable-width encoding: one to three
 237 bytes are used per character, depending on the character set.  16-bit
 238 and 32-bit encodings are nearly always fixed-width, and this is in fact
 239 one of the main reasons for using an encoding with a larger word size.
 240 The advantages of fixed-width encodings should be obvious.  The
 241 advantages of variable-width encodings are that they are generally more
 242 space-efficient and allow for compatibility with existing 8-bit
 243 encodings such as ASCII.  (For example, in Unicode ASCII characters are
 244 simply promoted to a 16-bit representation.  That means that every
 245 ASCII character contains a `NUL' byte; evidently all of the standard
 246 string manipulation functions will lose badly in a fixed-width Unicode
 247 environment.)
 248
 249    The bytes in an 8-bit encoding are often referred to as "octets"
 250 rather than simply as bytes.  This terminology dates back to the days
 251 before 8-bit bytes were universal, when some computers had 9-bit bytes,
 252 others had 10-bit bytes, etc.
 253
 254 \1f
 255 File: lispref.info,  Node: Charsets,  Next: MULE Characters,  Prev: Internationalization Terminology,  Up: MULE
 256
 257 Charsets
 258 ========
 259
 260    A "charset" in MULE is an object that encapsulates a particular
 261 character set as well as an ordering of those characters.  Charsets are
 262 permanent objects and are named using symbols, like faces.
 263
 264  - Function: charsetp object
 265      This function returns non-`nil' if OBJECT is a charset.
 266
 267 * Menu:
 268
 269 * Charset Properties::          Properties of a charset.
 270 * Basic Charset Functions::     Functions for working with charsets.
 271 * Charset Property Functions::  Functions for accessing charset properties.
 272 * Predefined Charsets::         Predefined charset objects.
 273
 274 \1f
 275 File: lispref.info,  Node: Charset Properties,  Next: Basic Charset Functions,  Up: Charsets
 276
 277 Charset Properties
 278 ------------------
 279
 280    Charsets have the following properties:
 281
 282 `name'
 283      A symbol naming the charset.  Every charset must have a different
 284      name; this allows a charset to be referred to using its name
 285      rather than the actual charset object.
 286
 287 `doc-string'
 288      A documentation string describing the charset.
 289
 290 `registry'
 291      A regular expression matching the font registry field for this
 292      character set.  For example, both the `ascii' and `latin-iso8859-1'
 293      charsets use the registry `"ISO8859-1"'.  This field is used to
 294      choose an appropriate font when the user gives a general font
 295      specification such as `-*-courier-medium-r-*-140-*', i.e. a
 296      14-point upright medium-weight Courier font.
 297
 298 `dimension'
 299      Number of position codes used to index a character in the
 300      character set.  XEmacs/MULE can only handle character sets of
 301      dimension 1 or 2.  This property defaults to 1.
 302
 303 `chars'
 304      Number of characters in each dimension.  In XEmacs/MULE, the only
 305      allowed values are 94 or 96. (There are a couple of pre-defined
 306      character sets, such as ASCII, that do not follow this, but you
 307      cannot define new ones like this.) Defaults to 94.  Note that if
 308      the dimension is 2, the character set thus described is 94x94 or
 309      96x96.
 310
 311 `columns'
 312      Number of columns used to display a character in this charset.
 313      Only used in TTY mode. (Under X, the actual width of a character
 314      can be derived from the font used to display the characters.)  If
 315      unspecified, defaults to the dimension. (This is almost always the
 316      correct value, because character sets with dimension 2 are usually
 317      ideograph character sets, which need two columns to display the
 318      intricate ideographs.)
 319
 320 `direction'
 321      A symbol, either `l2r' (left-to-right) or `r2l' (right-to-left).
 322      Defaults to `l2r'.  This specifies the direction that the text
 323      should be displayed in, and will be left-to-right for most
 324      charsets but right-to-left for Hebrew and Arabic. (Right-to-left
 325      display is not currently implemented.)
 326
 327 `final'
 328      Final byte of the standard ISO 2022 escape sequence designating
 329      this charset.  Must be supplied.  Each combination of (DIMENSION,
 330      CHARS) defines a separate namespace for final bytes, and each
 331      charset within a particular namespace must have a different final
 332      byte.  Note that ISO 2022 restricts the final byte to the range
 333      0x30 - 0x7E if dimension == 1, and 0x30 - 0x5F if dimension == 2.
 334      Note also that final bytes in the range 0x30 - 0x3F are reserved
 335      for user-defined (not official) character sets.  For more
 336      information on ISO 2022, see *Note Coding Systems::.
 337
 338 `graphic'
 339      0 (use left half of font on output) or 1 (use right half of font on
 340      output).  Defaults to 0.  This specifies how to convert the
 341      position codes that index a character in a character set into an
 342      index into the font used to display the character set.  With
 343      `graphic' set to 0, position codes 33 through 126 map to font
 344      indices 33 through 126; with it set to 1, position codes 33
 345      through 126 map to font indices 161 through 254 (i.e. the same
 346      number but with the high bit set).  For example, for a font whose
 347      registry is ISO8859-1, the left half of the font (octets 0x20 -
 348      0x7F) is the `ascii' charset, while the right half (octets 0xA0 -
 349      0xFF) is the `latin-iso8859-1' charset.
 350
 351 `ccl-program'
 352      A compiled CCL program used to convert a character in this charset
 353      into an index into the font.  This is in addition to the `graphic'
 354      property.  If a CCL program is defined, the position codes of a
 355      character will first be processed according to `graphic' and then
 356      passed through the CCL program, with the resulting values used to
 357      index the font.
 358
 359      This is used, for example, in the Big5 character set (used in
 360      Taiwan).  This character set is not ISO-2022-compliant, and its
 361      size (94x157) does not fit within the maximum 96x96 size of
 362      ISO-2022-compliant character sets.  As a result, XEmacs/MULE
 363      splits it (in a rather complex fashion, so as to group the most
 364      commonly used characters together) into two charset objects
 365      (`big5-1' and `big5-2'), each of size 94x94, and each charset
 366      object uses a CCL program to convert the modified position codes
 367      back into standard Big5 indices to retrieve a character from a
 368      Big5 font.
 369
 370    Most of the above properties can only be set when the charset is
 371 initialized, and cannot be changed later.  *Note Charset Property
 372 Functions::.
 373
 374 \1f
 375 File: lispref.info,  Node: Basic Charset Functions,  Next: Charset Property Functions,  Prev: Charset Properties,  Up: Charsets
 376
 377 Basic Charset Functions
 378 -----------------------
 379
 380  - Function: find-charset charset-or-name
 381      This function retrieves the charset of the given name.  If
 382      CHARSET-OR-NAME is a charset object, it is simply returned.
 383      Otherwise, CHARSET-OR-NAME should be a symbol.  If there is no
 384      such charset, `nil' is returned.  Otherwise the associated charset
 385      object is returned.
 386
 387  - Function: get-charset name
 388      This function retrieves the charset of the given name.  Same as
 389      `find-charset' except an error is signalled if there is no such
 390      charset instead of returning `nil'.
 391
 392  - Function: charset-list
 393      This function returns a list of the names of all defined charsets.
 394
 395  - Function: make-charset name doc-string props
 396      This function defines a new character set.  This function is for
 397      use with MULE support.  NAME is a symbol, the name by which the
 398      character set is normally referred.  DOC-STRING is a string
 399      describing the character set.  PROPS is a property list,
 400      describing the specific nature of the character set.  The
 401      recognized properties are `registry', `dimension', `columns',
 402      `chars', `final', `graphic', `direction', and `ccl-program', as
 403      previously described.
 404
 405  - Function: make-reverse-direction-charset charset new-name
 406      This function makes a charset equivalent to CHARSET but which goes
 407      in the opposite direction.  NEW-NAME is the name of the new
 408      charset.  The new charset is returned.
 409
 410  - Function: charset-from-attributes dimension chars final &optional
 411           direction
 412      This function returns a charset with the given DIMENSION, CHARS,
 413      FINAL, and DIRECTION.  If DIRECTION is omitted, both directions
 414      will be checked (left-to-right will be returned if character sets
 415      exist for both directions).
 416
 417  - Function: charset-reverse-direction-charset charset
 418      This function returns the charset (if any) with the same dimension,
 419      number of characters, and final byte as CHARSET, but which is
 420      displayed in the opposite direction.
 421
 422 \1f
 423 File: lispref.info,  Node: Charset Property Functions,  Next: Predefined Charsets,  Prev: Basic Charset Functions,  Up: Charsets
 424
 425 Charset Property Functions
 426 --------------------------
 427
 428    All of these functions accept either a charset name or charset
 429 object.
 430
 431  - Function: charset-property charset prop
 432      This function returns property PROP of CHARSET.  *Note Charset
 433      Properties::.
 434
 435    Convenience functions are also provided for retrieving individual
 436 properties of a charset.
 437
 438  - Function: charset-name charset
 439      This function returns the name of CHARSET.  This will be a symbol.
 440
 441  - Function: charset-doc-string charset
 442      This function returns the doc string of CHARSET.
 443
 444  - Function: charset-registry charset
 445      This function returns the registry of CHARSET.
 446
 447  - Function: charset-dimension charset
 448      This function returns the dimension of CHARSET.
 449
 450  - Function: charset-chars charset
 451      This function returns the number of characters per dimension of
 452      CHARSET.
 453
 454  - Function: charset-columns charset
 455      This function returns the number of display columns per character
 456      (in TTY mode) of CHARSET.
 457
 458  - Function: charset-direction charset
 459      This function returns the display direction of CHARSET--either
 460      `l2r' or `r2l'.
 461
 462  - Function: charset-final charset
 463      This function returns the final byte of the ISO 2022 escape
 464      sequence designating CHARSET.
 465
 466  - Function: charset-graphic charset
 467      This function returns either 0 or 1, depending on whether the
 468      position codes of characters in CHARSET map to the left or right
 469      half of their font, respectively.
 470
 471  - Function: charset-ccl-program charset
 472      This function returns the CCL program, if any, for converting
 473      position codes of characters in CHARSET into font indices.
 474
 475    The only property of a charset that can currently be set after the
 476 charset has been created is the CCL program.
 477
 478  - Function: set-charset-ccl-program charset ccl-program
 479      This function sets the `ccl-program' property of CHARSET to
 480      CCL-PROGRAM.
 481
 482 \1f
 483 File: lispref.info,  Node: Predefined Charsets,  Prev: Charset Property Functions,  Up: Charsets
 484
 485 Predefined Charsets
 486 -------------------
 487
 488    The following charsets are predefined in the C code.
 489
 490      Name                    Type  Fi Gr Dir Registry
 491      --------------------------------------------------------------
 492      ascii                    94    B  0  l2r ISO8859-1
 493      control-1                94       0  l2r ---
 494      latin-iso8859-1          94    A  1  l2r ISO8859-1
 495      latin-iso8859-2          96    B  1  l2r ISO8859-2
 496      latin-iso8859-3          96    C  1  l2r ISO8859-3
 497      latin-iso8859-4          96    D  1  l2r ISO8859-4
 498      cyrillic-iso8859-5       96    L  1  l2r ISO8859-5
 499      arabic-iso8859-6         96    G  1  r2l ISO8859-6
 500      greek-iso8859-7          96    F  1  l2r ISO8859-7
 501      hebrew-iso8859-8         96    H  1  r2l ISO8859-8
 502      latin-iso8859-9          96    M  1  l2r ISO8859-9
 503      thai-tis620              96    T  1  l2r TIS620
 504      katakana-jisx0201        94    I  1  l2r JISX0201.1976
 505      latin-jisx0201           94    J  0  l2r JISX0201.1976
 506      japanese-jisx0208-1978   94x94 @  0  l2r JISX0208.1978
 507      japanese-jisx0208        94x94 B  0  l2r JISX0208.19(83|90)
 508      japanese-jisx0212        94x94 D  0  l2r JISX0212
 509      chinese-gb2312           94x94 A  0  l2r GB2312
 510      chinese-cns11643-1       94x94 G  0  l2r CNS11643.1
 511      chinese-cns11643-2       94x94 H  0  l2r CNS11643.2
 512      chinese-big5-1           94x94 0  0  l2r Big5
 513      chinese-big5-2           94x94 1  0  l2r Big5
 514      korean-ksc5601           94x94 C  0  l2r KSC5601
 515      composite                96x96    0  l2r ---
 516
 517    The following charsets are predefined in the Lisp code.
 518
 519      Name                     Type  Fi Gr Dir Registry
 520      --------------------------------------------------------------
 521      arabic-digit             94    2  0  l2r MuleArabic-0
 522      arabic-1-column          94    3  0  r2l MuleArabic-1
 523      arabic-2-column          94    4  0  r2l MuleArabic-2
 524      sisheng                  94    0  0  l2r sisheng_cwnn\|OMRON_UDC_ZH
 525      chinese-cns11643-3       94x94 I  0  l2r CNS11643.1
 526      chinese-cns11643-4       94x94 J  0  l2r CNS11643.1
 527      chinese-cns11643-5       94x94 K  0  l2r CNS11643.1
 528      chinese-cns11643-6       94x94 L  0  l2r CNS11643.1
 529      chinese-cns11643-7       94x94 M  0  l2r CNS11643.1
 530      ethiopic                 94x94 2  0  l2r Ethio
 531      ascii-r2l                94    B  0  r2l ISO8859-1
 532      ipa                      96    0  1  l2r MuleIPA
 533      vietnamese-lower         96    1  1  l2r VISCII1.1
 534      vietnamese-upper         96    2  1  l2r VISCII1.1
 535
 536    For all of the above charsets, the dimension and number of columns
 537 are the same.
 538
 539    Note that ASCII, Control-1, and Composite are handled specially.
 540 This is why some of the fields are blank; and some of the filled-in
 541 fields (e.g. the type) are not really accurate.
 542
 543 \1f
 544 File: lispref.info,  Node: MULE Characters,  Next: Composite Characters,  Prev: Charsets,  Up: MULE
 545
 546 MULE Characters
 547 ===============
 548
 549  - Function: make-char charset arg1 &optional arg2
 550      This function makes a multi-byte character from CHARSET and octets
 551      ARG1 and ARG2.
 552
 553  - Function: char-charset ch
 554      This function returns the character set of char CH.
 555
 556  - Function: char-octet ch &optional n
 557      This function returns the octet (i.e. position code) numbered N
 558      (should be 0 or 1) of char CH.  N defaults to 0 if omitted.
 559
 560  - Function: find-charset-region start end &optional buffer
 561      This function returns a list of the charsets in the region between
 562      START and END.  BUFFER defaults to the current buffer if omitted.
 563
 564  - Function: find-charset-string string
 565      This function returns a list of the charsets in STRING.
 566
 567 \1f
 568 File: lispref.info,  Node: Composite Characters,  Next: Coding Systems,  Prev: MULE Characters,  Up: MULE
 569
 570 Composite Characters
 571 ====================
 572
 573    Composite characters are not yet completely implemented.
 574
 575  - Function: make-composite-char string
 576      This function converts a string into a single composite character.
 577      The character is the result of overstriking all the characters in
 578      the string.
 579
 580  - Function: composite-char-string ch
 581      This function returns a string of the characters comprising a
 582      composite character.
 583
 584  - Function: compose-region start end &optional buffer
 585      This function composes the characters in the region from START to
 586      END in BUFFER into one composite character.  The composite
 587      character replaces the composed characters.  BUFFER defaults to
 588      the current buffer if omitted.
 589
 590  - Function: decompose-region start end &optional buffer
 591      This function decomposes any composite characters in the region
 592      from START to END in BUFFER.  This converts each composite
 593      character into one or more characters, the individual characters
 594      out of which the composite character was formed.  Non-composite
 595      characters are left as-is.  BUFFER defaults to the current buffer
 596      if omitted.
 597
 598 \1f
 599 File: lispref.info,  Node: Coding Systems,  Next: CCL,  Prev: Composite Characters,  Up: MULE
 600
 601 Coding Systems
 602 ==============
 603
 604    A coding system is an object that defines how text containing
 605 multiple character sets is encoded into a stream of (typically 8-bit)
 606 bytes.  The coding system is used to decode the stream into a series of
 607 characters (which may be from multiple charsets) when the text is read
 608 from a file or process, and is used to encode the text back into the
 609 same format when it is written out to a file or process.
 610
 611    For example, many ISO-2022-compliant coding systems (such as Compound
 612 Text, which is used for inter-client data under the X Window System) use
 613 escape sequences to switch between different charsets - Japanese Kanji,
 614 for example, is invoked with `ESC $ ( B'; ASCII is invoked with `ESC (
 615 B'; and Cyrillic is invoked with `ESC - L'.  See `make-coding-system'
 616 for more information.
 617
 618    Coding systems are normally identified using a symbol, and the
 619 symbol is accepted in place of the actual coding system object whenever
 620 a coding system is called for. (This is similar to how faces and
 621 charsets work.)
 622
 623  - Function: coding-system-p object
 624      This function returns non-`nil' if OBJECT is a coding system.
 625
 626 * Menu:
 627
 628 * Coding System Types::               Classifying coding systems.
 629 * ISO 2022::                          An international standard for
 630                                         charsets and encodings.
 631 * EOL Conversion::                    Dealing with different ways of denoting
 632                                         the end of a line.
 633 * Coding System Properties::          Properties of a coding system.
 634 * Basic Coding System Functions::     Working with coding systems.
 635 * Coding System Property Functions::  Retrieving a coding system's properties.
 636 * Encoding and Decoding Text::        Encoding and decoding text.
 637 * Detection of Textual Encoding::     Determining how text is encoded.
 638 * Big5 and Shift-JIS Functions::      Special functions for these non-standard
 639                                         encodings.
 640 * Predefined Coding Systems::         Coding systems implemented by MULE.
 641
 642 \1f
 643 File: lispref.info,  Node: Coding System Types,  Next: ISO 2022,  Up: Coding Systems
 644
 645 Coding System Types
 646 -------------------
 647
 648    The coding system type determines the basic algorithm XEmacs will
 649 use to decode or encode a data stream.  Character encodings will be
 650 converted to the MULE encoding, escape sequences processed, and newline
 651 sequences converted to XEmacs's internal representation.  There are
 652 three basic classes of coding system type: no-conversion, ISO-2022, and
 653 special.
 654
 655    No conversion allows you to look at the file's internal
 656 representation.  Since XEmacs is basically a text editor, "no
 657 conversion" does convert newline conventions by default.  (Use the
 658 'binary coding-system if this is not desired.)
 659
 660    ISO 2022 (*note ISO 2022::) is the basic international standard
 661 regulating use of "coded character sets for the exchange of data", ie,
 662 text streams.  ISO 2022 contains functions that make it possible to
 663 encode text streams to comply with restrictions of the Internet mail
 664 system and de facto restrictions of most file systems (eg, use of the
 665 separator character in file names).  Coding systems which are not ISO
 666 2022 conformant can be difficult to handle.  Perhaps more important,
 667 they are not adaptable to multilingual information interchange, with
 668 the obvious exception of ISO 10646 (Unicode).  (Unicode is partially
 669 supported by XEmacs with the addition of the Lisp package ucs-conv.)
 670
 671    The special class of coding systems includes automatic detection,
 672 CCL (a "little language" embedded as an interpreter, useful for
 673 translating between variants of a single character set),
 674 non-ISO-2022-conformant encodings like Unicode, Shift JIS, and Big5,
 675 and MULE internal coding.  (NB: this list is based on XEmacs 21.2.
 676 Terminology may vary slightly for other versions of XEmacs and for GNU
 677 Emacs 20.)
 678
 679 `no-conversion'
 680      No conversion, for binary files, and a few special cases of
 681      non-ISO-2022 coding systems where conversion is done by hook
 682      functions (usually implemented in CCL).  On output, graphic
 683      characters that are not in ASCII or Latin-1 will be replaced by a
 684      `?'. (For a no-conversion-encoded buffer, these characters will
 685      only be present if you explicitly insert them.)
 686
 687 `iso2022'
 688      Any ISO-2022-compliant encoding.  Among others, this includes JIS
 689      (the Japanese encoding commonly used for e-mail), national
 690      variants of EUC (the standard Unix encoding for Japanese and other
 691      languages), and Compound Text (an encoding used in X11).  You can
 692      specify more specific information about the conversion with the
 693      FLAGS argument.
 694
 695 `ucs-4'
 696      ISO 10646 UCS-4 encoding.  A 31-bit fixed-width superset of
 697      Unicode.
 698
 699 `utf-8'
 700      ISO 10646 UTF-8 encoding.  A "file system safe" transformation
 701      format that can be used with both UCS-4 and Unicode.
 702
 703 `undecided'
 704      Automatic conversion.  XEmacs attempts to detect the coding system
 705      used in the file.
 706
 707 `shift-jis'
 708      Shift-JIS (a Japanese encoding commonly used in PC operating
 709      systems).
 710
 711 `big5'
 712      Big5 (the encoding commonly used for Taiwanese).
 713
 714 `ccl'
 715      The conversion is performed using a user-written pseudo-code
 716      program.  CCL (Code Conversion Language) is the name of this
 717      pseudo-code.  For example, CCL is used to map KOI8-R characters
 718      (an encoding for Russian Cyrillic) to ISO8859-5 (the form used
 719      internally by MULE).
 720
 721 `internal'
 722      Write out or read in the raw contents of the memory representing
 723      the buffer's text.  This is primarily useful for debugging
 724      purposes, and is only enabled when XEmacs has been compiled with
 725      `DEBUG_XEMACS' set (the `--debug' configure option).  *Warning*:
 726      Reading in a file using `internal' conversion can result in an
 727      internal inconsistency in the memory representing a buffer's text,
 728      which will produce unpredictable results and may cause XEmacs to
 729      crash.  Under normal circumstances you should never use `internal'
 730      conversion.
 731
 732 \1f
 733 File: lispref.info,  Node: ISO 2022,  Next: EOL Conversion,  Prev: Coding System Types,  Up: Coding Systems
 734
 735 ISO 2022
 736 ========
 737
 738    This section briefly describes the ISO 2022 encoding standard.  A
 739 more thorough treatment is available in the original document of ISO
 740 2022 as well as various national standards (such as JIS X 0202).
 741
 742    Character sets ("charsets") are classified into the following four
 743 categories, according to the number of characters in the charset:
 744 94-charset, 96-charset, 94x94-charset, and 96x96-charset.  This means
 745 that although an ISO 2022 coding system may have variable width
 746 characters, each charset used is fixed-width (in contrast to the MULE
 747 character set and UTF-8, for example).
 748
 749    ISO 2022 provides for switching between character sets via escape
 750 sequences.  This switching is somewhat complicated, because ISO 2022
 751 provides for both legacy applications like Internet mail that accept
 752 only 7 significant bits in some contexts (RFC 822 headers, for example),
 753 and more modern "8-bit clean" applications.  It also provides for
 754 compact and transparent representation of languages like Japanese which
 755 mix ASCII and a national script (even outside of computer programs).
 756
 757    First, ISO 2022 codified prevailing practice by dividing the code
 758 space into "control" and "graphic" regions.  The code points 0x00-0x1F
 759 and 0x80-0x9F are reserved for "control characters", while "graphic
 760 characters" must be assigned to code points in the regions 0x20-0x7F and
 761 0xA0-0xFF.  The positions 0x20 and 0x7F are special, and under some
 762 circumstances must be assigned the graphic character "ASCII SPACE" and
 763 the control character "ASCII DEL" respectively.
 764
 765    The various regions are given the name C0 (0x00-0x1F), GL
 766 (0x20-0x7F), C1 (0x80-0x9F), and GR (0xA0-0xFF).  GL and GR stand for
 767 "graphic left" and "graphic right", respectively, because of the
 768 standard method of displaying graphic character sets in tables with the
 769 high byte indexing columns and the low byte indexing rows.  I don't
 770 find it very intuitive, but these are called "registers".
 771
 772    An ISO 2022-conformant encoding for a graphic character set must use
 773 a fixed number of bytes per character, and the values must fit into a
 774 single register; that is, each byte must range over either 0x20-0x7F, or
 775 0xA0-0xFF.  It is not allowed to extend the range of the repertoire of a
 776 character set by using both ranges at the same.  This is why a standard
 777 character set such as ISO 8859-1 is actually considered by ISO 2022 to
 778 be an aggregation of two character sets, ASCII and LATIN-1, and why it
 779 is technically incorrect to refer to ISO 8859-1 as "Latin 1".  Also, a
 780 single character's bytes must all be drawn from the same register; this
 781 is why Shift JIS (for Japanese) and Big 5 (for Chinese) are not ISO
 782 2022-compatible encodings.
 783
 784    The reason for this restriction becomes clear when you attempt to
 785 define an efficient, robust encoding for a language like Japanese.
 786 Like ISO 8859, Japanese encodings are aggregations of several character
 787 sets.  In practice, the vast majority of characters are drawn from the
 788 "JIS Roman" character set (a derivative of ASCII; it won't hurt to
 789 think of it as ASCII) and the JIS X 0208 standard "basic Japanese"
 790 character set including not only ideographic characters ("kanji") but
 791 syllabic Japanese characters ("kana"), a wide variety of symbols, and
 792 many alphabetic characters (Roman, Greek, and Cyrillic) as well.
 793 Although JIS X 0208 includes the whole Roman alphabet, as a 2-byte code
 794 it is not suited to programming; thus the inclusion of ASCII in the
 795 standard Japanese encodings.
 796
 797    For normal Japanese text such as in newspapers, a broad repertoire of
 798 approximately 3000 characters is used.  Evidently this won't fit into
 799 one byte; two must be used.  But much of the text processed by Japanese
 800 computers is computer source code, nearly all of which is ASCII.  A not
 801 insignificant portion of ordinary text is English (as such or as
 802 borrowed Japanese vocabulary) or other languages which can represented
 803 at least approximately in ASCII, as well.  It seems reasonable then to
 804 represent ASCII in one byte, and JIS X 0208 in two.  And this is exactly
 805 what the Extended Unix Code for Japanese (EUC-JP) does.  ASCII is
 806 invoked to the GL register, and JIS X 0208 is invoked to the GR
 807 register.  Thus, each byte can be tested for its character set by
 808 looking at the high bit; if set, it is Japanese, if clear, it is ASCII.
 809 Furthermore, since control characters like newline can never be part of
 810 a graphic character, even in the case of corruption in transmission the
 811 stream will be resynchronized at every line break, on the order of 60-80
 812 bytes.  This coding system requires no escape sequences or special
 813 control codes to represent 99.9% of all Japanese text.
 814
 815    Note carefully the distinction between the character sets (ASCII and
 816 JIS X 0208), the encoding (EUC-JP), and the coding system (ISO 2022).
 817 The JIS X 0208 character set is used in three different encodings for
 818 Japanese, but in ISO-2022-JP it is invoked into GL (so the high bit is
 819 always clear), in EUC-JP it is invoked into GR (setting the high bit in
 820 the process), and in Shift JIS the high bit may be set or reset, and the
 821 significant bits are shifted within the 16-bit character so that the two
 822 main character sets can coexist with a third (the "halfwidth katakana"
 823 of JIS X 0201).  As the name implies, the ISO-2022-JP encoding is also a
 824 version of the ISO-2022 coding system.
 825
 826    In order to systematically treat subsidiary character sets (like the
 827 "halfwidth katakana" already mentioned, and the "supplementary kanji" of
 828 JIS X 0212), four further registers are defined: G0, G1, G2, and G3.
 829 Unlike GL and GR, they are not logically distinguished by internal
 830 format.  Instead, the process of "invocation" mentioned earlier is
 831 broken into two steps: first, a character set is "designated" to one of
 832 the registers G0-G3 by use of an "escape sequence" of the form:
 833
 834              ESC [I] I F
 835
 836    where I is an intermediate character or characters in the range 0x20
 837 - 0x3F, and F, from the range 0x30-0x7Fm is the final character
 838 identifying this charset.  (Final characters in the range 0x30-0x3F are
 839 reserved for private use and will never have a publically registered
 840 meaning.)
 841
 842    Then that register is "invoked" to either GL or GR, either
 843 automatically (designations to G0 normally involve invocation to GL as
 844 well), or by use of shifting (affecting only the following character in
 845 the data stream) or locking (effective until the next designation or
 846 locking) control sequences.  An encoding conformant to ISO 2022 is
 847 typically defined by designating the initial contents of the G0-G3
 848 registers, specifying an 7 or 8 bit environment, and specifying whether
 849 further designations will be recognized.
 850
 851    Some examples of character sets and the registered final characters
 852 F used to designate them:
 853
 854 94-charset
 855      ASCII (B), left (J) and right (I) half of JIS X 0201, ...
 856
 857 96-charset
 858      Latin-1 (A), Latin-2 (B), Latin-3 (C), ...
 859
 860 94x94-charset
 861      GB2312 (A), JIS X 0208 (B), KSC5601 (C), ...
 862
 863 96x96-charset
 864      none for the moment
 865
 866    The meanings of the various characters in these sequences, where not
 867 specified by the ISO 2022 standard (such as the ESC character), are
 868 assigned by "ECMA", the European Computer Manufacturers Association.
 869
 870    The meaning of intermediate characters are:
 871
 872              $ [0x24]: indicate charset of dimension 2 (94x94 or 96x96).
 873              ( [0x28]: designate to G0 a 94-charset whose final byte is F.
 874              ) [0x29]: designate to G1 a 94-charset whose final byte is F.
 875              * [0x2A]: designate to G2 a 94-charset whose final byte is F.
 876              + [0x2B]: designate to G3 a 94-charset whose final byte is F.
 877              , [0x2C]: designate to G0 a 96-charset whose final byte is F.
 878              - [0x2D]: designate to G1 a 96-charset whose final byte is F.
 879              . [0x2E]: designate to G2 a 96-charset whose final byte is F.
 880              / [0x2F]: designate to G3 a 96-charset whose final byte is F.
 881
 882    The comma may be used in files read and written only by MULE, as a
 883 MULE extension, but this is illegal in ISO 2022.  (The reason is that
 884 in ISO 2022 G0 must be a 94-member character set, with 0x20 assigned
 885 the value SPACE, and 0x7F assigned the value DEL.)
 886
 887    Here are examples of designations:
 888
 889              ESC ( B :              designate to G0 ASCII
 890              ESC - A :              designate to G1 Latin-1
 891              ESC $ ( A or ESC $ A : designate to G0 GB2312
 892              ESC $ ( B or ESC $ B : designate to G0 JISX0208
 893              ESC $ ) C :            designate to G1 KSC5601
 894
 895    (The short forms used to designate GB2312 and JIS X 0208 are for
 896 backwards compatibility; the long forms are preferred.)
 897
 898    To use a charset designated to G2 or G3, and to use a charset
 899 designated to G1 in a 7-bit environment, you must explicitly invoke G1,
 900 G2, or G3 into GL.  There are two types of invocation, Locking Shift
 901 (forever) and Single Shift (one character only).
 902
 903    Locking Shift is done as follows:
 904
 905              LS0 or SI (0x0F): invoke G0 into GL
 906              LS1 or SO (0x0E): invoke G1 into GL
 907              LS2:  invoke G2 into GL
 908              LS3:  invoke G3 into GL
 909              LS1R: invoke G1 into GR
 910              LS2R: invoke G2 into GR
 911              LS3R: invoke G3 into GR
 912
 913    Single Shift is done as follows:
 914
 915              SS2 or ESC N: invoke G2 into GL
 916              SS3 or ESC O: invoke G3 into GL
 917
 918    The shift functions (such as LS1R and SS3) are represented by control
 919 characters (from C1) in 8 bit environments and by escape sequences in 7
 920 bit environments.
 921
 922    (#### Ben says: I think the above is slightly incorrect.  It appears
 923 that SS2 invokes G2 into GR and SS3 invokes G3 into GR, whereas ESC N
 924 and ESC O behave as indicated.  The above definitions will not parse
 925 EUC-encoded text correctly, and it looks like the code in mule-coding.c
 926 has similar problems.)
 927
 928    Evidently there are a lot of ISO-2022-compliant ways of encoding
 929 multilingual text.  Now, in the world, there exist many coding systems
 930 such as X11's Compound Text, Japanese JUNET code, and so-called EUC
 931 (Extended UNIX Code); all of these are variants of ISO 2022.
 932
 933    In MULE, we characterize a version of ISO 2022 by the following
 934 attributes:
 935
 936   1. The character sets initially designated to G0 thru G3.
 937
 938   2. Whether short form designations are allowed for Japanese and
 939      Chinese.
 940
 941   3. Whether ASCII should be designated to G0 before control characters.
 942
 943   4. Whether ASCII should be designated to G0 at the end of line.
 944
 945   5. 7-bit environment or 8-bit environment.
 946
 947   6. Whether Locking Shifts are used or not.
 948
 949   7. Whether to use ASCII or the variant JIS X 0201-1976-Roman.
 950
 951   8. Whether to use JIS X 0208-1983 or the older version JIS X
 952      0208-1976.
 953
 954    (The last two are only for Japanese.)
 955
 956    By specifying these attributes, you can create any variant of ISO
 957 2022.
 958
 959    Here are several examples:
 960
 961      ISO-2022-JP -- Coding system used in Japanese email (RFC 1463 #### check).
 962              1. G0 <- ASCII, G1..3 <- never used
 963              2. Yes.
 964              3. Yes.
 965              4. Yes.
 966              5. 7-bit environment
 967              6. No.
 968              7. Use ASCII
 969              8. Use JIS X 0208-1983
 970
 971      ctext -- X11 Compound Text
 972              1. G0 <- ASCII, G1 <- Latin-1, G2,3 <- never used.
 973              2. No.
 974              3. No.
 975              4. Yes.
 976              5. 8-bit environment.
 977              6. No.
 978              7. Use ASCII.
 979              8. Use JIS X 0208-1983.
 980
 981      euc-china -- Chinese EUC.  Often called the "GB encoding", but that is
 982      technically incorrect.
 983              1. G0 <- ASCII, G1 <- GB 2312, G2,3 <- never used.
 984              2. No.
 985              3. Yes.
 986              4. Yes.
 987              5. 8-bit environment.
 988              6. No.
 989              7. Use ASCII.
 990              8. Use JIS X 0208-1983.
 991
 992      ISO-2022-KR -- Coding system used in Korean email.
 993              1. G0 <- ASCII, G1 <- KSC 5601, G2,3 <- never used.
 994              2. No.
 995              3. Yes.
 996              4. Yes.
 997              5. 7-bit environment.
 998              6. Yes.
 999              7. Use ASCII.
1000              8. Use JIS X 0208-1983.
1001
1002    MULE creates all of these coding systems by default.
1003
1004 \1f
1005 File: lispref.info,  Node: EOL Conversion,  Next: Coding System Properties,  Prev: ISO 2022,  Up: Coding Systems
1006
1007 EOL Conversion
1008 --------------
1009
1010 `nil'
1011      Automatically detect the end-of-line type (LF, CRLF, or CR).  Also
1012      generate subsidiary coding systems named `NAME-unix', `NAME-dos',
1013      and `NAME-mac', that are identical to this coding system but have
1014      an EOL-TYPE value of `lf', `crlf', and `cr', respectively.
1015
1016 `lf'
1017      The end of a line is marked externally using ASCII LF.  Since this
1018      is also the way that XEmacs represents an end-of-line internally,
1019      specifying this option results in no end-of-line conversion.  This
1020      is the standard format for Unix text files.
1021
1022 `crlf'
1023      The end of a line is marked externally using ASCII CRLF.  This is
1024      the standard format for MS-DOS text files.
1025
1026 `cr'
1027      The end of a line is marked externally using ASCII CR.  This is the
1028      standard format for Macintosh text files.
1029
1030 `t'
1031      Automatically detect the end-of-line type but do not generate
1032      subsidiary coding systems.  (This value is converted to `nil' when
1033      stored internally, and `coding-system-property' will return `nil'.)
1034