2 @c This is part of the XEmacs Lisp Reference Manual.
3 @c Copyright (C) 1996 Ben Wing.
4 @c See the file lispref.texi for copying conditions.
5 @setfilename ../../info/internationalization.info
6 @node MULE, Tips, Internationalization, top
9 @dfn{MULE} is the name originally given to the version of GNU Emacs
10 extended for multi-lingual (and in particular Asian-language) support.
11 ``MULE'' is short for ``MUlti-Lingual Emacs''. It is an extension and
12 complete rewrite of Nemacs (``Nihon Emacs'' where ``Nihon'' is the
13 Japanese word for ``Japan''), which only provided support for Japanese.
14 XEmacs refers to its multi-lingual support as @dfn{MULE support} since
15 it is based on @dfn{MULE}.
18 * Internationalization Terminology::
19 Definition of various internationalization terms.
20 * Charsets:: Sets of related characters.
21 * MULE Characters:: Working with characters in XEmacs/MULE.
22 * Composite Characters:: Making new characters by overstriking other ones.
23 * Coding Systems:: Ways of representing a string of chars using integers.
24 * CCL:: A special language for writing fast converters.
25 * Category Tables:: Subdividing charsets into groups.
28 @node Internationalization Terminology, Charsets, , MULE
29 @section Internationalization Terminology
31 In internationalization terminology, a string of text is divided up
32 into @dfn{characters}, which are the printable units that make up the
33 text. A single character is (for example) a capital @samp{A}, the
34 number @samp{2}, a Katakana character, a Hangul character, a Kanji
35 ideograph (an @dfn{ideograph} is a ``picture'' character, such as is
36 used in Japanese Kanji, Chinese Hanzi, and Korean Hanja; typically there
37 are thousands of such ideographs in each language), etc. The basic
38 property of a character is that it is the smallest unit of text with
39 semantic significance in text processing.
41 Human beings normally process text visually, so to a first approximation
42 a character may be identified with its shape. Note that the same
43 character may be drawn by two different people (or in two different
44 fonts) in slightly different ways, although the "basic shape" will be the
45 same. But consider the works of Scott Kim; human beings can recognize
46 hugely variant shapes as the "same" character. Sometimes, especially
47 where characters are extremely complicated to write, completely
48 different shapes may be defined as the "same" character in national
49 standards. The Taiwanese variant of Hanzi is generally the most
50 complicated; over the centuries, the Japanese, Koreans, and the People's
51 Republic of China have adopted simplifications of the shape, but the
52 line of descent from the original shape is recorded, and the meanings
53 and pronunciation of different forms of the same character are
54 considered to be identical within each language. (Of course, it may
55 take a specialist to recognize the related form; the point is that the
56 relations are standardized, despite the differing shapes.)
58 In some cases, the differences will be significant enough that it is
59 actually possible to identify two or more distinct shapes that both
60 represent the same character. For example, the lowercase letters
61 @samp{a} and @samp{g} each have two distinct possible shapes---the
62 @samp{a} can optionally have a curved tail projecting off the top, and
63 the @samp{g} can be formed either of two loops, or of one loop and a
64 tail hanging off the bottom. Such distinct possible shapes of a
65 character are called @dfn{glyphs}. The important characteristic of two
66 glyphs making up the same character is that the choice between one or
67 the other is purely stylistic and has no linguistic effect on a word
68 (this is the reason why a capital @samp{A} and lowercase @samp{a}
69 are different characters rather than different glyphs---e.g.
70 @samp{Aspen} is a city while @samp{aspen} is a kind of tree).
72 Note that @dfn{character} and @dfn{glyph} are used differently
73 here than elsewhere in XEmacs.
75 A @dfn{character set} is essentially a set of related characters. ASCII,
76 for example, is a set of 94 characters (or 128, if you count
77 non-printing characters). Other character sets are ISO8859-1 (ASCII
78 plus various accented characters and other international symbols),
79 JIS X 0201 (ASCII, more or less, plus half-width Katakana), JIS X 0208
80 (Japanese Kanji), JIS X 0212 (a second set of less-used Japanese Kanji),
81 GB2312 (Mainland Chinese Hanzi), etc.
83 The definition of a character set will implicitly or explicitly give
84 it an @dfn{ordering}, a way of assigning a number to each character in
85 the set. For many character sets, there is a natural ordering, for
86 example the ``ABC'' ordering of the Roman letters. But it is not clear
87 whether digits should come before or after the letters, and in fact
88 different European languages treat the ordering of accented characters
89 differently. It is useful to use the natural order where available, of
90 course. The number assigned to any particular character is called the
91 character's @dfn{code point}. (Within a given character set, each
92 character has a unique code point. Thus the word "set" is ill-chosen;
93 different orderings of the same characters are different character sets.
94 Identifying characters is simple enough for alphabetic character sets,
95 but the difference in ordering can cause great headaches when the same
96 thousands of characters are used by different cultures as in the Hanzi.)
98 A code point may be broken into a number of @dfn{position codes}. The
99 number of position codes required to index a particular character in a
100 character set is called the @dfn{dimension} of the character set. For
101 practical purposes, a position code may be thought of as a byte-sized
102 index. The printing characters of ASCII, being a relatively small
103 character set, is of dimension one, and each character in the set is
104 indexed using a single position code, in the range 1 through 94. Use of
105 this unusual range, rather than the familiar 33 through 126, is an
106 intentional abstraction; to understand the programming issues you must
107 break the equation between character sets and encodings.
109 JIS X 0208, i.e. Japanese Kanji, has thousands of characters, and is
110 of dimension two -- every character is indexed by two position codes,
111 each in the range 1 through 94. (This number ``94'' is not a
112 coincidence; we shall see that the JIS position codes were chosen so
113 that JIS kanji could be encoded without using codes that in ASCII are
114 associated with device control functions.) Note that the choice of the
115 range here is somewhat arbitrary. You could just as easily index the
116 printing characters in ASCII using numbers in the range 0 through 93, 2
117 through 95, 3 through 96, etc. In fact, the standardized
118 @emph{encoding} for the ASCII @emph{character set} uses the range 33
121 An @dfn{encoding} is a way of numerically representing characters from
122 one or more character sets into a stream of like-sized numerical values
123 called @dfn{words}; typically these are 8-bit, 16-bit, or 32-bit
124 quantities. If an encoding encompasses only one character set, then the
125 position codes for the characters in that character set could be used
126 directly. (This is the case with the trivial cipher used by children,
127 assigning 1 to `A', 2 to `B', and so on.) However, even with ASCII,
128 other considerations intrude. For example, why are the upper- and
129 lowercase alphabets separated by 8 characters? Why do the digits start
130 with `0' being assigned the code 48? In both cases because semantically
131 interesting operations (case conversion and numerical value extraction)
132 become convenient masking operations. Other artificial aspects (the
133 control characters being assigned to codes 0--31 and 127) are historical
134 accidents. (The use of 127 for @samp{DEL} is an artifact of the "punch
135 once" nature of paper tape, for example.)
137 Naive use of the position code is not possible, however, if more than
138 one character set is to be used in the encoding. For example, printed
139 Japanese text typically requires characters from multiple character sets
140 -- ASCII, JIS X 0208, and JIS X 0212, to be specific. Each of these is
141 indexed using one or more position codes in the range 1 through 94, so
142 the position codes could not be used directly or there would be no way
143 to tell which character was meant. Different Japanese encodings handle
144 this differently -- JIS uses special escape characters to denote
145 different character sets; EUC sets the high bit of the position codes
146 for JIS X 0208 and JIS X 0212, and puts a special extra byte before each
147 JIS X 0212 character; etc. (JIS, EUC, and most of the other encodings
148 you will encounter in files are 7-bit or 8-bit encodings. There is one
149 common 16-bit encoding, which is Unicode; this strives to represent all
150 the world's characters in a single large character set. 32-bit
151 encodings are often used internally in programs, such as XEmacs with
152 MULE support, to simplify the code that manipulates them; however, they
153 are not used externally because they are not very space-efficient.)
155 A general method of handling text using multiple character sets
156 (whether for multilingual text, or simply text in an extremely
157 complicated single language like Japanese) is defined in the
158 international standard ISO 2022. ISO 2022 will be discussed in more
159 detail later (@pxref{ISO 2022}), but for now suffice it to say that text
160 needs control functions (at least spacing), and if escape sequences are
161 to be used, an escape sequence introducer. It was decided to make all
162 text streams compatible with ASCII in the sense that the codes 0--31
163 (and 128-159) would always be control codes, never graphic characters,
164 and where defined by the character set the @samp{SPC} character would be
165 assigned code 32, and @samp{DEL} would be assigned 127. Thus there are
166 94 code points remaining if 7 bits are used. This is the reason that
167 most character sets are defined using position codes in the range 1
168 through 94. Then ISO 2022 compatible encodings are produced by shifting
169 the position codes 1 to 94 into character codes 33 to 126, or (if 8 bit
170 codes are available) into character codes 161 to 254.
172 Encodings are classified as either @dfn{modal} or @dfn{non-modal}. In
173 a @dfn{modal encoding}, there are multiple states that the encoding can
174 be in, and the interpretation of the values in the stream depends on the
175 current global state of the encoding. Special values in the encoding,
176 called @dfn{escape sequences}, are used to change the global state.
177 JIS, for example, is a modal encoding. The bytes @samp{ESC $ B}
178 indicate that, from then on, bytes are to be interpreted as position
179 codes for JIS X 0208, rather than as ASCII. This effect is cancelled
180 using the bytes @samp{ESC ( B}, which mean ``switch from whatever the
181 current state is to ASCII''. To switch to JIS X 0212, the escape
182 sequence @samp{ESC $ ( D}. (Note that here, as is common, the escape
183 sequences do in fact begin with @samp{ESC}. This is not necessarily the
184 case, however. Some encodings use control characters called "locking
185 shifts" (effect persists until cancelled) to switch character sets.)
187 A @dfn{non-modal encoding} has no global state that extends past the
188 character currently being interpreted. EUC, for example, is a
189 non-modal encoding. Characters in JIS X 0208 are encoded by setting
190 the high bit of the position codes, and characters in JIS X 0212 are
191 encoded by doing the same but also prefixing the character with the
194 The advantage of a modal encoding is that it is generally more
195 space-efficient, and is easily extendible because there are essentially
196 an arbitrary number of escape sequences that can be created. The
197 disadvantage, however, is that it is much more difficult to work with
198 if it is not being processed in a sequential manner. In the non-modal
199 EUC encoding, for example, the byte 0x41 always refers to the letter
200 @samp{A}; whereas in JIS, it could either be the letter @samp{A}, or
201 one of the two position codes in a JIS X 0208 character, or one of the
202 two position codes in a JIS X 0212 character. Determining exactly which
203 one is meant could be difficult and time-consuming if the previous
204 bytes in the string have not already been processed, or impossible if
205 they are drawn from an external stream that cannot be rewound.
207 Non-modal encodings are further divided into @dfn{fixed-width} and
208 @dfn{variable-width} formats. A fixed-width encoding always uses
209 the same number of words per character, whereas a variable-width
210 encoding does not. EUC is a good example of a variable-width
211 encoding: one to three bytes are used per character, depending on
212 the character set. 16-bit and 32-bit encodings are nearly always
213 fixed-width, and this is in fact one of the main reasons for using
214 an encoding with a larger word size. The advantages of fixed-width
215 encodings should be obvious. The advantages of variable-width
216 encodings are that they are generally more space-efficient and allow
217 for compatibility with existing 8-bit encodings such as ASCII. (For
218 example, in Unicode ASCII characters are simply promoted to a 16-bit
219 representation. That means that every ASCII character contains a
220 @samp{NUL} byte; evidently all of the standard string manipulation
221 functions will lose badly in a fixed-width Unicode environment.)
223 The bytes in an 8-bit encoding are often referred to as @dfn{octets}
224 rather than simply as bytes. This terminology dates back to the days
225 before 8-bit bytes were universal, when some computers had 9-bit bytes,
226 others had 10-bit bytes, etc.
228 @node Charsets, MULE Characters, Internationalization Terminology, MULE
231 A @dfn{charset} in MULE is an object that encapsulates a
232 particular character set as well as an ordering of those characters.
233 Charsets are permanent objects and are named using symbols, like
236 @defun charsetp object
237 This function returns non-@code{nil} if @var{object} is a charset.
241 * Charset Properties:: Properties of a charset.
242 * Basic Charset Functions:: Functions for working with charsets.
243 * Charset Property Functions:: Functions for accessing charset properties.
244 * Predefined Charsets:: Predefined charset objects.
247 @node Charset Properties, Basic Charset Functions, , Charsets
248 @subsection Charset Properties
250 Charsets have the following properties:
254 A symbol naming the charset. Every charset must have a different name;
255 this allows a charset to be referred to using its name rather than
256 the actual charset object.
258 A documentation string describing the charset.
260 A regular expression matching the font registry field for this character
261 set. For example, both the @code{ascii} and @code{latin-iso8859-1}
262 charsets use the registry @code{"ISO8859-1"}. This field is used to
263 choose an appropriate font when the user gives a general font
264 specification such as @samp{-*-courier-medium-r-*-140-*}, i.e. a
265 14-point upright medium-weight Courier font.
267 Number of position codes used to index a character in the character set.
268 XEmacs/MULE can only handle character sets of dimension 1 or 2.
269 This property defaults to 1.
271 Number of characters in each dimension. In XEmacs/MULE, the only
272 allowed values are 94 or 96. (There are a couple of pre-defined
273 character sets, such as ASCII, that do not follow this, but you cannot
274 define new ones like this.) Defaults to 94. Note that if the dimension
275 is 2, the character set thus described is 94x94 or 96x96.
277 Number of columns used to display a character in this charset.
278 Only used in TTY mode. (Under X, the actual width of a character
279 can be derived from the font used to display the characters.)
280 If unspecified, defaults to the dimension. (This is almost
281 always the correct value, because character sets with dimension 2
282 are usually ideograph character sets, which need two columns to
283 display the intricate ideographs.)
285 A symbol, either @code{l2r} (left-to-right) or @code{r2l}
286 (right-to-left). Defaults to @code{l2r}. This specifies the
287 direction that the text should be displayed in, and will be
288 left-to-right for most charsets but right-to-left for Hebrew
289 and Arabic. (Right-to-left display is not currently implemented.)
291 Final byte of the standard ISO 2022 escape sequence designating this
292 charset. Must be supplied. Each combination of (@var{dimension},
293 @var{chars}) defines a separate namespace for final bytes, and each
294 charset within a particular namespace must have a different final byte.
295 Note that ISO 2022 restricts the final byte to the range 0x30 - 0x7E if
296 dimension == 1, and 0x30 - 0x5F if dimension == 2. Note also that final
297 bytes in the range 0x30 - 0x3F are reserved for user-defined (not
298 official) character sets. For more information on ISO 2022, see @ref{Coding
301 0 (use left half of font on output) or 1 (use right half of font on
302 output). Defaults to 0. This specifies how to convert the position
303 codes that index a character in a character set into an index into the
304 font used to display the character set. With @code{graphic} set to 0,
305 position codes 33 through 126 map to font indices 33 through 126; with
306 it set to 1, position codes 33 through 126 map to font indices 161
307 through 254 (i.e. the same number but with the high bit set). For
308 example, for a font whose registry is ISO8859-1, the left half of the
309 font (octets 0x20 - 0x7F) is the @code{ascii} charset, while the right
310 half (octets 0xA0 - 0xFF) is the @code{latin-iso8859-1} charset.
312 A compiled CCL program used to convert a character in this charset into
313 an index into the font. This is in addition to the @code{graphic}
314 property. If a CCL program is defined, the position codes of a
315 character will first be processed according to @code{graphic} and
316 then passed through the CCL program, with the resulting values used
319 This is used, for example, in the Big5 character set (used in Taiwan).
320 This character set is not ISO-2022-compliant, and its size (94x157) does
321 not fit within the maximum 96x96 size of ISO-2022-compliant character
322 sets. As a result, XEmacs/MULE splits it (in a rather complex fashion,
323 so as to group the most commonly used characters together) into two
324 charset objects (@code{big5-1} and @code{big5-2}), each of size 94x94,
325 and each charset object uses a CCL program to convert the modified
326 position codes back into standard Big5 indices to retrieve a character
330 Most of the above properties can only be set when the charset is
331 initialized, and cannot be changed later.
332 @xref{Charset Property Functions}.
334 @node Basic Charset Functions, Charset Property Functions, Charset Properties, Charsets
335 @subsection Basic Charset Functions
337 @defun find-charset charset-or-name
338 This function retrieves the charset of the given name. If
339 @var{charset-or-name} is a charset object, it is simply returned.
340 Otherwise, @var{charset-or-name} should be a symbol. If there is no
341 such charset, @code{nil} is returned. Otherwise the associated charset
345 @defun get-charset name
346 This function retrieves the charset of the given name. Same as
347 @code{find-charset} except an error is signalled if there is no such
348 charset instead of returning @code{nil}.
352 This function returns a list of the names of all defined charsets.
355 @defun make-charset name doc-string props
356 This function defines a new character set. This function is for use
357 with MULE support. @var{name} is a symbol, the name by which the
358 character set is normally referred. @var{doc-string} is a string
359 describing the character set. @var{props} is a property list,
360 describing the specific nature of the character set. The recognized
361 properties are @code{registry}, @code{dimension}, @code{columns},
362 @code{chars}, @code{final}, @code{graphic}, @code{direction}, and
363 @code{ccl-program}, as previously described.
366 @defun make-reverse-direction-charset charset new-name
367 This function makes a charset equivalent to @var{charset} but which goes
368 in the opposite direction. @var{new-name} is the name of the new
369 charset. The new charset is returned.
372 @defun charset-from-attributes dimension chars final &optional direction
373 This function returns a charset with the given @var{dimension},
374 @var{chars}, @var{final}, and @var{direction}. If @var{direction} is
375 omitted, both directions will be checked (left-to-right will be returned
376 if character sets exist for both directions).
379 @defun charset-reverse-direction-charset charset
380 This function returns the charset (if any) with the same dimension,
381 number of characters, and final byte as @var{charset}, but which is
382 displayed in the opposite direction.
385 @node Charset Property Functions, Predefined Charsets, Basic Charset Functions, Charsets
386 @subsection Charset Property Functions
388 All of these functions accept either a charset name or charset object.
390 @defun charset-property charset prop
391 This function returns property @var{prop} of @var{charset}.
392 @xref{Charset Properties}.
395 Convenience functions are also provided for retrieving individual
396 properties of a charset.
398 @defun charset-name charset
399 This function returns the name of @var{charset}. This will be a symbol.
402 @defun charset-description charset
403 This function returns the documentation string of @var{charset}.
406 @defun charset-registry charset
407 This function returns the registry of @var{charset}.
410 @defun charset-dimension charset
411 This function returns the dimension of @var{charset}.
414 @defun charset-chars charset
415 This function returns the number of characters per dimension of
419 @defun charset-width charset
420 This function returns the number of display columns per character (in
421 TTY mode) of @var{charset}.
424 @defun charset-direction charset
425 This function returns the display direction of @var{charset}---either
426 @code{l2r} or @code{r2l}.
429 @defun charset-iso-final-char charset
430 This function returns the final byte of the ISO 2022 escape sequence
431 designating @var{charset}.
434 @defun charset-iso-graphic-plane charset
435 This function returns either 0 or 1, depending on whether the position
436 codes of characters in @var{charset} map to the left or right half
437 of their font, respectively.
440 @defun charset-ccl-program charset
441 This function returns the CCL program, if any, for converting
442 position codes of characters in @var{charset} into font indices.
445 The only property of a charset that can currently be set after
446 the charset has been created is the CCL program.
448 @defun set-charset-ccl-program charset ccl-program
449 This function sets the @code{ccl-program} property of @var{charset} to
453 @node Predefined Charsets, , Charset Property Functions, Charsets
454 @subsection Predefined Charsets
456 The following charsets are predefined in the C code.
459 Name Type Fi Gr Dir Registry
460 --------------------------------------------------------------
461 ascii 94 B 0 l2r ISO8859-1
462 control-1 94 0 l2r ---
463 latin-iso8859-1 94 A 1 l2r ISO8859-1
464 latin-iso8859-2 96 B 1 l2r ISO8859-2
465 latin-iso8859-3 96 C 1 l2r ISO8859-3
466 latin-iso8859-4 96 D 1 l2r ISO8859-4
467 cyrillic-iso8859-5 96 L 1 l2r ISO8859-5
468 arabic-iso8859-6 96 G 1 r2l ISO8859-6
469 greek-iso8859-7 96 F 1 l2r ISO8859-7
470 hebrew-iso8859-8 96 H 1 r2l ISO8859-8
471 latin-iso8859-9 96 M 1 l2r ISO8859-9
472 thai-tis620 96 T 1 l2r TIS620
473 katakana-jisx0201 94 I 1 l2r JISX0201.1976
474 latin-jisx0201 94 J 0 l2r JISX0201.1976
475 japanese-jisx0208-1978 94x94 @@ 0 l2r JISX0208.1978
476 japanese-jisx0208 94x94 B 0 l2r JISX0208.19(83|90)
477 japanese-jisx0212 94x94 D 0 l2r JISX0212
478 chinese-gb2312 94x94 A 0 l2r GB2312
479 chinese-cns11643-1 94x94 G 0 l2r CNS11643.1
480 chinese-cns11643-2 94x94 H 0 l2r CNS11643.2
481 chinese-big5-1 94x94 0 0 l2r Big5
482 chinese-big5-2 94x94 1 0 l2r Big5
483 korean-ksc5601 94x94 C 0 l2r KSC5601
484 composite 96x96 0 l2r ---
487 The following charsets are predefined in the Lisp code.
490 Name Type Fi Gr Dir Registry
491 --------------------------------------------------------------
492 arabic-digit 94 2 0 l2r MuleArabic-0
493 arabic-1-column 94 3 0 r2l MuleArabic-1
494 arabic-2-column 94 4 0 r2l MuleArabic-2
495 sisheng 94 0 0 l2r sisheng_cwnn\|OMRON_UDC_ZH
496 chinese-cns11643-3 94x94 I 0 l2r CNS11643.1
497 chinese-cns11643-4 94x94 J 0 l2r CNS11643.1
498 chinese-cns11643-5 94x94 K 0 l2r CNS11643.1
499 chinese-cns11643-6 94x94 L 0 l2r CNS11643.1
500 chinese-cns11643-7 94x94 M 0 l2r CNS11643.1
501 ethiopic 94x94 2 0 l2r Ethio
502 ascii-r2l 94 B 0 r2l ISO8859-1
503 ipa 96 0 1 l2r MuleIPA
504 vietnamese-lower 96 1 1 l2r VISCII1.1
505 vietnamese-upper 96 2 1 l2r VISCII1.1
508 For all of the above charsets, the dimension and number of columns are
511 Note that ASCII, Control-1, and Composite are handled specially.
512 This is why some of the fields are blank; and some of the filled-in
513 fields (e.g. the type) are not really accurate.
515 @node MULE Characters, Composite Characters, Charsets, MULE
516 @section MULE Characters
518 @defun make-char charset arg1 &optional arg2
519 This function makes a multi-byte character from @var{charset} and octets
520 @var{arg1} and @var{arg2}.
523 @defun char-charset character
524 This function returns the character set of char @var{character}.
527 @defun char-octet character &optional n
528 This function returns the octet (i.e. position code) numbered @var{n}
529 (should be 0 or 1) of char @var{character}. @var{n} defaults to 0 if omitted.
532 @defun find-charset-region start end &optional buffer
533 This function returns a list of the charsets in the region between
534 @var{start} and @var{end}. @var{buffer} defaults to the current buffer
538 @defun find-charset-string string
539 This function returns a list of the charsets in @var{string}.
542 @node Composite Characters, Coding Systems, MULE Characters, MULE
543 @section Composite Characters
545 Composite characters are not yet completely implemented.
547 @defun make-composite-char string
548 This function converts a string into a single composite character. The
549 character is the result of overstriking all the characters in the
553 @defun composite-char-string character
554 This function returns a string of the characters comprising a composite
558 @defun compose-region start end &optional buffer
559 This function composes the characters in the region from @var{start} to
560 @var{end} in @var{buffer} into one composite character. The composite
561 character replaces the composed characters. @var{buffer} defaults to
562 the current buffer if omitted.
565 @defun decompose-region start end &optional buffer
566 This function decomposes any composite characters in the region from
567 @var{start} to @var{end} in @var{buffer}. This converts each composite
568 character into one or more characters, the individual characters out of
569 which the composite character was formed. Non-composite characters are
570 left as-is. @var{buffer} defaults to the current buffer if omitted.
573 @node Coding Systems, CCL, Composite Characters, MULE
574 @section Coding Systems
576 A coding system is an object that defines how text containing multiple
577 character sets is encoded into a stream of (typically 8-bit) bytes. The
578 coding system is used to decode the stream into a series of characters
579 (which may be from multiple charsets) when the text is read from a file
580 or process, and is used to encode the text back into the same format
581 when it is written out to a file or process.
583 For example, many ISO-2022-compliant coding systems (such as Compound
584 Text, which is used for inter-client data under the X Window System) use
585 escape sequences to switch between different charsets -- Japanese Kanji,
586 for example, is invoked with @samp{ESC $ ( B}; ASCII is invoked with
587 @samp{ESC ( B}; and Cyrillic is invoked with @samp{ESC - L}. See
588 @code{make-coding-system} for more information.
590 Coding systems are normally identified using a symbol, and the symbol is
591 accepted in place of the actual coding system object whenever a coding
592 system is called for. (This is similar to how faces and charsets work.)
594 @defun coding-system-p object
595 This function returns non-@code{nil} if @var{object} is a coding system.
599 * Coding System Types:: Classifying coding systems.
600 * ISO 2022:: An international standard for
601 charsets and encodings.
602 * EOL Conversion:: Dealing with different ways of denoting
604 * Coding System Properties:: Properties of a coding system.
605 * Basic Coding System Functions:: Working with coding systems.
606 * Coding System Property Functions:: Retrieving a coding system's properties.
607 * Encoding and Decoding Text:: Encoding and decoding text.
608 * Detection of Textual Encoding:: Determining how text is encoded.
609 * Big5 and Shift-JIS Functions:: Special functions for these non-standard
611 * Predefined Coding Systems:: Coding systems implemented by MULE.
614 @node Coding System Types, ISO 2022, , Coding Systems
615 @subsection Coding System Types
617 The coding system type determines the basic algorithm XEmacs will use to
618 decode or encode a data stream. Character encodings will be converted
619 to the MULE encoding, escape sequences processed, and newline sequences
620 converted to XEmacs's internal representation. There are three basic
621 classes of coding system type: no-conversion, ISO-2022, and special.
623 No conversion allows you to look at the file's internal representation.
624 Since XEmacs is basically a text editor, "no conversion" does convert
625 newline conventions by default. (Use the 'binary coding-system if this
628 ISO 2022 (@pxref{ISO 2022}) is the basic international standard regulating
629 use of "coded character sets for the exchange of data", ie, text
630 streams. ISO 2022 contains functions that make it possible to encode
631 text streams to comply with restrictions of the Internet mail system and
632 de facto restrictions of most file systems (eg, use of the separator
633 character in file names). Coding systems which are not ISO 2022
634 conformant can be difficult to handle. Perhaps more important, they are
635 not adaptable to multilingual information interchange, with the obvious
636 exception of ISO 10646 (Unicode). (Unicode is partially supported by
637 XEmacs with the addition of the Lisp package ucs-conv.)
639 The special class of coding systems includes automatic detection, CCL (a
640 "little language" embedded as an interpreter, useful for translating
641 between variants of a single character set), non-ISO-2022-conformant
642 encodings like Unicode, Shift JIS, and Big5, and MULE internal coding.
643 (NB: this list is based on XEmacs 21.2. Terminology may vary slightly
644 for other versions of XEmacs and for GNU Emacs 20.)
648 No conversion, for binary files, and a few special cases of non-ISO-2022
649 coding systems where conversion is done by hook functions (usually
650 implemented in CCL). On output, graphic characters that are not in
651 ASCII or Latin-1 will be replaced by a @samp{?}. (For a
652 no-conversion-encoded buffer, these characters will only be present if
653 you explicitly insert them.)
655 Any ISO-2022-compliant encoding. Among others, this includes JIS (the
656 Japanese encoding commonly used for e-mail), national variants of EUC
657 (the standard Unix encoding for Japanese and other languages), and
658 Compound Text (an encoding used in X11). You can specify more specific
659 information about the conversion with the @var{flags} argument.
661 ISO 10646 UCS-4 encoding. A 31-bit fixed-width superset of Unicode.
663 ISO 10646 UTF-8 encoding. A ``file system safe'' transformation format
664 that can be used with both UCS-4 and Unicode.
666 Automatic conversion. XEmacs attempts to detect the coding system used
669 Shift-JIS (a Japanese encoding commonly used in PC operating systems).
671 Big5 (the encoding commonly used for Taiwanese).
673 The conversion is performed using a user-written pseudo-code program.
674 CCL (Code Conversion Language) is the name of this pseudo-code. For
675 example, CCL is used to map KOI8-R characters (an encoding for Russian
676 Cyrillic) to ISO8859-5 (the form used internally by MULE).
678 Write out or read in the raw contents of the memory representing the
679 buffer's text. This is primarily useful for debugging purposes, and is
680 only enabled when XEmacs has been compiled with @code{DEBUG_XEMACS} set
681 (the @samp{--debug} configure option). @strong{Warning}: Reading in a
682 file using @code{internal} conversion can result in an internal
683 inconsistency in the memory representing a buffer's text, which will
684 produce unpredictable results and may cause XEmacs to crash. Under
685 normal circumstances you should never use @code{internal} conversion.
688 @node ISO 2022, EOL Conversion, Coding System Types, Coding Systems
691 This section briefly describes the ISO 2022 encoding standard. A more
692 thorough treatment is available in the original document of ISO
693 2022 as well as various national standards (such as JIS X 0202).
695 Character sets (@dfn{charsets}) are classified into the following four
696 categories, according to the number of characters in the charset:
697 94-charset, 96-charset, 94x94-charset, and 96x96-charset. This means
698 that although an ISO 2022 coding system may have variable width
699 characters, each charset used is fixed-width (in contrast to the MULE
700 character set and UTF-8, for example).
702 ISO 2022 provides for switching between character sets via escape
703 sequences. This switching is somewhat complicated, because ISO 2022
704 provides for both legacy applications like Internet mail that accept
705 only 7 significant bits in some contexts (RFC 822 headers, for example),
706 and more modern "8-bit clean" applications. It also provides for
707 compact and transparent representation of languages like Japanese which
708 mix ASCII and a national script (even outside of computer programs).
710 First, ISO 2022 codified prevailing practice by dividing the code space
711 into "control" and "graphic" regions. The code points 0x00-0x1F and
712 0x80-0x9F are reserved for "control characters", while "graphic
713 characters" must be assigned to code points in the regions 0x20-0x7F and
714 0xA0-0xFF. The positions 0x20 and 0x7F are special, and under some
715 circumstances must be assigned the graphic character "ASCII SPACE" and
716 the control character "ASCII DEL" respectively.
718 The various regions are given the name C0 (0x00-0x1F), GL (0x20-0x7F),
719 C1 (0x80-0x9F), and GR (0xA0-0xFF). GL and GR stand for "graphic left"
720 and "graphic right", respectively, because of the standard method of
721 displaying graphic character sets in tables with the high byte indexing
722 columns and the low byte indexing rows. I don't find it very intuitive,
723 but these are called "registers".
725 An ISO 2022-conformant encoding for a graphic character set must use a
726 fixed number of bytes per character, and the values must fit into a
727 single register; that is, each byte must range over either 0x20-0x7F, or
728 0xA0-0xFF. It is not allowed to extend the range of the repertoire of a
729 character set by using both ranges at the same. This is why a standard
730 character set such as ISO 8859-1 is actually considered by ISO 2022 to
731 be an aggregation of two character sets, ASCII and LATIN-1, and why it
732 is technically incorrect to refer to ISO 8859-1 as "Latin 1". Also, a
733 single character's bytes must all be drawn from the same register; this
734 is why Shift JIS (for Japanese) and Big 5 (for Chinese) are not ISO
735 2022-compatible encodings.
737 The reason for this restriction becomes clear when you attempt to define
738 an efficient, robust encoding for a language like Japanese. Like ISO
739 8859, Japanese encodings are aggregations of several character sets. In
740 practice, the vast majority of characters are drawn from the "JIS Roman"
741 character set (a derivative of ASCII; it won't hurt to think of it as
742 ASCII) and the JIS X 0208 standard "basic Japanese" character set
743 including not only ideographic characters ("kanji") but syllabic
744 Japanese characters ("kana"), a wide variety of symbols, and many
745 alphabetic characters (Roman, Greek, and Cyrillic) as well. Although
746 JIS X 0208 includes the whole Roman alphabet, as a 2-byte code it is not
747 suited to programming; thus the inclusion of ASCII in the standard
750 For normal Japanese text such as in newspapers, a broad repertoire of
751 approximately 3000 characters is used. Evidently this won't fit into
752 one byte; two must be used. But much of the text processed by Japanese
753 computers is computer source code, nearly all of which is ASCII. A not
754 insignificant portion of ordinary text is English (as such or as
755 borrowed Japanese vocabulary) or other languages which can represented
756 at least approximately in ASCII, as well. It seems reasonable then to
757 represent ASCII in one byte, and JIS X 0208 in two. And this is exactly
758 what the Extended Unix Code for Japanese (EUC-JP) does. ASCII is
759 invoked to the GL register, and JIS X 0208 is invoked to the GR
760 register. Thus, each byte can be tested for its character set by
761 looking at the high bit; if set, it is Japanese, if clear, it is ASCII.
762 Furthermore, since control characters like newline can never be part of
763 a graphic character, even in the case of corruption in transmission the
764 stream will be resynchronized at every line break, on the order of 60-80
765 bytes. This coding system requires no escape sequences or special
766 control codes to represent 99.9% of all Japanese text.
768 Note carefully the distinction between the character sets (ASCII and JIS
769 X 0208), the encoding (EUC-JP), and the coding system (ISO 2022). The
770 JIS X 0208 character set is used in three different encodings for
771 Japanese, but in ISO-2022-JP it is invoked into GL (so the high bit is
772 always clear), in EUC-JP it is invoked into GR (setting the high bit in
773 the process), and in Shift JIS the high bit may be set or reset, and the
774 significant bits are shifted within the 16-bit character so that the two
775 main character sets can coexist with a third (the "halfwidth katakana"
776 of JIS X 0201). As the name implies, the ISO-2022-JP encoding is also a
777 version of the ISO-2022 coding system.
779 In order to systematically treat subsidiary character sets (like the
780 "halfwidth katakana" already mentioned, and the "supplementary kanji" of
781 JIS X 0212), four further registers are defined: G0, G1, G2, and G3.
782 Unlike GL and GR, they are not logically distinguished by internal
783 format. Instead, the process of "invocation" mentioned earlier is
784 broken into two steps: first, a character set is @dfn{designated} to one
785 of the registers G0-G3 by use of an @dfn{escape sequence} of the form:
788 ESC [@var{I}] @var{I} @var{F}
791 where @var{I} is an intermediate character or characters in the range
792 0x20 - 0x3F, and @var{F}, from the range 0x30-0x7Fm is the final
793 character identifying this charset. (Final characters in the range
794 0x30-0x3F are reserved for private use and will never have a publicly
797 Then that register is @dfn{invoked} to either GL or GR, either
798 automatically (designations to G0 normally involve invocation to GL as
799 well), or by use of shifting (affecting only the following character in
800 the data stream) or locking (effective until the next designation or
801 locking) control sequences. An encoding conformant to ISO 2022 is
802 typically defined by designating the initial contents of the G0-G3
803 registers, specifying an 7 or 8 bit environment, and specifying whether
804 further designations will be recognized.
806 Some examples of character sets and the registered final characters
807 @var{F} used to designate them:
812 ASCII (B), left (J) and right (I) half of JIS X 0201, ...
814 Latin-1 (A), Latin-2 (B), Latin-3 (C), ...
816 GB2312 (A), JIS X 0208 (B), KSC5601 (C), ...
821 The meanings of the various characters in these sequences, where not
822 specified by the ISO 2022 standard (such as the ESC character), are
823 assigned by @dfn{ECMA}, the European Computer Manufacturers Association.
825 The meaning of intermediate characters are:
829 $ [0x24]: indicate charset of dimension 2 (94x94 or 96x96).
830 ( [0x28]: designate to G0 a 94-charset whose final byte is @var{F}.
831 ) [0x29]: designate to G1 a 94-charset whose final byte is @var{F}.
832 * [0x2A]: designate to G2 a 94-charset whose final byte is @var{F}.
833 + [0x2B]: designate to G3 a 94-charset whose final byte is @var{F}.
834 , [0x2C]: designate to G0 a 96-charset whose final byte is @var{F}.
835 - [0x2D]: designate to G1 a 96-charset whose final byte is @var{F}.
836 . [0x2E]: designate to G2 a 96-charset whose final byte is @var{F}.
837 / [0x2F]: designate to G3 a 96-charset whose final byte is @var{F}.
841 The comma may be used in files read and written only by MULE, as a MULE
842 extension, but this is illegal in ISO 2022. (The reason is that in ISO
843 2022 G0 must be a 94-member character set, with 0x20 assigned the value
844 SPACE, and 0x7F assigned the value DEL.)
846 Here are examples of designations:
850 ESC ( B : designate to G0 ASCII
851 ESC - A : designate to G1 Latin-1
852 ESC $ ( A or ESC $ A : designate to G0 GB2312
853 ESC $ ( B or ESC $ B : designate to G0 JISX0208
854 ESC $ ) C : designate to G1 KSC5601
858 (The short forms used to designate GB2312 and JIS X 0208 are for
859 backwards compatibility; the long forms are preferred.)
861 To use a charset designated to G2 or G3, and to use a charset designated
862 to G1 in a 7-bit environment, you must explicitly invoke G1, G2, or G3
863 into GL. There are two types of invocation, Locking Shift (forever) and
864 Single Shift (one character only).
866 Locking Shift is done as follows:
869 LS0 or SI (0x0F): invoke G0 into GL
870 LS1 or SO (0x0E): invoke G1 into GL
871 LS2: invoke G2 into GL
872 LS3: invoke G3 into GL
873 LS1R: invoke G1 into GR
874 LS2R: invoke G2 into GR
875 LS3R: invoke G3 into GR
878 Single Shift is done as follows:
882 SS2 or ESC N: invoke G2 into GL
883 SS3 or ESC O: invoke G3 into GL
887 The shift functions (such as LS1R and SS3) are represented by control
888 characters (from C1) in 8 bit environments and by escape sequences in 7
891 (#### Ben says: I think the above is slightly incorrect. It appears that
892 SS2 invokes G2 into GR and SS3 invokes G3 into GR, whereas ESC N and
893 ESC O behave as indicated. The above definitions will not parse
894 EUC-encoded text correctly, and it looks like the code in mule-coding.c
895 has similar problems.)
897 Evidently there are a lot of ISO-2022-compliant ways of encoding
898 multilingual text. Now, in the world, there exist many coding systems
899 such as X11's Compound Text, Japanese JUNET code, and so-called EUC
900 (Extended UNIX Code); all of these are variants of ISO 2022.
902 In MULE, we characterize a version of ISO 2022 by the following
907 The character sets initially designated to G0 thru G3.
909 Whether short form designations are allowed for Japanese and Chinese.
911 Whether ASCII should be designated to G0 before control characters.
913 Whether ASCII should be designated to G0 at the end of line.
915 7-bit environment or 8-bit environment.
917 Whether Locking Shifts are used or not.
919 Whether to use ASCII or the variant JIS X 0201-1976-Roman.
921 Whether to use JIS X 0208-1983 or the older version JIS X 0208-1976.
924 (The last two are only for Japanese.)
926 By specifying these attributes, you can create any variant
929 Here are several examples:
933 ISO-2022-JP -- Coding system used in Japanese email (RFC 1463 #### check).
934 1. G0 <- ASCII, G1..3 <- never used
941 8. Use JIS X 0208-1983
945 ctext -- X11 Compound Text
946 1. G0 <- ASCII, G1 <- Latin-1, G2,3 <- never used.
950 5. 8-bit environment.
953 8. Use JIS X 0208-1983.
957 euc-china -- Chinese EUC. Often called the "GB encoding", but that is
958 technically incorrect.
959 1. G0 <- ASCII, G1 <- GB 2312, G2,3 <- never used.
963 5. 8-bit environment.
966 8. Use JIS X 0208-1983.
970 ISO-2022-KR -- Coding system used in Korean email.
971 1. G0 <- ASCII, G1 <- KSC 5601, G2,3 <- never used.
975 5. 7-bit environment.
978 8. Use JIS X 0208-1983.
982 MULE creates all of these coding systems by default.
984 @node EOL Conversion, Coding System Properties, ISO 2022, Coding Systems
985 @subsection EOL Conversion
989 Automatically detect the end-of-line type (LF, CRLF, or CR). Also
990 generate subsidiary coding systems named @code{@var{name}-unix},
991 @code{@var{name}-dos}, and @code{@var{name}-mac}, that are identical to
992 this coding system but have an EOL-TYPE value of @code{lf}, @code{crlf},
993 and @code{cr}, respectively.
995 The end of a line is marked externally using ASCII LF. Since this is
996 also the way that XEmacs represents an end-of-line internally,
997 specifying this option results in no end-of-line conversion. This is
998 the standard format for Unix text files.
1000 The end of a line is marked externally using ASCII CRLF. This is the
1001 standard format for MS-DOS text files.
1003 The end of a line is marked externally using ASCII CR. This is the
1004 standard format for Macintosh text files.
1006 Automatically detect the end-of-line type but do not generate subsidiary
1007 coding systems. (This value is converted to @code{nil} when stored
1008 internally, and @code{coding-system-property} will return @code{nil}.)
1011 @node Coding System Properties, Basic Coding System Functions, EOL Conversion, Coding Systems
1012 @subsection Coding System Properties
1016 String to be displayed in the modeline when this coding system is
1020 End-of-line conversion to be used. It should be one of the types
1021 listed in @ref{EOL Conversion}.
1024 The coding system which is the same as this one, except that it uses the
1025 Unix line-breaking convention.
1028 The coding system which is the same as this one, except that it uses the
1029 DOS line-breaking convention.
1032 The coding system which is the same as this one, except that it uses the
1033 Macintosh line-breaking convention.
1035 @item post-read-conversion
1036 Function called after a file has been read in, to perform the decoding.
1037 Called with two arguments, @var{start} and @var{end}, denoting a region of
1038 the current buffer to be decoded.
1040 @item pre-write-conversion
1041 Function called before a file is written out, to perform the encoding.
1042 Called with two arguments, @var{start} and @var{end}, denoting a region of
1043 the current buffer to be encoded.
1046 The following additional properties are recognized if @var{type} is
1054 The character set initially designated to the G0 - G3 registers.
1055 The value should be one of
1059 A charset object (designate that character set)
1061 @code{nil} (do not ever use this register)
1063 @code{t} (no character set is initially designated to the register, but
1064 may be later on; this automatically sets the corresponding
1065 @code{force-g*-on-output} property)
1068 @item force-g0-on-output
1069 @itemx force-g1-on-output
1070 @itemx force-g2-on-output
1071 @itemx force-g3-on-output
1072 If non-@code{nil}, send an explicit designation sequence on output
1073 before using the specified register.
1076 If non-@code{nil}, use the short forms @samp{ESC $ @@}, @samp{ESC $ A},
1077 and @samp{ESC $ B} on output in place of the full designation sequences
1078 @samp{ESC $ ( @@}, @samp{ESC $ ( A}, and @samp{ESC $ ( B}.
1081 If non-@code{nil}, don't designate ASCII to G0 at each end of line on
1082 output. Setting this to non-@code{nil} also suppresses other
1083 state-resetting that normally happens at the end of a line.
1086 If non-@code{nil}, don't designate ASCII to G0 before control chars on
1090 If non-@code{nil}, use 7-bit environment on output. Otherwise, use 8-bit
1094 If non-@code{nil}, use locking-shift (SO/SI) instead of single-shift or
1095 designation by escape sequence.
1098 If non-@code{nil}, don't use ISO6429's direction specification.
1101 If non-@code{nil}, literal control characters that are the same as the
1102 beginning of a recognized ISO 2022 or ISO 6429 escape sequence (in
1103 particular, ESC (0x1B), SO (0x0E), SI (0x0F), SS2 (0x8E), SS3 (0x8F),
1104 and CSI (0x9B)) are ``quoted'' with an escape character so that they can
1105 be properly distinguished from an escape sequence. (Note that doing
1106 this results in a non-portable encoding.) This encoding flag is used for
1107 byte-compiled files. Note that ESC is a good choice for a quoting
1108 character because there are no escape sequences whose second byte is a
1109 character from the Control-0 or Control-1 character sets; this is
1110 explicitly disallowed by the ISO 2022 standard.
1112 @item input-charset-conversion
1113 A list of conversion specifications, specifying conversion of characters
1114 in one charset to another when decoding is performed. Each
1115 specification is a list of two elements: the source charset, and the
1116 destination charset.
1118 @item output-charset-conversion
1119 A list of conversion specifications, specifying conversion of characters
1120 in one charset to another when encoding is performed. The form of each
1121 specification is the same as for @code{input-charset-conversion}.
1124 The following additional properties are recognized (and required) if
1125 @var{type} is @code{ccl}:
1129 CCL program used for decoding (converting to internal format).
1132 CCL program used for encoding (converting to external format).
1135 The following properties are used internally: @var{eol-cr},
1136 @var{eol-crlf}, @var{eol-lf}, and @var{base}.
1138 @node Basic Coding System Functions, Coding System Property Functions, Coding System Properties, Coding Systems
1139 @subsection Basic Coding System Functions
1141 @defun find-coding-system coding-system-or-name
1142 This function retrieves the coding system of the given name.
1144 If @var{coding-system-or-name} is a coding-system object, it is simply
1145 returned. Otherwise, @var{coding-system-or-name} should be a symbol.
1146 If there is no such coding system, @code{nil} is returned. Otherwise
1147 the associated coding system object is returned.
1150 @defun get-coding-system name
1151 This function retrieves the coding system of the given name. Same as
1152 @code{find-coding-system} except an error is signalled if there is no
1153 such coding system instead of returning @code{nil}.
1156 @defun coding-system-list
1157 This function returns a list of the names of all defined coding systems.
1160 @defun coding-system-name coding-system
1161 This function returns the name of the given coding system.
1164 @defun coding-system-base coding-system
1165 Returns the base coding system (undecided EOL convention)
1169 @defun make-coding-system name type &optional doc-string props
1170 This function registers symbol @var{name} as a coding system.
1172 @var{type} describes the conversion method used and should be one of
1173 the types listed in @ref{Coding System Types}.
1175 @var{doc-string} is a string describing the coding system.
1177 @var{props} is a property list, describing the specific nature of the
1178 character set. Recognized properties are as in @ref{Coding System
1182 @defun copy-coding-system old-coding-system new-name
1183 This function copies @var{old-coding-system} to @var{new-name}. If
1184 @var{new-name} does not name an existing coding system, a new one will
1188 @defun subsidiary-coding-system coding-system eol-type
1189 This function returns the subsidiary coding system of
1190 @var{coding-system} with eol type @var{eol-type}.
1193 @node Coding System Property Functions, Encoding and Decoding Text, Basic Coding System Functions, Coding Systems
1194 @subsection Coding System Property Functions
1196 @defun coding-system-doc-string coding-system
1197 This function returns the doc string for @var{coding-system}.
1200 @defun coding-system-type coding-system
1201 This function returns the type of @var{coding-system}.
1204 @defun coding-system-property coding-system prop
1205 This function returns the @var{prop} property of @var{coding-system}.
1208 @node Encoding and Decoding Text, Detection of Textual Encoding, Coding System Property Functions, Coding Systems
1209 @subsection Encoding and Decoding Text
1211 @defun decode-coding-region start end coding-system &optional buffer
1212 This function decodes the text between @var{start} and @var{end} which
1213 is encoded in @var{coding-system}. This is useful if you've read in
1214 encoded text from a file without decoding it (e.g. you read in a
1215 JIS-formatted file but used the @code{binary} or @code{no-conversion} coding
1216 system, so that it shows up as @samp{^[$B!<!+^[(B}). The length of the
1217 encoded text is returned. @var{buffer} defaults to the current buffer
1221 @defun encode-coding-region start end coding-system &optional buffer
1222 This function encodes the text between @var{start} and @var{end} using
1223 @var{coding-system}. This will, for example, convert Japanese
1224 characters into stuff such as @samp{^[$B!<!+^[(B} if you use the JIS
1225 encoding. The length of the encoded text is returned. @var{buffer}
1226 defaults to the current buffer if unspecified.
1229 @node Detection of Textual Encoding, Big5 and Shift-JIS Functions, Encoding and Decoding Text, Coding Systems
1230 @subsection Detection of Textual Encoding
1232 @defun coding-category-list
1233 This function returns a list of all recognized coding categories.
1236 @defun set-coding-priority-list list
1237 This function changes the priority order of the coding categories.
1238 @var{list} should be a list of coding categories, in descending order of
1239 priority. Unspecified coding categories will be lower in priority than
1240 all specified ones, in the same relative order they were in previously.
1243 @defun coding-priority-list
1244 This function returns a list of coding categories in descending order of
1248 @defun set-coding-category-system coding-category coding-system
1249 This function changes the coding system associated with a coding category.
1252 @defun coding-category-system coding-category
1253 This function returns the coding system associated with a coding category.
1256 @defun detect-coding-region start end &optional buffer
1257 This function detects coding system of the text in the region between
1258 @var{start} and @var{end}. Returned value is a list of possible coding
1259 systems ordered by priority. If only ASCII characters are found, it
1260 returns @code{autodetect} or one of its subsidiary coding systems
1261 according to a detected end-of-line type. Optional arg @var{buffer}
1262 defaults to the current buffer.
1265 @node Big5 and Shift-JIS Functions, Predefined Coding Systems, Detection of Textual Encoding, Coding Systems
1266 @subsection Big5 and Shift-JIS Functions
1268 These are special functions for working with the non-standard
1269 Shift-JIS and Big5 encodings.
1271 @defun decode-shift-jis-char code
1272 This function decodes a JIS X 0208 character of Shift-JIS coding-system.
1273 @var{code} is the character code in Shift-JIS as a cons of type bytes.
1274 The corresponding character is returned.
1277 @defun encode-shift-jis-char character
1278 This function encodes a JIS X 0208 character @var{character} to
1279 SHIFT-JIS coding-system. The corresponding character code in SHIFT-JIS
1280 is returned as a cons of two bytes.
1283 @defun decode-big5-char code
1284 This function decodes a Big5 character @var{code} of BIG5 coding-system.
1285 @var{code} is the character code in BIG5. The corresponding character
1289 @defun encode-big5-char character
1290 This function encodes the Big5 character @var{character} to BIG5
1291 coding-system. The corresponding character code in Big5 is returned.
1294 @node Predefined Coding Systems, , Big5 and Shift-JIS Functions, Coding Systems
1295 @subsection Coding Systems Implemented
1297 MULE initializes most of the commonly used coding systems at XEmacs's
1298 startup. A few others are initialized only when the relevant language
1299 environment is selected and support libraries are loaded. (NB: The
1300 following list is based on XEmacs 21.2.19, the development branch at the
1301 time of writing. The list may be somewhat different for other
1302 versions. Recent versions of GNU Emacs 20 implement a few more rare
1303 coding systems; work is being done to port these to XEmacs.)
1305 Unfortunately, there is not a consistent naming convention for character
1306 sets, and for practical purposes coding systems often take their name
1307 from their principal character sets (ASCII, KOI8-R, Shift JIS). Others
1308 take their names from the coding system (ISO-2022-JP, EUC-KR), and a few
1309 from their non-text usages (internal, binary). To provide for this, and
1310 for the fact that many coding systems have several common names, an
1311 aliasing system is provided. Finally, some effort has been made to use
1312 names that are registered as MIME charsets (this is why the name
1313 'shift_jis contains that un-Lisp-y underscore).
1315 There is a systematic naming convention regarding end-of-line (EOL)
1316 conventions for different systems. A coding system whose name ends in
1317 "-unix" forces the assumptions that lines are broken by newlines (0x0A).
1318 A coding system whose name ends in "-mac" forces the assumptions that
1319 lines are broken by ASCII CRs (0x0D). A coding system whose name ends
1320 in "-dos" forces the assumptions that lines are broken by CRLF sequences
1321 (0x0D 0x0A). These subsidiary coding systems are automatically derived
1322 from a base coding system. Use of the base coding system implies
1323 autodetection of the text file convention. (The fact that the -unix,
1324 -mac, and -dos are derived from a base system results in them showing up
1325 as "aliases" in `list-coding-systems'.) These subsidiaries have a
1326 consistent modeline indicator as well. "-dos" coding systems have ":T"
1327 appended to their modeline indicator, while "-mac" coding systems have
1328 ":t" appended (eg, "ISO8:t" for iso-2022-8-mac).
1330 In the following table, each coding system is given with its mode line
1331 indicator in parentheses. Non-textual coding systems are listed first,
1332 followed by textual coding systems and their aliases. (The coding system
1333 subsidiary modeline indicators ":T" and ":t" will be omitted from the
1334 table of coding systems.)
1336 ### SJT 1999-08-23 Maybe should order these by language? Definitely
1337 need language usage for the ISO-8859 family.
1339 Note that although true coding system aliases have been implemented for
1340 XEmacs 21.2, the coding system initialization has not yet been converted
1341 as of 21.2.19. So coding systems described as aliases have the same
1342 properties as the aliased coding system, but will not be equal as Lisp
1347 @item automatic-conversion
1349 @itemx undecided-dos
1350 @itemx undecided-mac
1351 @itemx undecided-unix
1353 Modeline indicator: @code{Auto}. A type @code{undecided} coding system.
1354 Attempts to determine an appropriate coding system from file contents or
1358 @itemx no-conversion
1361 @itemx raw-text-unix
1362 @itemx no-conversion-dos
1363 @itemx no-conversion-mac
1364 @itemx no-conversion-unix
1366 Modeline indicator: @code{Raw}. A type @code{no-conversion} coding system,
1367 which converts only line-break-codes. An implementation quirk means
1368 that this coding system is also used for ISO8859-1.
1371 Modeline indicator: @code{Binary}. A type @code{no-conversion} coding
1372 system which does no character coding or EOL conversions. An alias for
1373 @code{raw-text-unix}.
1376 @itemx alternativnyj-dos
1377 @itemx alternativnyj-mac
1378 @itemx alternativnyj-unix
1380 Modeline indicator: @code{Cy.Alt}. A type @code{ccl} coding system used for
1381 Alternativnyj, an encoding of the Cyrillic alphabet.
1388 Modeline indicator: @code{Zh/Big5}. A type @code{big5} coding system used for
1389 BIG5, the most common encoding of traditional Chinese as used in Taiwan.
1392 @itemx cn-gb-2312-dos
1393 @itemx cn-gb-2312-mac
1394 @itemx cn-gb-2312-unix
1396 Modeline indicator: @code{Zh-GB/EUC}. A type @code{iso2022} coding system used
1397 for simplified Chinese (as used in the People's Republic of China), with
1398 the @code{ascii} (G0), @code{chinese-gb2312} (G1), and @code{sisheng}
1399 (G2) character sets initially designated. Chinese EUC (Extended Unix
1403 @itemx ctext-hebrew-dos
1404 @itemx ctext-hebrew-mac
1405 @itemx ctext-hebrew-unix
1407 Modeline indicator: @code{CText/Hbrw}. A type @code{iso2022} coding system
1408 with the @code{ascii} (G0) and @code{hebrew-iso8859-8} (G1) character
1409 sets initially designated for Hebrew.
1416 Modeline indicator: @code{CText}. A type @code{iso2022} 8-bit coding system
1417 with the @code{ascii} (G0) and @code{latin-iso8859-1} (G1) character
1418 sets initially designated. X11 Compound Text Encoding. Often
1419 mistakenly recognized instead of EUC encodings; usual cause is
1420 inappropriate setting of @code{coding-priority-list}.
1424 Modeline indicator: @code{ESC/Quot}. A type @code{iso2022} 8-bit coding
1425 system with the @code{ascii} (G0) and @code{latin-iso8859-1} (G1)
1426 character sets initially designated and escape quoting. Unix EOL
1427 conversion (ie, no conversion). It is used for .ELC files.
1434 Modeline indicator: @code{Ja/EUC}. A type @code{iso2022} 8-bit coding system
1435 with @code{ascii} (G0), @code{japanese-jisx0208} (G1),
1436 @code{katakana-jisx0201} (G2), and @code{japanese-jisx0212} (G3)
1437 initially designated. Japanese EUC (Extended Unix Code).
1444 Modeline indicator: @code{ko/EUC}. A type @code{iso2022} 8-bit coding system
1445 with @code{ascii} (G0) and @code{korean-ksc5601} (G1) initially
1446 designated. Korean EUC (Extended Unix Code).
1449 Modeline indicator: @code{Zh-GB/Hz}. A type @code{no-conversion} coding
1450 system with Unix EOL convention (ie, no conversion) using
1451 post-read-decode and pre-write-encode functions to translate the Hz/ZW
1452 coding system used for Chinese.
1455 @itemx iso-2022-7bit-unix
1456 @itemx iso-2022-7bit-dos
1457 @itemx iso-2022-7bit-mac
1460 Modeline indicator: @code{ISO7}. A type @code{iso2022} 7-bit coding system
1461 with @code{ascii} (G0) initially designated. Other character sets must
1462 be explicitly designated to be used.
1464 @item iso-2022-7bit-ss2
1465 @itemx iso-2022-7bit-ss2-dos
1466 @itemx iso-2022-7bit-ss2-mac
1467 @itemx iso-2022-7bit-ss2-unix
1469 Modeline indicator: @code{ISO7/SS}. A type @code{iso2022} 7-bit coding system
1470 with @code{ascii} (G0) initially designated. Other character sets must
1471 be explicitly designated to be used. SS2 is used to invoke a
1472 96-charset, one character at a time.
1475 @itemx iso-2022-8-dos
1476 @itemx iso-2022-8-mac
1477 @itemx iso-2022-8-unix
1479 Modeline indicator: @code{ISO8}. A type @code{iso2022} 8-bit coding system
1480 with @code{ascii} (G0) and @code{latin-iso8859-1} (G1) initially
1481 designated. Other character sets must be explicitly designated to be
1482 used. No single-shift or locking-shift.
1484 @item iso-2022-8bit-ss2
1485 @itemx iso-2022-8bit-ss2-dos
1486 @itemx iso-2022-8bit-ss2-mac
1487 @itemx iso-2022-8bit-ss2-unix
1489 Modeline indicator: @code{ISO8/SS}. A type @code{iso2022} 8-bit coding system
1490 with @code{ascii} (G0) and @code{latin-iso8859-1} (G1) initially
1491 designated. Other character sets must be explicitly designated to be
1492 used. SS2 is used to invoke a 96-charset, one character at a time.
1494 @item iso-2022-int-1
1495 @itemx iso-2022-int-1-dos
1496 @itemx iso-2022-int-1-mac
1497 @itemx iso-2022-int-1-unix
1499 Modeline indicator: @code{INT-1}. A type @code{iso2022} 7-bit coding system
1500 with @code{ascii} (G0) and @code{korean-ksc5601} (G1) initially
1501 designated. ISO-2022-INT-1.
1503 @item iso-2022-jp-1978-irv
1504 @itemx iso-2022-jp-1978-irv-dos
1505 @itemx iso-2022-jp-1978-irv-mac
1506 @itemx iso-2022-jp-1978-irv-unix
1508 Modeline indicator: @code{Ja-78/7bit}. A type @code{iso2022} 7-bit coding
1509 system. For compatibility with old Japanese terminals; if you need to
1510 know, look at the source.
1513 @itemx iso-2022-jp-2 (ISO7/SS)
1514 @itemx iso-2022-jp-dos
1515 @itemx iso-2022-jp-mac
1516 @itemx iso-2022-jp-unix
1517 @itemx iso-2022-jp-2-dos
1518 @itemx iso-2022-jp-2-mac
1519 @itemx iso-2022-jp-2-unix
1521 Modeline indicator: @code{MULE/7bit}. A type @code{iso2022} 7-bit coding
1522 system with @code{ascii} (G0) initially designated, and complex
1523 specifications to insure backward compatibility with old Japanese
1524 systems. Used for communication with mail and news in Japan. The "-2"
1525 versions also use SS2 to invoke a 96-charset one character at a time.
1528 Modeline indicator: @code{Ko/7bit} A type @code{iso2022} 7-bit coding
1529 system with @code{ascii} (G0) and @code{korean-ksc5601} (G1) initially
1530 designated. Used for e-mail in Korea.
1533 @itemx iso-2022-lock-dos
1534 @itemx iso-2022-lock-mac
1535 @itemx iso-2022-lock-unix
1537 Modeline indicator: @code{ISO7/Lock}. A type @code{iso2022} 7-bit coding
1538 system with @code{ascii} (G0) initially designated, using Locking-Shift
1539 to invoke a 96-charset.
1542 @itemx iso-8859-1-dos
1543 @itemx iso-8859-1-mac
1544 @itemx iso-8859-1-unix
1546 Due to implementation, this is not a type @code{iso2022} coding system,
1547 but rather an alias for the @code{raw-text} coding system.
1550 @itemx iso-8859-2-dos
1551 @itemx iso-8859-2-mac
1552 @itemx iso-8859-2-unix
1554 Modeline indicator: @code{MIME/Ltn-2}. A type @code{iso2022} coding
1555 system with @code{ascii} (G0) and @code{latin-iso8859-2} (G1) initially
1559 @itemx iso-8859-3-dos
1560 @itemx iso-8859-3-mac
1561 @itemx iso-8859-3-unix
1563 Modeline indicator: @code{MIME/Ltn-3}. A type @code{iso2022} coding system
1564 with @code{ascii} (G0) and @code{latin-iso8859-3} (G1) initially
1568 @itemx iso-8859-4-dos
1569 @itemx iso-8859-4-mac
1570 @itemx iso-8859-4-unix
1572 Modeline indicator: @code{MIME/Ltn-4}. A type @code{iso2022} coding system
1573 with @code{ascii} (G0) and @code{latin-iso8859-4} (G1) initially
1577 @itemx iso-8859-5-dos
1578 @itemx iso-8859-5-mac
1579 @itemx iso-8859-5-unix
1581 Modeline indicator: @code{ISO8/Cyr}. A type @code{iso2022} coding system with
1582 @code{ascii} (G0) and @code{cyrillic-iso8859-5} (G1) initially invoked.
1585 @itemx iso-8859-7-dos
1586 @itemx iso-8859-7-mac
1587 @itemx iso-8859-7-unix
1589 Modeline indicator: @code{Grk}. A type @code{iso2022} coding system with
1590 @code{ascii} (G0) and @code{greek-iso8859-7} (G1) initially invoked.
1593 @itemx iso-8859-8-dos
1594 @itemx iso-8859-8-mac
1595 @itemx iso-8859-8-unix
1597 Modeline indicator: @code{MIME/Hbrw}. A type @code{iso2022} coding system with
1598 @code{ascii} (G0) and @code{hebrew-iso8859-8} (G1) initially invoked.
1601 @itemx iso-8859-9-dos
1602 @itemx iso-8859-9-mac
1603 @itemx iso-8859-9-unix
1605 Modeline indicator: @code{MIME/Ltn-5}. A type @code{iso2022} coding system
1606 with @code{ascii} (G0) and @code{latin-iso8859-9} (G1) initially
1614 Modeline indicator: @code{KOI8}. A type @code{ccl} coding-system used for
1615 KOI8-R, an encoding of the Cyrillic alphabet.
1618 @itemx shift_jis-dos
1619 @itemx shift_jis-mac
1620 @itemx shift_jis-unix
1622 Modeline indicator: @code{Ja/SJIS}. A type @code{shift-jis} coding-system
1623 implementing the Shift-JIS encoding for Japanese. The underscore is to
1624 conform to the MIME charset implementing this encoding.
1631 Modeline indicator: @code{TIS620}. A type @code{ccl} encoding for Thai. The
1632 external encoding is defined by TIS620, the internal encoding is
1633 peculiar to MULE, and called @code{thai-xtis}.
1637 Modeline indicator: @code{VIQR}. A type @code{no-conversion} coding
1638 system with Unix EOL convention (ie, no conversion) using
1639 post-read-decode and pre-write-encode functions to translate the VIQR
1640 coding system for Vietnamese.
1647 Modeline indicator: @code{VISCII}. A type @code{ccl} coding-system used
1648 for VISCII 1.1 for Vietnamese. Differs slightly from VSCII; VISCII is
1649 given priority by XEmacs.
1656 Modeline indicator: @code{VSCII}. A type @code{ccl} coding-system used
1657 for VSCII 1.1 for Vietnamese. Differs slightly from VISCII, which is
1658 given priority by XEmacs. Use
1659 @code{(prefer-coding-system 'vietnamese-vscii)} to give priority to VSCII.
1663 @node CCL, Category Tables, Coding Systems, MULE
1666 CCL (Code Conversion Language) is a simple structured programming
1667 language designed for character coding conversions. A CCL program is
1668 compiled to CCL code (represented by a vector of integers) and executed
1669 by the CCL interpreter embedded in Emacs. The CCL interpreter
1670 implements a virtual machine with 8 registers called @code{r0}, ...,
1671 @code{r7}, a number of control structures, and some I/O operators. Take
1672 care when using registers @code{r0} (used in implicit @dfn{set}
1673 statements) and especially @code{r7} (used internally by several
1674 statements and operations, especially for multiple return values and I/O
1677 CCL is used for code conversion during process I/O and file I/O for
1678 non-ISO2022 coding systems. (It is the only way for a user to specify a
1679 code conversion function.) It is also used for calculating the code
1680 point of an X11 font from a character code. However, since CCL is
1681 designed as a powerful programming language, it can be used for more
1682 generic calculation where efficiency is demanded. A combination of
1683 three or more arithmetic operations can be calculated faster by CCL than
1686 @strong{Warning:} The code in @file{src/mule-ccl.c} and
1687 @file{$packages/lisp/mule-base/mule-ccl.el} is the definitive
1688 description of CCL's semantics. The previous version of this section
1689 contained several typos and obsolete names left from earlier versions of
1690 MULE, and many may remain. (I am not an experienced CCL programmer; the
1691 few who know CCL well find writing English painful.)
1693 A CCL program transforms an input data stream into an output data
1694 stream. The input stream, held in a buffer of constant bytes, is left
1695 unchanged. The buffer may be filled by an external input operation,
1696 taken from an Emacs buffer, or taken from a Lisp string. The output
1697 buffer is a dynamic array of bytes, which can be written by an external
1698 output operation, inserted into an Emacs buffer, or returned as a Lisp
1701 A CCL program is a (Lisp) list containing two or three members. The
1702 first member is the @dfn{buffer magnification}, which indicates the
1703 required minimum size of the output buffer as a multiple of the input
1704 buffer. It is followed by the @dfn{main block} which executes while
1705 there is input remaining, and an optional @dfn{EOF block} which is
1706 executed when the input is exhausted. Both the main block and the EOF
1707 block are CCL blocks.
1709 A @dfn{CCL block} is either a CCL statement or list of CCL statements.
1710 A @dfn{CCL statement} is either a @dfn{set statement} (either an integer
1711 or an @dfn{assignment}, which is a list of a register to receive the
1712 assignment, an assignment operator, and an expression) or a @dfn{control
1713 statement} (a list starting with a keyword, whose allowable syntax
1714 depends on the keyword).
1717 * CCL Syntax:: CCL program syntax in BNF notation.
1718 * CCL Statements:: Semantics of CCL statements.
1719 * CCL Expressions:: Operators and expressions in CCL.
1720 * Calling CCL:: Running CCL programs.
1721 * CCL Examples:: The encoding functions for Big5 and KOI-8.
1724 @node CCL Syntax, CCL Statements, , CCL
1725 @comment Node, Next, Previous, Up
1726 @subsection CCL Syntax
1728 The full syntax of a CCL program in BNF notation:
1732 (BUFFER_MAGNIFICATION
1736 BUFFER_MAGNIFICATION := integer
1737 CCL_MAIN_BLOCK := CCL_BLOCK
1738 CCL_EOF_BLOCK := CCL_BLOCK
1741 STATEMENT | (STATEMENT [STATEMENT ...])
1743 SET | IF | BRANCH | LOOP | REPEAT | BREAK | READ | WRITE
1748 | (REG ASSIGNMENT_OPERATOR EXPRESSION)
1751 EXPRESSION := ARG | (EXPRESSION OPERATOR ARG)
1753 IF := (if EXPRESSION CCL_BLOCK [CCL_BLOCK])
1754 BRANCH := (branch EXPRESSION CCL_BLOCK [CCL_BLOCK ...])
1755 LOOP := (loop STATEMENT [STATEMENT ...])
1759 | (write-repeat [REG | integer | string])
1760 | (write-read-repeat REG [integer | ARRAY])
1763 | (read-if (REG OPERATOR ARG) CCL_BLOCK CCL_BLOCK)
1764 | (read-branch REG CCL_BLOCK [CCL_BLOCK ...])
1767 | (write EXPRESSION)
1768 | (write integer) | (write string) | (write REG ARRAY)
1770 CALL := (call ccl-program-name)
1773 REG := r0 | r1 | r2 | r3 | r4 | r5 | r6 | r7
1774 ARG := REG | integer
1776 + | - | * | / | % | & | '|' | ^ | << | >> | <8 | >8 | //
1777 | < | > | == | <= | >= | != | de-sjis | en-sjis
1778 ASSIGNMENT_OPERATOR :=
1779 += | -= | *= | /= | %= | &= | '|=' | ^= | <<= | >>=
1780 ARRAY := '[' integer ... ']'
1783 @node CCL Statements, CCL Expressions, CCL Syntax, CCL
1784 @comment Node, Next, Previous, Up
1785 @subsection CCL Statements
1787 The Emacs Code Conversion Language provides the following statement
1788 types: @dfn{set}, @dfn{if}, @dfn{branch}, @dfn{loop}, @dfn{repeat},
1789 @dfn{break}, @dfn{read}, @dfn{write}, @dfn{call}, and @dfn{end}.
1791 @heading Set statement:
1793 The @dfn{set} statement has three variants with the syntaxes
1794 @samp{(@var{reg} = @var{expression})},
1795 @samp{(@var{reg} @var{assignment_operator} @var{expression})}, and
1796 @samp{@var{integer}}. The assignment operator variation of the
1797 @dfn{set} statement works the same way as the corresponding C expression
1798 statement does. The assignment operators are @code{+=}, @code{-=},
1799 @code{*=}, @code{/=}, @code{%=}, @code{&=}, @code{|=}, @code{^=},
1800 @code{<<=}, and @code{>>=}, and they have the same meanings as in C. A
1801 "naked integer" @var{integer} is equivalent to a @var{set} statement of
1802 the form @code{(r0 = @var{integer})}.
1804 @heading I/O statements:
1806 The @dfn{read} statement takes one or more registers as arguments. It
1807 reads one byte (a C char) from the input into each register in turn.
1809 The @dfn{write} takes several forms. In the form @samp{(write @var{reg}
1810 ...)} it takes one or more registers as arguments and writes each in
1811 turn to the output. The integer in a register (interpreted as an
1812 Emchar) is encoded to multibyte form (ie, Bufbytes) and written to the
1813 current output buffer. If it is less than 256, it is written as is.
1814 The forms @samp{(write @var{expression})} and @samp{(write
1815 @var{integer})} are treated analogously. The form @samp{(write
1816 @var{string})} writes the constant string to the output. A
1817 "naked string" @samp{@var{string}} is equivalent to the statement @samp{(write
1818 @var{string})}. The form @samp{(write @var{reg} @var{array})} writes
1819 the @var{reg}th element of the @var{array} to the output.
1821 @heading Conditional statements:
1823 The @dfn{if} statement takes an @var{expression}, a @var{CCL block}, and
1824 an optional @var{second CCL block} as arguments. If the
1825 @var{expression} evaluates to non-zero, the first @var{CCL block} is
1826 executed. Otherwise, if there is a @var{second CCL block}, it is
1829 The @dfn{read-if} variant of the @dfn{if} statement takes an
1830 @var{expression}, a @var{CCL block}, and an optional @var{second CCL
1831 block} as arguments. The @var{expression} must have the form
1832 @code{(@var{reg} @var{operator} @var{operand})} (where @var{operand} is
1833 a register or an integer). The @code{read-if} statement first reads
1834 from the input into the first register operand in the @var{expression},
1835 then conditionally executes a CCL block just as the @code{if} statement
1838 The @dfn{branch} statement takes an @var{expression} and one or more CCL
1839 blocks as arguments. The CCL blocks are treated as a zero-indexed
1840 array, and the @code{branch} statement uses the @var{expression} as the
1841 index of the CCL block to execute. Null CCL blocks may be used as
1842 no-ops, continuing execution with the statement following the
1843 @code{branch} statement in the containing CCL block. Out-of-range
1844 values for the @var{expression} are also treated as no-ops.
1846 The @dfn{read-branch} variant of the @dfn{branch} statement takes an
1847 @var{register}, a @var{CCL block}, and an optional @var{second CCL
1848 block} as arguments. The @code{read-branch} statement first reads from
1849 the input into the @var{register}, then conditionally executes a CCL
1850 block just as the @code{branch} statement does.
1852 @heading Loop control statements:
1854 The @dfn{loop} statement creates a block with an implied jump from the
1855 end of the block back to its head. The loop is exited on a @code{break}
1856 statement, and continued without executing the tail by a @code{repeat}
1859 The @dfn{break} statement, written @samp{(break)}, terminates the
1860 current loop and continues with the next statement in the current
1863 The @dfn{repeat} statement has three variants, @code{repeat},
1864 @code{write-repeat}, and @code{write-read-repeat}. Each continues the
1865 current loop from its head, possibly after performing I/O.
1866 @code{repeat} takes no arguments and does no I/O before jumping.
1867 @code{write-repeat} takes a single argument (a register, an
1868 integer, or a string), writes it to the output, then jumps.
1869 @code{write-read-repeat} takes one or two arguments. The first must
1870 be a register. The second may be an integer or an array; if absent, it
1871 is implicitly set to the first (register) argument.
1872 @code{write-read-repeat} writes its second argument to the output, then
1873 reads from the input into the register, and finally jumps. See the
1874 @code{write} and @code{read} statements for the semantics of the I/O
1875 operations for each type of argument.
1877 @heading Other control statements:
1879 The @dfn{call} statement, written @samp{(call @var{ccl-program-name})},
1880 executes a CCL program as a subroutine. It does not return a value to
1881 the caller, but can modify the register status.
1883 The @dfn{end} statement, written @samp{(end)}, terminates the CCL
1884 program successfully, and returns to caller (which may be a CCL
1885 program). It does not alter the status of the registers.
1887 @node CCL Expressions, Calling CCL, CCL Statements, CCL
1888 @comment Node, Next, Previous, Up
1889 @subsection CCL Expressions
1891 CCL, unlike Lisp, uses infix expressions. The simplest CCL expressions
1892 consist of a single @var{operand}, either a register (one of @code{r0},
1893 ..., @code{r0}) or an integer. Complex expressions are lists of the
1894 form @code{( @var{expression} @var{operator} @var{operand} )}. Unlike
1895 C, assignments are not expressions.
1897 In the following table, @var{X} is the target resister for a @dfn{set}.
1898 In subexpressions, this is implicitly @code{r7}. This means that
1899 @code{>8}, @code{//}, @code{de-sjis}, and @code{en-sjis} cannot be used
1900 freely in subexpressions, since they return parts of their values in
1901 @code{r7}. @var{Y} may be an expression, register, or integer, while
1902 @var{Z} must be a register or an integer.
1904 @multitable @columnfractions .22 .14 .09 .55
1905 @item Name @tab Operator @tab Code @tab C-like Description
1906 @item CCL_PLUS @tab @code{+} @tab 0x00 @tab X = Y + Z
1907 @item CCL_MINUS @tab @code{-} @tab 0x01 @tab X = Y - Z
1908 @item CCL_MUL @tab @code{*} @tab 0x02 @tab X = Y * Z
1909 @item CCL_DIV @tab @code{/} @tab 0x03 @tab X = Y / Z
1910 @item CCL_MOD @tab @code{%} @tab 0x04 @tab X = Y % Z
1911 @item CCL_AND @tab @code{&} @tab 0x05 @tab X = Y & Z
1912 @item CCL_OR @tab @code{|} @tab 0x06 @tab X = Y | Z
1913 @item CCL_XOR @tab @code{^} @tab 0x07 @tab X = Y ^ Z
1914 @item CCL_LSH @tab @code{<<} @tab 0x08 @tab X = Y << Z
1915 @item CCL_RSH @tab @code{>>} @tab 0x09 @tab X = Y >> Z
1916 @item CCL_LSH8 @tab @code{<8} @tab 0x0A @tab X = (Y << 8) | Z
1917 @item CCL_RSH8 @tab @code{>8} @tab 0x0B @tab X = Y >> 8, r[7] = Y & 0xFF
1918 @item CCL_DIVMOD @tab @code{//} @tab 0x0C @tab X = Y / Z, r[7] = Y % Z
1919 @item CCL_LS @tab @code{<} @tab 0x10 @tab X = (X < Y)
1920 @item CCL_GT @tab @code{>} @tab 0x11 @tab X = (X > Y)
1921 @item CCL_EQ @tab @code{==} @tab 0x12 @tab X = (X == Y)
1922 @item CCL_LE @tab @code{<=} @tab 0x13 @tab X = (X <= Y)
1923 @item CCL_GE @tab @code{>=} @tab 0x14 @tab X = (X >= Y)
1924 @item CCL_NE @tab @code{!=} @tab 0x15 @tab X = (X != Y)
1925 @item CCL_ENCODE_SJIS @tab @code{en-sjis} @tab 0x16 @tab X = HIGHER_BYTE (SJIS (Y, Z))
1926 @item @tab @tab @tab r[7] = LOWER_BYTE (SJIS (Y, Z)
1927 @item CCL_DECODE_SJIS @tab @code{de-sjis} @tab 0x17 @tab X = HIGHER_BYTE (DE-SJIS (Y, Z))
1928 @item @tab @tab @tab r[7] = LOWER_BYTE (DE-SJIS (Y, Z))
1931 The CCL operators are as in C, with the addition of CCL_LSH8, CCL_RSH8,
1932 CCL_DIVMOD, CCL_ENCODE_SJIS, and CCL_DECODE_SJIS. The CCL_ENCODE_SJIS
1933 and CCL_DECODE_SJIS treat their first and second bytes as the high and
1934 low bytes of a two-byte character code. (SJIS stands for Shift JIS, an
1935 encoding of Japanese characters used by Microsoft. CCL_ENCODE_SJIS is a
1936 complicated transformation of the Japanese standard JIS encoding to
1937 Shift JIS. CCL_DECODE_SJIS is its inverse.) It is somewhat odd to
1938 represent the SJIS operations in infix form.
1940 @node Calling CCL, CCL Examples, CCL Expressions, CCL
1941 @comment Node, Next, Previous, Up
1942 @subsection Calling CCL
1944 CCL programs are called automatically during Emacs buffer I/O when the
1945 external representation has a coding system type of @code{shift-jis},
1946 @code{big5}, or @code{ccl}. The program is specified by the coding
1947 system (@pxref{Coding Systems}). You can also call CCL programs from
1948 other CCL programs, and from Lisp using these functions:
1950 @defun ccl-execute ccl-program status
1951 Execute @var{ccl-program} with registers initialized by
1952 @var{status}. @var{ccl-program} is a vector of compiled CCL code
1953 created by @code{ccl-compile}. It is an error for the program to try to
1954 execute a CCL I/O command. @var{status} must be a vector of nine
1955 values, specifying the initial value for the R0, R1 .. R7 registers and
1956 for the instruction counter IC. A @code{nil} value for a register
1957 initializer causes the register to be set to 0. A @code{nil} value for
1958 the IC initializer causes execution to start at the beginning of the
1959 program. When the program is done, @var{status} is modified (by
1960 side-effect) to contain the ending values for the corresponding
1964 @defun ccl-execute-on-string ccl-program status string &optional continue
1965 Execute @var{ccl-program} with initial @var{status} on
1966 @var{string}. @var{ccl-program} is a vector of compiled CCL code
1967 created by @code{ccl-compile}. @var{status} must be a vector of nine
1968 values, specifying the initial value for the R0, R1 .. R7 registers and
1969 for the instruction counter IC. A @code{nil} value for a register
1970 initializer causes the register to be set to 0. A @code{nil} value for
1971 the IC initializer causes execution to start at the beginning of the
1972 program. An optional fourth argument @var{continue}, if non-@code{nil}, causes
1974 remain on the unsatisfied read operation if the program terminates due
1975 to exhaustion of the input buffer. Otherwise the IC is set to the end
1976 of the program. When the program is done, @var{status} is modified (by
1977 side-effect) to contain the ending values for the corresponding
1978 registers and IC. Returns the resulting string.
1981 To call a CCL program from another CCL program, it must first be
1984 @defun register-ccl-program name ccl-program
1985 Register @var{name} for CCL program @var{ccl-program} in
1986 @code{ccl-program-table}. @var{ccl-program} should be the compiled form of
1987 a CCL program, or @code{nil}. Return index number of the registered CCL
1991 Information about the processor time used by the CCL interpreter can be
1992 obtained using these functions:
1994 @defun ccl-elapsed-time
1995 Returns the elapsed processor time of the CCL interpreter as cons of
1996 user and system time, as
1997 floating point numbers measured in seconds. If only one
1998 overall value can be determined, the return value will be a cons of that
2002 @defun ccl-reset-elapsed-time
2003 Resets the CCL interpreter's internal elapsed time registers.
2006 @node CCL Examples, , Calling CCL, CCL
2007 @comment Node, Next, Previous, Up
2008 @subsection CCL Examples
2010 This section is not yet written.
2012 @node Category Tables, , CCL, MULE
2013 @section Category Tables
2015 A category table is a type of char table used for keeping track of
2016 categories. Categories are used for classifying characters for use in
2017 regexps---you can refer to a category rather than having to use a
2018 complicated [] expression (and category lookups are significantly
2021 There are 95 different categories available, one for each printable
2022 character (including space) in the ASCII charset. Each category is
2023 designated by one such character, called a @dfn{category designator}.
2024 They are specified in a regexp using the syntax @samp{\cX}, where X is a
2025 category designator. (This is not yet implemented.)
2027 A category table specifies, for each character, the categories that
2028 the character is in. Note that a character can be in more than one
2029 category. More specifically, a category table maps from a character to
2030 either the value @code{nil} (meaning the character is in no categories)
2031 or a 95-element bit vector, specifying for each of the 95 categories
2032 whether the character is in that category.
2034 Special Lisp functions are provided that abstract this, so you do not
2035 have to directly manipulate bit vectors.
2037 @defun category-table-p object
2038 This function returns @code{t} if @var{object} is a category table.
2041 @defun category-table &optional buffer
2042 This function returns the current category table. This is the one
2043 specified by the current buffer, or by @var{buffer} if it is
2047 @defun standard-category-table
2048 This function returns the standard category table. This is the one used
2052 @defun copy-category-table &optional category-table
2053 This function returns a new category table which is a copy of
2054 @var{category-table}, which defaults to the standard category table.
2057 @defun set-category-table category-table &optional buffer
2058 This function selects @var{category-table} as the new category table for
2059 @var{buffer}. @var{buffer} defaults to the current buffer if omitted.
2062 @defun category-designator-p object
2063 This function returns @code{t} if @var{object} is a category designator (a
2064 char in the range @samp{' '} to @samp{'~'}).
2067 @defun category-table-value-p object
2068 This function returns @code{t} if @var{object} is a category table value.
2069 Valid values are @code{nil} or a bit vector of size 95.