2 @c This is part of the XEmacs Lisp Reference Manual.
3 @c Copyright (C) 1996 Ben Wing.
4 @c See the file lispref.texi for copying conditions.
5 @setfilename ../../info/internationalization.info
6 @node MULE, Tips, Internationalization, top
9 @dfn{MULE} is the name originally given to the version of GNU Emacs
10 extended for multi-lingual (and in particular Asian-language) support.
11 ``MULE'' is short for ``MUlti-Lingual Emacs''. It was originally called
12 Nemacs (``Nihon Emacs'' where ``Nihon'' is the Japanese word for
13 ``Japan''), when it only provided support for Japanese. XEmacs
14 refers to its multi-lingual support as @dfn{MULE support} since it
15 is based on @dfn{MULE}.
18 * Internationalization Terminology::
19 Definition of various internationalization terms.
20 * Charsets:: Sets of related characters.
21 * MULE Characters:: Working with characters in XEmacs/MULE.
22 * Composite Characters:: Making new characters by overstriking other ones.
23 * ISO 2022:: An international standard for charsets and encodings.
24 * Coding Systems:: Ways of representing a string of chars using integers.
25 * CCL:: A special language for writing fast converters.
26 * Category Tables:: Subdividing charsets into groups.
29 @node Internationalization Terminology
30 @section Internationalization Terminology
32 In internationalization terminology, a string of text is divided up
33 into @dfn{characters}, which are the printable units that make up the
34 text. A single character is (for example) a capital @samp{A}, the
35 number @samp{2}, a Katakana character, a Kanji ideograph (an
36 @dfn{ideograph} is a ``picture'' character, such as is used in Japanese
37 Kanji, Chinese Hanzi, and Korean Hangul; typically there are thousands
38 of such ideographs in each language), etc. The basic property of a
39 character is its shape. Note that the same character may be drawn by
40 two different people (or in two different fonts) in slightly different
41 ways, although the basic shape will be the same.
43 In some cases, the differences will be significant enough that it is
44 actually possible to identify two or more distinct shapes that both
45 represent the same character. For example, the lowercase letters
46 @samp{a} and @samp{g} each have two distinct possible shapes -- the
47 @samp{a} can optionally have a curved tail projecting off the top, and
48 the @samp{g} can be formed either of two loops, or of one loop and a
49 tail hanging off the bottom. Such distinct possible shapes of a
50 character are called @dfn{glyphs}. The important characteristic of two
51 glyphs making up the same character is that the choice between one or
52 the other is purely stylistic and has no linguistic effect on a word
53 (this is the reason why a capital @samp{A} and lowercase @samp{a}
54 are different characters rather than different glyphs -- e.g.
55 @samp{Aspen} is a city while @samp{aspen} is a kind of tree).
57 Note that @dfn{character} and @dfn{glyph} are used differently
58 here than elsewhere in XEmacs.
60 A @dfn{character set} is simply a set of related characters. ASCII,
61 for example, is a set of 94 characters (or 128, if you count
62 non-printing characters). Other character sets are ISO8859-1 (ASCII
63 plus various accented characters and other international symbols),
64 JISX0201 (ASCII, more or less, plus half-width Katakana), JISX0208
65 (Japanese Kanji), JISX0212 (a second set of less-used Japanese Kanji),
66 GB2312 (Mainland Chinese Hanzi), etc.
68 Every character set has one or more @dfn{orderings}, which can be
69 viewed as a way of assigning a number (or set of numbers) to each
70 character in the set. For most character sets, there is a standard
71 ordering, and in fact all of the character sets mentioned above define a
72 particular ordering. ASCII, for example, places letters in their
73 ``natural'' order, puts uppercase letters before lowercase letters,
74 numbers before letters, etc. Note that for many of the Asian character
75 sets, there is no natural ordering of the characters. The actual
76 orderings are based on one or more salient characteristic, of which
77 there are many to choose from -- e.g. number of strokes, common
78 radicals, phonetic ordering, etc.
80 The set of numbers assigned to any particular character are called
81 the character's @dfn{position codes}. The number of position codes
82 required to index a particular character in a character set is called
83 the @dfn{dimension} of the character set. ASCII, being a relatively
84 small character set, is of dimension one, and each character in the
85 set is indexed using a single position code, in the range 0 through
86 127 (if non-printing characters are included) or 33 through 126
87 (if only the printing characters are considered). JISX0208, i.e.
88 Japanese Kanji, has thousands of characters, and is of dimension two --
89 every character is indexed by two position codes, each in the range
90 33 through 126. (Note that the choice of the range here is somewhat
91 arbitrary. Although a character set such as JISX0208 defines an
92 @emph{ordering} of all its characters, it does not define the actual
93 mapping between numbers and characters. You could just as easily
94 index the characters in JISX0208 using numbers in the range 0 through
95 93, 1 through 94, 2 through 95, etc. The reason for the actual range
96 chosen is so that the position codes match up with the actual values
97 used in the common encodings.)
99 An @dfn{encoding} is a way of numerically representing characters from
100 one or more character sets into a stream of like-sized numerical values
101 called @dfn{words}; typically these are 8-bit, 16-bit, or 32-bit
102 quantities. If an encoding encompasses only one character set, then the
103 position codes for the characters in that character set could be used
104 directly. (This is the case with ASCII, and as a result, most people do
105 not understand the difference between a character set and an encoding.)
106 This is not possible, however, if more than one character set is to be
107 used in the encoding. For example, printed Japanese text typically
108 requires characters from multiple character sets -- ASCII, JISX0208, and
109 JISX0212, to be specific. Each of these is indexed using one or more
110 position codes in the range 33 through 126, so the position codes could
111 not be used directly or there would be no way to tell which character
112 was meant. Different Japanese encodings handle this differently -- JIS
113 uses special escape characters to denote different character sets; EUC
114 sets the high bit of the position codes for JISX0208 and JISX0212, and
115 puts a special extra byte before each JISX0212 character; etc. (JIS,
116 EUC, and most of the other encodings you will encounter are 7-bit or
117 8-bit encodings. There is one common 16-bit encoding, which is Unicode;
118 this strives to represent all the world's characters in a single large
119 character set. 32-bit encodings are generally used internally in
120 programs to simplify the code that manipulates them; however, they are
121 not much used externally because they are not very space-efficient.)
123 Encodings are classified as either @dfn{modal} or @dfn{non-modal}. In
124 a @dfn{modal encoding}, there are multiple states that the encoding can be in,
125 and the interpretation of the values in the stream depends on the
126 current global state of the encoding. Special values in the encoding,
127 called @dfn{escape sequences}, are used to change the global state.
128 JIS, for example, is a modal encoding. The bytes @samp{ESC $ B}
129 indicate that, from then on, bytes are to be interpreted as position
130 codes for JISX0208, rather than as ASCII. This effect is cancelled
131 using the bytes @samp{ESC ( B}, which mean ``switch from whatever the
132 current state is to ASCII''. To switch to JISX0212, the escape sequence
133 @samp{ESC $ ( D}. (Note that here, as is common, the escape sequences do
134 in fact begin with @samp{ESC}. This is not necessarily the case,
137 A @dfn{non-modal encoding} has no global state that extends past the
138 character currently being interpreted. EUC, for example, is a
139 non-modal encoding. Characters in JISX0208 are encoded by setting
140 the high bit of the position codes, and characters in JISX0212 are
141 encoded by doing the same but also prefixing the character with the
144 The advantage of a modal encoding is that it is generally more
145 space-efficient, and is easily extendable because there are essentially
146 an arbitrary number of escape sequences that can be created. The
147 disadvantage, however, is that it is much more difficult to work with
148 if it is not being processed in a sequential manner. In the non-modal
149 EUC encoding, for example, the byte 0x41 always refers to the letter
150 @samp{A}; whereas in JIS, it could either be the letter @samp{A}, or
151 one of the two position codes in a JISX0208 character, or one of the
152 two position codes in a JISX0212 character. Determining exactly which
153 one is meant could be difficult and time-consuming if the previous
154 bytes in the string have not already been processed.
156 Non-modal encodings are further divided into @dfn{fixed-width} and
157 @dfn{variable-width} formats. A fixed-width encoding always uses
158 the same number of words per character, whereas a variable-width
159 encoding does not. EUC is a good example of a variable-width
160 encoding: one to three bytes are used per character, depending on
161 the character set. 16-bit and 32-bit encodings are nearly always
162 fixed-width, and this is in fact one of the main reasons for using
163 an encoding with a larger word size. The advantages of fixed-width
164 encodings should be obvious. The advantages of variable-width
165 encodings are that they are generally more space-efficient and allow
166 for compatibility with existing 8-bit encodings such as ASCII.
168 Note that the bytes in an 8-bit encoding are often referred to
169 as @dfn{octets} rather than simply as bytes. This terminology
170 dates back to the days before 8-bit bytes were universal, when
171 some computers had 9-bit bytes, others had 10-bit bytes, etc.
176 A @dfn{charset} in MULE is an object that encapsulates a
177 particular character set as well as an ordering of those characters.
178 Charsets are permanent objects and are named using symbols, like
181 @defun charsetp object
182 This function returns non-@code{nil} if @var{object} is a charset.
186 * Charset Properties:: Properties of a charset.
187 * Basic Charset Functions:: Functions for working with charsets.
188 * Charset Property Functions:: Functions for accessing charset properties.
189 * Predefined Charsets:: Predefined charset objects.
192 @node Charset Properties
193 @subsection Charset Properties
195 Charsets have the following properties:
199 A symbol naming the charset. Every charset must have a different name;
200 this allows a charset to be referred to using its name rather than
201 the actual charset object.
203 A documentation string describing the charset.
205 A regular expression matching the font registry field for this character
206 set. For example, both the @code{ascii} and @code{latin-iso8859-1}
207 charsets use the registry @code{"ISO8859-1"}. This field is used to
208 choose an appropriate font when the user gives a general font
209 specification such as @samp{-*-courier-medium-r-*-140-*}, i.e. a
210 14-point upright medium-weight Courier font.
212 Number of position codes used to index a character in the character set.
213 XEmacs/MULE can only handle character sets of dimension 1 or 2.
214 This property defaults to 1.
216 Number of characters in each dimension. In XEmacs/MULE, the only
217 allowed values are 94 or 96. (There are a couple of pre-defined
218 character sets, such as ASCII, that do not follow this, but you cannot
219 define new ones like this.) Defaults to 94. Note that if the dimension
220 is 2, the character set thus described is 94x94 or 96x96.
222 Number of columns used to display a character in this charset.
223 Only used in TTY mode. (Under X, the actual width of a character
224 can be derived from the font used to display the characters.)
225 If unspecified, defaults to the dimension. (This is almost
226 always the correct value, because character sets with dimension 2
227 are usually ideograph character sets, which need two columns to
228 display the intricate ideographs.)
230 A symbol, either @code{l2r} (left-to-right) or @code{r2l}
231 (right-to-left). Defaults to @code{l2r}. This specifies the
232 direction that the text should be displayed in, and will be
233 left-to-right for most charsets but right-to-left for Hebrew
234 and Arabic. (Right-to-left display is not currently implemented.)
236 Final byte of the standard ISO 2022 escape sequence designating this
237 charset. Must be supplied. Each combination of (@var{dimension},
238 @var{chars}) defines a separate namespace for final bytes, and each
239 charset within a particular namespace must have a different final byte.
240 Note that ISO 2022 restricts the final byte to the range 0x30 - 0x7E if
241 dimension == 1, and 0x30 - 0x5F if dimension == 2. Note also that final
242 bytes in the range 0x30 - 0x3F are reserved for user-defined (not
243 official) character sets. For more information on ISO 2022, see @ref{Coding
246 0 (use left half of font on output) or 1 (use right half of font on
247 output). Defaults to 0. This specifies how to convert the position
248 codes that index a character in a character set into an index into the
249 font used to display the character set. With @code{graphic} set to 0,
250 position codes 33 through 126 map to font indices 33 through 126; with
251 it set to 1, position codes 33 through 126 map to font indices 161
252 through 254 (i.e. the same number but with the high bit set). For
253 example, for a font whose registry is ISO8859-1, the left half of the
254 font (octets 0x20 - 0x7F) is the @code{ascii} charset, while the right
255 half (octets 0xA0 - 0xFF) is the @code{latin-iso8859-1} charset.
257 A compiled CCL program used to convert a character in this charset into
258 an index into the font. This is in addition to the @code{graphic}
259 property. If a CCL program is defined, the position codes of a
260 character will first be processed according to @code{graphic} and
261 then passed through the CCL program, with the resulting values used
264 This is used, for example, in the Big5 character set (used in Taiwan).
265 This character set is not ISO-2022-compliant, and its size (94x157) does
266 not fit within the maximum 96x96 size of ISO-2022-compliant character
267 sets. As a result, XEmacs/MULE splits it (in a rather complex fashion,
268 so as to group the most commonly used characters together) into two
269 charset objects (@code{big5-1} and @code{big5-2}), each of size 94x94,
270 and each charset object uses a CCL program to convert the modified
271 position codes back into standard Big5 indices to retrieve a character
275 Most of the above properties can only be changed when the charset
276 is created. @xref{Charset Property Functions}.
278 @node Basic Charset Functions
279 @subsection Basic Charset Functions
281 @defun find-charset charset-or-name
282 This function retrieves the charset of the given name. If
283 @var{charset-or-name} is a charset object, it is simply returned.
284 Otherwise, @var{charset-or-name} should be a symbol. If there is no
285 such charset, @code{nil} is returned. Otherwise the associated charset
289 @defun get-charset name
290 This function retrieves the charset of the given name. Same as
291 @code{find-charset} except an error is signalled if there is no such
292 charset instead of returning @code{nil}.
296 This function returns a list of the names of all defined charsets.
299 @defun make-charset name doc-string props
300 This function defines a new character set. This function is for use
301 with Mule support. @var{name} is a symbol, the name by which the
302 character set is normally referred. @var{doc-string} is a string
303 describing the character set. @var{props} is a property list,
304 describing the specific nature of the character set. The recognized
305 properties are @code{registry}, @code{dimension}, @code{columns},
306 @code{chars}, @code{final}, @code{graphic}, @code{direction}, and
307 @code{ccl-program}, as previously described.
310 @defun make-reverse-direction-charset charset new-name
311 This function makes a charset equivalent to @var{charset} but which goes
312 in the opposite direction. @var{new-name} is the name of the new
313 charset. The new charset is returned.
316 @defun charset-from-attributes dimension chars final &optional direction
317 This function returns a charset with the given @var{dimension},
318 @var{chars}, @var{final}, and @var{direction}. If @var{direction} is
319 omitted, both directions will be checked (left-to-right will be returned
320 if character sets exist for both directions).
323 @defun charset-reverse-direction-charset charset
324 This function returns the charset (if any) with the same dimension,
325 number of characters, and final byte as @var{charset}, but which is
326 displayed in the opposite direction.
329 @node Charset Property Functions
330 @subsection Charset Property Functions
332 All of these functions accept either a charset name or charset object.
334 @defun charset-property charset prop
335 This function returns property @var{prop} of @var{charset}.
336 @xref{Charset Properties}.
339 Convenience functions are also provided for retrieving individual
340 properties of a charset.
342 @defun charset-name charset
343 This function returns the name of @var{charset}. This will be a symbol.
346 @defun charset-doc-string charset
347 This function returns the doc string of @var{charset}.
350 @defun charset-registry charset
351 This function returns the registry of @var{charset}.
354 @defun charset-dimension charset
355 This function returns the dimension of @var{charset}.
358 @defun charset-chars charset
359 This function returns the number of characters per dimension of
363 @defun charset-columns charset
364 This function returns the number of display columns per character (in
365 TTY mode) of @var{charset}.
368 @defun charset-direction charset
369 This function returns the display direction of @var{charset} -- either
370 @code{l2r} or @code{r2l}.
373 @defun charset-final charset
374 This function returns the final byte of the ISO 2022 escape sequence
375 designating @var{charset}.
378 @defun charset-graphic charset
379 This function returns either 0 or 1, depending on whether the position
380 codes of characters in @var{charset} map to the left or right half
381 of their font, respectively.
384 @defun charset-ccl-program charset
385 This function returns the CCL program, if any, for converting
386 position codes of characters in @var{charset} into font indices.
389 The only property of a charset that can currently be set after
390 the charset has been created is the CCL program.
392 @defun set-charset-ccl-program charset ccl-program
393 This function sets the @code{ccl-program} property of @var{charset} to
397 @node Predefined Charsets
398 @subsection Predefined Charsets
400 The following charsets are predefined in the C code.
403 Name Type Fi Gr Dir Registry
404 --------------------------------------------------------------
405 ascii 94 B 0 l2r ISO8859-1
406 control-1 94 0 l2r ---
407 latin-iso8859-1 94 A 1 l2r ISO8859-1
408 latin-iso8859-2 96 B 1 l2r ISO8859-2
409 latin-iso8859-3 96 C 1 l2r ISO8859-3
410 latin-iso8859-4 96 D 1 l2r ISO8859-4
411 cyrillic-iso8859-5 96 L 1 l2r ISO8859-5
412 arabic-iso8859-6 96 G 1 r2l ISO8859-6
413 greek-iso8859-7 96 F 1 l2r ISO8859-7
414 hebrew-iso8859-8 96 H 1 r2l ISO8859-8
415 latin-iso8859-9 96 M 1 l2r ISO8859-9
416 thai-tis620 96 T 1 l2r TIS620
417 katakana-jisx0201 94 I 1 l2r JISX0201.1976
418 latin-jisx0201 94 J 0 l2r JISX0201.1976
419 japanese-jisx0208-1978 94x94 @@ 0 l2r JISX0208.1978
420 japanese-jisx0208 94x94 B 0 l2r JISX0208.19(83|90)
421 japanese-jisx0212 94x94 D 0 l2r JISX0212
422 chinese-gb2312 94x94 A 0 l2r GB2312
423 chinese-cns11643-1 94x94 G 0 l2r CNS11643.1
424 chinese-cns11643-2 94x94 H 0 l2r CNS11643.2
425 chinese-big5-1 94x94 0 0 l2r Big5
426 chinese-big5-2 94x94 1 0 l2r Big5
427 korean-ksc5601 94x94 C 0 l2r KSC5601
428 composite 96x96 0 l2r ---
431 The following charsets are predefined in the Lisp code.
434 Name Type Fi Gr Dir Registry
435 --------------------------------------------------------------
436 arabic-digit 94 2 0 l2r MuleArabic-0
437 arabic-1-column 94 3 0 r2l MuleArabic-1
438 arabic-2-column 94 4 0 r2l MuleArabic-2
439 sisheng 94 0 0 l2r sisheng_cwnn\|OMRON_UDC_ZH
440 chinese-cns11643-3 94x94 I 0 l2r CNS11643.1
441 chinese-cns11643-4 94x94 J 0 l2r CNS11643.1
442 chinese-cns11643-5 94x94 K 0 l2r CNS11643.1
443 chinese-cns11643-6 94x94 L 0 l2r CNS11643.1
444 chinese-cns11643-7 94x94 M 0 l2r CNS11643.1
445 ethiopic 94x94 2 0 l2r Ethio
446 ascii-r2l 94 B 0 r2l ISO8859-1
447 ipa 96 0 1 l2r MuleIPA
448 vietnamese-lower 96 1 1 l2r VISCII1.1
449 vietnamese-upper 96 2 1 l2r VISCII1.1
452 For all of the above charsets, the dimension and number of columns are
455 Note that ASCII, Control-1, and Composite are handled specially.
456 This is why some of the fields are blank; and some of the filled-in
457 fields (e.g. the type) are not really accurate.
459 @node MULE Characters
460 @section MULE Characters
462 @defun make-char charset arg1 &optional arg2
463 This function makes a multi-byte character from @var{charset} and octets
464 @var{arg1} and @var{arg2}.
467 @defun char-charset ch
468 This function returns the character set of char @var{ch}.
471 @defun char-octet ch &optional n
472 This function returns the octet (i.e. position code) numbered @var{n}
473 (should be 0 or 1) of char @var{ch}. @var{n} defaults to 0 if omitted.
476 @defun find-charset-region start end &optional buffer
477 This function returns a list of the charsets in the region between
478 @var{start} and @var{end}. @var{buffer} defaults to the current buffer
482 @defun find-charset-string string
483 This function returns a list of the charsets in @var{string}.
486 @node Composite Characters
487 @section Composite Characters
489 Composite characters are not yet completely implemented.
491 @defun make-composite-char string
492 This function converts a string into a single composite character. The
493 character is the result of overstriking all the characters in the
497 @defun composite-char-string ch
498 This function returns a string of the characters comprising a composite
502 @defun compose-region start end &optional buffer
503 This function composes the characters in the region from @var{start} to
504 @var{end} in @var{buffer} into one composite character. The composite
505 character replaces the composed characters. @var{buffer} defaults to
506 the current buffer if omitted.
509 @defun decompose-region start end &optional buffer
510 This function decomposes any composite characters in the region from
511 @var{start} to @var{end} in @var{buffer}. This converts each composite
512 character into one or more characters, the individual characters out of
513 which the composite character was formed. Non-composite characters are
514 left as-is. @var{buffer} defaults to the current buffer if omitted.
520 This section briefly describes the ISO 2022 encoding standard. For more
521 thorough understanding, please refer to the original document of ISO
524 Character sets (@dfn{charsets}) are classified into the following four
525 categories, according to the number of characters of charset:
526 94-charset, 96-charset, 94x94-charset, and 96x96-charset.
531 ASCII(B), left(J) and right(I) half of JISX0201, ...
533 Latin-1(A), Latin-2(B), Latin-3(C), ...
535 GB2312(A), JISX0208(B), KSC5601(C), ...
540 The character in parentheses after the name of each charset
541 is the @dfn{final character} @var{F}, which can be regarded as
542 the identifier of the charset. ECMA allocates @var{F} to each
543 charset. @var{F} is in the range of 0x30..0x7F, but 0x30..0x3F
544 are only for private use.
546 Note: @dfn{ECMA} = European Computer Manufacturers Association
548 There are four @dfn{registers of charsets}, called G0 thru G3.
549 You can designate (or assign) any charset to one of these
552 The code space contained within one octet (of size 256) is divided into
553 4 areas: C0, GL, C1, and GR. GL and GR are the areas into which a
554 register of charset can be invoked into.
565 Usually, in the initial state, G0 is invoked into GL, and G1
568 ISO 2022 distinguishes 7-bit environments and 8-bit environments. In
569 7-bit environments, only C0 and GL are used.
571 Charset designation is done by escape sequences of the form:
574 ESC [@var{I}] @var{I} @var{F}
577 where @var{I} is an intermediate character in the range 0x20 - 0x2F, and
578 @var{F} is the final character identifying this charset.
580 The meaning of intermediate characters are:
584 $ [0x24]: indicate charset of dimension 2 (94x94 or 96x96).
585 ( [0x28]: designate to G0 a 94-charset whose final byte is @var{F}.
586 ) [0x29]: designate to G1 a 94-charset whose final byte is @var{F}.
587 * [0x2A]: designate to G2 a 94-charset whose final byte is @var{F}.
588 + [0x2B]: designate to G3 a 94-charset whose final byte is @var{F}.
589 - [0x2D]: designate to G1 a 96-charset whose final byte is @var{F}.
590 . [0x2E]: designate to G2 a 96-charset whose final byte is @var{F}.
591 / [0x2F]: designate to G3 a 96-charset whose final byte is @var{F}.
595 The following rule is not allowed in ISO 2022 but can be used in Mule.
598 , [0x2C]: designate to G0 a 96-charset whose final byte is @var{F}.
601 Here are examples of designations:
605 ESC ( B : designate to G0 ASCII
606 ESC - A : designate to G1 Latin-1
607 ESC $ ( A or ESC $ A : designate to G0 GB2312
608 ESC $ ( B or ESC $ B : designate to G0 JISX0208
609 ESC $ ) C : designate to G1 KSC5601
613 To use a charset designated to G2 or G3, and to use a charset designated
614 to G1 in a 7-bit environment, you must explicitly invoke G1, G2, or G3
615 into GL. There are two types of invocation, Locking Shift (forever) and
616 Single Shift (one character only).
618 Locking Shift is done as follows:
621 LS0 or SI (0x0F): invoke G0 into GL
622 LS1 or SO (0x0E): invoke G1 into GL
623 LS2: invoke G2 into GL
624 LS3: invoke G3 into GL
625 LS1R: invoke G1 into GR
626 LS2R: invoke G2 into GR
627 LS3R: invoke G3 into GR
630 Single Shift is done as follows:
634 SS2 or ESC N: invoke G2 into GL
635 SS3 or ESC O: invoke G3 into GL
639 (#### Ben says: I think the above is slightly incorrect. It appears that
640 SS2 invokes G2 into GR and SS3 invokes G3 into GR, whereas ESC N and
641 ESC O behave as indicated. The above definitions will not parse
642 EUC-encoded text correctly, and it looks like the code in mule-coding.c
643 has similar problems.)
645 You may realize that there are a lot of ISO-2022-compliant ways of
646 encoding multilingual text. Now, in the world, there exist many coding
647 systems such as X11's Compound Text, Japanese JUNET code, and so-called
648 EUC (Extended UNIX Code); all of these are variants of ISO 2022.
650 In Mule, we characterize ISO 2022 by the following attributes:
654 Initial designation to G0 thru G3.
656 Allow designation of short form for Japanese and Chinese.
658 Should we designate ASCII to G0 before control characters?
660 Should we designate ASCII to G0 at the end of line?
662 7-bit environment or 8-bit environment.
664 Use Locking Shift or not.
666 Use ASCII or JIS0201-1976-Roman.
668 Use JISX0208-1983 or JISX0208-1976.
671 (The last two are only for Japanese.)
673 By specifying these attributes, you can create any variant
676 Here are several examples:
680 junet -- Coding system used in JUNET.
681 1. G0 <- ASCII, G1..3 <- never used
692 ctext -- Compound Text
693 1. G0 <- ASCII, G1 <- Latin-1, G2,3 <- never used
704 euc-china -- Chinese EUC. Although many people call this
705 as "GB encoding", the name may cause misunderstanding.
706 1. G0 <- ASCII, G1 <- GB2312, G2,3 <- never used
717 korean-mail -- Coding system used in Korean network.
718 1. G0 <- ASCII, G1 <- KSC5601, G2,3 <- never used
729 Mule creates all these coding systems by default.
732 @section Coding Systems
734 A coding system is an object that defines how text containing multiple
735 character sets is encoded into a stream of (typically 8-bit) bytes. The
736 coding system is used to decode the stream into a series of characters
737 (which may be from multiple charsets) when the text is read from a file
738 or process, and is used to encode the text back into the same format
739 when it is written out to a file or process.
741 For example, many ISO-2022-compliant coding systems (such as Compound
742 Text, which is used for inter-client data under the X Window System) use
743 escape sequences to switch between different charsets -- Japanese Kanji,
744 for example, is invoked with @samp{ESC $ ( B}; ASCII is invoked with
745 @samp{ESC ( B}; and Cyrillic is invoked with @samp{ESC - L}. See
746 @code{make-coding-system} for more information.
748 Coding systems are normally identified using a symbol, and the symbol is
749 accepted in place of the actual coding system object whenever a coding
750 system is called for. (This is similar to how faces and charsets work.)
752 @defun coding-system-p object
753 This function returns non-@code{nil} if @var{object} is a coding system.
757 * Coding System Types:: Classifying coding systems.
758 * EOL Conversion:: Dealing with different ways of denoting
760 * Coding System Properties:: Properties of a coding system.
761 * Basic Coding System Functions:: Working with coding systems.
762 * Coding System Property Functions:: Retrieving a coding system's properties.
763 * Encoding and Decoding Text:: Encoding and decoding text.
764 * Detection of Textual Encoding:: Determining how text is encoded.
765 * Big5 and Shift-JIS Functions:: Special functions for these non-standard
769 @node Coding System Types
770 @subsection Coding System Types
775 Automatic conversion. XEmacs attempts to detect the coding system used
778 No conversion. Use this for binary files and such. On output, graphic
779 characters that are not in ASCII or Latin-1 will be replaced by a
780 @samp{?}. (For a no-conversion-encoded buffer, these characters will
781 only be present if you explicitly insert them.)
783 Shift-JIS (a Japanese encoding commonly used in PC operating systems).
785 Any ISO-2022-compliant encoding. Among other things, this includes JIS
786 (the Japanese encoding commonly used for e-mail), national variants of
787 EUC (the standard Unix encoding for Japanese and other languages), and
788 Compound Text (an encoding used in X11). You can specify more specific
789 information about the conversion with the @var{flags} argument.
791 Big5 (the encoding commonly used for Taiwanese).
793 The conversion is performed using a user-written pseudo-code program.
794 CCL (Code Conversion Language) is the name of this pseudo-code.
796 Write out or read in the raw contents of the memory representing the
797 buffer's text. This is primarily useful for debugging purposes, and is
798 only enabled when XEmacs has been compiled with @code{DEBUG_XEMACS} set
799 (the @samp{--debug} configure option). @strong{Warning}: Reading in a
800 file using @code{internal} conversion can result in an internal
801 inconsistency in the memory representing a buffer's text, which will
802 produce unpredictable results and may cause XEmacs to crash. Under
803 normal circumstances you should never use @code{internal} conversion.
807 @subsection EOL Conversion
811 Automatically detect the end-of-line type (LF, CRLF, or CR). Also
812 generate subsidiary coding systems named @code{@var{name}-unix},
813 @code{@var{name}-dos}, and @code{@var{name}-mac}, that are identical to
814 this coding system but have an EOL-TYPE value of @code{lf}, @code{crlf},
815 and @code{cr}, respectively.
817 The end of a line is marked externally using ASCII LF. Since this is
818 also the way that XEmacs represents an end-of-line internally,
819 specifying this option results in no end-of-line conversion. This is
820 the standard format for Unix text files.
822 The end of a line is marked externally using ASCII CRLF. This is the
823 standard format for MS-DOS text files.
825 The end of a line is marked externally using ASCII CR. This is the
826 standard format for Macintosh text files.
828 Automatically detect the end-of-line type but do not generate subsidiary
829 coding systems. (This value is converted to @code{nil} when stored
830 internally, and @code{coding-system-property} will return @code{nil}.)
833 @node Coding System Properties
834 @subsection Coding System Properties
838 String to be displayed in the modeline when this coding system is
842 End-of-line conversion to be used. It should be one of the types
843 listed in @ref{EOL Conversion}.
845 @item post-read-conversion
846 Function called after a file has been read in, to perform the decoding.
847 Called with two arguments, @var{beg} and @var{end}, denoting a region of
848 the current buffer to be decoded.
850 @item pre-write-conversion
851 Function called before a file is written out, to perform the encoding.
852 Called with two arguments, @var{beg} and @var{end}, denoting a region of
853 the current buffer to be encoded.
856 The following additional properties are recognized if @var{type} is
864 The character set initially designated to the G0 - G3 registers.
865 The value should be one of
869 A charset object (designate that character set)
871 @code{nil} (do not ever use this register)
873 @code{t} (no character set is initially designated to the register, but
874 may be later on; this automatically sets the corresponding
875 @code{force-g*-on-output} property)
878 @item force-g0-on-output
879 @itemx force-g1-on-output
880 @itemx force-g2-on-output
881 @itemx force-g3-on-output
882 If non-@code{nil}, send an explicit designation sequence on output
883 before using the specified register.
886 If non-@code{nil}, use the short forms @samp{ESC $ @@}, @samp{ESC $ A},
887 and @samp{ESC $ B} on output in place of the full designation sequences
888 @samp{ESC $ ( @@}, @samp{ESC $ ( A}, and @samp{ESC $ ( B}.
891 If non-@code{nil}, don't designate ASCII to G0 at each end of line on
892 output. Setting this to non-@code{nil} also suppresses other
893 state-resetting that normally happens at the end of a line.
896 If non-@code{nil}, don't designate ASCII to G0 before control chars on
900 If non-@code{nil}, use 7-bit environment on output. Otherwise, use 8-bit
904 If non-@code{nil}, use locking-shift (SO/SI) instead of single-shift or
905 designation by escape sequence.
908 If non-@code{nil}, don't use ISO6429's direction specification.
911 If non-nil, literal control characters that are the same as the
912 beginning of a recognized ISO 2022 or ISO 6429 escape sequence (in
913 particular, ESC (0x1B), SO (0x0E), SI (0x0F), SS2 (0x8E), SS3 (0x8F),
914 and CSI (0x9B)) are ``quoted'' with an escape character so that they can
915 be properly distinguished from an escape sequence. (Note that doing
916 this results in a non-portable encoding.) This encoding flag is used for
917 byte-compiled files. Note that ESC is a good choice for a quoting
918 character because there are no escape sequences whose second byte is a
919 character from the Control-0 or Control-1 character sets; this is
920 explicitly disallowed by the ISO 2022 standard.
922 @item input-charset-conversion
923 A list of conversion specifications, specifying conversion of characters
924 in one charset to another when decoding is performed. Each
925 specification is a list of two elements: the source charset, and the
928 @item output-charset-conversion
929 A list of conversion specifications, specifying conversion of characters
930 in one charset to another when encoding is performed. The form of each
931 specification is the same as for @code{input-charset-conversion}.
934 The following additional properties are recognized (and required) if
935 @var{type} is @code{ccl}:
939 CCL program used for decoding (converting to internal format).
942 CCL program used for encoding (converting to external format).
945 @node Basic Coding System Functions
946 @subsection Basic Coding System Functions
948 @defun find-coding-system coding-system-or-name
949 This function retrieves the coding system of the given name.
951 If @var{coding-system-or-name} is a coding-system object, it is simply
952 returned. Otherwise, @var{coding-system-or-name} should be a symbol.
953 If there is no such coding system, @code{nil} is returned. Otherwise
954 the associated coding system object is returned.
957 @defun get-coding-system name
958 This function retrieves the coding system of the given name. Same as
959 @code{find-coding-system} except an error is signalled if there is no
960 such coding system instead of returning @code{nil}.
963 @defun coding-system-list
964 This function returns a list of the names of all defined coding systems.
967 @defun coding-system-name coding-system
968 This function returns the name of the given coding system.
971 @defun make-coding-system name type &optional doc-string props
972 This function registers symbol @var{name} as a coding system.
974 @var{type} describes the conversion method used and should be one of
975 the types listed in @ref{Coding System Types}.
977 @var{doc-string} is a string describing the coding system.
979 @var{props} is a property list, describing the specific nature of the
980 character set. Recognized properties are as in @ref{Coding System
984 @defun copy-coding-system old-coding-system new-name
985 This function copies @var{old-coding-system} to @var{new-name}. If
986 @var{new-name} does not name an existing coding system, a new one will
990 @defun subsidiary-coding-system coding-system eol-type
991 This function returns the subsidiary coding system of
992 @var{coding-system} with eol type @var{eol-type}.
995 @node Coding System Property Functions
996 @subsection Coding System Property Functions
998 @defun coding-system-doc-string coding-system
999 This function returns the doc string for @var{coding-system}.
1002 @defun coding-system-type coding-system
1003 This function returns the type of @var{coding-system}.
1006 @defun coding-system-property coding-system prop
1007 This function returns the @var{prop} property of @var{coding-system}.
1010 @node Encoding and Decoding Text
1011 @subsection Encoding and Decoding Text
1013 @defun decode-coding-region start end coding-system &optional buffer
1014 This function decodes the text between @var{start} and @var{end} which
1015 is encoded in @var{coding-system}. This is useful if you've read in
1016 encoded text from a file without decoding it (e.g. you read in a
1017 JIS-formatted file but used the @code{binary} or @code{no-conversion} coding
1018 system, so that it shows up as @samp{^[$B!<!+^[(B}). The length of the
1019 encoded text is returned. @var{buffer} defaults to the current buffer
1023 @defun encode-coding-region start end coding-system &optional buffer
1024 This function encodes the text between @var{start} and @var{end} using
1025 @var{coding-system}. This will, for example, convert Japanese
1026 characters into stuff such as @samp{^[$B!<!+^[(B} if you use the JIS
1027 encoding. The length of the encoded text is returned. @var{buffer}
1028 defaults to the current buffer if unspecified.
1031 @node Detection of Textual Encoding
1032 @subsection Detection of Textual Encoding
1034 @defun coding-category-list
1035 This function returns a list of all recognized coding categories.
1038 @defun set-coding-priority-list list
1039 This function changes the priority order of the coding categories.
1040 @var{list} should be a list of coding categories, in descending order of
1041 priority. Unspecified coding categories will be lower in priority than
1042 all specified ones, in the same relative order they were in previously.
1045 @defun coding-priority-list
1046 This function returns a list of coding categories in descending order of
1050 @defun set-coding-category-system coding-category coding-system
1051 This function changes the coding system associated with a coding category.
1054 @defun coding-category-system coding-category
1055 This function returns the coding system associated with a coding category.
1058 @defun detect-coding-region start end &optional buffer
1059 This function detects coding system of the text in the region between
1060 @var{start} and @var{end}. Returned value is a list of possible coding
1061 systems ordered by priority. If only ASCII characters are found, it
1062 returns @code{autodetect} or one of its subsidiary coding systems
1063 according to a detected end-of-line type. Optional arg @var{buffer}
1064 defaults to the current buffer.
1067 @node Big5 and Shift-JIS Functions
1068 @subsection Big5 and Shift-JIS Functions
1070 These are special functions for working with the non-standard
1071 Shift-JIS and Big5 encodings.
1073 @defun decode-shift-jis-char code
1074 This function decodes a JISX0208 character of Shift-JIS coding-system.
1075 @var{code} is the character code in Shift-JIS as a cons of type bytes.
1076 The corresponding character is returned.
1079 @defun encode-shift-jis-char ch
1080 This function encodes a JISX0208 character @var{ch} to SHIFT-JIS
1081 coding-system. The corresponding character code in SHIFT-JIS is
1082 returned as a cons of two bytes.
1085 @defun decode-big5-char code
1086 This function decodes a Big5 character @var{code} of BIG5 coding-system.
1087 @var{code} is the character code in BIG5. The corresponding character
1091 @defun encode-big5-char ch
1092 This function encodes the Big5 character @var{char} to BIG5
1093 coding-system. The corresponding character code in Big5 is returned.
1096 @node CCL, Category Tables, Coding Systems, MULE
1099 CCL (Code Conversion Language) is a simple structured programming
1100 language designed for character coding conversions. A CCL program is
1101 compiled to CCL code (represented by a vector of integers) and executed
1102 by the CCL interpreter embedded in Emacs. The CCL interpreter
1103 implements a virtual machine with 8 registers called @code{r0}, ...,
1104 @code{r7}, a number of control structures, and some I/O operators. Take
1105 care when using registers @code{r0} (used in implicit @dfn{set}
1106 statements) and especially @code{r7} (used internally by several
1107 statements and operations, especially for multiple return values and I/O
1110 CCL is used for code conversion during process I/O and file I/O for
1111 non-ISO2022 coding systems. (It is the only way for a user to specify a
1112 code conversion function.) It is also used for calculating the code
1113 point of an X11 font from a character code. However, since CCL is
1114 designed as a powerful programming language, it can be used for more
1115 generic calculation where efficiency is demanded. A combination of
1116 three or more arithmetic operations can be calculated faster by CCL than
1119 @strong{Warning:} The code in @file{src/mule-ccl.c} and
1120 @file{$packages/lisp/mule-base/mule-ccl.el} is the definitive
1121 description of CCL's semantics. The previous version of this section
1122 contained several typos and obsolete names left from earlier versions of
1123 MULE, and many may remain. (I am not an experienced CCL programmer; the
1124 few who know CCL well find writing English painful.)
1126 A CCL program transforms an input data stream into an output data
1127 stream. The input stream, held in a buffer of constant bytes, is left
1128 unchanged. The buffer may be filled by an external input operation,
1129 taken from an Emacs buffer, or taken from a Lisp string. The output
1130 buffer is a dynamic array of bytes, which can be written by an external
1131 output operation, inserted into an Emacs buffer, or returned as a Lisp
1134 A CCL program is a (Lisp) list containing two or three members. The
1135 first member is the @dfn{buffer magnification}, which indicates the
1136 required minimum size of the output buffer as a multiple of the input
1137 buffer. It is followed by the @dfn{main block} which executes while
1138 there is input remaining, and an optional @dfn{EOF block} which is
1139 executed when the input is exhausted. Both the main block and the EOF
1140 block are CCL blocks.
1142 A @dfn{CCL block} is either a CCL statement or list of CCL statements.
1143 A @dfn{CCL statement} is either a @dfn{set statement} (either an integer
1144 or an @dfn{assignment}, which is a list of a register to receive the
1145 assignment, an assignment operator, and an expression) or a @dfn{control
1146 statement} (a list starting with a keyword, whose allowable syntax
1147 depends on the keyword).
1150 * CCL Syntax:: CCL program syntax in BNF notation.
1151 * CCL Statements:: Semantics of CCL statements.
1152 * CCL Expressions:: Operators and expressions in CCL.
1153 * Calling CCL:: Running CCL programs.
1154 * CCL Examples:: The encoding functions for Big5 and KOI-8.
1157 @node CCL Syntax, CCL Statements, CCL, CCL
1158 @comment Node, Next, Previous, Up
1159 @subsection CCL Syntax
1161 The full syntax of a CCL program in BNF notation:
1165 (BUFFER_MAGNIFICATION
1169 BUFFER_MAGNIFICATION := integer
1170 CCL_MAIN_BLOCK := CCL_BLOCK
1171 CCL_EOF_BLOCK := CCL_BLOCK
1174 STATEMENT | (STATEMENT [STATEMENT ...])
1176 SET | IF | BRANCH | LOOP | REPEAT | BREAK | READ | WRITE
1181 | (REG ASSIGNMENT_OPERATOR EXPRESSION)
1184 EXPRESSION := ARG | (EXPRESSION OPERATOR ARG)
1186 IF := (if EXPRESSION CCL_BLOCK [CCL_BLOCK])
1187 BRANCH := (branch EXPRESSION CCL_BLOCK [CCL_BLOCK ...])
1188 LOOP := (loop STATEMENT [STATEMENT ...])
1192 | (write-repeat [REG | integer | string])
1193 | (write-read-repeat REG [integer | ARRAY])
1196 | (read-if (REG OPERATOR ARG) CCL_BLOCK CCL_BLOCK)
1197 | (read-branch REG CCL_BLOCK [CCL_BLOCK ...])
1200 | (write EXPRESSION)
1201 | (write integer) | (write string) | (write REG ARRAY)
1203 CALL := (call ccl-program-name)
1206 REG := r0 | r1 | r2 | r3 | r4 | r5 | r6 | r7
1207 ARG := REG | integer
1209 + | - | * | / | % | & | '|' | ^ | << | >> | <8 | >8 | //
1210 | < | > | == | <= | >= | != | de-sjis | en-sjis
1211 ASSIGNMENT_OPERATOR :=
1212 += | -= | *= | /= | %= | &= | '|=' | ^= | <<= | >>=
1213 ARRAY := '[' integer ... ']'
1216 @node CCL Statements, CCL Expressions, CCL Syntax, CCL
1217 @comment Node, Next, Previous, Up
1218 @subsection CCL Statements
1220 The Emacs Code Conversion Language provides the following statement
1221 types: @dfn{set}, @dfn{if}, @dfn{branch}, @dfn{loop}, @dfn{repeat},
1222 @dfn{break}, @dfn{read}, @dfn{write}, @dfn{call}, and @dfn{end}.
1224 @heading Set statement:
1226 The @dfn{set} statement has three variants with the syntaxes
1227 @samp{(@var{reg} = @var{expression})},
1228 @samp{(@var{reg} @var{assignment_operator} @var{expression})}, and
1229 @samp{@var{integer}}. The assignment operator variation of the
1230 @dfn{set} statement works the same way as the corresponding C expression
1231 statement does. The assignment operators are @code{+=}, @code{-=},
1232 @code{*=}, @code{/=}, @code{%=}, @code{&=}, @code{|=}, @code{^=},
1233 @code{<<=}, and @code{>>=}, and they have the same meanings as in C. A
1234 "naked integer" @var{integer} is equivalent to a @var{set} statement of
1235 the form @code{(r0 = @var{integer})}.
1237 @heading I/O statements:
1239 The @dfn{read} statement takes one or more registers as arguments. It
1240 reads one byte (a C char) from the input into each register in turn.
1242 The @dfn{write} takes several forms. In the form @samp{(write @var{reg}
1243 ...)} it takes one or more registers as arguments and writes each in
1244 turn to the output. The integer in a register (interpreted as an
1245 Emchar) is encoded to multibyte form (ie, Bufbytes) and written to the
1246 current output buffer. If it is less than 256, it is written as is.
1247 The forms @samp{(write @var{expression})} and @samp{(write
1248 @var{integer})} are treated analogously. The form @samp{(write
1249 @var{string})} writes the constant string to the output. A
1250 "naked string" @samp{@var{string}} is equivalent to the statement @samp{(write
1251 @var{string})}. The form @samp{(write @var{reg} @var{array})} writes
1252 the @var{reg}th element of the @var{array} to the output.
1254 @heading Conditional statements:
1256 The @dfn{if} statement takes an @var{expression}, a @var{CCL block}, and
1257 an optional @var{second CCL block} as arguments. If the
1258 @var{expression} evaluates to non-zero, the first @var{CCL block} is
1259 executed. Otherwise, if there is a @var{second CCL block}, it is
1262 The @dfn{read-if} variant of the @dfn{if} statement takes an
1263 @var{expression}, a @var{CCL block}, and an optional @var{second CCL
1264 block} as arguments. The @var{expression} must have the form
1265 @code{(@var{reg} @var{operator} @var{operand})} (where @var{operand} is
1266 a register or an integer). The @code{read-if} statement first reads
1267 from the input into the first register operand in the @var{expression},
1268 then conditionally executes a CCL block just as the @code{if} statement
1271 The @dfn{branch} statement takes an @var{expression} and one or more CCL
1272 blocks as arguments. The CCL blocks are treated as a zero-indexed
1273 array, and the @code{branch} statement uses the @var{expression} as the
1274 index of the CCL block to execute. Null CCL blocks may be used as
1275 no-ops, continuing execution with the statement following the
1276 @code{branch} statement in the containing CCL block. Out-of-range
1277 values for the @var{EXPRESSION} are also treated as no-ops.
1279 The @dfn{read-branch} variant of the @dfn{branch} statement takes an
1280 @var{register}, a @var{CCL block}, and an optional @var{second CCL
1281 block} as arguments. The @code{read-branch} statement first reads from
1282 the input into the @var{register}, then conditionally executes a CCL
1283 block just as the @code{branch} statement does.
1285 @heading Loop control statements:
1287 The @dfn{loop} statement creates a block with an implied jump from the
1288 end of the block back to its head. The loop is exited on a @code{break}
1289 statement, and continued without executing the tail by a @code{repeat}
1292 The @dfn{break} statement, written @samp{(break)}, terminates the
1293 current loop and continues with the next statement in the current
1296 The @dfn{repeat} statement has three variants, @code{repeat},
1297 @code{write-repeat}, and @code{write-read-repeat}. Each continues the
1298 current loop from its head, possibly after performing I/O.
1299 @code{repeat} takes no arguments and does no I/O before jumping.
1300 @code{write-repeat} takes a single argument (a register, an
1301 integer, or a string), writes it to the output, then jumps.
1302 @code{write-read-repeat} takes one or two arguments. The first must
1303 be a register. The second may be an integer or an array; if absent, it
1304 is implicitly set to the first (register) argument.
1305 @code{write-read-repeat} writes its second argument to the output, then
1306 reads from the input into the register, and finally jumps. See the
1307 @code{write} and @code{read} statements for the semantics of the I/O
1308 operations for each type of argument.
1310 @heading Other control statements:
1312 The @dfn{call} statement, written @samp{(call @var{ccl-program-name})},
1313 executes a CCL program as a subroutine. It does not return a value to
1314 the caller, but can modify the register status.
1316 The @dfn{end} statement, written @samp{(end)}, terminates the CCL
1317 program successfully, and returns to caller (which may be a CCL
1318 program). It does not alter the status of the registers.
1320 @node CCL Expressions, Calling CCL, CCL Statements, CCL
1321 @comment Node, Next, Previous, Up
1322 @subsection CCL Expressions
1324 CCL, unlike Lisp, uses infix expressions. The simplest CCL expressions
1325 consist of a single @var{operand}, either a register (one of @code{r0},
1326 ..., @code{r0}) or an integer. Complex expressions are lists of the
1327 form @code{( @var{expression} @var{operator} @var{operand} )}. Unlike
1328 C, assignments are not expressions.
1330 In the following table, @var{X} is the target resister for a @dfn{set}.
1331 In subexpressions, this is implicitly @code{r7}. This means that
1332 @code{>8}, @code{//}, @code{de-sjis}, and @code{en-sjis} cannot be used
1333 freely in subexpressions, since they return parts of their values in
1334 @code{r7}. @var{Y} may be an expression, register, or integer, while
1335 @var{Z} must be a register or an integer.
1337 @multitable @columnfractions .22 .14 .09 .55
1338 @item Name @tab Operator @tab Code @tab C-like Description
1339 @item CCL_PLUS @tab @code{+} @tab 0x00 @tab X = Y + Z
1340 @item CCL_MINUS @tab @code{-} @tab 0x01 @tab X = Y - Z
1341 @item CCL_MUL @tab @code{*} @tab 0x02 @tab X = Y * Z
1342 @item CCL_DIV @tab @code{/} @tab 0x03 @tab X = Y / Z
1343 @item CCL_MOD @tab @code{%} @tab 0x04 @tab X = Y % Z
1344 @item CCL_AND @tab @code{&} @tab 0x05 @tab X = Y & Z
1345 @item CCL_OR @tab @code{|} @tab 0x06 @tab X = Y | Z
1346 @item CCL_XOR @tab @code{^} @tab 0x07 @tab X = Y ^ Z
1347 @item CCL_LSH @tab @code{<<} @tab 0x08 @tab X = Y << Z
1348 @item CCL_RSH @tab @code{>>} @tab 0x09 @tab X = Y >> Z
1349 @item CCL_LSH8 @tab @code{<8} @tab 0x0A @tab X = (Y << 8) | Z
1350 @item CCL_RSH8 @tab @code{>8} @tab 0x0B @tab X = Y >> 8, r[7] = Y & 0xFF
1351 @item CCL_DIVMOD @tab @code{//} @tab 0x0C @tab X = Y / Z, r[7] = Y % Z
1352 @item CCL_LS @tab @code{<} @tab 0x10 @tab X = (X < Y)
1353 @item CCL_GT @tab @code{>} @tab 0x11 @tab X = (X > Y)
1354 @item CCL_EQ @tab @code{==} @tab 0x12 @tab X = (X == Y)
1355 @item CCL_LE @tab @code{<=} @tab 0x13 @tab X = (X <= Y)
1356 @item CCL_GE @tab @code{>=} @tab 0x14 @tab X = (X >= Y)
1357 @item CCL_NE @tab @code{!=} @tab 0x15 @tab X = (X != Y)
1358 @item CCL_ENCODE_SJIS @tab @code{en-sjis} @tab 0x16 @tab X = HIGHER_BYTE (SJIS (Y, Z))
1359 @item @tab @tab @tab r[7] = LOWER_BYTE (SJIS (Y, Z)
1360 @item CCL_DECODE_SJIS @tab @code{de-sjis} @tab 0x17 @tab X = HIGHER_BYTE (DE-SJIS (Y, Z))
1361 @item @tab @tab @tab r[7] = LOWER_BYTE (DE-SJIS (Y, Z))
1364 The CCL operators are as in C, with the addition of CCL_LSH8, CCL_RSH8,
1365 CCL_DIVMOD, CCL_ENCODE_SJIS, and CCL_DECODE_SJIS. The CCL_ENCODE_SJIS
1366 and CCL_DECODE_SJIS treat their first and second bytes as the high and
1367 low bytes of a two-byte character code. (SJIS stands for Shift JIS, an
1368 encoding of Japanese characters used by Microsoft. CCL_ENCODE_SJIS is a
1369 complicated transformation of the Japanese standard JIS encoding to
1370 Shift JIS. CCL_DECODE_SJIS is its inverse.) It is somewhat odd to
1371 represent the SJIS operations in infix form.
1373 @node Calling CCL, CCL Examples, CCL Expressions, CCL
1374 @comment Node, Next, Previous, Up
1375 @subsection Calling CCL
1377 CCL programs are called automatically during Emacs buffer I/O when the
1378 external representation has a coding system type of @code{shift-jis},
1379 @code{big5}, or @code{ccl}. The program is specified by the coding
1380 system (@pxref{Coding Systems}). You can also call CCL programs from
1381 other CCL programs, and from Lisp using these functions:
1383 @defun ccl-execute ccl-program status
1384 Execute @var{ccl-program} with registers initialized by
1385 @var{status}. @var{ccl-program} is a vector of compiled CCL code
1386 created by @code{ccl-compile}. It is an error for the program to try to
1387 execute a CCL I/O command. @var{status} must be a vector of nine
1388 values, specifying the initial value for the R0, R1 .. R7 registers and
1389 for the instruction counter IC. A @code{nil} value for a register
1390 initializer causes the register to be set to 0. A @code{nil} value for
1391 the IC initializer causes execution to start at the beginning of the
1392 program. When the program is done, @var{status} is modified (by
1393 side-effect) to contain the ending values for the corresponding
1397 @defun ccl-execute-on-string ccl-program status str &optional continue
1398 Execute @var{ccl-program} with initial @var{status} on
1399 @var{string}. @var{ccl-program} is a vector of compiled CCL code
1400 created by @code{ccl-compile}. @var{status} must be a vector of nine
1401 values, specifying the initial value for the R0, R1 .. R7 registers and
1402 for the instruction counter IC. A @code{nil} value for a register
1403 initializer causes the register to be set to 0. A @code{nil} value for
1404 the IC initializer causes execution to start at the beginning of the
1405 program. An optional fourth argument @var{continue}, if non-nil, causes
1407 remain on the unsatisfied read operation if the program terminates due
1408 to exhaustion of the input buffer. Otherwise the IC is set to the end
1409 of the program. When the program is done, @var{status} is modified (by
1410 side-effect) to contain the ending values for the corresponding
1411 registers and IC. Returns the resulting string.
1414 To call a CCL program from another CCL program, it must first be
1417 @defun register-ccl-program name ccl-program
1418 Register @var{name} for CCL program @var{program} in
1419 @code{ccl-program-table}. @var{program} should be the compiled form of
1420 a CCL program, or nil. Return index number of the registered CCL
1424 Information about the processor time used by the CCL interpreter can be
1425 obtained using these functions:
1427 @defun ccl-elapsed-time
1428 Returns the elapsed processor time of the CCL interpreter as cons of
1429 user and system time, as
1430 floating point numbers measured in seconds. If only one
1431 overall value can be determined, the return value will be a cons of that
1435 @defun ccl-reset-elapsed-time
1436 Resets the CCL interpreter's internal elapsed time registers.
1439 @node CCL Examples, , Calling CCL, CCL
1440 @comment Node, Next, Previous, Up
1441 @subsection CCL Examples
1443 This section is not yet written.
1445 @node Category Tables, , CCL, MULE
1446 @section Category Tables
1448 A category table is a type of char table used for keeping track of
1449 categories. Categories are used for classifying characters for use in
1450 regexps -- you can refer to a category rather than having to use a
1451 complicated [] expression (and category lookups are significantly
1454 There are 95 different categories available, one for each printable
1455 character (including space) in the ASCII charset. Each category is
1456 designated by one such character, called a @dfn{category designator}.
1457 They are specified in a regexp using the syntax @samp{\cX}, where X is a
1458 category designator. (This is not yet implemented.)
1460 A category table specifies, for each character, the categories that
1461 the character is in. Note that a character can be in more than one
1462 category. More specifically, a category table maps from a character to
1463 either the value @code{nil} (meaning the character is in no categories)
1464 or a 95-element bit vector, specifying for each of the 95 categories
1465 whether the character is in that category.
1467 Special Lisp functions are provided that abstract this, so you do not
1468 have to directly manipulate bit vectors.
1470 @defun category-table-p obj
1471 This function returns @code{t} if @var{arg} is a category table.
1474 @defun category-table &optional buffer
1475 This function returns the current category table. This is the one
1476 specified by the current buffer, or by @var{buffer} if it is
1480 @defun standard-category-table
1481 This function returns the standard category table. This is the one used
1485 @defun copy-category-table &optional table
1486 This function constructs a new category table and return it. It is a
1487 copy of the @var{table}, which defaults to the standard category table.
1490 @defun set-category-table table &optional buffer
1491 This function selects a new category table for @var{buffer}. One
1492 argument, a category table. @var{buffer} defaults to the current buffer
1496 @defun category-designator-p obj
1497 This function returns @code{t} if @var{arg} is a category designator (a
1498 char in the range @samp{' '} to @samp{'~'}).
1501 @defun category-table-value-p obj
1502 This function returns @code{t} if @var{arg} is a category table value.
1503 Valid values are @code{nil} or a bit vector of size 95.