2 @c This is part of the XEmacs Lisp Reference Manual.
3 @c Copyright (C) 1996 Ben Wing.
4 @c See the file lispref.texi for copying conditions.
5 @setfilename ../../info/internationalization.info
6 @node MULE, Tips, Internationalization, top
9 @dfn{MULE} is the name originally given to the version of GNU Emacs
10 extended for multi-lingual (and in particular Asian-language) support.
11 ``MULE'' is short for ``MUlti-Lingual Emacs''. It was originally called
12 Nemacs (``Nihon Emacs'' where ``Nihon'' is the Japanese word for
13 ``Japan''), when it only provided support for Japanese. XEmacs
14 refers to its multi-lingual support as @dfn{MULE support} since it
15 is based on @dfn{MULE}.
18 * Internationalization Terminology::
19 Definition of various internationalization terms.
20 * Charsets:: Sets of related characters.
21 * MULE Characters:: Working with characters in XEmacs/MULE.
22 * Composite Characters:: Making new characters by overstriking other ones.
23 * ISO 2022:: An international standard for charsets and encodings.
24 * Coding Systems:: Ways of representing a string of chars using integers.
25 * CCL:: A special language for writing fast converters.
26 * Category Tables:: Subdividing charsets into groups.
29 @node Internationalization Terminology
30 @section Internationalization Terminology
32 In internationalization terminology, a string of text is divided up
33 into @dfn{characters}, which are the printable units that make up the
34 text. A single character is (for example) a capital @samp{A}, the
35 number @samp{2}, a Katakana character, a Kanji ideograph (an
36 @dfn{ideograph} is a ``picture'' character, such as is used in Japanese
37 Kanji, Chinese Hanzi, and Korean Hangul; typically there are thousands
38 of such ideographs in each language), etc. The basic property of a
39 character is its shape. Note that the same character may be drawn by
40 two different people (or in two different fonts) in slightly different
41 ways, although the basic shape will be the same.
43 In some cases, the differences will be significant enough that it is
44 actually possible to identify two or more distinct shapes that both
45 represent the same character. For example, the lowercase letters
46 @samp{a} and @samp{g} each have two distinct possible shapes -- the
47 @samp{a} can optionally have a curved tail projecting off the top, and
48 the @samp{g} can be formed either of two loops, or of one loop and a
49 tail hanging off the bottom. Such distinct possible shapes of a
50 character are called @dfn{glyphs}. The important characteristic of two
51 glyphs making up the same character is that the choice between one or
52 the other is purely stylistic and has no linguistic effect on a word
53 (this is the reason why a capital @samp{A} and lowercase @samp{a}
54 are different characters rather than different glyphs -- e.g.
55 @samp{Aspen} is a city while @samp{aspen} is a kind of tree).
57 Note that @dfn{character} and @dfn{glyph} are used differently
58 here than elsewhere in XEmacs.
60 A @dfn{character set} is simply a set of related characters. ASCII,
61 for example, is a set of 94 characters (or 128, if you count
62 non-printing characters). Other character sets are ISO8859-1 (ASCII
63 plus various accented characters and other international symbols),
64 JISX0201 (ASCII, more or less, plus half-width Katakana), JISX0208
65 (Japanese Kanji), JISX0212 (a second set of less-used Japanese Kanji),
66 GB2312 (Mainland Chinese Hanzi), etc.
68 Every character set has one or more @dfn{orderings}, which can be
69 viewed as a way of assigning a number (or set of numbers) to each
70 character in the set. For most character sets, there is a standard
71 ordering, and in fact all of the character sets mentioned above define a
72 particular ordering. ASCII, for example, places letters in their
73 ``natural'' order, puts uppercase letters before lowercase letters,
74 numbers before letters, etc. Note that for many of the Asian character
75 sets, there is no natural ordering of the characters. The actual
76 orderings are based on one or more salient characteristic, of which
77 there are many to choose from -- e.g. number of strokes, common
78 radicals, phonetic ordering, etc.
80 The set of numbers assigned to any particular character are called
81 the character's @dfn{position codes}. The number of position codes
82 required to index a particular character in a character set is called
83 the @dfn{dimension} of the character set. ASCII, being a relatively
84 small character set, is of dimension one, and each character in the
85 set is indexed using a single position code, in the range 0 through
86 127 (if non-printing characters are included) or 33 through 126
87 (if only the printing characters are considered). JISX0208, i.e.
88 Japanese Kanji, has thousands of characters, and is of dimension two --
89 every character is indexed by two position codes, each in the range
90 33 through 126. (Note that the choice of the range here is somewhat
91 arbitrary. Although a character set such as JISX0208 defines an
92 @emph{ordering} of all its characters, it does not define the actual
93 mapping between numbers and characters. You could just as easily
94 index the characters in JISX0208 using numbers in the range 0 through
95 93, 1 through 94, 2 through 95, etc. The reason for the actual range
96 chosen is so that the position codes match up with the actual values
97 used in the common encodings.)
99 An @dfn{encoding} is a way of numerically representing characters from
100 one or more character sets into a stream of like-sized numerical values
101 called @dfn{words}; typically these are 8-bit, 16-bit, or 32-bit
102 quantities. If an encoding encompasses only one character set, then the
103 position codes for the characters in that character set could be used
104 directly. (This is the case with ASCII, and as a result, most people do
105 not understand the difference between a character set and an encoding.)
106 This is not possible, however, if more than one character set is to be
107 used in the encoding. For example, printed Japanese text typically
108 requires characters from multiple character sets -- ASCII, JISX0208, and
109 JISX0212, to be specific. Each of these is indexed using one or more
110 position codes in the range 33 through 126, so the position codes could
111 not be used directly or there would be no way to tell which character
112 was meant. Different Japanese encodings handle this differently -- JIS
113 uses special escape characters to denote different character sets; EUC
114 sets the high bit of the position codes for JISX0208 and JISX0212, and
115 puts a special extra byte before each JISX0212 character; etc. (JIS,
116 EUC, and most of the other encodings you will encounter are 7-bit or
117 8-bit encodings. There is one common 16-bit encoding, which is Unicode;
118 this strives to represent all the world's characters in a single large
119 character set. 32-bit encodings are generally used internally in
120 programs to simplify the code that manipulates them; however, they are
121 not much used externally because they are not very space-efficient.)
123 Encodings are classified as either @dfn{modal} or @dfn{non-modal}. In
124 a @dfn{modal encoding}, there are multiple states that the encoding can be in,
125 and the interpretation of the values in the stream depends on the
126 current global state of the encoding. Special values in the encoding,
127 called @dfn{escape sequences}, are used to change the global state.
128 JIS, for example, is a modal encoding. The bytes @samp{ESC $ B}
129 indicate that, from then on, bytes are to be interpreted as position
130 codes for JISX0208, rather than as ASCII. This effect is cancelled
131 using the bytes @samp{ESC ( B}, which mean ``switch from whatever the
132 current state is to ASCII''. To switch to JISX0212, the escape sequence
133 @samp{ESC $ ( D}. (Note that here, as is common, the escape sequences do
134 in fact begin with @samp{ESC}. This is not necessarily the case,
137 A @dfn{non-modal encoding} has no global state that extends past the
138 character currently being interpreted. EUC, for example, is a
139 non-modal encoding. Characters in JISX0208 are encoded by setting
140 the high bit of the position codes, and characters in JISX0212 are
141 encoded by doing the same but also prefixing the character with the
144 The advantage of a modal encoding is that it is generally more
145 space-efficient, and is easily extendable because there are essentially
146 an arbitrary number of escape sequences that can be created. The
147 disadvantage, however, is that it is much more difficult to work with
148 if it is not being processed in a sequential manner. In the non-modal
149 EUC encoding, for example, the byte 0x41 always refers to the letter
150 @samp{A}; whereas in JIS, it could either be the letter @samp{A}, or
151 one of the two position codes in a JISX0208 character, or one of the
152 two position codes in a JISX0212 character. Determining exactly which
153 one is meant could be difficult and time-consuming if the previous
154 bytes in the string have not already been processed.
156 Non-modal encodings are further divided into @dfn{fixed-width} and
157 @dfn{variable-width} formats. A fixed-width encoding always uses
158 the same number of words per character, whereas a variable-width
159 encoding does not. EUC is a good example of a variable-width
160 encoding: one to three bytes are used per character, depending on
161 the character set. 16-bit and 32-bit encodings are nearly always
162 fixed-width, and this is in fact one of the main reasons for using
163 an encoding with a larger word size. The advantages of fixed-width
164 encodings should be obvious. The advantages of variable-width
165 encodings are that they are generally more space-efficient and allow
166 for compatibility with existing 8-bit encodings such as ASCII.
168 Note that the bytes in an 8-bit encoding are often referred to
169 as @dfn{octets} rather than simply as bytes. This terminology
170 dates back to the days before 8-bit bytes were universal, when
171 some computers had 9-bit bytes, others had 10-bit bytes, etc.
176 A @dfn{charset} in MULE is an object that encapsulates a
177 particular character set as well as an ordering of those characters.
178 Charsets are permanent objects and are named using symbols, like
181 @defun charsetp object
182 This function returns non-@code{nil} if @var{object} is a charset.
186 * Charset Properties:: Properties of a charset.
187 * Basic Charset Functions:: Functions for working with charsets.
188 * Charset Property Functions:: Functions for accessing charset properties.
189 * Predefined Charsets:: Predefined charset objects.
192 @node Charset Properties
193 @subsection Charset Properties
195 Charsets have the following properties:
199 A symbol naming the charset. Every charset must have a different name;
200 this allows a charset to be referred to using its name rather than
201 the actual charset object.
203 A documentation string describing the charset.
205 A regular expression matching the font registry field for this character
206 set. For example, both the @code{ascii} and @code{latin-iso8859-1}
207 charsets use the registry @code{"ISO8859-1"}. This field is used to
208 choose an appropriate font when the user gives a general font
209 specification such as @samp{-*-courier-medium-r-*-140-*}, i.e. a
210 14-point upright medium-weight Courier font.
212 Number of position codes used to index a character in the character set.
213 XEmacs/MULE can only handle character sets of dimension 1 or 2.
214 This property defaults to 1.
216 Number of characters in each dimension. In XEmacs/MULE, the only
217 allowed values are 94 or 96. (There are a couple of pre-defined
218 character sets, such as ASCII, that do not follow this, but you cannot
219 define new ones like this.) Defaults to 94. Note that if the dimension
220 is 2, the character set thus described is 94x94 or 96x96.
222 Number of columns used to display a character in this charset.
223 Only used in TTY mode. (Under X, the actual width of a character
224 can be derived from the font used to display the characters.)
225 If unspecified, defaults to the dimension. (This is almost
226 always the correct value, because character sets with dimension 2
227 are usually ideograph character sets, which need two columns to
228 display the intricate ideographs.)
230 A symbol, either @code{l2r} (left-to-right) or @code{r2l}
231 (right-to-left). Defaults to @code{l2r}. This specifies the
232 direction that the text should be displayed in, and will be
233 left-to-right for most charsets but right-to-left for Hebrew
234 and Arabic. (Right-to-left display is not currently implemented.)
236 Final byte of the standard ISO 2022 escape sequence designating this
237 charset. Must be supplied. Each combination of (@var{dimension},
238 @var{chars}) defines a separate namespace for final bytes, and each
239 charset within a particular namespace must have a different final byte.
240 Note that ISO 2022 restricts the final byte to the range 0x30 - 0x7E if
241 dimension == 1, and 0x30 - 0x5F if dimension == 2. Note also that final
242 bytes in the range 0x30 - 0x3F are reserved for user-defined (not
243 official) character sets. For more information on ISO 2022, see @ref{Coding
246 0 (use left half of font on output) or 1 (use right half of font on
247 output). Defaults to 0. This specifies how to convert the position
248 codes that index a character in a character set into an index into the
249 font used to display the character set. With @code{graphic} set to 0,
250 position codes 33 through 126 map to font indices 33 through 126; with
251 it set to 1, position codes 33 through 126 map to font indices 161
252 through 254 (i.e. the same number but with the high bit set). For
253 example, for a font whose registry is ISO8859-1, the left half of the
254 font (octets 0x20 - 0x7F) is the @code{ascii} charset, while the right
255 half (octets 0xA0 - 0xFF) is the @code{latin-iso8859-1} charset.
257 A compiled CCL program used to convert a character in this charset into
258 an index into the font. This is in addition to the @code{graphic}
259 property. If a CCL program is defined, the position codes of a
260 character will first be processed according to @code{graphic} and
261 then passed through the CCL program, with the resulting values used
264 This is used, for example, in the Big5 character set (used in Taiwan).
265 This character set is not ISO-2022-compliant, and its size (94x157) does
266 not fit within the maximum 96x96 size of ISO-2022-compliant character
267 sets. As a result, XEmacs/MULE splits it (in a rather complex fashion,
268 so as to group the most commonly used characters together) into two
269 charset objects (@code{big5-1} and @code{big5-2}), each of size 94x94,
270 and each charset object uses a CCL program to convert the modified
271 position codes back into standard Big5 indices to retrieve a character
275 Most of the above properties can only be changed when the charset
276 is created. @xref{Charset Property Functions}.
278 @node Basic Charset Functions
279 @subsection Basic Charset Functions
281 @defun find-charset charset-or-name
282 This function retrieves the charset of the given name. If
283 @var{charset-or-name} is a charset object, it is simply returned.
284 Otherwise, @var{charset-or-name} should be a symbol. If there is no
285 such charset, @code{nil} is returned. Otherwise the associated charset
289 @defun get-charset name
290 This function retrieves the charset of the given name. Same as
291 @code{find-charset} except an error is signalled if there is no such
292 charset instead of returning @code{nil}.
296 This function returns a list of the names of all defined charsets.
299 @defun make-charset name doc-string props
300 This function defines a new character set. This function is for use
301 with Mule support. @var{name} is a symbol, the name by which the
302 character set is normally referred. @var{doc-string} is a string
303 describing the character set. @var{props} is a property list,
304 describing the specific nature of the character set. The recognized
305 properties are @code{registry}, @code{dimension}, @code{columns},
306 @code{chars}, @code{final}, @code{graphic}, @code{direction}, and
307 @code{ccl-program}, as previously described.
310 @defun make-reverse-direction-charset charset new-name
311 This function makes a charset equivalent to @var{charset} but which goes
312 in the opposite direction. @var{new-name} is the name of the new
313 charset. The new charset is returned.
316 @defun charset-from-attributes dimension chars final &optional direction
317 This function returns a charset with the given @var{dimension},
318 @var{chars}, @var{final}, and @var{direction}. If @var{direction} is
319 omitted, both directions will be checked (left-to-right will be returned
320 if character sets exist for both directions).
323 @defun charset-reverse-direction-charset charset
324 This function returns the charset (if any) with the same dimension,
325 number of characters, and final byte as @var{charset}, but which is
326 displayed in the opposite direction.
329 @node Charset Property Functions
330 @subsection Charset Property Functions
332 All of these functions accept either a charset name or charset object.
334 @defun charset-property charset prop
335 This function returns property @var{prop} of @var{charset}.
336 @xref{Charset Properties}.
339 Convenience functions are also provided for retrieving individual
340 properties of a charset.
342 @defun charset-name charset
343 This function returns the name of @var{charset}. This will be a symbol.
346 @defun charset-doc-string charset
347 This function returns the doc string of @var{charset}.
350 @defun charset-registry charset
351 This function returns the registry of @var{charset}.
354 @defun charset-dimension charset
355 This function returns the dimension of @var{charset}.
358 @defun charset-chars charset
359 This function returns the number of characters per dimension of
363 @defun charset-columns charset
364 This function returns the number of display columns per character (in
365 TTY mode) of @var{charset}.
368 @defun charset-direction charset
369 This function returns the display direction of @var{charset} -- either
370 @code{l2r} or @code{r2l}.
373 @defun charset-final charset
374 This function returns the final byte of the ISO 2022 escape sequence
375 designating @var{charset}.
378 @defun charset-graphic charset
379 This function returns either 0 or 1, depending on whether the position
380 codes of characters in @var{charset} map to the left or right half
381 of their font, respectively.
384 @defun charset-ccl-program charset
385 This function returns the CCL program, if any, for converting
386 position codes of characters in @var{charset} into font indices.
389 The only property of a charset that can currently be set after
390 the charset has been created is the CCL program.
392 @defun set-charset-ccl-program charset ccl-program
393 This function sets the @code{ccl-program} property of @var{charset} to
397 @node Predefined Charsets
398 @subsection Predefined Charsets
400 The following charsets are predefined in the C code.
403 Name Type Fi Gr Dir Registry
404 --------------------------------------------------------------
405 ascii 94 B 0 l2r ISO8859-1
406 control-1 94 0 l2r ---
407 latin-iso8859-1 94 A 1 l2r ISO8859-1
408 latin-iso8859-2 96 B 1 l2r ISO8859-2
409 latin-iso8859-3 96 C 1 l2r ISO8859-3
410 latin-iso8859-4 96 D 1 l2r ISO8859-4
411 cyrillic-iso8859-5 96 L 1 l2r ISO8859-5
412 arabic-iso8859-6 96 G 1 r2l ISO8859-6
413 greek-iso8859-7 96 F 1 l2r ISO8859-7
414 hebrew-iso8859-8 96 H 1 r2l ISO8859-8
415 latin-iso8859-9 96 M 1 l2r ISO8859-9
416 thai-tis620 96 T 1 l2r TIS620
417 katakana-jisx0201 94 I 1 l2r JISX0201.1976
418 latin-jisx0201 94 J 0 l2r JISX0201.1976
419 japanese-jisx0208-1978 94x94 @@ 0 l2r JISX0208.1978
420 japanese-jisx0208 94x94 B 0 l2r JISX0208.19(83|90)
421 japanese-jisx0212 94x94 D 0 l2r JISX0212
422 chinese-gb2312 94x94 A 0 l2r GB2312
423 chinese-cns11643-1 94x94 G 0 l2r CNS11643.1
424 chinese-cns11643-2 94x94 H 0 l2r CNS11643.2
425 chinese-big5-1 94x94 0 0 l2r Big5
426 chinese-big5-2 94x94 1 0 l2r Big5
427 korean-ksc5601 94x94 C 0 l2r KSC5601
428 composite 96x96 0 l2r ---
431 The following charsets are predefined in the Lisp code.
434 Name Type Fi Gr Dir Registry
435 --------------------------------------------------------------
436 arabic-digit 94 2 0 l2r MuleArabic-0
437 arabic-1-column 94 3 0 r2l MuleArabic-1
438 arabic-2-column 94 4 0 r2l MuleArabic-2
439 sisheng 94 0 0 l2r sisheng_cwnn\|OMRON_UDC_ZH
440 chinese-cns11643-3 94x94 I 0 l2r CNS11643.1
441 chinese-cns11643-4 94x94 J 0 l2r CNS11643.1
442 chinese-cns11643-5 94x94 K 0 l2r CNS11643.1
443 chinese-cns11643-6 94x94 L 0 l2r CNS11643.1
444 chinese-cns11643-7 94x94 M 0 l2r CNS11643.1
445 ethiopic 94x94 2 0 l2r Ethio
446 ascii-r2l 94 B 0 r2l ISO8859-1
447 ipa 96 0 1 l2r MuleIPA
448 vietnamese-lower 96 1 1 l2r VISCII1.1
449 vietnamese-upper 96 2 1 l2r VISCII1.1
452 For all of the above charsets, the dimension and number of columns are
455 Note that ASCII, Control-1, and Composite are handled specially.
456 This is why some of the fields are blank; and some of the filled-in
457 fields (e.g. the type) are not really accurate.
459 @node MULE Characters
460 @section MULE Characters
462 @defun make-char charset arg1 &optional arg2
463 This function makes a multi-byte character from @var{charset} and octets
464 @var{arg1} and @var{arg2}.
467 @defun char-charset ch
468 This function returns the character set of char @var{ch}.
471 @defun char-octet ch &optional n
472 This function returns the octet (i.e. position code) numbered @var{n}
473 (should be 0 or 1) of char @var{ch}. @var{n} defaults to 0 if omitted.
476 @defun find-charset-region start end &optional buffer
477 This function returns a list of the charsets in the region between
478 @var{start} and @var{end}. @var{buffer} defaults to the current buffer
482 @defun find-charset-string string
483 This function returns a list of the charsets in @var{string}.
486 @node Composite Characters
487 @section Composite Characters
489 Composite characters are not yet completely implemented.
491 @defun make-composite-char string
492 This function converts a string into a single composite character. The
493 character is the result of overstriking all the characters in the
497 @defun composite-char-string ch
498 This function returns a string of the characters comprising a composite
502 @defun compose-region start end &optional buffer
503 This function composes the characters in the region from @var{start} to
504 @var{end} in @var{buffer} into one composite character. The composite
505 character replaces the composed characters. @var{buffer} defaults to
506 the current buffer if omitted.
509 @defun decompose-region start end &optional buffer
510 This function decomposes any composite characters in the region from
511 @var{start} to @var{end} in @var{buffer}. This converts each composite
512 character into one or more characters, the individual characters out of
513 which the composite character was formed. Non-composite characters are
514 left as-is. @var{buffer} defaults to the current buffer if omitted.
520 This section briefly describes the ISO 2022 encoding standard. For more
521 thorough understanding, please refer to the original document of ISO
524 Character sets (@dfn{charsets}) are classified into the following four
525 categories, according to the number of characters of charset:
526 94-charset, 96-charset, 94x94-charset, and 96x96-charset.
531 ASCII(B), left(J) and right(I) half of JISX0201, ...
533 Latin-1(A), Latin-2(B), Latin-3(C), ...
535 GB2312(A), JISX0208(B), KSC5601(C), ...
540 The character in parentheses after the name of each charset
541 is the @dfn{final character} @var{F}, which can be regarded as
542 the identifier of the charset. ECMA allocates @var{F} to each
543 charset. @var{F} is in the range of 0x30..0x7F, but 0x30..0x3F
544 are only for private use.
546 Note: @dfn{ECMA} = European Computer Manufacturers Association
548 There are four @dfn{registers of charsets}, called G0 thru G3.
549 You can designate (or assign) any charset to one of these
552 The code space contained within one octet (of size 256) is divided into
553 4 areas: C0, GL, C1, and GR. GL and GR are the areas into which a
554 register of charset can be invoked into.
565 Usually, in the initial state, G0 is invoked into GL, and G1
568 ISO 2022 distinguishes 7-bit environments and 8-bit environments. In
569 7-bit environments, only C0 and GL are used.
571 Charset designation is done by escape sequences of the form:
574 ESC [@var{I}] @var{I} @var{F}
577 where @var{I} is an intermediate character in the range 0x20 - 0x2F, and
578 @var{F} is the final character identifying this charset.
580 The meaning of intermediate characters are:
584 $ [0x24]: indicate charset of dimension 2 (94x94 or 96x96).
585 ( [0x28]: designate to G0 a 94-charset whose final byte is @var{F}.
586 ) [0x29]: designate to G1 a 94-charset whose final byte is @var{F}.
587 * [0x2A]: designate to G2 a 94-charset whose final byte is @var{F}.
588 + [0x2B]: designate to G3 a 94-charset whose final byte is @var{F}.
589 - [0x2D]: designate to G1 a 96-charset whose final byte is @var{F}.
590 . [0x2E]: designate to G2 a 96-charset whose final byte is @var{F}.
591 / [0x2F]: designate to G3 a 96-charset whose final byte is @var{F}.
595 The following rule is not allowed in ISO 2022 but can be used in Mule.
598 , [0x2C]: designate to G0 a 96-charset whose final byte is @var{F}.
601 Here are examples of designations:
605 ESC ( B : designate to G0 ASCII
606 ESC - A : designate to G1 Latin-1
607 ESC $ ( A or ESC $ A : designate to G0 GB2312
608 ESC $ ( B or ESC $ B : designate to G0 JISX0208
609 ESC $ ) C : designate to G1 KSC5601
613 To use a charset designated to G2 or G3, and to use a charset designated
614 to G1 in a 7-bit environment, you must explicitly invoke G1, G2, or G3
615 into GL. There are two types of invocation, Locking Shift (forever) and
616 Single Shift (one character only).
618 Locking Shift is done as follows:
621 LS0 or SI (0x0F): invoke G0 into GL
622 LS1 or SO (0x0E): invoke G1 into GL
623 LS2: invoke G2 into GL
624 LS3: invoke G3 into GL
625 LS1R: invoke G1 into GR
626 LS2R: invoke G2 into GR
627 LS3R: invoke G3 into GR
630 Single Shift is done as follows:
634 SS2 or ESC N: invoke G2 into GL
635 SS3 or ESC O: invoke G3 into GL
639 (#### Ben says: I think the above is slightly incorrect. It appears that
640 SS2 invokes G2 into GR and SS3 invokes G3 into GR, whereas ESC N and
641 ESC O behave as indicated. The above definitions will not parse
642 EUC-encoded text correctly, and it looks like the code in mule-coding.c
643 has similar problems.)
645 You may realize that there are a lot of ISO-2022-compliant ways of
646 encoding multilingual text. Now, in the world, there exist many coding
647 systems such as X11's Compound Text, Japanese JUNET code, and so-called
648 EUC (Extended UNIX Code); all of these are variants of ISO 2022.
650 In Mule, we characterize ISO 2022 by the following attributes:
654 Initial designation to G0 thru G3.
656 Allow designation of short form for Japanese and Chinese.
658 Should we designate ASCII to G0 before control characters?
660 Should we designate ASCII to G0 at the end of line?
662 7-bit environment or 8-bit environment.
664 Use Locking Shift or not.
666 Use ASCII or JIS0201-1976-Roman.
668 Use JISX0208-1983 or JISX0208-1976.
671 (The last two are only for Japanese.)
673 By specifying these attributes, you can create any variant
676 Here are several examples:
680 junet -- Coding system used in JUNET.
681 1. G0 <- ASCII, G1..3 <- never used
692 ctext -- Compound Text
693 1. G0 <- ASCII, G1 <- Latin-1, G2,3 <- never used
704 euc-china -- Chinese EUC. Although many people call this
705 as "GB encoding", the name may cause misunderstanding.
706 1. G0 <- ASCII, G1 <- GB2312, G2,3 <- never used
717 korean-mail -- Coding system used in Korean network.
718 1. G0 <- ASCII, G1 <- KSC5601, G2,3 <- never used
729 Mule creates all these coding systems by default.
732 @section Coding Systems
734 A coding system is an object that defines how text containing multiple
735 character sets is encoded into a stream of (typically 8-bit) bytes. The
736 coding system is used to decode the stream into a series of characters
737 (which may be from multiple charsets) when the text is read from a file
738 or process, and is used to encode the text back into the same format
739 when it is written out to a file or process.
741 For example, many ISO-2022-compliant coding systems (such as Compound
742 Text, which is used for inter-client data under the X Window System) use
743 escape sequences to switch between different charsets -- Japanese Kanji,
744 for example, is invoked with @samp{ESC $ ( B}; ASCII is invoked with
745 @samp{ESC ( B}; and Cyrillic is invoked with @samp{ESC - L}. See
746 @code{make-coding-system} for more information.
748 Coding systems are normally identified using a symbol, and the symbol is
749 accepted in place of the actual coding system object whenever a coding
750 system is called for. (This is similar to how faces and charsets work.)
752 @defun coding-system-p object
753 This function returns non-@code{nil} if @var{object} is a coding system.
757 * Coding System Types:: Classifying coding systems.
758 * EOL Conversion:: Dealing with different ways of denoting
760 * Coding System Properties:: Properties of a coding system.
761 * Basic Coding System Functions:: Working with coding systems.
762 * Coding System Property Functions:: Retrieving a coding system's properties.
763 * Encoding and Decoding Text:: Encoding and decoding text.
764 * Detection of Textual Encoding:: Determining how text is encoded.
765 * Big5 and Shift-JIS Functions:: Special functions for these non-standard
769 @node Coding System Types
770 @subsection Coding System Types
775 Automatic conversion. XEmacs attempts to detect the coding system used
778 No conversion. Use this for binary files and such. On output, graphic
779 characters that are not in ASCII or Latin-1 will be replaced by a
780 @samp{?}. (For a no-conversion-encoded buffer, these characters will
781 only be present if you explicitly insert them.)
783 Shift-JIS (a Japanese encoding commonly used in PC operating systems).
785 Any ISO-2022-compliant encoding. Among other things, this includes JIS
786 (the Japanese encoding commonly used for e-mail), national variants of
787 EUC (the standard Unix encoding for Japanese and other languages), and
788 Compound Text (an encoding used in X11). You can specify more specific
789 information about the conversion with the @var{flags} argument.
791 Big5 (the encoding commonly used for Taiwanese).
793 The conversion is performed using a user-written pseudo-code program.
794 CCL (Code Conversion Language) is the name of this pseudo-code.
796 Write out or read in the raw contents of the memory representing the
797 buffer's text. This is primarily useful for debugging purposes, and is
798 only enabled when XEmacs has been compiled with @code{DEBUG_XEMACS} set
799 (the @samp{--debug} configure option). @strong{Warning}: Reading in a
800 file using @code{internal} conversion can result in an internal
801 inconsistency in the memory representing a buffer's text, which will
802 produce unpredictable results and may cause XEmacs to crash. Under
803 normal circumstances you should never use @code{internal} conversion.
807 @subsection EOL Conversion
811 Automatically detect the end-of-line type (LF, CRLF, or CR). Also
812 generate subsidiary coding systems named @code{@var{name}-unix},
813 @code{@var{name}-dos}, and @code{@var{name}-mac}, that are identical to
814 this coding system but have an EOL-TYPE value of @code{lf}, @code{crlf},
815 and @code{cr}, respectively.
817 The end of a line is marked externally using ASCII LF. Since this is
818 also the way that XEmacs represents an end-of-line internally,
819 specifying this option results in no end-of-line conversion. This is
820 the standard format for Unix text files.
822 The end of a line is marked externally using ASCII CRLF. This is the
823 standard format for MS-DOS text files.
825 The end of a line is marked externally using ASCII CR. This is the
826 standard format for Macintosh text files.
828 Automatically detect the end-of-line type but do not generate subsidiary
829 coding systems. (This value is converted to @code{nil} when stored
830 internally, and @code{coding-system-property} will return @code{nil}.)
833 @node Coding System Properties
834 @subsection Coding System Properties
838 String to be displayed in the modeline when this coding system is
842 End-of-line conversion to be used. It should be one of the types
843 listed in @ref{EOL Conversion}.
845 @item post-read-conversion
846 Function called after a file has been read in, to perform the decoding.
847 Called with two arguments, @var{beg} and @var{end}, denoting a region of
848 the current buffer to be decoded.
850 @item pre-write-conversion
851 Function called before a file is written out, to perform the encoding.
852 Called with two arguments, @var{beg} and @var{end}, denoting a region of
853 the current buffer to be encoded.
856 The following additional properties are recognized if @var{type} is
864 The character set initially designated to the G0 - G3 registers.
865 The value should be one of
869 A charset object (designate that character set)
871 @code{nil} (do not ever use this register)
873 @code{t} (no character set is initially designated to the register, but
874 may be later on; this automatically sets the corresponding
875 @code{force-g*-on-output} property)
878 @item force-g0-on-output
879 @itemx force-g1-on-output
880 @itemx force-g2-on-output
881 @itemx force-g3-on-output
882 If non-@code{nil}, send an explicit designation sequence on output
883 before using the specified register.
886 If non-@code{nil}, use the short forms @samp{ESC $ @@}, @samp{ESC $ A},
887 and @samp{ESC $ B} on output in place of the full designation sequences
888 @samp{ESC $ ( @@}, @samp{ESC $ ( A}, and @samp{ESC $ ( B}.
891 If non-@code{nil}, don't designate ASCII to G0 at each end of line on
892 output. Setting this to non-@code{nil} also suppresses other
893 state-resetting that normally happens at the end of a line.
896 If non-@code{nil}, don't designate ASCII to G0 before control chars on
900 If non-@code{nil}, use 7-bit environment on output. Otherwise, use 8-bit
904 If non-@code{nil}, use locking-shift (SO/SI) instead of single-shift or
905 designation by escape sequence.
908 If non-@code{nil}, don't use ISO6429's direction specification.
911 If non-nil, literal control characters that are the same as the
912 beginning of a recognized ISO 2022 or ISO 6429 escape sequence (in
913 particular, ESC (0x1B), SO (0x0E), SI (0x0F), SS2 (0x8E), SS3 (0x8F),
914 and CSI (0x9B)) are ``quoted'' with an escape character so that they can
915 be properly distinguished from an escape sequence. (Note that doing
916 this results in a non-portable encoding.) This encoding flag is used for
917 byte-compiled files. Note that ESC is a good choice for a quoting
918 character because there are no escape sequences whose second byte is a
919 character from the Control-0 or Control-1 character sets; this is
920 explicitly disallowed by the ISO 2022 standard.
922 @item input-charset-conversion
923 A list of conversion specifications, specifying conversion of characters
924 in one charset to another when decoding is performed. Each
925 specification is a list of two elements: the source charset, and the
928 @item output-charset-conversion
929 A list of conversion specifications, specifying conversion of characters
930 in one charset to another when encoding is performed. The form of each
931 specification is the same as for @code{input-charset-conversion}.
934 The following additional properties are recognized (and required) if
935 @var{type} is @code{ccl}:
939 CCL program used for decoding (converting to internal format).
942 CCL program used for encoding (converting to external format).
945 @node Basic Coding System Functions
946 @subsection Basic Coding System Functions
948 @defun find-coding-system coding-system-or-name
949 This function retrieves the coding system of the given name.
951 If @var{coding-system-or-name} is a coding-system object, it is simply
952 returned. Otherwise, @var{coding-system-or-name} should be a symbol.
953 If there is no such coding system, @code{nil} is returned. Otherwise
954 the associated coding system object is returned.
957 @defun get-coding-system name
958 This function retrieves the coding system of the given name. Same as
959 @code{find-coding-system} except an error is signalled if there is no
960 such coding system instead of returning @code{nil}.
963 @defun coding-system-list
964 This function returns a list of the names of all defined coding systems.
967 @defun coding-system-name coding-system
968 This function returns the name of the given coding system.
971 @defun make-coding-system name type &optional doc-string props
972 This function registers symbol @var{name} as a coding system.
974 @var{type} describes the conversion method used and should be one of
975 the types listed in @ref{Coding System Types}.
977 @var{doc-string} is a string describing the coding system.
979 @var{props} is a property list, describing the specific nature of the
980 character set. Recognized properties are as in @ref{Coding System
984 @defun copy-coding-system old-coding-system new-name
985 This function copies @var{old-coding-system} to @var{new-name}. If
986 @var{new-name} does not name an existing coding system, a new one will
990 @defun subsidiary-coding-system coding-system eol-type
991 This function returns the subsidiary coding system of
992 @var{coding-system} with eol type @var{eol-type}.
995 @node Coding System Property Functions
996 @subsection Coding System Property Functions
998 @defun coding-system-doc-string coding-system
999 This function returns the doc string for @var{coding-system}.
1002 @defun coding-system-type coding-system
1003 This function returns the type of @var{coding-system}.
1006 @defun coding-system-property coding-system prop
1007 This function returns the @var{prop} property of @var{coding-system}.
1010 @node Encoding and Decoding Text
1011 @subsection Encoding and Decoding Text
1013 @defun decode-coding-region start end coding-system &optional buffer
1014 This function decodes the text between @var{start} and @var{end} which
1015 is encoded in @var{coding-system}. This is useful if you've read in
1016 encoded text from a file without decoding it (e.g. you read in a
1017 JIS-formatted file but used the @code{binary} or @code{no-conversion} coding
1018 system, so that it shows up as @samp{^[$B!<!+^[(B}). The length of the
1019 encoded text is returned. @var{buffer} defaults to the current buffer
1023 @defun encode-coding-region start end coding-system &optional buffer
1024 This function encodes the text between @var{start} and @var{end} using
1025 @var{coding-system}. This will, for example, convert Japanese
1026 characters into stuff such as @samp{^[$B!<!+^[(B} if you use the JIS
1027 encoding. The length of the encoded text is returned. @var{buffer}
1028 defaults to the current buffer if unspecified.
1031 @node Detection of Textual Encoding
1032 @subsection Detection of Textual Encoding
1034 @defun coding-category-list
1035 This function returns a list of all recognized coding categories.
1038 @defun set-coding-priority-list list
1039 This function changes the priority order of the coding categories.
1040 @var{list} should be a list of coding categories, in descending order of
1041 priority. Unspecified coding categories will be lower in priority than
1042 all specified ones, in the same relative order they were in previously.
1045 @defun coding-priority-list
1046 This function returns a list of coding categories in descending order of
1050 @defun set-coding-category-system coding-category coding-system
1051 This function changes the coding system associated with a coding category.
1054 @defun coding-category-system coding-category
1055 This function returns the coding system associated with a coding category.
1058 @defun detect-coding-region start end &optional buffer
1059 This function detects coding system of the text in the region between
1060 @var{start} and @var{end}. Returned value is a list of possible coding
1061 systems ordered by priority. If only ASCII characters are found, it
1062 returns @code{autodetect} or one of its subsidiary coding systems
1063 according to a detected end-of-line type. Optional arg @var{buffer}
1064 defaults to the current buffer.
1067 @node Big5 and Shift-JIS Functions
1068 @subsection Big5 and Shift-JIS Functions
1070 These are special functions for working with the non-standard
1071 Shift-JIS and Big5 encodings.
1073 @defun decode-shift-jis-char code
1074 This function decodes a JISX0208 character of Shift-JIS coding-system.
1075 @var{code} is the character code in Shift-JIS as a cons of type bytes.
1076 The corresponding character is returned.
1079 @defun encode-shift-jis-char ch
1080 This function encodes a JISX0208 character @var{ch} to SHIFT-JIS
1081 coding-system. The corresponding character code in SHIFT-JIS is
1082 returned as a cons of two bytes.
1085 @defun decode-big5-char code
1086 This function decodes a Big5 character @var{code} of BIG5 coding-system.
1087 @var{code} is the character code in BIG5. The corresponding character
1091 @defun encode-big5-char ch
1092 This function encodes the Big5 character @var{char} to BIG5
1093 coding-system. The corresponding character code in Big5 is returned.
1099 @defun execute-ccl-program ccl-program status
1100 This function executes @var{ccl-program} with registers initialized by
1101 @var{status}. @var{ccl-program} is a vector of compiled CCL code
1102 created by @code{ccl-compile}. @var{status} must be a vector of nine
1103 values, specifying the initial value for the R0, R1 .. R7 registers and
1104 for the instruction counter IC. A @code{nil} value for a register
1105 initializer causes the register to be set to 0. A @code{nil} value for
1106 the IC initializer causes execution to start at the beginning of the
1107 program. When the program is done, @var{status} is modified (by
1108 side-effect) to contain the ending values for the corresponding
1112 @defun execute-ccl-program-string ccl-program status str
1113 This function executes @var{ccl-program} with initial @var{status} on
1114 @var{string}. @var{ccl-program} is a vector of compiled CCL code
1115 created by @code{ccl-compile}. @var{status} must be a vector of nine
1116 values, specifying the initial value for the R0, R1 .. R7 registers and
1117 for the instruction counter IC. A @code{nil} value for a register
1118 initializer causes the register to be set to 0. A @code{nil} value for
1119 the IC initializer causes execution to start at the beginning of the
1120 program. When the program is done, @var{status} is modified (by
1121 side-effect) to contain the ending values for the corresponding
1122 registers and IC. Returns the resulting string.
1125 @defun ccl-reset-elapsed-time
1126 This function resets the internal value which holds the time elapsed by
1130 @defun ccl-elapsed-time
1131 This function returns the time elapsed by CCL interpreter as cons of
1132 user and system time. This measures processor time, not real time.
1133 Both values are floating point numbers measured in seconds. If only one
1134 overall value can be determined, the return value will be a cons of that
1138 @node Category Tables
1139 @section Category Tables
1141 A category table is a type of char table used for keeping track of
1142 categories. Categories are used for classifying characters for use in
1143 regexps -- you can refer to a category rather than having to use a
1144 complicated [] expression (and category lookups are significantly
1147 There are 95 different categories available, one for each printable
1148 character (including space) in the ASCII charset. Each category is
1149 designated by one such character, called a @dfn{category designator}.
1150 They are specified in a regexp using the syntax @samp{\cX}, where X is a
1151 category designator. (This is not yet implemented.)
1153 A category table specifies, for each character, the categories that
1154 the character is in. Note that a character can be in more than one
1155 category. More specifically, a category table maps from a character to
1156 either the value @code{nil} (meaning the character is in no categories)
1157 or a 95-element bit vector, specifying for each of the 95 categories
1158 whether the character is in that category.
1160 Special Lisp functions are provided that abstract this, so you do not
1161 have to directly manipulate bit vectors.
1163 @defun category-table-p obj
1164 This function returns @code{t} if @var{arg} is a category table.
1167 @defun category-table &optional buffer
1168 This function returns the current category table. This is the one
1169 specified by the current buffer, or by @var{buffer} if it is
1173 @defun standard-category-table
1174 This function returns the standard category table. This is the one used
1178 @defun copy-category-table &optional table
1179 This function constructs a new category table and return it. It is a
1180 copy of the @var{table}, which defaults to the standard category table.
1183 @defun set-category-table table &optional buffer
1184 This function selects a new category table for @var{buffer}. One
1185 argument, a category table. @var{buffer} defaults to the current buffer
1189 @defun category-designator-p obj
1190 This function returns @code{t} if @var{arg} is a category designator (a
1191 char in the range @samp{' '} to @samp{'~'}).
1194 @defun category-table-value-p obj
1195 This function returns @code{t} if @var{arg} is a category table value.
1196 Valid values are @code{nil} or a bit vector of size 95.