1 This is Info file ../../info/lispref.info, produced by Makeinfo version
2 1.68 from the input file lispref.texi.
4 INFO-DIR-SECTION XEmacs Editor
6 * Lispref: (lispref). XEmacs Lisp Reference Manual.
11 GNU Emacs Lisp Reference Manual Second Edition (v2.01), May 1993 GNU
12 Emacs Lisp Reference Manual Further Revised (v2.02), August 1993 Lucid
13 Emacs Lisp Reference Manual (for 19.10) First Edition, March 1994
14 XEmacs Lisp Programmer's Manual (for 19.12) Second Edition, April 1995
15 GNU Emacs Lisp Reference Manual v2.4, June 1995 XEmacs Lisp
16 Programmer's Manual (for 19.13) Third Edition, July 1995 XEmacs Lisp
17 Reference Manual (for 19.14 and 20.0) v3.1, March 1996 XEmacs Lisp
18 Reference Manual (for 19.15 and 20.1, 20.2, 20.3) v3.2, April, May,
19 November 1997 XEmacs Lisp Reference Manual (for 21.0) v3.3, April 1998
21 Copyright (C) 1990, 1991, 1992, 1993, 1994, 1995 Free Software
22 Foundation, Inc. Copyright (C) 1994, 1995 Sun Microsystems, Inc.
23 Copyright (C) 1995, 1996 Ben Wing.
25 Permission is granted to make and distribute verbatim copies of this
26 manual provided the copyright notice and this permission notice are
27 preserved on all copies.
29 Permission is granted to copy and distribute modified versions of
30 this manual under the conditions for verbatim copying, provided that the
31 entire resulting derived work is distributed under the terms of a
32 permission notice identical to this one.
34 Permission is granted to copy and distribute translations of this
35 manual into another language, under the above conditions for modified
36 versions, except that this permission notice may be stated in a
37 translation approved by the Foundation.
39 Permission is granted to copy and distribute modified versions of
40 this manual under the conditions for verbatim copying, provided also
41 that the section entitled "GNU General Public License" is included
42 exactly as in the original, and provided that the entire resulting
43 derived work is distributed under the terms of a permission notice
44 identical to this one.
46 Permission is granted to copy and distribute translations of this
47 manual into another language, under the above conditions for modified
48 versions, except that the section entitled "GNU General Public License"
49 may be included in a translation approved by the Free Software
50 Foundation instead of in the original English.
53 File: lispref.info, Node: Level 3 Primitives, Next: Dynamic Messaging, Prev: Level 3 Basics, Up: I18N Level 3
58 - Function: gettext STRING
59 This function looks up STRING in the default message domain and
60 returns its translation. If `I18N3' was not enabled when XEmacs
61 was compiled, it just returns STRING.
63 - Function: dgettext DOMAIN STRING
64 This function looks up STRING in the specified message domain and
65 returns its translation. If `I18N3' was not enabled when XEmacs
66 was compiled, it just returns STRING.
68 - Function: bind-text-domain DOMAIN PATHNAME
69 This function associates a pathname with a message domain. Here's
70 how the path to message file is constructed under SunOS 5.x:
72 `{pathname}/{LANG}/LC_MESSAGES/{domain}.mo'
74 If `I18N3' was not enabled when XEmacs was compiled, this function
77 - Special Form: domain STRING
78 This function specifies the text domain used for translating
79 documentation strings and interactive prompts of a function. For
82 (defun foo (arg) "Doc string" (domain "emacs-foo") ...)
84 to specify `emacs-foo' as the text domain of the function `foo'.
85 The "call" to `domain' is actually a declaration rather than a
86 function; when actually called, `domain' just returns `nil'.
88 - Function: domain-of FUNCTION
89 This function returns the text domain of FUNCTION; it returns
90 `nil' if it is the default domain. If `I18N3' was not enabled
91 when XEmacs was compiled, it always returns `nil'.
94 File: lispref.info, Node: Dynamic Messaging, Next: Domain Specification, Prev: Level 3 Primitives, Up: I18N Level 3
99 The `format' function has been extended to permit you to change the
100 order of parameter insertion. For example, the conversion format
101 `%1$s' inserts parameter one as a string, while `%2$s' inserts
102 parameter two. This is useful when creating translations which require
103 you to change the word order.
106 File: lispref.info, Node: Domain Specification, Next: Documentation String Extraction, Prev: Dynamic Messaging, Up: I18N Level 3
111 The default message domain of XEmacs is `emacs'. For add-on
112 packages, it is best to use a different domain. For example, let us
113 say we want to convert the "gorilla" package to use the domain
114 `emacs-gorilla'. To translate the message "What gorilla?", use
115 `dgettext' as follows:
117 (dgettext "emacs-gorilla" "What gorilla?")
119 A function (or macro) which has a documentation string or an
120 interactive prompt needs to be associated with the domain in order for
121 the documentation or prompt to be translated. This is done with the
122 `domain' special form as follows:
124 (defun scratch (location)
125 "Scratch the specified location."
126 (domain "emacs-gorilla")
127 (interactive "sScratch: ")
130 It is most efficient to specify the domain in the first line of the
131 function body, before the `interactive' form.
133 For variables and constants which have documentation strings,
134 specify the domain after the documentation.
136 - Special Form: defvar SYMBOL [VALUE [DOC-STRING [DOMAIN]]]
138 (defvar weight 250 "Weight of gorilla, in pounds." "emacs-gorilla")
140 - Special Form: defconst SYMBOL [VALUE [DOC-STRING [DOMAIN]]]
142 (defconst limbs 4 "Number of limbs" "emacs-gorilla")
144 Autoloaded functions which are specified in `loaddefs.el' do not need
145 to have a domain specification, because their documentation strings are
146 extracted into the main message base. However, for autoloaded functions
147 which are specified in a separate package, use following syntax:
149 - Function: autoload SYMBOL FILENAME &optional DOCSTRING INTERACTIVE
152 (autoload 'explore "jungle" "Explore the jungle." nil nil "emacs-gorilla")
155 File: lispref.info, Node: Documentation String Extraction, Prev: Domain Specification, Up: I18N Level 3
157 Documentation String Extraction
158 -------------------------------
160 The utility `etc/make-po' scans the file `DOC' to extract
161 documentation strings and creates a message file `doc.po'. This file
162 may then be inserted within `emacs.po'.
164 Currently, `make-po' is hard-coded to read from `DOC' and write to
165 `doc.po'. In order to extract documentation strings from an add-on
166 package, first run `make-docfile' on the package to produce the `DOC'
167 file. Then run `make-po -p' with the `-p' argument to indicate that we
168 are extracting documentation for an add-on package.
170 (The `-p' argument is a kludge to make up for a subtle difference
171 between pre-loaded documentation and add-on documentation: For add-on
172 packages, the final carriage returns in the strings produced by
173 `make-docfile' must be ignored.)
176 File: lispref.info, Node: I18N Level 4, Prev: I18N Level 3, Up: Internationalization
181 The Asian-language support in XEmacs is called "MULE". *Note MULE::.
184 File: lispref.info, Node: MULE, Next: Tips, Prev: Internationalization, Up: Top
189 "MULE" is the name originally given to the version of GNU Emacs
190 extended for multi-lingual (and in particular Asian-language) support.
191 "MULE" is short for "MUlti-Lingual Emacs". It was originally called
192 Nemacs ("Nihon Emacs" where "Nihon" is the Japanese word for "Japan"),
193 when it only provided support for Japanese. XEmacs refers to its
194 multi-lingual support as "MULE support" since it is based on "MULE".
198 * Internationalization Terminology::
199 Definition of various internationalization terms.
200 * Charsets:: Sets of related characters.
201 * MULE Characters:: Working with characters in XEmacs/MULE.
202 * Composite Characters:: Making new characters by overstriking other ones.
203 * ISO 2022:: An international standard for charsets and encodings.
204 * Coding Systems:: Ways of representing a string of chars using integers.
205 * CCL:: A special language for writing fast converters.
206 * Category Tables:: Subdividing charsets into groups.
209 File: lispref.info, Node: Internationalization Terminology, Next: Charsets, Up: MULE
211 Internationalization Terminology
212 ================================
214 In internationalization terminology, a string of text is divided up
215 into "characters", which are the printable units that make up the text.
216 A single character is (for example) a capital `A', the number `2', a
217 Katakana character, a Kanji ideograph (an "ideograph" is a "picture"
218 character, such as is used in Japanese Kanji, Chinese Hanzi, and Korean
219 Hangul; typically there are thousands of such ideographs in each
220 language), etc. The basic property of a character is its shape. Note
221 that the same character may be drawn by two different people (or in two
222 different fonts) in slightly different ways, although the basic shape
225 In some cases, the differences will be significant enough that it is
226 actually possible to identify two or more distinct shapes that both
227 represent the same character. For example, the lowercase letters `a'
228 and `g' each have two distinct possible shapes - the `a' can optionally
229 have a curved tail projecting off the top, and the `g' can be formed
230 either of two loops, or of one loop and a tail hanging off the bottom.
231 Such distinct possible shapes of a character are called "glyphs". The
232 important characteristic of two glyphs making up the same character is
233 that the choice between one or the other is purely stylistic and has no
234 linguistic effect on a word (this is the reason why a capital `A' and
235 lowercase `a' are different characters rather than different glyphs -
236 e.g. `Aspen' is a city while `aspen' is a kind of tree).
238 Note that "character" and "glyph" are used differently here than
241 A "character set" is simply a set of related characters. ASCII, for
242 example, is a set of 94 characters (or 128, if you count non-printing
243 characters). Other character sets are ISO8859-1 (ASCII plus various
244 accented characters and other international symbols), JISX0201 (ASCII,
245 more or less, plus half-width Katakana), JISX0208 (Japanese Kanji),
246 JISX0212 (a second set of less-used Japanese Kanji), GB2312 (Mainland
249 Every character set has one or more "orderings", which can be viewed
250 as a way of assigning a number (or set of numbers) to each character in
251 the set. For most character sets, there is a standard ordering, and in
252 fact all of the character sets mentioned above define a particular
253 ordering. ASCII, for example, places letters in their "natural" order,
254 puts uppercase letters before lowercase letters, numbers before
255 letters, etc. Note that for many of the Asian character sets, there is
256 no natural ordering of the characters. The actual orderings are based
257 on one or more salient characteristic, of which there are many to
258 choose from - e.g. number of strokes, common radicals, phonetic
261 The set of numbers assigned to any particular character are called
262 the character's "position codes". The number of position codes
263 required to index a particular character in a character set is called
264 the "dimension" of the character set. ASCII, being a relatively small
265 character set, is of dimension one, and each character in the set is
266 indexed using a single position code, in the range 0 through 127 (if
267 non-printing characters are included) or 33 through 126 (if only the
268 printing characters are considered). JISX0208, i.e. Japanese Kanji,
269 has thousands of characters, and is of dimension two - every character
270 is indexed by two position codes, each in the range 33 through 126.
271 (Note that the choice of the range here is somewhat arbitrary.
272 Although a character set such as JISX0208 defines an *ordering* of all
273 its characters, it does not define the actual mapping between numbers
274 and characters. You could just as easily index the characters in
275 JISX0208 using numbers in the range 0 through 93, 1 through 94, 2
276 through 95, etc. The reason for the actual range chosen is so that the
277 position codes match up with the actual values used in the common
280 An "encoding" is a way of numerically representing characters from
281 one or more character sets into a stream of like-sized numerical values
282 called "words"; typically these are 8-bit, 16-bit, or 32-bit
283 quantities. If an encoding encompasses only one character set, then the
284 position codes for the characters in that character set could be used
285 directly. (This is the case with ASCII, and as a result, most people do
286 not understand the difference between a character set and an encoding.)
287 This is not possible, however, if more than one character set is to be
288 used in the encoding. For example, printed Japanese text typically
289 requires characters from multiple character sets - ASCII, JISX0208, and
290 JISX0212, to be specific. Each of these is indexed using one or more
291 position codes in the range 33 through 126, so the position codes could
292 not be used directly or there would be no way to tell which character
293 was meant. Different Japanese encodings handle this differently - JIS
294 uses special escape characters to denote different character sets; EUC
295 sets the high bit of the position codes for JISX0208 and JISX0212, and
296 puts a special extra byte before each JISX0212 character; etc. (JIS,
297 EUC, and most of the other encodings you will encounter are 7-bit or
298 8-bit encodings. There is one common 16-bit encoding, which is Unicode;
299 this strives to represent all the world's characters in a single large
300 character set. 32-bit encodings are generally used internally in
301 programs to simplify the code that manipulates them; however, they are
302 not much used externally because they are not very space-efficient.)
304 Encodings are classified as either "modal" or "non-modal". In a
305 "modal encoding", there are multiple states that the encoding can be in,
306 and the interpretation of the values in the stream depends on the
307 current global state of the encoding. Special values in the encoding,
308 called "escape sequences", are used to change the global state. JIS,
309 for example, is a modal encoding. The bytes `ESC $ B' indicate that,
310 from then on, bytes are to be interpreted as position codes for
311 JISX0208, rather than as ASCII. This effect is cancelled using the
312 bytes `ESC ( B', which mean "switch from whatever the current state is
313 to ASCII". To switch to JISX0212, the escape sequence `ESC $ ( D'.
314 (Note that here, as is common, the escape sequences do in fact begin
315 with `ESC'. This is not necessarily the case, however.)
317 A "non-modal encoding" has no global state that extends past the
318 character currently being interpreted. EUC, for example, is a
319 non-modal encoding. Characters in JISX0208 are encoded by setting the
320 high bit of the position codes, and characters in JISX0212 are encoded
321 by doing the same but also prefixing the character with the byte 0x8F.
323 The advantage of a modal encoding is that it is generally more
324 space-efficient, and is easily extendable because there are essentially
325 an arbitrary number of escape sequences that can be created. The
326 disadvantage, however, is that it is much more difficult to work with
327 if it is not being processed in a sequential manner. In the non-modal
328 EUC encoding, for example, the byte 0x41 always refers to the letter
329 `A'; whereas in JIS, it could either be the letter `A', or one of the
330 two position codes in a JISX0208 character, or one of the two position
331 codes in a JISX0212 character. Determining exactly which one is meant
332 could be difficult and time-consuming if the previous bytes in the
333 string have not already been processed.
335 Non-modal encodings are further divided into "fixed-width" and
336 "variable-width" formats. A fixed-width encoding always uses the same
337 number of words per character, whereas a variable-width encoding does
338 not. EUC is a good example of a variable-width encoding: one to three
339 bytes are used per character, depending on the character set. 16-bit
340 and 32-bit encodings are nearly always fixed-width, and this is in fact
341 one of the main reasons for using an encoding with a larger word size.
342 The advantages of fixed-width encodings should be obvious. The
343 advantages of variable-width encodings are that they are generally more
344 space-efficient and allow for compatibility with existing 8-bit
345 encodings such as ASCII.
347 Note that the bytes in an 8-bit encoding are often referred to as
348 "octets" rather than simply as bytes. This terminology dates back to
349 the days before 8-bit bytes were universal, when some computers had
350 9-bit bytes, others had 10-bit bytes, etc.
353 File: lispref.info, Node: Charsets, Next: MULE Characters, Prev: Internationalization Terminology, Up: MULE
358 A "charset" in MULE is an object that encapsulates a particular
359 character set as well as an ordering of those characters. Charsets are
360 permanent objects and are named using symbols, like faces.
362 - Function: charsetp OBJECT
363 This function returns non-`nil' if OBJECT is a charset.
367 * Charset Properties:: Properties of a charset.
368 * Basic Charset Functions:: Functions for working with charsets.
369 * Charset Property Functions:: Functions for accessing charset properties.
370 * Predefined Charsets:: Predefined charset objects.
373 File: lispref.info, Node: Charset Properties, Next: Basic Charset Functions, Up: Charsets
378 Charsets have the following properties:
381 A symbol naming the charset. Every charset must have a different
382 name; this allows a charset to be referred to using its name
383 rather than the actual charset object.
386 A documentation string describing the charset.
389 A regular expression matching the font registry field for this
390 character set. For example, both the `ascii' and `latin-iso8859-1'
391 charsets use the registry `"ISO8859-1"'. This field is used to
392 choose an appropriate font when the user gives a general font
393 specification such as `-*-courier-medium-r-*-140-*', i.e. a
394 14-point upright medium-weight Courier font.
397 Number of position codes used to index a character in the
398 character set. XEmacs/MULE can only handle character sets of
399 dimension 1 or 2. This property defaults to 1.
402 Number of characters in each dimension. In XEmacs/MULE, the only
403 allowed values are 94 or 96. (There are a couple of pre-defined
404 character sets, such as ASCII, that do not follow this, but you
405 cannot define new ones like this.) Defaults to 94. Note that if
406 the dimension is 2, the character set thus described is 94x94 or
410 Number of columns used to display a character in this charset.
411 Only used in TTY mode. (Under X, the actual width of a character
412 can be derived from the font used to display the characters.) If
413 unspecified, defaults to the dimension. (This is almost always the
414 correct value, because character sets with dimension 2 are usually
415 ideograph character sets, which need two columns to display the
416 intricate ideographs.)
419 A symbol, either `l2r' (left-to-right) or `r2l' (right-to-left).
420 Defaults to `l2r'. This specifies the direction that the text
421 should be displayed in, and will be left-to-right for most
422 charsets but right-to-left for Hebrew and Arabic. (Right-to-left
423 display is not currently implemented.)
426 Final byte of the standard ISO 2022 escape sequence designating
427 this charset. Must be supplied. Each combination of (DIMENSION,
428 CHARS) defines a separate namespace for final bytes, and each
429 charset within a particular namespace must have a different final
430 byte. Note that ISO 2022 restricts the final byte to the range
431 0x30 - 0x7E if dimension == 1, and 0x30 - 0x5F if dimension == 2.
432 Note also that final bytes in the range 0x30 - 0x3F are reserved
433 for user-defined (not official) character sets. For more
434 information on ISO 2022, see *Note Coding Systems::.
437 0 (use left half of font on output) or 1 (use right half of font on
438 output). Defaults to 0. This specifies how to convert the
439 position codes that index a character in a character set into an
440 index into the font used to display the character set. With
441 `graphic' set to 0, position codes 33 through 126 map to font
442 indices 33 through 126; with it set to 1, position codes 33
443 through 126 map to font indices 161 through 254 (i.e. the same
444 number but with the high bit set). For example, for a font whose
445 registry is ISO8859-1, the left half of the font (octets 0x20 -
446 0x7F) is the `ascii' charset, while the right half (octets 0xA0 -
447 0xFF) is the `latin-iso8859-1' charset.
450 A compiled CCL program used to convert a character in this charset
451 into an index into the font. This is in addition to the `graphic'
452 property. If a CCL program is defined, the position codes of a
453 character will first be processed according to `graphic' and then
454 passed through the CCL program, with the resulting values used to
457 This is used, for example, in the Big5 character set (used in
458 Taiwan). This character set is not ISO-2022-compliant, and its
459 size (94x157) does not fit within the maximum 96x96 size of
460 ISO-2022-compliant character sets. As a result, XEmacs/MULE
461 splits it (in a rather complex fashion, so as to group the most
462 commonly used characters together) into two charset objects
463 (`big5-1' and `big5-2'), each of size 94x94, and each charset
464 object uses a CCL program to convert the modified position codes
465 back into standard Big5 indices to retrieve a character from a
468 Most of the above properties can only be changed when the charset is
469 created. *Note Charset Property Functions::.
472 File: lispref.info, Node: Basic Charset Functions, Next: Charset Property Functions, Prev: Charset Properties, Up: Charsets
474 Basic Charset Functions
475 -----------------------
477 - Function: find-charset CHARSET-OR-NAME
478 This function retrieves the charset of the given name. If
479 CHARSET-OR-NAME is a charset object, it is simply returned.
480 Otherwise, CHARSET-OR-NAME should be a symbol. If there is no
481 such charset, `nil' is returned. Otherwise the associated charset
484 - Function: get-charset NAME
485 This function retrieves the charset of the given name. Same as
486 `find-charset' except an error is signalled if there is no such
487 charset instead of returning `nil'.
489 - Function: charset-list
490 This function returns a list of the names of all defined charsets.
492 - Function: make-charset NAME DOC-STRING PROPS
493 This function defines a new character set. This function is for
494 use with Mule support. NAME is a symbol, the name by which the
495 character set is normally referred. DOC-STRING is a string
496 describing the character set. PROPS is a property list,
497 describing the specific nature of the character set. The
498 recognized properties are `registry', `dimension', `columns',
499 `chars', `final', `graphic', `direction', and `ccl-program', as
500 previously described.
502 - Function: make-reverse-direction-charset CHARSET NEW-NAME
503 This function makes a charset equivalent to CHARSET but which goes
504 in the opposite direction. NEW-NAME is the name of the new
505 charset. The new charset is returned.
507 - Function: charset-from-attributes DIMENSION CHARS FINAL &optional
509 This function returns a charset with the given DIMENSION, CHARS,
510 FINAL, and DIRECTION. If DIRECTION is omitted, both directions
511 will be checked (left-to-right will be returned if character sets
512 exist for both directions).
514 - Function: charset-reverse-direction-charset CHARSET
515 This function returns the charset (if any) with the same dimension,
516 number of characters, and final byte as CHARSET, but which is
517 displayed in the opposite direction.
520 File: lispref.info, Node: Charset Property Functions, Next: Predefined Charsets, Prev: Basic Charset Functions, Up: Charsets
522 Charset Property Functions
523 --------------------------
525 All of these functions accept either a charset name or charset
528 - Function: charset-property CHARSET PROP
529 This function returns property PROP of CHARSET. *Note Charset
532 Convenience functions are also provided for retrieving individual
533 properties of a charset.
535 - Function: charset-name CHARSET
536 This function returns the name of CHARSET. This will be a symbol.
538 - Function: charset-doc-string CHARSET
539 This function returns the doc string of CHARSET.
541 - Function: charset-registry CHARSET
542 This function returns the registry of CHARSET.
544 - Function: charset-dimension CHARSET
545 This function returns the dimension of CHARSET.
547 - Function: charset-chars CHARSET
548 This function returns the number of characters per dimension of
551 - Function: charset-columns CHARSET
552 This function returns the number of display columns per character
553 (in TTY mode) of CHARSET.
555 - Function: charset-direction CHARSET
556 This function returns the display direction of CHARSET - either
559 - Function: charset-final CHARSET
560 This function returns the final byte of the ISO 2022 escape
561 sequence designating CHARSET.
563 - Function: charset-graphic CHARSET
564 This function returns either 0 or 1, depending on whether the
565 position codes of characters in CHARSET map to the left or right
566 half of their font, respectively.
568 - Function: charset-ccl-program CHARSET
569 This function returns the CCL program, if any, for converting
570 position codes of characters in CHARSET into font indices.
572 The only property of a charset that can currently be set after the
573 charset has been created is the CCL program.
575 - Function: set-charset-ccl-program CHARSET CCL-PROGRAM
576 This function sets the `ccl-program' property of CHARSET to
580 File: lispref.info, Node: Predefined Charsets, Prev: Charset Property Functions, Up: Charsets
585 The following charsets are predefined in the C code.
587 Name Type Fi Gr Dir Registry
588 --------------------------------------------------------------
589 ascii 94 B 0 l2r ISO8859-1
590 control-1 94 0 l2r ---
591 latin-iso8859-1 94 A 1 l2r ISO8859-1
592 latin-iso8859-2 96 B 1 l2r ISO8859-2
593 latin-iso8859-3 96 C 1 l2r ISO8859-3
594 latin-iso8859-4 96 D 1 l2r ISO8859-4
595 cyrillic-iso8859-5 96 L 1 l2r ISO8859-5
596 arabic-iso8859-6 96 G 1 r2l ISO8859-6
597 greek-iso8859-7 96 F 1 l2r ISO8859-7
598 hebrew-iso8859-8 96 H 1 r2l ISO8859-8
599 latin-iso8859-9 96 M 1 l2r ISO8859-9
600 thai-tis620 96 T 1 l2r TIS620
601 katakana-jisx0201 94 I 1 l2r JISX0201.1976
602 latin-jisx0201 94 J 0 l2r JISX0201.1976
603 japanese-jisx0208-1978 94x94 @ 0 l2r JISX0208.1978
604 japanese-jisx0208 94x94 B 0 l2r JISX0208.19(83|90)
605 japanese-jisx0212 94x94 D 0 l2r JISX0212
606 chinese-gb2312 94x94 A 0 l2r GB2312
607 chinese-cns11643-1 94x94 G 0 l2r CNS11643.1
608 chinese-cns11643-2 94x94 H 0 l2r CNS11643.2
609 chinese-big5-1 94x94 0 0 l2r Big5
610 chinese-big5-2 94x94 1 0 l2r Big5
611 korean-ksc5601 94x94 C 0 l2r KSC5601
612 composite 96x96 0 l2r ---
614 The following charsets are predefined in the Lisp code.
616 Name Type Fi Gr Dir Registry
617 --------------------------------------------------------------
618 arabic-digit 94 2 0 l2r MuleArabic-0
619 arabic-1-column 94 3 0 r2l MuleArabic-1
620 arabic-2-column 94 4 0 r2l MuleArabic-2
621 sisheng 94 0 0 l2r sisheng_cwnn\|OMRON_UDC_ZH
622 chinese-cns11643-3 94x94 I 0 l2r CNS11643.1
623 chinese-cns11643-4 94x94 J 0 l2r CNS11643.1
624 chinese-cns11643-5 94x94 K 0 l2r CNS11643.1
625 chinese-cns11643-6 94x94 L 0 l2r CNS11643.1
626 chinese-cns11643-7 94x94 M 0 l2r CNS11643.1
627 ethiopic 94x94 2 0 l2r Ethio
628 ascii-r2l 94 B 0 r2l ISO8859-1
629 ipa 96 0 1 l2r MuleIPA
630 vietnamese-lower 96 1 1 l2r VISCII1.1
631 vietnamese-upper 96 2 1 l2r VISCII1.1
633 For all of the above charsets, the dimension and number of columns
636 Note that ASCII, Control-1, and Composite are handled specially.
637 This is why some of the fields are blank; and some of the filled-in
638 fields (e.g. the type) are not really accurate.
641 File: lispref.info, Node: MULE Characters, Next: Composite Characters, Prev: Charsets, Up: MULE
646 - Function: make-char CHARSET ARG1 &optional ARG2
647 This function makes a multi-byte character from CHARSET and octets
650 - Function: char-charset CH
651 This function returns the character set of char CH.
653 - Function: char-octet CH &optional N
654 This function returns the octet (i.e. position code) numbered N
655 (should be 0 or 1) of char CH. N defaults to 0 if omitted.
657 - Function: find-charset-region START END &optional BUFFER
658 This function returns a list of the charsets in the region between
659 START and END. BUFFER defaults to the current buffer if omitted.
661 - Function: find-charset-string STRING
662 This function returns a list of the charsets in STRING.
665 File: lispref.info, Node: Composite Characters, Next: ISO 2022, Prev: MULE Characters, Up: MULE
670 Composite characters are not yet completely implemented.
672 - Function: make-composite-char STRING
673 This function converts a string into a single composite character.
674 The character is the result of overstriking all the characters in
677 - Function: composite-char-string CH
678 This function returns a string of the characters comprising a
681 - Function: compose-region START END &optional BUFFER
682 This function composes the characters in the region from START to
683 END in BUFFER into one composite character. The composite
684 character replaces the composed characters. BUFFER defaults to
685 the current buffer if omitted.
687 - Function: decompose-region START END &optional BUFFER
688 This function decomposes any composite characters in the region
689 from START to END in BUFFER. This converts each composite
690 character into one or more characters, the individual characters
691 out of which the composite character was formed. Non-composite
692 characters are left as-is. BUFFER defaults to the current buffer
696 File: lispref.info, Node: ISO 2022, Next: Coding Systems, Prev: Composite Characters, Up: MULE
701 This section briefly describes the ISO 2022 encoding standard. For
702 more thorough understanding, please refer to the original document of
705 Character sets ("charsets") are classified into the following four
706 categories, according to the number of characters of charset:
707 94-charset, 96-charset, 94x94-charset, and 96x96-charset.
710 ASCII(B), left(J) and right(I) half of JISX0201, ...
713 Latin-1(A), Latin-2(B), Latin-3(C), ...
716 GB2312(A), JISX0208(B), KSC5601(C), ...
721 The character in parentheses after the name of each charset is the
722 "final character" F, which can be regarded as the identifier of the
723 charset. ECMA allocates F to each charset. F is in the range of
724 0x30..0x7F, but 0x30..0x3F are only for private use.
726 Note: "ECMA" = European Computer Manufacturers Association
728 There are four "registers of charsets", called G0 thru G3. You can
729 designate (or assign) any charset to one of these registers.
731 The code space contained within one octet (of size 256) is divided
732 into 4 areas: C0, GL, C1, and GR. GL and GR are the areas into which a
733 register of charset can be invoked into.
740 Usually, in the initial state, G0 is invoked into GL, and G1 is
743 ISO 2022 distinguishes 7-bit environments and 8-bit environments. In
744 7-bit environments, only C0 and GL are used.
746 Charset designation is done by escape sequences of the form:
750 where I is an intermediate character in the range 0x20 - 0x2F, and F
751 is the final character identifying this charset.
753 The meaning of intermediate characters are:
755 $ [0x24]: indicate charset of dimension 2 (94x94 or 96x96).
756 ( [0x28]: designate to G0 a 94-charset whose final byte is F.
757 ) [0x29]: designate to G1 a 94-charset whose final byte is F.
758 * [0x2A]: designate to G2 a 94-charset whose final byte is F.
759 + [0x2B]: designate to G3 a 94-charset whose final byte is F.
760 - [0x2D]: designate to G1 a 96-charset whose final byte is F.
761 . [0x2E]: designate to G2 a 96-charset whose final byte is F.
762 / [0x2F]: designate to G3 a 96-charset whose final byte is F.
764 The following rule is not allowed in ISO 2022 but can be used in
767 , [0x2C]: designate to G0 a 96-charset whose final byte is F.
769 Here are examples of designations:
771 ESC ( B : designate to G0 ASCII
772 ESC - A : designate to G1 Latin-1
773 ESC $ ( A or ESC $ A : designate to G0 GB2312
774 ESC $ ( B or ESC $ B : designate to G0 JISX0208
775 ESC $ ) C : designate to G1 KSC5601
777 To use a charset designated to G2 or G3, and to use a charset
778 designated to G1 in a 7-bit environment, you must explicitly invoke G1,
779 G2, or G3 into GL. There are two types of invocation, Locking Shift
780 (forever) and Single Shift (one character only).
782 Locking Shift is done as follows:
784 LS0 or SI (0x0F): invoke G0 into GL
785 LS1 or SO (0x0E): invoke G1 into GL
786 LS2: invoke G2 into GL
787 LS3: invoke G3 into GL
788 LS1R: invoke G1 into GR
789 LS2R: invoke G2 into GR
790 LS3R: invoke G3 into GR
792 Single Shift is done as follows:
794 SS2 or ESC N: invoke G2 into GL
795 SS3 or ESC O: invoke G3 into GL
797 (#### Ben says: I think the above is slightly incorrect. It appears
798 that SS2 invokes G2 into GR and SS3 invokes G3 into GR, whereas ESC N
799 and ESC O behave as indicated. The above definitions will not parse
800 EUC-encoded text correctly, and it looks like the code in mule-coding.c
801 has similar problems.)
803 You may realize that there are a lot of ISO-2022-compliant ways of
804 encoding multilingual text. Now, in the world, there exist many coding
805 systems such as X11's Compound Text, Japanese JUNET code, and so-called
806 EUC (Extended UNIX Code); all of these are variants of ISO 2022.
808 In Mule, we characterize ISO 2022 by the following attributes:
810 1. Initial designation to G0 thru G3.
812 2. Allow designation of short form for Japanese and Chinese.
814 3. Should we designate ASCII to G0 before control characters?
816 4. Should we designate ASCII to G0 at the end of line?
818 5. 7-bit environment or 8-bit environment.
820 6. Use Locking Shift or not.
822 7. Use ASCII or JIS0201-1976-Roman.
824 8. Use JISX0208-1983 or JISX0208-1976.
826 (The last two are only for Japanese.)
828 By specifying these attributes, you can create any variant of ISO
831 Here are several examples:
833 junet -- Coding system used in JUNET.
834 1. G0 <- ASCII, G1..3 <- never used
843 ctext -- Compound Text
844 1. G0 <- ASCII, G1 <- Latin-1, G2,3 <- never used
853 euc-china -- Chinese EUC. Although many people call this
854 as "GB encoding", the name may cause misunderstanding.
855 1. G0 <- ASCII, G1 <- GB2312, G2,3 <- never used
864 korean-mail -- Coding system used in Korean network.
865 1. G0 <- ASCII, G1 <- KSC5601, G2,3 <- never used
874 Mule creates all these coding systems by default.
877 File: lispref.info, Node: Coding Systems, Next: CCL, Prev: ISO 2022, Up: MULE
882 A coding system is an object that defines how text containing
883 multiple character sets is encoded into a stream of (typically 8-bit)
884 bytes. The coding system is used to decode the stream into a series of
885 characters (which may be from multiple charsets) when the text is read
886 from a file or process, and is used to encode the text back into the
887 same format when it is written out to a file or process.
889 For example, many ISO-2022-compliant coding systems (such as Compound
890 Text, which is used for inter-client data under the X Window System) use
891 escape sequences to switch between different charsets - Japanese Kanji,
892 for example, is invoked with `ESC $ ( B'; ASCII is invoked with `ESC (
893 B'; and Cyrillic is invoked with `ESC - L'. See `make-coding-system'
894 for more information.
896 Coding systems are normally identified using a symbol, and the
897 symbol is accepted in place of the actual coding system object whenever
898 a coding system is called for. (This is similar to how faces and
901 - Function: coding-system-p OBJECT
902 This function returns non-`nil' if OBJECT is a coding system.
906 * Coding System Types:: Classifying coding systems.
907 * EOL Conversion:: Dealing with different ways of denoting
909 * Coding System Properties:: Properties of a coding system.
910 * Basic Coding System Functions:: Working with coding systems.
911 * Coding System Property Functions:: Retrieving a coding system's properties.
912 * Encoding and Decoding Text:: Encoding and decoding text.
913 * Detection of Textual Encoding:: Determining how text is encoded.
914 * Big5 and Shift-JIS Functions:: Special functions for these non-standard
918 File: lispref.info, Node: Coding System Types, Next: EOL Conversion, Up: Coding Systems
925 Automatic conversion. XEmacs attempts to detect the coding system
929 No conversion. Use this for binary files and such. On output,
930 graphic characters that are not in ASCII or Latin-1 will be
931 replaced by a `?'. (For a no-conversion-encoded buffer, these
932 characters will only be present if you explicitly insert them.)
935 Shift-JIS (a Japanese encoding commonly used in PC operating
939 Any ISO-2022-compliant encoding. Among other things, this
940 includes JIS (the Japanese encoding commonly used for e-mail),
941 national variants of EUC (the standard Unix encoding for Japanese
942 and other languages), and Compound Text (an encoding used in X11).
943 You can specify more specific information about the conversion
944 with the FLAGS argument.
947 Big5 (the encoding commonly used for Taiwanese).
950 The conversion is performed using a user-written pseudo-code
951 program. CCL (Code Conversion Language) is the name of this
955 Write out or read in the raw contents of the memory representing
956 the buffer's text. This is primarily useful for debugging
957 purposes, and is only enabled when XEmacs has been compiled with
958 `DEBUG_XEMACS' set (the `--debug' configure option). *Warning*:
959 Reading in a file using `internal' conversion can result in an
960 internal inconsistency in the memory representing a buffer's text,
961 which will produce unpredictable results and may cause XEmacs to
962 crash. Under normal circumstances you should never use `internal'
966 File: lispref.info, Node: EOL Conversion, Next: Coding System Properties, Prev: Coding System Types, Up: Coding Systems
972 Automatically detect the end-of-line type (LF, CRLF, or CR). Also
973 generate subsidiary coding systems named `NAME-unix', `NAME-dos',
974 and `NAME-mac', that are identical to this coding system but have
975 an EOL-TYPE value of `lf', `crlf', and `cr', respectively.
978 The end of a line is marked externally using ASCII LF. Since this
979 is also the way that XEmacs represents an end-of-line internally,
980 specifying this option results in no end-of-line conversion. This
981 is the standard format for Unix text files.
984 The end of a line is marked externally using ASCII CRLF. This is
985 the standard format for MS-DOS text files.
988 The end of a line is marked externally using ASCII CR. This is the
989 standard format for Macintosh text files.
992 Automatically detect the end-of-line type but do not generate
993 subsidiary coding systems. (This value is converted to `nil' when
994 stored internally, and `coding-system-property' will return `nil'.)
997 File: lispref.info, Node: Coding System Properties, Next: Basic Coding System Functions, Prev: EOL Conversion, Up: Coding Systems
999 Coding System Properties
1000 ------------------------
1003 String to be displayed in the modeline when this coding system is
1007 End-of-line conversion to be used. It should be one of the types
1008 listed in *Note EOL Conversion::.
1010 `post-read-conversion'
1011 Function called after a file has been read in, to perform the
1012 decoding. Called with two arguments, BEG and END, denoting a
1013 region of the current buffer to be decoded.
1015 `pre-write-conversion'
1016 Function called before a file is written out, to perform the
1017 encoding. Called with two arguments, BEG and END, denoting a
1018 region of the current buffer to be encoded.
1020 The following additional properties are recognized if TYPE is
1027 The character set initially designated to the G0 - G3 registers.
1028 The value should be one of
1030 * A charset object (designate that character set)
1032 * `nil' (do not ever use this register)
1034 * `t' (no character set is initially designated to the
1035 register, but may be later on; this automatically sets the
1036 corresponding `force-g*-on-output' property)
1038 `force-g0-on-output'
1039 `force-g1-on-output'
1040 `force-g2-on-output'
1041 `force-g3-on-output'
1042 If non-`nil', send an explicit designation sequence on output
1043 before using the specified register.
1046 If non-`nil', use the short forms `ESC $ @', `ESC $ A', and `ESC $
1047 B' on output in place of the full designation sequences `ESC $ (
1048 @', `ESC $ ( A', and `ESC $ ( B'.
1051 If non-`nil', don't designate ASCII to G0 at each end of line on
1052 output. Setting this to non-`nil' also suppresses other
1053 state-resetting that normally happens at the end of a line.
1056 If non-`nil', don't designate ASCII to G0 before control chars on
1060 If non-`nil', use 7-bit environment on output. Otherwise, use
1064 If non-`nil', use locking-shift (SO/SI) instead of single-shift or
1065 designation by escape sequence.
1068 If non-`nil', don't use ISO6429's direction specification.
1071 If non-nil, literal control characters that are the same as the
1072 beginning of a recognized ISO 2022 or ISO 6429 escape sequence (in
1073 particular, ESC (0x1B), SO (0x0E), SI (0x0F), SS2 (0x8E), SS3
1074 (0x8F), and CSI (0x9B)) are "quoted" with an escape character so
1075 that they can be properly distinguished from an escape sequence.
1076 (Note that doing this results in a non-portable encoding.) This
1077 encoding flag is used for byte-compiled files. Note that ESC is a
1078 good choice for a quoting character because there are no escape
1079 sequences whose second byte is a character from the Control-0 or
1080 Control-1 character sets; this is explicitly disallowed by the ISO
1083 `input-charset-conversion'
1084 A list of conversion specifications, specifying conversion of
1085 characters in one charset to another when decoding is performed.
1086 Each specification is a list of two elements: the source charset,
1087 and the destination charset.
1089 `output-charset-conversion'
1090 A list of conversion specifications, specifying conversion of
1091 characters in one charset to another when encoding is performed.
1092 The form of each specification is the same as for
1093 `input-charset-conversion'.
1095 The following additional properties are recognized (and required) if
1099 CCL program used for decoding (converting to internal format).
1102 CCL program used for encoding (converting to external format).
1105 File: lispref.info, Node: Basic Coding System Functions, Next: Coding System Property Functions, Prev: Coding System Properties, Up: Coding Systems
1107 Basic Coding System Functions
1108 -----------------------------
1110 - Function: find-coding-system CODING-SYSTEM-OR-NAME
1111 This function retrieves the coding system of the given name.
1113 If CODING-SYSTEM-OR-NAME is a coding-system object, it is simply
1114 returned. Otherwise, CODING-SYSTEM-OR-NAME should be a symbol.
1115 If there is no such coding system, `nil' is returned. Otherwise
1116 the associated coding system object is returned.
1118 - Function: get-coding-system NAME
1119 This function retrieves the coding system of the given name. Same
1120 as `find-coding-system' except an error is signalled if there is no
1121 such coding system instead of returning `nil'.
1123 - Function: coding-system-list
1124 This function returns a list of the names of all defined coding
1127 - Function: coding-system-name CODING-SYSTEM
1128 This function returns the name of the given coding system.
1130 - Function: make-coding-system NAME TYPE &optional DOC-STRING PROPS
1131 This function registers symbol NAME as a coding system.
1133 TYPE describes the conversion method used and should be one of the
1134 types listed in *Note Coding System Types::.
1136 DOC-STRING is a string describing the coding system.
1138 PROPS is a property list, describing the specific nature of the
1139 character set. Recognized properties are as in *Note Coding
1140 System Properties::.
1142 - Function: copy-coding-system OLD-CODING-SYSTEM NEW-NAME
1143 This function copies OLD-CODING-SYSTEM to NEW-NAME. If NEW-NAME
1144 does not name an existing coding system, a new one will be created.
1146 - Function: subsidiary-coding-system CODING-SYSTEM EOL-TYPE
1147 This function returns the subsidiary coding system of
1148 CODING-SYSTEM with eol type EOL-TYPE.
1151 File: lispref.info, Node: Coding System Property Functions, Next: Encoding and Decoding Text, Prev: Basic Coding System Functions, Up: Coding Systems
1153 Coding System Property Functions
1154 --------------------------------
1156 - Function: coding-system-doc-string CODING-SYSTEM
1157 This function returns the doc string for CODING-SYSTEM.
1159 - Function: coding-system-type CODING-SYSTEM
1160 This function returns the type of CODING-SYSTEM.
1162 - Function: coding-system-property CODING-SYSTEM PROP
1163 This function returns the PROP property of CODING-SYSTEM.
1166 File: lispref.info, Node: Encoding and Decoding Text, Next: Detection of Textual Encoding, Prev: Coding System Property Functions, Up: Coding Systems
1168 Encoding and Decoding Text
1169 --------------------------
1171 - Function: decode-coding-region START END CODING-SYSTEM &optional
1173 This function decodes the text between START and END which is
1174 encoded in CODING-SYSTEM. This is useful if you've read in
1175 encoded text from a file without decoding it (e.g. you read in a
1176 JIS-formatted file but used the `binary' or `no-conversion' coding
1177 system, so that it shows up as `^[$B!<!+^[(B'). The length of the
1178 encoded text is returned. BUFFER defaults to the current buffer
1181 - Function: encode-coding-region START END CODING-SYSTEM &optional
1183 This function encodes the text between START and END using
1184 CODING-SYSTEM. This will, for example, convert Japanese
1185 characters into stuff such as `^[$B!<!+^[(B' if you use the JIS
1186 encoding. The length of the encoded text is returned. BUFFER
1187 defaults to the current buffer if unspecified.
1190 File: lispref.info, Node: Detection of Textual Encoding, Next: Big5 and Shift-JIS Functions, Prev: Encoding and Decoding Text, Up: Coding Systems
1192 Detection of Textual Encoding
1193 -----------------------------
1195 - Function: coding-category-list
1196 This function returns a list of all recognized coding categories.
1198 - Function: set-coding-priority-list LIST
1199 This function changes the priority order of the coding categories.
1200 LIST should be a list of coding categories, in descending order of
1201 priority. Unspecified coding categories will be lower in priority
1202 than all specified ones, in the same relative order they were in
1205 - Function: coding-priority-list
1206 This function returns a list of coding categories in descending
1209 - Function: set-coding-category-system CODING-CATEGORY CODING-SYSTEM
1210 This function changes the coding system associated with a coding
1213 - Function: coding-category-system CODING-CATEGORY
1214 This function returns the coding system associated with a coding
1217 - Function: detect-coding-region START END &optional BUFFER
1218 This function detects coding system of the text in the region
1219 between START and END. Returned value is a list of possible coding
1220 systems ordered by priority. If only ASCII characters are found,
1221 it returns `autodetect' or one of its subsidiary coding systems
1222 according to a detected end-of-line type. Optional arg BUFFER
1223 defaults to the current buffer.