-\1f
-File: lispref.info, Node: Internationalization Terminology, Next: Charsets, Up: MULE
-
-Internationalization Terminology
-================================
-
- In internationalization terminology, a string of text is divided up
-into "characters", which are the printable units that make up the text.
-A single character is (for example) a capital `A', the number `2', a
-Katakana character, a Hangul character, a Kanji ideograph (an
-"ideograph" is a "picture" character, such as is used in Japanese
-Kanji, Chinese Hanzi, and Korean Hanja; typically there are thousands
-of such ideographs in each language), etc. The basic property of a
-character is that it is the smallest unit of text with semantic
-significance in text processing.
-
- Human beings normally process text visually, so to a first
-approximation a character may be identified with its shape. Note that
-the same character may be drawn by two different people (or in two
-different fonts) in slightly different ways, although the "basic shape"
-will be the same. But consider the works of Scott Kim; human beings
-can recognize hugely variant shapes as the "same" character.
-Sometimes, especially where characters are extremely complicated to
-write, completely different shapes may be defined as the "same"
-character in national standards. The Taiwanese variant of Hanzi is
-generally the most complicated; over the centuries, the Japanese,
-Koreans, and the People's Republic of China have adopted
-simplifications of the shape, but the line of descent from the original
-shape is recorded, and the meanings and pronunciation of different
-forms of the same character are considered to be identical within each
-language. (Of course, it may take a specialist to recognize the
-related form; the point is that the relations are standardized, despite
-the differing shapes.)
-
- In some cases, the differences will be significant enough that it is
-actually possible to identify two or more distinct shapes that both
-represent the same character. For example, the lowercase letters `a'
-and `g' each have two distinct possible shapes--the `a' can optionally
-have a curved tail projecting off the top, and the `g' can be formed
-either of two loops, or of one loop and a tail hanging off the bottom.
-Such distinct possible shapes of a character are called "glyphs". The
-important characteristic of two glyphs making up the same character is
-that the choice between one or the other is purely stylistic and has no
-linguistic effect on a word (this is the reason why a capital `A' and
-lowercase `a' are different characters rather than different
-glyphs--e.g. `Aspen' is a city while `aspen' is a kind of tree).
-
- Note that "character" and "glyph" are used differently here than
-elsewhere in XEmacs.
-
- A "character set" is essentially a set of related characters. ASCII,
-for example, is a set of 94 characters (or 128, if you count
-non-printing characters). Other character sets are ISO8859-1 (ASCII
-plus various accented characters and other international symbols), JIS
-X 0201 (ASCII, more or less, plus half-width Katakana), JIS X 0208
-(Japanese Kanji), JIS X 0212 (a second set of less-used Japanese Kanji),
-GB2312 (Mainland Chinese Hanzi), etc.
-
- The definition of a character set will implicitly or explicitly give
-it an "ordering", a way of assigning a number to each character in the
-set. For many character sets, there is a natural ordering, for example
-the "ABC" ordering of the Roman letters. But it is not clear whether
-digits should come before or after the letters, and in fact different
-European languages treat the ordering of accented characters
-differently. It is useful to use the natural order where available, of
-course. The number assigned to any particular character is called the
-character's "code point". (Within a given character set, each
-character has a unique code point. Thus the word "set" is ill-chosen;
-different orderings of the same characters are different character sets.
-Identifying characters is simple enough for alphabetic character sets,
-but the difference in ordering can cause great headaches when the same
-thousands of characters are used by different cultures as in the Hanzi.)
-
- A code point may be broken into a number of "position codes". The
-number of position codes required to index a particular character in a
-character set is called the "dimension" of the character set. For
-practical purposes, a position code may be thought of as a byte-sized
-index. The printing characters of ASCII, being a relatively small
-character set, is of dimension one, and each character in the set is
-indexed using a single position code, in the range 1 through 94. Use of
-this unusual range, rather than the familiar 33 through 126, is an
-intentional abstraction; to understand the programming issues you must
-break the equation between character sets and encodings.
-
- JIS X 0208, i.e. Japanese Kanji, has thousands of characters, and is
-of dimension two - every character is indexed by two position codes,
-each in the range 1 through 94. (This number "94" is not a
-coincidence; we shall see that the JIS position codes were chosen so
-that JIS kanji could be encoded without using codes that in ASCII are
-associated with device control functions.) Note that the choice of the
-range here is somewhat arbitrary. You could just as easily index the
-printing characters in ASCII using numbers in the range 0 through 93, 2
-through 95, 3 through 96, etc. In fact, the standardized _encoding_
-for the ASCII _character set_ uses the range 33 through 126.
-
- An "encoding" is a way of numerically representing characters from
-one or more character sets into a stream of like-sized numerical values
-called "words"; typically these are 8-bit, 16-bit, or 32-bit
-quantities. If an encoding encompasses only one character set, then the
-position codes for the characters in that character set could be used
-directly. (This is the case with the trivial cipher used by children,
-assigning 1 to `A', 2 to `B', and so on.) However, even with ASCII,
-other considerations intrude. For example, why are the upper- and
-lowercase alphabets separated by 8 characters? Why do the digits start
-with `0' being assigned the code 48? In both cases because semantically
-interesting operations (case conversion and numerical value extraction)
-become convenient masking operations. Other artificial aspects (the
-control characters being assigned to codes 0-31 and 127) are historical
-accidents. (The use of 127 for `DEL' is an artifact of the "punch
-once" nature of paper tape, for example.)
-
- Naive use of the position code is not possible, however, if more than
-one character set is to be used in the encoding. For example, printed
-Japanese text typically requires characters from multiple character sets
-- ASCII, JIS X 0208, and JIS X 0212, to be specific. Each of these is
-indexed using one or more position codes in the range 1 through 94, so
-the position codes could not be used directly or there would be no way
-to tell which character was meant. Different Japanese encodings handle
-this differently - JIS uses special escape characters to denote
-different character sets; EUC sets the high bit of the position codes
-for JIS X 0208 and JIS X 0212, and puts a special extra byte before each
-JIS X 0212 character; etc. (JIS, EUC, and most of the other encodings
-you will encounter in files are 7-bit or 8-bit encodings. There is one
-common 16-bit encoding, which is Unicode; this strives to represent all
-the world's characters in a single large character set. 32-bit
-encodings are often used internally in programs, such as XEmacs with
-MULE support, to simplify the code that manipulates them; however, they
-are not used externally because they are not very space-efficient.)
-
- A general method of handling text using multiple character sets
-(whether for multilingual text, or simply text in an extremely
-complicated single language like Japanese) is defined in the
-international standard ISO 2022. ISO 2022 will be discussed in more
-detail later (*note ISO 2022::), but for now suffice it to say that text
-needs control functions (at least spacing), and if escape sequences are
-to be used, an escape sequence introducer. It was decided to make all
-text streams compatible with ASCII in the sense that the codes 0-31
-(and 128-159) would always be control codes, never graphic characters,
-and where defined by the character set the `SPC' character would be
-assigned code 32, and `DEL' would be assigned 127. Thus there are 94
-code points remaining if 7 bits are used. This is the reason that most
-character sets are defined using position codes in the range 1 through
-94. Then ISO 2022 compatible encodings are produced by shifting the
-position codes 1 to 94 into character codes 33 to 126, or (if 8 bit
-codes are available) into character codes 161 to 254.
-
- Encodings are classified as either "modal" or "non-modal". In a
-"modal encoding", there are multiple states that the encoding can be
-in, and the interpretation of the values in the stream depends on the
-current global state of the encoding. Special values in the encoding,
-called "escape sequences", are used to change the global state. JIS,
-for example, is a modal encoding. The bytes `ESC $ B' indicate that,
-from then on, bytes are to be interpreted as position codes for JIS X
-0208, rather than as ASCII. This effect is cancelled using the bytes
-`ESC ( B', which mean "switch from whatever the current state is to
-ASCII". To switch to JIS X 0212, the escape sequence `ESC $ ( D'.
-(Note that here, as is common, the escape sequences do in fact begin
-with `ESC'. This is not necessarily the case, however. Some encodings
-use control characters called "locking shifts" (effect persists until
-cancelled) to switch character sets.)
-
- A "non-modal encoding" has no global state that extends past the
-character currently being interpreted. EUC, for example, is a
-non-modal encoding. Characters in JIS X 0208 are encoded by setting
-the high bit of the position codes, and characters in JIS X 0212 are
-encoded by doing the same but also prefixing the character with the
-byte 0x8F.
-
- The advantage of a modal encoding is that it is generally more
-space-efficient, and is easily extendable because there are essentially
-an arbitrary number of escape sequences that can be created. The
-disadvantage, however, is that it is much more difficult to work with
-if it is not being processed in a sequential manner. In the non-modal
-EUC encoding, for example, the byte 0x41 always refers to the letter
-`A'; whereas in JIS, it could either be the letter `A', or one of the
-two position codes in a JIS X 0208 character, or one of the two
-position codes in a JIS X 0212 character. Determining exactly which
-one is meant could be difficult and time-consuming if the previous
-bytes in the string have not already been processed, or impossible if
-they are drawn from an external stream that cannot be rewound.
-
- Non-modal encodings are further divided into "fixed-width" and
-"variable-width" formats. A fixed-width encoding always uses the same
-number of words per character, whereas a variable-width encoding does
-not. EUC is a good example of a variable-width encoding: one to three
-bytes are used per character, depending on the character set. 16-bit
-and 32-bit encodings are nearly always fixed-width, and this is in fact
-one of the main reasons for using an encoding with a larger word size.
-The advantages of fixed-width encodings should be obvious. The
-advantages of variable-width encodings are that they are generally more
-space-efficient and allow for compatibility with existing 8-bit
-encodings such as ASCII. (For example, in Unicode ASCII characters are
-simply promoted to a 16-bit representation. That means that every
-ASCII character contains a `NUL' byte; evidently all of the standard
-string manipulation functions will lose badly in a fixed-width Unicode
-environment.)
-
- The bytes in an 8-bit encoding are often referred to as "octets"
-rather than simply as bytes. This terminology dates back to the days
-before 8-bit bytes were universal, when some computers had 9-bit bytes,
-others had 10-bit bytes, etc.
-
-\1f
-File: lispref.info, Node: Charsets, Next: MULE Characters, Prev: Internationalization Terminology, Up: MULE
-
-Charsets
-========
-
- A "charset" in MULE is an object that encapsulates a particular
-character set as well as an ordering of those characters. Charsets are
-permanent objects and are named using symbols, like faces.
-
- - Function: charsetp object
- This function returns non-`nil' if OBJECT is a charset.
-
-* Menu:
-
-* Charset Properties:: Properties of a charset.
-* Basic Charset Functions:: Functions for working with charsets.
-* Charset Property Functions:: Functions for accessing charset properties.
-* Predefined Charsets:: Predefined charset objects.
-
-\1f
-File: lispref.info, Node: Charset Properties, Next: Basic Charset Functions, Up: Charsets
-
-Charset Properties
-------------------
-
- Charsets have the following properties:
-
-`name'
- A symbol naming the charset. Every charset must have a different
- name; this allows a charset to be referred to using its name
- rather than the actual charset object.
-
-`doc-string'
- A documentation string describing the charset.
-
-`registry'
- A regular expression matching the font registry field for this
- character set. For example, both the `ascii' and `latin-iso8859-1'
- charsets use the registry `"ISO8859-1"'. This field is used to
- choose an appropriate font when the user gives a general font
- specification such as `-*-courier-medium-r-*-140-*', i.e. a
- 14-point upright medium-weight Courier font.
-
-`dimension'
- Number of position codes used to index a character in the
- character set. XEmacs/MULE can only handle character sets of
- dimension 1 or 2. This property defaults to 1.
-
-`chars'
- Number of characters in each dimension. In XEmacs/MULE, the only
- allowed values are 94 or 96. (There are a couple of pre-defined
- character sets, such as ASCII, that do not follow this, but you
- cannot define new ones like this.) Defaults to 94. Note that if
- the dimension is 2, the character set thus described is 94x94 or
- 96x96.
-
-`columns'
- Number of columns used to display a character in this charset.
- Only used in TTY mode. (Under X, the actual width of a character
- can be derived from the font used to display the characters.) If
- unspecified, defaults to the dimension. (This is almost always the
- correct value, because character sets with dimension 2 are usually
- ideograph character sets, which need two columns to display the
- intricate ideographs.)
-
-`direction'
- A symbol, either `l2r' (left-to-right) or `r2l' (right-to-left).
- Defaults to `l2r'. This specifies the direction that the text
- should be displayed in, and will be left-to-right for most
- charsets but right-to-left for Hebrew and Arabic. (Right-to-left
- display is not currently implemented.)
-
-`final'
- Final byte of the standard ISO 2022 escape sequence designating
- this charset. Must be supplied. Each combination of (DIMENSION,
- CHARS) defines a separate namespace for final bytes, and each
- charset within a particular namespace must have a different final
- byte. Note that ISO 2022 restricts the final byte to the range
- 0x30 - 0x7E if dimension == 1, and 0x30 - 0x5F if dimension == 2.
- Note also that final bytes in the range 0x30 - 0x3F are reserved
- for user-defined (not official) character sets. For more
- information on ISO 2022, see *Note Coding Systems::.
-
-`graphic'
- 0 (use left half of font on output) or 1 (use right half of font on
- output). Defaults to 0. This specifies how to convert the
- position codes that index a character in a character set into an
- index into the font used to display the character set. With
- `graphic' set to 0, position codes 33 through 126 map to font
- indices 33 through 126; with it set to 1, position codes 33
- through 126 map to font indices 161 through 254 (i.e. the same
- number but with the high bit set). For example, for a font whose
- registry is ISO8859-1, the left half of the font (octets 0x20 -
- 0x7F) is the `ascii' charset, while the right half (octets 0xA0 -
- 0xFF) is the `latin-iso8859-1' charset.
-
-`ccl-program'
- A compiled CCL program used to convert a character in this charset
- into an index into the font. This is in addition to the `graphic'
- property. If a CCL program is defined, the position codes of a
- character will first be processed according to `graphic' and then
- passed through the CCL program, with the resulting values used to
- index the font.
-
- This is used, for example, in the Big5 character set (used in
- Taiwan). This character set is not ISO-2022-compliant, and its
- size (94x157) does not fit within the maximum 96x96 size of
- ISO-2022-compliant character sets. As a result, XEmacs/MULE
- splits it (in a rather complex fashion, so as to group the most
- commonly used characters together) into two charset objects
- (`big5-1' and `big5-2'), each of size 94x94, and each charset
- object uses a CCL program to convert the modified position codes
- back into standard Big5 indices to retrieve a character from a
- Big5 font.
-
- Most of the above properties can only be set when the charset is
-initialized, and cannot be changed later. *Note Charset Property
-Functions::.
-
-\1f
-File: lispref.info, Node: Basic Charset Functions, Next: Charset Property Functions, Prev: Charset Properties, Up: Charsets
-
-Basic Charset Functions
------------------------
-
- - Function: find-charset charset-or-name
- This function retrieves the charset of the given name. If
- CHARSET-OR-NAME is a charset object, it is simply returned.
- Otherwise, CHARSET-OR-NAME should be a symbol. If there is no
- such charset, `nil' is returned. Otherwise the associated charset
- object is returned.
-
- - Function: get-charset name
- This function retrieves the charset of the given name. Same as
- `find-charset' except an error is signalled if there is no such
- charset instead of returning `nil'.
-
- - Function: charset-list
- This function returns a list of the names of all defined charsets.
-
- - Function: make-charset name doc-string props
- This function defines a new character set. This function is for
- use with MULE support. NAME is a symbol, the name by which the
- character set is normally referred. DOC-STRING is a string
- describing the character set. PROPS is a property list,
- describing the specific nature of the character set. The
- recognized properties are `registry', `dimension', `columns',
- `chars', `final', `graphic', `direction', and `ccl-program', as
- previously described.
-
- - Function: make-reverse-direction-charset charset new-name
- This function makes a charset equivalent to CHARSET but which goes
- in the opposite direction. NEW-NAME is the name of the new
- charset. The new charset is returned.
-
- - Function: charset-from-attributes dimension chars final &optional
- direction
- This function returns a charset with the given DIMENSION, CHARS,
- FINAL, and DIRECTION. If DIRECTION is omitted, both directions
- will be checked (left-to-right will be returned if character sets
- exist for both directions).
-
- - Function: charset-reverse-direction-charset charset
- This function returns the charset (if any) with the same dimension,
- number of characters, and final byte as CHARSET, but which is
- displayed in the opposite direction.
-
-\1f
-File: lispref.info, Node: Charset Property Functions, Next: Predefined Charsets, Prev: Basic Charset Functions, Up: Charsets
-
-Charset Property Functions
---------------------------
-
- All of these functions accept either a charset name or charset
-object.
-
- - Function: charset-property charset prop
- This function returns property PROP of CHARSET. *Note Charset
- Properties::.
-
- Convenience functions are also provided for retrieving individual
-properties of a charset.
-
- - Function: charset-name charset
- This function returns the name of CHARSET. This will be a symbol.
-
- - Function: charset-doc-string charset
- This function returns the doc string of CHARSET.
-
- - Function: charset-registry charset
- This function returns the registry of CHARSET.
-
- - Function: charset-dimension charset
- This function returns the dimension of CHARSET.
-
- - Function: charset-chars charset
- This function returns the number of characters per dimension of
- CHARSET.
-
- - Function: charset-columns charset
- This function returns the number of display columns per character
- (in TTY mode) of CHARSET.
-
- - Function: charset-direction charset
- This function returns the display direction of CHARSET--either
- `l2r' or `r2l'.
-
- - Function: charset-final charset
- This function returns the final byte of the ISO 2022 escape
- sequence designating CHARSET.
-
- - Function: charset-graphic charset
- This function returns either 0 or 1, depending on whether the
- position codes of characters in CHARSET map to the left or right
- half of their font, respectively.
-
- - Function: charset-ccl-program charset
- This function returns the CCL program, if any, for converting
- position codes of characters in CHARSET into font indices.
-
- The only property of a charset that can currently be set after the
-charset has been created is the CCL program.
-
- - Function: set-charset-ccl-program charset ccl-program
- This function sets the `ccl-program' property of CHARSET to
- CCL-PROGRAM.
-
-\1f
-File: lispref.info, Node: Predefined Charsets, Prev: Charset Property Functions, Up: Charsets
-
-Predefined Charsets
--------------------
-
- The following charsets are predefined in the C code.
-
- Name Type Fi Gr Dir Registry
- --------------------------------------------------------------
- ascii 94 B 0 l2r ISO8859-1
- control-1 94 0 l2r ---
- latin-iso8859-1 94 A 1 l2r ISO8859-1
- latin-iso8859-2 96 B 1 l2r ISO8859-2
- latin-iso8859-3 96 C 1 l2r ISO8859-3
- latin-iso8859-4 96 D 1 l2r ISO8859-4
- cyrillic-iso8859-5 96 L 1 l2r ISO8859-5
- arabic-iso8859-6 96 G 1 r2l ISO8859-6
- greek-iso8859-7 96 F 1 l2r ISO8859-7
- hebrew-iso8859-8 96 H 1 r2l ISO8859-8
- latin-iso8859-9 96 M 1 l2r ISO8859-9
- thai-tis620 96 T 1 l2r TIS620
- katakana-jisx0201 94 I 1 l2r JISX0201.1976
- latin-jisx0201 94 J 0 l2r JISX0201.1976
- japanese-jisx0208-1978 94x94 @ 0 l2r JISX0208.1978
- japanese-jisx0208 94x94 B 0 l2r JISX0208.19(83|90)
- japanese-jisx0212 94x94 D 0 l2r JISX0212
- chinese-gb2312 94x94 A 0 l2r GB2312
- chinese-cns11643-1 94x94 G 0 l2r CNS11643.1
- chinese-cns11643-2 94x94 H 0 l2r CNS11643.2
- chinese-big5-1 94x94 0 0 l2r Big5
- chinese-big5-2 94x94 1 0 l2r Big5
- korean-ksc5601 94x94 C 0 l2r KSC5601
- composite 96x96 0 l2r ---
-
- The following charsets are predefined in the Lisp code.
-
- Name Type Fi Gr Dir Registry
- --------------------------------------------------------------
- arabic-digit 94 2 0 l2r MuleArabic-0
- arabic-1-column 94 3 0 r2l MuleArabic-1
- arabic-2-column 94 4 0 r2l MuleArabic-2
- sisheng 94 0 0 l2r sisheng_cwnn\|OMRON_UDC_ZH
- chinese-cns11643-3 94x94 I 0 l2r CNS11643.1
- chinese-cns11643-4 94x94 J 0 l2r CNS11643.1
- chinese-cns11643-5 94x94 K 0 l2r CNS11643.1
- chinese-cns11643-6 94x94 L 0 l2r CNS11643.1
- chinese-cns11643-7 94x94 M 0 l2r CNS11643.1
- ethiopic 94x94 2 0 l2r Ethio
- ascii-r2l 94 B 0 r2l ISO8859-1
- ipa 96 0 1 l2r MuleIPA
- vietnamese-lower 96 1 1 l2r VISCII1.1
- vietnamese-upper 96 2 1 l2r VISCII1.1
-
- For all of the above charsets, the dimension and number of columns
-are the same.
-
- Note that ASCII, Control-1, and Composite are handled specially.
-This is why some of the fields are blank; and some of the filled-in
-fields (e.g. the type) are not really accurate.
-
-\1f
-File: lispref.info, Node: MULE Characters, Next: Composite Characters, Prev: Charsets, Up: MULE
-
-MULE Characters
-===============
-
- - Function: make-char charset arg1 &optional arg2
- This function makes a multi-byte character from CHARSET and octets
- ARG1 and ARG2.
-
- - Function: char-charset ch
- This function returns the character set of char CH.
-
- - Function: char-octet ch &optional n
- This function returns the octet (i.e. position code) numbered N
- (should be 0 or 1) of char CH. N defaults to 0 if omitted.
-
- - Function: find-charset-region start end &optional buffer
- This function returns a list of the charsets in the region between
- START and END. BUFFER defaults to the current buffer if omitted.
-
- - Function: find-charset-string string
- This function returns a list of the charsets in STRING.
-
-\1f
-File: lispref.info, Node: Composite Characters, Next: Coding Systems, Prev: MULE Characters, Up: MULE
-
-Composite Characters
-====================
-
- Composite characters are not yet completely implemented.
-
- - Function: make-composite-char string
- This function converts a string into a single composite character.
- The character is the result of overstriking all the characters in
- the string.
-
- - Function: composite-char-string ch
- This function returns a string of the characters comprising a
- composite character.
-
- - Function: compose-region start end &optional buffer
- This function composes the characters in the region from START to
- END in BUFFER into one composite character. The composite
- character replaces the composed characters. BUFFER defaults to
- the current buffer if omitted.
-
- - Function: decompose-region start end &optional buffer
- This function decomposes any composite characters in the region
- from START to END in BUFFER. This converts each composite
- character into one or more characters, the individual characters
- out of which the composite character was formed. Non-composite
- characters are left as-is. BUFFER defaults to the current buffer
- if omitted.
-
-\1f
-File: lispref.info, Node: Coding Systems, Next: CCL, Prev: Composite Characters, Up: MULE
-
-Coding Systems
-==============
-
- A coding system is an object that defines how text containing
-multiple character sets is encoded into a stream of (typically 8-bit)
-bytes. The coding system is used to decode the stream into a series of
-characters (which may be from multiple charsets) when the text is read
-from a file or process, and is used to encode the text back into the
-same format when it is written out to a file or process.
-
- For example, many ISO-2022-compliant coding systems (such as Compound
-Text, which is used for inter-client data under the X Window System) use
-escape sequences to switch between different charsets - Japanese Kanji,
-for example, is invoked with `ESC $ ( B'; ASCII is invoked with `ESC (
-B'; and Cyrillic is invoked with `ESC - L'. See `make-coding-system'
-for more information.
-
- Coding systems are normally identified using a symbol, and the
-symbol is accepted in place of the actual coding system object whenever
-a coding system is called for. (This is similar to how faces and
-charsets work.)
-
- - Function: coding-system-p object
- This function returns non-`nil' if OBJECT is a coding system.
-
-* Menu:
-
-* Coding System Types:: Classifying coding systems.
-* ISO 2022:: An international standard for
- charsets and encodings.
-* EOL Conversion:: Dealing with different ways of denoting
- the end of a line.
-* Coding System Properties:: Properties of a coding system.
-* Basic Coding System Functions:: Working with coding systems.
-* Coding System Property Functions:: Retrieving a coding system's properties.
-* Encoding and Decoding Text:: Encoding and decoding text.
-* Detection of Textual Encoding:: Determining how text is encoded.
-* Big5 and Shift-JIS Functions:: Special functions for these non-standard
- encodings.
-* Predefined Coding Systems:: Coding systems implemented by MULE.
-