1 This is ../info/internals.info, produced by makeinfo version 4.0 from
2 internals/internals.texi.
4 INFO-DIR-SECTION XEmacs Editor
6 * Internals: (internals). XEmacs Internals Manual.
9 Copyright (C) 1992 - 1996 Ben Wing. Copyright (C) 1996, 1997 Sun
10 Microsystems. Copyright (C) 1994 - 1998 Free Software Foundation.
11 Copyright (C) 1994, 1995 Board of Trustees, University of Illinois.
13 Permission is granted to make and distribute verbatim copies of this
14 manual provided the copyright notice and this permission notice are
15 preserved on all copies.
17 Permission is granted to copy and distribute modified versions of
18 this manual under the conditions for verbatim copying, provided that the
19 entire resulting derived work is distributed under the terms of a
20 permission notice identical to this one.
22 Permission is granted to copy and distribute translations of this
23 manual into another language, under the above conditions for modified
24 versions, except that this permission notice may be stated in a
25 translation approved by the Foundation.
27 Permission is granted to copy and distribute modified versions of
28 this manual under the conditions for verbatim copying, provided also
29 that the section entitled "GNU General Public License" is included
30 exactly as in the original, and provided that the entire resulting
31 derived work is distributed under the terms of a permission notice
32 identical to this one.
34 Permission is granted to copy and distribute translations of this
35 manual into another language, under the above conditions for modified
36 versions, except that the section entitled "GNU General Public License"
37 may be included in a translation approved by the Free Software
38 Foundation instead of in the original English.
41 File: internals.info, Node: Obarrays, Next: Symbol Values, Prev: Introduction to Symbols, Up: Symbols and Variables
46 The identity of symbols with their names is accomplished through a
47 structure called an obarray, which is just a poorly-implemented hash
48 table mapping from strings to symbols whose name is that string. (I say
49 "poorly implemented" because an obarray appears in Lisp as a vector
50 with some hidden fields rather than as its own opaque type. This is an
51 Emacs Lisp artifact that should be fixed.)
53 Obarrays are implemented as a vector of some fixed size (which should
54 be a prime for best results), where each "bucket" of the vector
55 contains one or more symbols, threaded through a hidden `next' field in
56 the symbol. Lookup of a symbol in an obarray, and adding a symbol to
57 an obarray, is accomplished through standard hash-table techniques.
59 The standard Lisp function for working with symbols and obarrays is
60 `intern'. This looks up a symbol in an obarray given its name; if it's
61 not found, a new symbol is automatically created with the specified
62 name, added to the obarray, and returned. This is what happens when the
63 Lisp reader encounters a symbol (or more precisely, encounters the name
64 of a symbol) in some text that it is reading. There is a standard
65 obarray called `obarray' that is used for this purpose, although the
66 Lisp programmer is free to create his own obarrays and `intern' symbols
69 Note that, once a symbol is in an obarray, it stays there until
70 something is done about it, and the standard obarray `obarray' always
71 stays around, so once you use any particular variable name, a
72 corresponding symbol will stay around in `obarray' until you exit
75 Note that `obarray' itself is a variable, and as such there is a
76 symbol in `obarray' whose name is `"obarray"' and which contains
77 `obarray' as its value.
79 Note also that this call to `intern' occurs only when in the Lisp
80 reader, not when the code is executed (at which point the symbol is
81 already around, stored as such in the definition of the function).
83 You can create your own obarray using `make-vector' (this is
84 horrible but is an artifact) and intern symbols into that obarray.
85 Doing that will result in two or more symbols with the same name.
86 However, at most one of these symbols is in the standard `obarray': You
87 cannot have two symbols of the same name in any particular obarray.
88 Note that you cannot add a symbol to an obarray in any fashion other
89 than using `intern': i.e. you can't take an existing symbol and put it
90 in an existing obarray. Nor can you change the name of an existing
91 symbol. (Since obarrays are vectors, you can violate the consistency of
92 things by storing directly into the vector, but let's ignore that
95 Usually symbols are created by `intern', but if you really want, you
96 can explicitly create a symbol using `make-symbol', giving it some
97 name. The resulting symbol is not in any obarray (i.e. it is
98 "uninterned"), and you can't add it to any obarray. Therefore its
99 primary purpose is as a symbol to use in macros to avoid namespace
100 pollution. It can also be used as a carrier of information, but cons
101 cells could probably be used just as well.
103 You can also use `intern-soft' to look up a symbol but not create a
104 new one, and `unintern' to remove a symbol from an obarray. This
105 returns the removed symbol. (Remember: You can't put the symbol back
106 into any obarray.) Finally, `mapatoms' maps over all of the symbols in
110 File: internals.info, Node: Symbol Values, Prev: Obarrays, Up: Symbols and Variables
115 The value field of a symbol normally contains a Lisp object.
116 However, a symbol can be "unbound", meaning that it logically has no
117 value. This is internally indicated by storing a special Lisp object,
118 called "the unbound marker" and stored in the global variable
119 `Qunbound'. The unbound marker is of a special Lisp object type called
120 "symbol-value-magic". It is impossible for the Lisp programmer to
121 directly create or access any object of this type.
123 *You must not let any "symbol-value-magic" object escape to the Lisp
124 level.* Printing any of these objects will cause the message `INTERNAL
125 EMACS BUG' to appear as part of the print representation. (You may see
126 this normally when you call `debug_print()' from the debugger on a Lisp
127 object.) If you let one of these objects escape to the Lisp level, you
128 will violate a number of assumptions contained in the C code and make
129 the unbound marker not function right.
131 When a symbol is created, its value field (and function field) are
132 set to `Qunbound'. The Lisp programmer can restore these conditions
133 later using `makunbound' or `fmakunbound', and can query to see whether
134 the value of function fields are "bound" (i.e. have a value other than
135 `Qunbound') using `boundp' and `fboundp'. The fields are set to a
136 normal Lisp object using `set' (or `setq') and `fset'.
138 Other symbol-value-magic objects are used as special markers to
139 indicate variables that have non-normal properties. This includes any
140 variables that are tied into C variables (setting the variable magically
141 sets some global variable in the C code, and likewise for retrieving the
142 variable's value), variables that magically tie into slots in the
143 current buffer, variables that are buffer-local, etc. The
144 symbol-value-magic object is stored in the value cell in place of a
145 normal object, and the code to retrieve a symbol's value (i.e.
146 `symbol-value') knows how to do special things with them. This means
147 that you should not just fetch the value cell directly if you want a
150 The exact workings of this are rather complex and involved and are
151 well-documented in comments in `buffer.c', `symbols.c', and `lisp.h'.
154 File: internals.info, Node: Buffers and Textual Representation, Next: MULE Character Sets and Encodings, Prev: Symbols and Variables, Up: Top
156 Buffers and Textual Representation
157 **********************************
161 * Introduction to Buffers:: A buffer holds a block of text such as a file.
162 * The Text in a Buffer:: Representation of the text in a buffer.
163 * Buffer Lists:: Keeping track of all buffers.
164 * Markers and Extents:: Tagging locations within a buffer.
165 * Bufbytes and Emchars:: Representation of individual characters.
166 * The Buffer Object:: The Lisp object corresponding to a buffer.
169 File: internals.info, Node: Introduction to Buffers, Next: The Text in a Buffer, Up: Buffers and Textual Representation
171 Introduction to Buffers
172 =======================
174 A buffer is logically just a Lisp object that holds some text. In
175 this, it is like a string, but a buffer is optimized for frequent
176 insertion and deletion, while a string is not. Furthermore:
178 1. Buffers are "permanent" objects, i.e. once you create them, they
179 remain around, and need to be explicitly deleted before they go
182 2. Each buffer has a unique name, which is a string. Buffers are
183 normally referred to by name. In this respect, they are like
186 3. Buffers have a default insertion position, called "point".
187 Inserting text (unless you explicitly give a position) goes at
188 point, and moves point forward past the text. This is what is
189 going on when you type text into Emacs.
191 4. Buffers have lots of extra properties associated with them.
193 5. Buffers can be "displayed". What this means is that there exist a
194 number of "windows", which are objects that correspond to some
195 visible section of your display, and each window has an associated
196 buffer, and the current contents of the buffer are shown in that
197 section of the display. The redisplay mechanism (which takes care
198 of doing this) knows how to look at the text of a buffer and come
199 up with some reasonable way of displaying this. Many of the
200 properties of a buffer control how the buffer's text is displayed.
202 6. One buffer is distinguished and called the "current buffer". It is
203 stored in the variable `current_buffer'. Buffer operations operate
204 on this buffer by default. When you are typing text into a
205 buffer, the buffer you are typing into is always `current_buffer'.
206 Switching to a different window changes the current buffer. Note
207 that Lisp code can temporarily change the current buffer using
208 `set-buffer' (often enclosed in a `save-excursion' so that the
209 former current buffer gets restored when the code is finished).
210 However, calling `set-buffer' will NOT cause a permanent change in
211 the current buffer. The reason for this is that the top-level
212 event loop sets `current_buffer' to the buffer of the selected
213 window, each time it finishes executing a user command.
215 Make sure you understand the distinction between "current buffer"
216 and "buffer of the selected window", and the distinction between
217 "point" of the current buffer and "window-point" of the selected
218 window. (This latter distinction is explained in detail in the section
222 File: internals.info, Node: The Text in a Buffer, Next: Buffer Lists, Prev: Introduction to Buffers, Up: Buffers and Textual Representation
227 The text in a buffer consists of a sequence of zero or more
228 characters. A "character" is an integer that logically represents a
229 letter, number, space, or other unit of text. Most of the characters
230 that you will typically encounter belong to the ASCII set of characters,
231 but there are also characters for various sorts of accented letters,
232 special symbols, Chinese and Japanese ideograms (i.e. Kanji, Katakana,
233 etc.), Cyrillic and Greek letters, etc. The actual number of possible
234 characters is quite large.
236 For now, we can view a character as some non-negative integer that
237 has some shape that defines how it typically appears (e.g. as an
238 uppercase A). (The exact way in which a character appears depends on the
239 font used to display the character.) The internal type of characters in
240 the C code is an `Emchar'; this is just an `int', but using a symbolic
241 type makes the code clearer.
243 Between every character in a buffer is a "buffer position" or
244 "character position". We can speak of the character before or after a
245 particular buffer position, and when you insert a character at a
246 particular position, all characters after that position end up at new
247 positions. When we speak of the character "at" a position, we really
248 mean the character after the position. (This schizophrenia between a
249 buffer position being "between" a character and "on" a character is
252 Buffer positions are numbered starting at 1. This means that
253 position 1 is before the first character, and position 0 is not valid.
254 If there are N characters in a buffer, then buffer position N+1 is
255 after the last one, and position N+2 is not valid.
257 The internal makeup of the Emchar integer varies depending on whether
258 we have compiled with MULE support. If not, the Emchar integer is an
259 8-bit integer with possible values from 0 - 255. 0 - 127 are the
260 standard ASCII characters, while 128 - 255 are the characters from the
261 ISO-8859-1 character set. If we have compiled with MULE support, an
262 Emchar is a 19-bit integer, with the various bits having meanings
263 according to a complex scheme that will be detailed later. The
264 characters numbered 0 - 255 still have the same meanings as for the
265 non-MULE case, though.
267 Internally, the text in a buffer is represented in a fairly simple
268 fashion: as a contiguous array of bytes, with a "gap" of some size in
269 the middle. Although the gap is of some substantial size in bytes,
270 there is no text contained within it: From the perspective of the text
271 in the buffer, it does not exist. The gap logically sits at some buffer
272 position, between two characters (or possibly at the beginning or end of
273 the buffer). Insertion of text in a buffer at a particular position is
274 always accomplished by first moving the gap to that position (i.e.
275 through some block moving of text), then writing the text into the
276 beginning of the gap, thereby shrinking the gap. If the gap shrinks
277 down to nothing, a new gap is created. (What actually happens is that a
278 new gap is "created" at the end of the buffer's text, which requires
279 nothing more than changing a couple of indices; then the gap is "moved"
280 to the position where the insertion needs to take place by moving up in
281 memory all the text after that position.) Similarly, deletion occurs
282 by moving the gap to the place where the text is to be deleted, and
283 then simply expanding the gap to include the deleted text.
284 ("Expanding" and "shrinking" the gap as just described means just that
285 the internal indices that keep track of where the gap is located are
288 Note that the total amount of memory allocated for a buffer text
289 never decreases while the buffer is live. Therefore, if you load up a
290 20-megabyte file and then delete all but one character, there will be a
291 20-megabyte gap, which won't get any smaller (except by inserting
292 characters back again). Once the buffer is killed, the memory allocated
293 for the buffer text will be freed, but it will still be sitting on the
294 heap, taking up virtual memory, and will not be released back to the
295 operating system. (However, if you have compiled XEmacs with rel-alloc,
296 the situation is different. In this case, the space _will_ be released
297 back to the operating system. However, this tends to result in a
298 noticeable speed penalty.)
300 Astute readers may notice that the text in a buffer is represented as
301 an array of _bytes_, while (at least in the MULE case) an Emchar is a
302 19-bit integer, which clearly cannot fit in a byte. This means (of
303 course) that the text in a buffer uses a different representation from
304 an Emchar: specifically, the 19-bit Emchar becomes a series of one to
305 four bytes. The conversion between these two representations is complex
306 and will be described later.
308 In the non-MULE case, everything is very simple: An Emchar is an
309 8-bit value, which fits neatly into one byte.
311 If we are given a buffer position and want to retrieve the character
312 at that position, we need to follow these steps:
314 1. Pretend there's no gap, and convert the buffer position into a
315 "byte index" that indexes to the appropriate byte in the buffer's
316 stream of textual bytes. By convention, byte indices begin at 1,
317 just like buffer positions. In the non-MULE case, byte indices
318 and buffer positions are identical, since one character equals one
321 2. Convert the byte index into a "memory index", which takes the gap
322 into account. The memory index is a direct index into the block of
323 memory that stores the text of a buffer. This basically just
324 involves checking to see if the byte index is past the gap, and if
325 so, adding the size of the gap to it. By convention, memory
326 indices begin at 1, just like buffer positions and byte indices,
327 and when referring to the position that is "at" the gap, we always
328 use the memory position at the _beginning_, not at the end, of the
331 3. Fetch the appropriate bytes at the determined memory position.
333 4. Convert these bytes into an Emchar.
335 In the non-Mule case, (3) and (4) boil down to a simple one-byte
338 Note that we have defined three types of positions in a buffer:
340 1. "buffer positions" or "character positions", typedef `Bufpos'
342 2. "byte indices", typedef `Bytind'
344 3. "memory indices", typedef `Memind'
346 All three typedefs are just `int's, but defining them this way makes
347 things a lot clearer.
349 Most code works with buffer positions. In particular, all Lisp code
350 that refers to text in a buffer uses buffer positions. Lisp code does
351 not know that byte indices or memory indices exist.
353 Finally, we have a typedef for the bytes in a buffer. This is a
354 `Bufbyte', which is an unsigned char. Referring to them as Bufbytes
355 underscores the fact that we are working with a string of bytes in the
356 internal Emacs buffer representation rather than in one of a number of
357 possible alternative representations (e.g. EUC-encoded text, etc.).
360 File: internals.info, Node: Buffer Lists, Next: Markers and Extents, Prev: The Text in a Buffer, Up: Buffers and Textual Representation
365 Recall earlier that buffers are "permanent" objects, i.e. that they
366 remain around until explicitly deleted. This entails that there is a
367 list of all the buffers in existence. This list is actually an
368 assoc-list (mapping from the buffer's name to the buffer) and is stored
369 in the global variable `Vbuffer_alist'.
371 The order of the buffers in the list is important: the buffers are
372 ordered approximately from most-recently-used to least-recently-used.
373 Switching to a buffer using `switch-to-buffer', `pop-to-buffer', etc.
374 and switching windows using `other-window', etc. usually brings the
375 new current buffer to the front of the list. `switch-to-buffer',
376 `other-buffer', etc. look at the beginning of the list to find an
377 alternative buffer to suggest. You can also explicitly move a buffer
378 to the end of the list using `bury-buffer'.
380 In addition to the global ordering in `Vbuffer_alist', each frame
381 has its own ordering of the list. These lists always contain the same
382 elements as in `Vbuffer_alist' although possibly in a different order.
383 `buffer-list' normally returns the list for the selected frame. This
384 allows you to work in separate frames without things interfering with
387 The standard way to look up a buffer given a name is `get-buffer',
388 and the standard way to create a new buffer is `get-buffer-create',
389 which looks up a buffer with a given name, creating a new one if
390 necessary. These operations correspond exactly with the symbol
391 operations `intern-soft' and `intern', respectively. You can also
392 force a new buffer to be created using `generate-new-buffer', which
393 takes a name and (if necessary) makes a unique name from this by
394 appending a number, and then creates the buffer. This is basically
395 like the symbol operation `gensym'.
398 File: internals.info, Node: Markers and Extents, Next: Bufbytes and Emchars, Prev: Buffer Lists, Up: Buffers and Textual Representation
403 Among the things associated with a buffer are things that are
404 logically attached to certain buffer positions. This can be used to
405 keep track of a buffer position when text is inserted and deleted, so
406 that it remains at the same spot relative to the text around it; to
407 assign properties to particular sections of text; etc. There are two
408 such objects that are useful in this regard: they are "markers" and
411 A "marker" is simply a flag placed at a particular buffer position,
412 which is moved around as text is inserted and deleted. Markers are
413 used for all sorts of purposes, such as the `mark' that is the other
414 end of textual regions to be cut, copied, etc.
416 An "extent" is similar to two markers plus some associated
417 properties, and is used to keep track of regions in a buffer as text is
418 inserted and deleted, and to add properties (e.g. fonts) to particular
419 regions of text. The external interface of extents is explained
422 The important thing here is that markers and extents simply contain
423 buffer positions in them as integers, and every time text is inserted or
424 deleted, these positions must be updated. In order to minimize the
425 amount of shuffling that needs to be done, the positions in markers and
426 extents (there's one per marker, two per extent) are stored in Meminds.
427 This means that they only need to be moved when the text is physically
428 moved in memory; since the gap structure tries to minimize this, it also
429 minimizes the number of marker and extent indices that need to be
430 adjusted. Look in `insdel.c' for the details of how this works.
432 One other important distinction is that markers are "temporary"
433 while extents are "permanent". This means that markers disappear as
434 soon as there are no more pointers to them, and correspondingly, there
435 is no way to determine what markers are in a buffer if you are just
436 given the buffer. Extents remain in a buffer until they are detached
437 (which could happen as a result of text being deleted) or the buffer is
438 deleted, and primitives do exist to enumerate the extents in a buffer.
441 File: internals.info, Node: Bufbytes and Emchars, Next: The Buffer Object, Prev: Markers and Extents, Up: Buffers and Textual Representation
449 File: internals.info, Node: The Buffer Object, Prev: Bufbytes and Emchars, Up: Buffers and Textual Representation
454 Buffers contain fields not directly accessible by the Lisp
455 programmer. We describe them here, naming them by the names used in
456 the C code. Many are accessible indirectly in Lisp programs via Lisp
460 The buffer name is a string that names the buffer. It is
461 guaranteed to be unique. *Note Buffer Names: (lispref)Buffer
465 This field contains the time when the buffer was last saved, as an
466 integer. *Note Buffer Modification: (lispref)Buffer Modification.
469 This field contains the modification time of the visited file. It
470 is set when the file is written or read. Every time the buffer is
471 written to the file, this field is compared to the modification
472 time of the file. *Note Buffer Modification: (lispref)Buffer
476 This field contains the time when the buffer was last auto-saved.
479 This field contains the `window-start' position in the buffer as of
480 the last time the buffer was displayed in a window.
483 This field points to the buffer's undo list. *Note Undo:
487 This field contains the syntax table for the buffer. *Note Syntax
488 Tables: (lispref)Syntax Tables.
491 This field contains the conversion table for converting text to
492 lower case. *Note Case Tables: (lispref)Case Tables.
495 This field contains the conversion table for converting text to
496 upper case. *Note Case Tables: (lispref)Case Tables.
499 This field contains the conversion table for canonicalizing text
500 for case-folding search. *Note Case Tables: (lispref)Case Tables.
503 This field contains the equivalence table for case-folding search.
504 *Note Case Tables: (lispref)Case Tables.
507 This field contains the buffer's display table, or `nil' if it
508 doesn't have one. *Note Display Tables: (lispref)Display Tables.
511 This field contains the chain of all markers that currently point
512 into the buffer. Deletion of text in the buffer, and motion of
513 the buffer's gap, must check each of these markers and perhaps
514 update it. *Note Markers: (lispref)Markers.
517 This field is a flag that tells whether a backup file has been
518 made for the visited file of this buffer.
521 This field contains the mark for the buffer. The mark is a marker,
522 hence it is also included on the list `markers'. *Note The Mark:
526 This field is non-`nil' if the buffer's mark is active.
529 This field contains the association list describing the variables
530 local in this buffer, and their values, with the exception of
531 local variables that have special slots in the buffer object.
532 (Those slots are omitted from this table.) *Note Buffer-Local
533 Variables: (lispref)Buffer-Local Variables.
536 This field contains a Lisp object which controls how to display
537 the mode line for this buffer. *Note Modeline Format:
538 (lispref)Modeline Format.
541 This field holds the buffer's base buffer (if it is an indirect
545 File: internals.info, Node: MULE Character Sets and Encodings, Next: The Lisp Reader and Compiler, Prev: Buffers and Textual Representation, Up: Top
547 MULE Character Sets and Encodings
548 *********************************
550 Recall that there are two primary ways that text is represented in
551 XEmacs. The "buffer" representation sees the text as a series of bytes
552 (Bufbytes), with a variable number of bytes used per character. The
553 "character" representation sees the text as a series of integers
554 (Emchars), one per character. The character representation is a cleaner
555 representation from a theoretical standpoint, and is thus used in many
556 cases when lots of manipulations on a string need to be done. However,
557 the buffer representation is the standard representation used in both
558 Lisp strings and buffers, and because of this, it is the "default"
559 representation that text comes in. The reason for using this
560 representation is that it's compact and is compatible with ASCII.
566 * Internal Mule Encodings::
570 File: internals.info, Node: Character Sets, Next: Encodings, Up: MULE Character Sets and Encodings
575 A character set (or "charset") is an ordered set of characters. A
576 particular character in a charset is indexed using one or more
577 "position codes", which are non-negative integers. The number of
578 position codes needed to identify a particular character in a charset is
579 called the "dimension" of the charset. In XEmacs/Mule, all charsets
580 have dimension 1 or 2, and the size of all charsets (except for a few
581 special cases) is either 94, 96, 94 by 94, or 96 by 96. The range of
582 position codes used to index characters from any of these types of
583 character sets is as follows:
585 Charset type Position code 1 Position code 2
586 ------------------------------------------------------------
589 94x94 33 - 126 33 - 126
590 96x96 32 - 127 32 - 127
592 Note that in the above cases position codes do not start at an
593 expected value such as 0 or 1. The reason for this will become clear
596 For example, Latin-1 is a 96-character charset, and JISX0208 (the
597 Japanese national character set) is a 94x94-character charset.
599 [Note that, although the ranges above define the _valid_ position
600 codes for a charset, some of the slots in a particular charset may in
601 fact be empty. This is the case for JISX0208, for example, where (e.g.)
602 all the slots whose first position code is in the range 118 - 127 are
605 There are three charsets that do not follow the above rules. All of
606 them have one dimension, and have ranges of position codes as follows:
608 Charset name Position code 1
609 ------------------------------------
612 Composite 0 - some large number
614 (The upper bound of the position code for composite characters has
615 not yet been determined, but it will probably be at least 16,383).
617 ASCII is the union of two subsidiary character sets: Printing-ASCII
618 (the printing ASCII character set, consisting of position codes 33 -
619 126, like for a standard 94-character charset) and Control-ASCII (the
620 non-printing characters that would appear in a binary file with codes 0
623 Control-1 contains the non-printing characters that would appear in a
624 binary file with codes 128 - 159.
626 Composite contains characters that are generated by overstriking one
627 or more characters from other charsets.
629 Note that some characters in ASCII, and all characters in Control-1,
630 are "control" (non-printing) characters. These have no printed
631 representation but instead control some other function of the printing
632 (e.g. TAB or 8 moves the current character position to the next tab
633 stop). All other characters in all charsets are "graphic" (printing)
636 When a binary file is read in, the bytes in the file are assigned to
637 character sets as follows:
639 Bytes Character set Range
640 --------------------------------------------------
641 0 - 127 ASCII 0 - 127
642 128 - 159 Control-1 0 - 31
643 160 - 255 Latin-1 32 - 127
645 This is a bit ad-hoc but gets the job done.
648 File: internals.info, Node: Encodings, Next: Internal Mule Encodings, Prev: Character Sets, Up: MULE Character Sets and Encodings
653 An "encoding" is a way of numerically representing characters from
654 one or more character sets. If an encoding only encompasses one
655 character set, then the position codes for the characters in that
656 character set could be used directly. This is not possible, however, if
657 more than one character set is to be used in the encoding.
659 For example, the conversion detailed above between bytes in a binary
660 file and characters is effectively an encoding that encompasses the
661 three character sets ASCII, Control-1, and Latin-1 in a stream of 8-bit
664 Thus, an encoding can be viewed as a way of encoding characters from
665 a specified group of character sets using a stream of bytes, each of
666 which contains a fixed number of bits (but not necessarily 8, as in the
667 common usage of "byte").
669 Here are descriptions of a couple of common encodings:
673 * Japanese EUC (Extended Unix Code)::
677 File: internals.info, Node: Japanese EUC (Extended Unix Code), Next: JIS7, Up: Encodings
679 Japanese EUC (Extended Unix Code)
680 ---------------------------------
682 This encompasses the character sets Printing-ASCII,
683 Japanese-JISX0201, and Japanese-JISX0208-Kana (half-width katakana, the
684 right half of JISX0201). It uses 8-bit bytes.
686 Note that Printing-ASCII and Japanese-JISX0201-Kana are 94-character
687 charsets, while Japanese-JISX0208 is a 94x94-character charset.
689 The encoding is as follows:
691 Character set Representation (PC=position-code)
692 ------------- --------------
694 Japanese-JISX0201-Kana 0x8E | PC1 + 0x80
695 Japanese-JISX0208 PC1 + 0x80 | PC2 + 0x80
696 Japanese-JISX0212 PC1 + 0x80 | PC2 + 0x80
699 File: internals.info, Node: JIS7, Prev: Japanese EUC (Extended Unix Code), Up: Encodings
704 This encompasses the character sets Printing-ASCII,
705 Japanese-JISX0201-Roman (the left half of JISX0201; this character set
706 is very similar to Printing-ASCII and is a 94-character charset),
707 Japanese-JISX0208, and Japanese-JISX0201-Kana. It uses 7-bit bytes.
709 Unlike Japanese EUC, this is a "modal" encoding, which means that
710 there are multiple states that the encoding can be in, which affect how
711 the bytes are to be interpreted. Special sequences of bytes (called
712 "escape sequences") are used to change states.
714 The encoding is as follows:
716 Character set Representation (PC=position-code)
717 ------------- --------------
719 Japanese-JISX0201-Roman PC1
720 Japanese-JISX0201-Kana PC1
721 Japanese-JISX0208 PC1 PC2
724 Escape sequence ASCII equivalent Meaning
725 --------------- ---------------- -------
726 0x1B 0x28 0x4A ESC ( J invoke Japanese-JISX0201-Roman
727 0x1B 0x28 0x49 ESC ( I invoke Japanese-JISX0201-Kana
728 0x1B 0x24 0x42 ESC $ B invoke Japanese-JISX0208
729 0x1B 0x28 0x42 ESC ( B invoke Printing-ASCII
731 Initially, Printing-ASCII is invoked.
734 File: internals.info, Node: Internal Mule Encodings, Next: CCL, Prev: Encodings, Up: MULE Character Sets and Encodings
736 Internal Mule Encodings
737 =======================
739 In XEmacs/Mule, each character set is assigned a unique number,
740 called a "leading byte". This is used in the encodings of a character.
741 Leading bytes are in the range 0x80 - 0xFF (except for ASCII, which has
742 a leading byte of 0), although some leading bytes are reserved.
744 Charsets whose leading byte is in the range 0x80 - 0x9F are called
745 "official" and are used for built-in charsets. Other charsets are
746 called "private" and have leading bytes in the range 0xA0 - 0xFF; these
747 are user-defined charsets.
751 Character set Leading byte
752 ------------- ------------
755 Dimension-1 Official 0x81 - 0x8D
758 Dimension-2 Official 0x90 - 0x99
759 (0x9A - 0x9D are free;
760 0x9E and 0x9F are reserved)
761 Dimension-1 Private 0xA0 - 0xEF
762 Dimension-2 Private 0xF0 - 0xFF
764 There are two internal encodings for characters in XEmacs/Mule. One
765 is called "string encoding" and is an 8-bit encoding that is used for
766 representing characters in a buffer or string. It uses 1 to 4 bytes per
767 character. The other is called "character encoding" and is a 19-bit
768 encoding that is used for representing characters individually in a
771 (In the following descriptions, we'll ignore composite characters for
772 the moment. We also give a general (structural) overview first,
773 followed later by the exact details.)
777 * Internal String Encoding::
778 * Internal Character Encoding::
781 File: internals.info, Node: Internal String Encoding, Next: Internal Character Encoding, Up: Internal Mule Encodings
783 Internal String Encoding
784 ------------------------
786 ASCII characters are encoded using their position code directly.
787 Other characters are encoded using their leading byte followed by their
788 position code(s) with the high bit set. Characters in private character
789 sets have their leading byte prefixed with a "leading byte prefix",
790 which is either 0x9E or 0x9F. (No character sets are ever assigned these
791 leading bytes.) Specifically:
793 Character set Encoding (PC=position-code, LB=leading-byte)
794 ------------- --------
796 Control-1 LB | PC1 + 0xA0 |
797 Dimension-1 official LB | PC1 + 0x80 |
798 Dimension-1 private 0x9E | LB | PC1 + 0x80 |
799 Dimension-2 official LB | PC1 + 0x80 | PC2 + 0x80 |
800 Dimension-2 private 0x9F | LB | PC1 + 0x80 | PC2 + 0x80
802 The basic characteristic of this encoding is that the first byte of
803 all characters is in the range 0x00 - 0x9F, and the second and
804 following bytes of all characters is in the range 0xA0 - 0xFF. This
805 means that it is impossible to get out of sync, or more specifically:
807 1. Given any byte position, the beginning of the character it is
808 within can be determined in constant time.
810 2. Given any byte position at the beginning of a character, the
811 beginning of the next character can be determined in constant time.
813 3. Given any byte position at the beginning of a character, the
814 beginning of the previous character can be determined in constant
817 4. Textual searches can simply treat encoded strings as if they were
818 encoded in a one-byte-per-character fashion rather than the actual
821 None of the standard non-modal encodings meet all of these
822 conditions. For example, EUC satisfies only (2) and (3), while
823 Shift-JIS and Big5 (not yet described) satisfy only (2). (All non-modal
824 encodings must satisfy (2), in order to be unambiguous.)
827 File: internals.info, Node: Internal Character Encoding, Prev: Internal String Encoding, Up: Internal Mule Encodings
829 Internal Character Encoding
830 ---------------------------
832 One 19-bit word represents a single character. The word is
833 separated into three fields:
835 Bit number: 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
836 <------------> <------------------> <------------------>
839 Note that fields 2 and 3 hold 7 bits each, while field 1 holds 5
842 Character set Field 1 Field 2 Field 3
843 ------------- ------- ------- -------
848 Dimension-1 official 0 LB - 0x80 PC1
849 range: (01 - 0D) (20 - 7F)
850 Dimension-1 private 0 LB - 0x80 PC1
851 range: (20 - 6F) (20 - 7F)
852 Dimension-2 official LB - 0x8F PC1 PC2
853 range: (01 - 0A) (20 - 7F) (20 - 7F)
854 Dimension-2 private LB - 0xE1 PC1 PC2
855 range: (0F - 1E) (20 - 7F) (20 - 7F)
858 Note that character codes 0 - 255 are the same as the "binary
859 encoding" described above.
862 File: internals.info, Node: CCL, Prev: Internal Mule Encodings, Up: MULE Character Sets and Encodings
868 CCL_PROGRAM := (CCL_MAIN_BLOCK
871 CCL_MAIN_BLOCK := CCL_BLOCK
872 CCL_EOF_BLOCK := CCL_BLOCK
874 CCL_BLOCK := STATEMENT | (STATEMENT [STATEMENT ...])
876 SET | IF | BRANCH | LOOP | REPEAT | BREAK
879 SET := (REG = EXPRESSION) | (REG SELF_OP EXPRESSION)
882 EXPRESSION := ARG | (EXPRESSION OP ARG)
884 IF := (if EXPRESSION CCL_BLOCK CCL_BLOCK)
885 BRANCH := (branch EXPRESSION CCL_BLOCK [CCL_BLOCK ...])
886 LOOP := (loop STATEMENT [STATEMENT ...])
889 | (write-repeat [REG | INT-OR-CHAR | string])
890 | (write-read-repeat REG [INT-OR-CHAR | string | ARRAY]?)
891 READ := (read REG) | (read REG REG)
892 | (read-if REG ARITH_OP ARG CCL_BLOCK CCL_BLOCK)
893 | (read-branch REG CCL_BLOCK [CCL_BLOCK ...])
894 WRITE := (write REG) | (write REG REG)
895 | (write INT-OR-CHAR) | (write STRING) | STRING
899 REG := r0 | r1 | r2 | r3 | r4 | r5 | r6 | r7
900 ARG := REG | INT-OR-CHAR
901 OP := + | - | * | / | % | & | '|' | ^ | << | >> | <8 | >8 | //
902 | < | > | == | <= | >= | !=
904 += | -= | *= | /= | %= | &= | '|=' | ^= | <<= | >>=
905 ARRAY := '[' INT-OR-CHAR ... ']'
906 INT-OR-CHAR := INT | CHAR
910 The machine code consists of a vector of 32-bit words.
911 The first such word specifies the start of the EOF section of the code;
912 this is the code executed to handle any stuff that needs to be done
913 (e.g. designating back to ASCII and left-to-right mode) after all
914 other encoded/decoded data has been written out. This is not used for
915 charset CCL programs.
917 REGISTER: 0..7 -- referred by RRR or rrr
919 OPERATOR BIT FIELD (27-bit): XXXXXXXXXXXXXXX RRR TTTTT
920 TTTTT (5-bit): operator type
921 RRR (3-bit): register number
922 XXXXXXXXXXXXXXXX (15-bit):
923 CCCCCCCCCCCCCCC: constant or address
924 000000000000rrr: register number
951 OPERATORS: TTTTT RRR XX..
953 SetCS: 00000 RRR C...C RRR = C...C
954 SetCL: 00001 RRR ..... RRR = c...c
956 SetR: 00010 RRR ..rrr RRR = rrr
957 SetA: 00011 RRR ..rrr RRR = array[rrr]
958 C.............C size of array = C...C
959 c.............c contents = c...c
961 Jump: 00100 000 c...c jump to c...c
962 JumpCond: 00101 RRR c...c if (!RRR) jump to c...c
963 WriteJump: 00110 RRR c...c Write1 RRR, jump to c...c
964 WriteReadJump: 00111 RRR c...c Write1, Read1 RRR, jump to c...c
965 WriteCJump: 01000 000 c...c Write1 C...C, jump to c...c
967 WriteCReadJump: 01001 RRR c...c Write1 C...C, Read1 RRR,
968 C.............C and jump to c...c
969 WriteSJump: 01010 000 c...c WriteS, jump to c...c
973 WriteSReadJump: 01011 RRR c...c WriteS, Read1 RRR, jump to c...c
977 WriteAReadJump: 01100 RRR c...c WriteA, Read1 RRR, jump to c...c
978 C.............C size of array = C...C
979 c.............c contents = c...c
981 Branch: 01101 RRR C...C if (RRR >= 0 && RRR < C..)
982 c.............c branch to (RRR+1)th address
983 Read1: 01110 RRR ... read 1-byte to RRR
984 Read2: 01111 RRR ..rrr read 2-byte to RRR and rrr
985 ReadBranch: 10000 RRR C...C Read1 and Branch
988 Write1: 10001 RRR ..... write 1-byte RRR
989 Write2: 10010 RRR ..rrr write 2-byte RRR and rrr
990 WriteC: 10011 000 ..... write 1-char C...CC
992 WriteS: 10100 000 ..... write C..-byte of string
996 WriteA: 10101 RRR ..... write array[RRR]
997 C.............C size of array = C...C
998 c.............c contents = c...c
1000 End: 10110 000 ..... terminate the execution
1002 SetSelfCS: 10111 RRR C...C RRR AAAAA= C...C
1004 SetSelfCL: 11000 RRR ..... RRR AAAAA= c...c
1007 SetSelfR: 11001 RRR ..Rrr RRR AAAAA= rrr
1009 SetExprCL: 11010 RRR ..Rrr RRR = rrr AAAAA c...c
1012 SetExprR: 11011 RRR ..rrr RRR = rrr AAAAA Rrr
1015 JumpCondC: 11100 RRR c...c if !(RRR AAAAA C..) jump to c...c
1018 JumpCondR: 11101 RRR c...c if !(RRR AAAAA rrr) jump to c...c
1021 ReadJumpCondC: 11110 RRR c...c Read1 and JumpCondC
1024 ReadJumpCondR: 11111 RRR c...c Read1 and JumpCondR
1029 File: internals.info, Node: The Lisp Reader and Compiler, Next: Lstreams, Prev: MULE Character Sets and Encodings, Up: Top
1031 The Lisp Reader and Compiler
1032 ****************************
1037 File: internals.info, Node: Lstreams, Next: Consoles; Devices; Frames; Windows, Prev: The Lisp Reader and Compiler, Up: Top
1042 An "lstream" is an internal Lisp object that provides a generic
1043 buffering stream implementation. Conceptually, you send data to the
1044 stream or read data from the stream, not caring what's on the other end
1045 of the stream. The other end could be another stream, a file
1046 descriptor, a stdio stream, a fixed block of memory, a reallocating
1047 block of memory, etc. The main purpose of the stream is to provide a
1048 standard interface and to do buffering. Macros are defined to read or
1049 write characters, so the calling functions do not have to worry about
1050 blocking data together in order to achieve efficiency.
1054 * Creating an Lstream:: Creating an lstream object.
1055 * Lstream Types:: Different sorts of things that are streamed.
1056 * Lstream Functions:: Functions for working with lstreams.
1057 * Lstream Methods:: Creating new lstream types.
1060 File: internals.info, Node: Creating an Lstream, Next: Lstream Types, Up: Lstreams
1065 Lstreams come in different types, depending on what is being
1066 interfaced to. Although the primitive for creating new lstreams is
1067 `Lstream_new()', generally you do not call this directly. Instead, you
1068 call some type-specific creation function, which creates the lstream
1069 and initializes it as appropriate for the particular type.
1071 All lstream creation functions take a MODE argument, specifying what
1072 mode the lstream should be opened as. This controls whether the
1073 lstream is for input and output, and optionally whether data should be
1074 blocked up in units of MULE characters. Note that some types of
1075 lstreams can only be opened for input; others only for output; and
1076 others can be opened either way. #### Richard Mlynarik thinks that
1077 there should be a strict separation between input and output streams,
1078 and he's probably right.
1080 MODE is a string, one of
1089 Open for reading, but "read" never returns partial MULE characters.
1092 Open for writing, but never writes partial MULE characters.
1095 File: internals.info, Node: Lstream Types, Next: Lstream Functions, Prev: Creating an Lstream, Up: Lstreams