git.chise.org Git - chise/xemacs-chise.git-/blob - info/internals.info-7

   1 This is ../info/internals.info, produced by makeinfo version 4.0 from
   2 internals/internals.texi.
   3
   4 INFO-DIR-SECTION XEmacs Editor
   5 START-INFO-DIR-ENTRY
   6 * Internals: (internals).       XEmacs Internals Manual.
   7 END-INFO-DIR-ENTRY
   8
   9    Copyright (C) 1992 - 1996 Ben Wing.  Copyright (C) 1996, 1997 Sun
  10 Microsystems.  Copyright (C) 1994 - 1998 Free Software Foundation.
  11 Copyright (C) 1994, 1995 Board of Trustees, University of Illinois.
  12
  13    Permission is granted to make and distribute verbatim copies of this
  14 manual provided the copyright notice and this permission notice are
  15 preserved on all copies.
  16
  17    Permission is granted to copy and distribute modified versions of
  18 this manual under the conditions for verbatim copying, provided that the
  19 entire resulting derived work is distributed under the terms of a
  20 permission notice identical to this one.
  21
  22    Permission is granted to copy and distribute translations of this
  23 manual into another language, under the above conditions for modified
  24 versions, except that this permission notice may be stated in a
  25 translation approved by the Foundation.
  26
  27    Permission is granted to copy and distribute modified versions of
  28 this manual under the conditions for verbatim copying, provided also
  29 that the section entitled "GNU General Public License" is included
  30 exactly as in the original, and provided that the entire resulting
  31 derived work is distributed under the terms of a permission notice
  32 identical to this one.
  33
  34    Permission is granted to copy and distribute translations of this
  35 manual into another language, under the above conditions for modified
  36 versions, except that the section entitled "GNU General Public License"
  37 may be included in a translation approved by the Free Software
  38 Foundation instead of in the original English.
  39
  40 \1f
  41 File: internals.info,  Node: Obarrays,  Next: Symbol Values,  Prev: Introduction to Symbols,  Up: Symbols and Variables
  42
  43 Obarrays
  44 ========
  45
  46    The identity of symbols with their names is accomplished through a
  47 structure called an obarray, which is just a poorly-implemented hash
  48 table mapping from strings to symbols whose name is that string. (I say
  49 "poorly implemented" because an obarray appears in Lisp as a vector
  50 with some hidden fields rather than as its own opaque type.  This is an
  51 Emacs Lisp artifact that should be fixed.)
  52
  53    Obarrays are implemented as a vector of some fixed size (which should
  54 be a prime for best results), where each "bucket" of the vector
  55 contains one or more symbols, threaded through a hidden `next' field in
  56 the symbol.  Lookup of a symbol in an obarray, and adding a symbol to
  57 an obarray, is accomplished through standard hash-table techniques.
  58
  59    The standard Lisp function for working with symbols and obarrays is
  60 `intern'.  This looks up a symbol in an obarray given its name; if it's
  61 not found, a new symbol is automatically created with the specified
  62 name, added to the obarray, and returned.  This is what happens when the
  63 Lisp reader encounters a symbol (or more precisely, encounters the name
  64 of a symbol) in some text that it is reading.  There is a standard
  65 obarray called `obarray' that is used for this purpose, although the
  66 Lisp programmer is free to create his own obarrays and `intern' symbols
  67 in them.
  68
  69    Note that, once a symbol is in an obarray, it stays there until
  70 something is done about it, and the standard obarray `obarray' always
  71 stays around, so once you use any particular variable name, a
  72 corresponding symbol will stay around in `obarray' until you exit
  73 XEmacs.
  74
  75    Note that `obarray' itself is a variable, and as such there is a
  76 symbol in `obarray' whose name is `"obarray"' and which contains
  77 `obarray' as its value.
  78
  79    Note also that this call to `intern' occurs only when in the Lisp
  80 reader, not when the code is executed (at which point the symbol is
  81 already around, stored as such in the definition of the function).
  82
  83    You can create your own obarray using `make-vector' (this is
  84 horrible but is an artifact) and intern symbols into that obarray.
  85 Doing that will result in two or more symbols with the same name.
  86 However, at most one of these symbols is in the standard `obarray': You
  87 cannot have two symbols of the same name in any particular obarray.
  88 Note that you cannot add a symbol to an obarray in any fashion other
  89 than using `intern': i.e. you can't take an existing symbol and put it
  90 in an existing obarray.  Nor can you change the name of an existing
  91 symbol. (Since obarrays are vectors, you can violate the consistency of
  92 things by storing directly into the vector, but let's ignore that
  93 possibility.)
  94
  95    Usually symbols are created by `intern', but if you really want, you
  96 can explicitly create a symbol using `make-symbol', giving it some
  97 name.  The resulting symbol is not in any obarray (i.e. it is
  98 "uninterned"), and you can't add it to any obarray.  Therefore its
  99 primary purpose is as a symbol to use in macros to avoid namespace
 100 pollution.  It can also be used as a carrier of information, but cons
 101 cells could probably be used just as well.
 102
 103    You can also use `intern-soft' to look up a symbol but not create a
 104 new one, and `unintern' to remove a symbol from an obarray.  This
 105 returns the removed symbol. (Remember: You can't put the symbol back
 106 into any obarray.) Finally, `mapatoms' maps over all of the symbols in
 107 an obarray.
 108
 109 \1f
 110 File: internals.info,  Node: Symbol Values,  Prev: Obarrays,  Up: Symbols and Variables
 111
 112 Symbol Values
 113 =============
 114
 115    The value field of a symbol normally contains a Lisp object.
 116 However, a symbol can be "unbound", meaning that it logically has no
 117 value.  This is internally indicated by storing a special Lisp object,
 118 called "the unbound marker" and stored in the global variable
 119 `Qunbound'.  The unbound marker is of a special Lisp object type called
 120 "symbol-value-magic".  It is impossible for the Lisp programmer to
 121 directly create or access any object of this type.
 122
 123    *You must not let any "symbol-value-magic" object escape to the Lisp
 124 level.*  Printing any of these objects will cause the message `INTERNAL
 125 EMACS BUG' to appear as part of the print representation.  (You may see
 126 this normally when you call `debug_print()' from the debugger on a Lisp
 127 object.) If you let one of these objects escape to the Lisp level, you
 128 will violate a number of assumptions contained in the C code and make
 129 the unbound marker not function right.
 130
 131    When a symbol is created, its value field (and function field) are
 132 set to `Qunbound'.  The Lisp programmer can restore these conditions
 133 later using `makunbound' or `fmakunbound', and can query to see whether
 134 the value of function fields are "bound" (i.e. have a value other than
 135 `Qunbound') using `boundp' and `fboundp'.  The fields are set to a
 136 normal Lisp object using `set' (or `setq') and `fset'.
 137
 138    Other symbol-value-magic objects are used as special markers to
 139 indicate variables that have non-normal properties.  This includes any
 140 variables that are tied into C variables (setting the variable magically
 141 sets some global variable in the C code, and likewise for retrieving the
 142 variable's value), variables that magically tie into slots in the
 143 current buffer, variables that are buffer-local, etc.  The
 144 symbol-value-magic object is stored in the value cell in place of a
 145 normal object, and the code to retrieve a symbol's value (i.e.
 146 `symbol-value') knows how to do special things with them.  This means
 147 that you should not just fetch the value cell directly if you want a
 148 symbol's value.
 149
 150    The exact workings of this are rather complex and involved and are
 151 well-documented in comments in `buffer.c', `symbols.c', and `lisp.h'.
 152
 153 \1f
 154 File: internals.info,  Node: Buffers and Textual Representation,  Next: MULE Character Sets and Encodings,  Prev: Symbols and Variables,  Up: Top
 155
 156 Buffers and Textual Representation
 157 **********************************
 158
 159 * Menu:
 160
 161 * Introduction to Buffers::     A buffer holds a block of text such as a file.
 162 * The Text in a Buffer::        Representation of the text in a buffer.
 163 * Buffer Lists::                Keeping track of all buffers.
 164 * Markers and Extents::         Tagging locations within a buffer.
 165 * Bufbytes and Emchars::        Representation of individual characters.
 166 * The Buffer Object::           The Lisp object corresponding to a buffer.
 167
 168 \1f
 169 File: internals.info,  Node: Introduction to Buffers,  Next: The Text in a Buffer,  Up: Buffers and Textual Representation
 170
 171 Introduction to Buffers
 172 =======================
 173
 174    A buffer is logically just a Lisp object that holds some text.  In
 175 this, it is like a string, but a buffer is optimized for frequent
 176 insertion and deletion, while a string is not.  Furthermore:
 177
 178   1. Buffers are "permanent" objects, i.e. once you create them, they
 179      remain around, and need to be explicitly deleted before they go
 180      away.
 181
 182   2. Each buffer has a unique name, which is a string.  Buffers are
 183      normally referred to by name.  In this respect, they are like
 184      symbols.
 185
 186   3. Buffers have a default insertion position, called "point".
 187      Inserting text (unless you explicitly give a position) goes at
 188      point, and moves point forward past the text.  This is what is
 189      going on when you type text into Emacs.
 190
 191   4. Buffers have lots of extra properties associated with them.
 192
 193   5. Buffers can be "displayed".  What this means is that there exist a
 194      number of "windows", which are objects that correspond to some
 195      visible section of your display, and each window has an associated
 196      buffer, and the current contents of the buffer are shown in that
 197      section of the display.  The redisplay mechanism (which takes care
 198      of doing this) knows how to look at the text of a buffer and come
 199      up with some reasonable way of displaying this.  Many of the
 200      properties of a buffer control how the buffer's text is displayed.
 201
 202   6. One buffer is distinguished and called the "current buffer".  It is
 203      stored in the variable `current_buffer'.  Buffer operations operate
 204      on this buffer by default.  When you are typing text into a
 205      buffer, the buffer you are typing into is always `current_buffer'.
 206      Switching to a different window changes the current buffer.  Note
 207      that Lisp code can temporarily change the current buffer using
 208      `set-buffer' (often enclosed in a `save-excursion' so that the
 209      former current buffer gets restored when the code is finished).
 210      However, calling `set-buffer' will NOT cause a permanent change in
 211      the current buffer.  The reason for this is that the top-level
 212      event loop sets `current_buffer' to the buffer of the selected
 213      window, each time it finishes executing a user command.
 214
 215    Make sure you understand the distinction between "current buffer"
 216 and "buffer of the selected window", and the distinction between
 217 "point" of the current buffer and "window-point" of the selected
 218 window. (This latter distinction is explained in detail in the section
 219 on windows.)
 220
 221 \1f
 222 File: internals.info,  Node: The Text in a Buffer,  Next: Buffer Lists,  Prev: Introduction to Buffers,  Up: Buffers and Textual Representation
 223
 224 The Text in a Buffer
 225 ====================
 226
 227    The text in a buffer consists of a sequence of zero or more
 228 characters.  A "character" is an integer that logically represents a
 229 letter, number, space, or other unit of text.  Most of the characters
 230 that you will typically encounter belong to the ASCII set of characters,
 231 but there are also characters for various sorts of accented letters,
 232 special symbols, Chinese and Japanese ideograms (i.e. Kanji, Katakana,
 233 etc.), Cyrillic and Greek letters, etc.  The actual number of possible
 234 characters is quite large.
 235
 236    For now, we can view a character as some non-negative integer that
 237 has some shape that defines how it typically appears (e.g. as an
 238 uppercase A). (The exact way in which a character appears depends on the
 239 font used to display the character.) The internal type of characters in
 240 the C code is an `Emchar'; this is just an `int', but using a symbolic
 241 type makes the code clearer.
 242
 243    Between every character in a buffer is a "buffer position" or
 244 "character position".  We can speak of the character before or after a
 245 particular buffer position, and when you insert a character at a
 246 particular position, all characters after that position end up at new
 247 positions.  When we speak of the character "at" a position, we really
 248 mean the character after the position.  (This schizophrenia between a
 249 buffer position being "between" a character and "on" a character is
 250 rampant in Emacs.)
 251
 252    Buffer positions are numbered starting at 1.  This means that
 253 position 1 is before the first character, and position 0 is not valid.
 254 If there are N characters in a buffer, then buffer position N+1 is
 255 after the last one, and position N+2 is not valid.
 256
 257    The internal makeup of the Emchar integer varies depending on whether
 258 we have compiled with MULE support.  If not, the Emchar integer is an
 259 8-bit integer with possible values from 0 - 255.  0 - 127 are the
 260 standard ASCII characters, while 128 - 255 are the characters from the
 261 ISO-8859-1 character set.  If we have compiled with MULE support, an
 262 Emchar is a 19-bit integer, with the various bits having meanings
 263 according to a complex scheme that will be detailed later.  The
 264 characters numbered 0 - 255 still have the same meanings as for the
 265 non-MULE case, though.
 266
 267    Internally, the text in a buffer is represented in a fairly simple
 268 fashion: as a contiguous array of bytes, with a "gap" of some size in
 269 the middle.  Although the gap is of some substantial size in bytes,
 270 there is no text contained within it: From the perspective of the text
 271 in the buffer, it does not exist.  The gap logically sits at some buffer
 272 position, between two characters (or possibly at the beginning or end of
 273 the buffer).  Insertion of text in a buffer at a particular position is
 274 always accomplished by first moving the gap to that position (i.e.
 275 through some block moving of text), then writing the text into the
 276 beginning of the gap, thereby shrinking the gap.  If the gap shrinks
 277 down to nothing, a new gap is created. (What actually happens is that a
 278 new gap is "created" at the end of the buffer's text, which requires
 279 nothing more than changing a couple of indices; then the gap is "moved"
 280 to the position where the insertion needs to take place by moving up in
 281 memory all the text after that position.)  Similarly, deletion occurs
 282 by moving the gap to the place where the text is to be deleted, and
 283 then simply expanding the gap to include the deleted text.
 284 ("Expanding" and "shrinking" the gap as just described means just that
 285 the internal indices that keep track of where the gap is located are
 286 changed.)
 287
 288    Note that the total amount of memory allocated for a buffer text
 289 never decreases while the buffer is live.  Therefore, if you load up a
 290 20-megabyte file and then delete all but one character, there will be a
 291 20-megabyte gap, which won't get any smaller (except by inserting
 292 characters back again).  Once the buffer is killed, the memory allocated
 293 for the buffer text will be freed, but it will still be sitting on the
 294 heap, taking up virtual memory, and will not be released back to the
 295 operating system. (However, if you have compiled XEmacs with rel-alloc,
 296 the situation is different.  In this case, the space _will_ be released
 297 back to the operating system.  However, this tends to result in a
 298 noticeable speed penalty.)
 299
 300    Astute readers may notice that the text in a buffer is represented as
 301 an array of _bytes_, while (at least in the MULE case) an Emchar is a
 302 19-bit integer, which clearly cannot fit in a byte.  This means (of
 303 course) that the text in a buffer uses a different representation from
 304 an Emchar: specifically, the 19-bit Emchar becomes a series of one to
 305 four bytes.  The conversion between these two representations is complex
 306 and will be described later.
 307
 308    In the non-MULE case, everything is very simple: An Emchar is an
 309 8-bit value, which fits neatly into one byte.
 310
 311    If we are given a buffer position and want to retrieve the character
 312 at that position, we need to follow these steps:
 313
 314   1. Pretend there's no gap, and convert the buffer position into a
 315      "byte index" that indexes to the appropriate byte in the buffer's
 316      stream of textual bytes.  By convention, byte indices begin at 1,
 317      just like buffer positions.  In the non-MULE case, byte indices
 318      and buffer positions are identical, since one character equals one
 319      byte.
 320
 321   2. Convert the byte index into a "memory index", which takes the gap
 322      into account.  The memory index is a direct index into the block of
 323      memory that stores the text of a buffer.  This basically just
 324      involves checking to see if the byte index is past the gap, and if
 325      so, adding the size of the gap to it.  By convention, memory
 326      indices begin at 1, just like buffer positions and byte indices,
 327      and when referring to the position that is "at" the gap, we always
 328      use the memory position at the _beginning_, not at the end, of the
 329      gap.
 330
 331   3. Fetch the appropriate bytes at the determined memory position.
 332
 333   4. Convert these bytes into an Emchar.
 334
 335    In the non-Mule case, (3) and (4) boil down to a simple one-byte
 336 memory access.
 337
 338    Note that we have defined three types of positions in a buffer:
 339
 340   1. "buffer positions" or "character positions", typedef `Bufpos'
 341
 342   2. "byte indices", typedef `Bytind'
 343
 344   3. "memory indices", typedef `Memind'
 345
 346    All three typedefs are just `int's, but defining them this way makes
 347 things a lot clearer.
 348
 349    Most code works with buffer positions.  In particular, all Lisp code
 350 that refers to text in a buffer uses buffer positions.  Lisp code does
 351 not know that byte indices or memory indices exist.
 352
 353    Finally, we have a typedef for the bytes in a buffer.  This is a
 354 `Bufbyte', which is an unsigned char.  Referring to them as Bufbytes
 355 underscores the fact that we are working with a string of bytes in the
 356 internal Emacs buffer representation rather than in one of a number of
 357 possible alternative representations (e.g. EUC-encoded text, etc.).
 358
 359 \1f
 360 File: internals.info,  Node: Buffer Lists,  Next: Markers and Extents,  Prev: The Text in a Buffer,  Up: Buffers and Textual Representation
 361
 362 Buffer Lists
 363 ============
 364
 365    Recall earlier that buffers are "permanent" objects, i.e.  that they
 366 remain around until explicitly deleted.  This entails that there is a
 367 list of all the buffers in existence.  This list is actually an
 368 assoc-list (mapping from the buffer's name to the buffer) and is stored
 369 in the global variable `Vbuffer_alist'.
 370
 371    The order of the buffers in the list is important: the buffers are
 372 ordered approximately from most-recently-used to least-recently-used.
 373 Switching to a buffer using `switch-to-buffer', `pop-to-buffer', etc.
 374 and switching windows using `other-window', etc.  usually brings the
 375 new current buffer to the front of the list.  `switch-to-buffer',
 376 `other-buffer', etc. look at the beginning of the list to find an
 377 alternative buffer to suggest.  You can also explicitly move a buffer
 378 to the end of the list using `bury-buffer'.
 379
 380    In addition to the global ordering in `Vbuffer_alist', each frame
 381 has its own ordering of the list.  These lists always contain the same
 382 elements as in `Vbuffer_alist' although possibly in a different order.
 383 `buffer-list' normally returns the list for the selected frame.  This
 384 allows you to work in separate frames without things interfering with
 385 each other.
 386
 387    The standard way to look up a buffer given a name is `get-buffer',
 388 and the standard way to create a new buffer is `get-buffer-create',
 389 which looks up a buffer with a given name, creating a new one if
 390 necessary.  These operations correspond exactly with the symbol
 391 operations `intern-soft' and `intern', respectively.  You can also
 392 force a new buffer to be created using `generate-new-buffer', which
 393 takes a name and (if necessary) makes a unique name from this by
 394 appending a number, and then creates the buffer.  This is basically
 395 like the symbol operation `gensym'.
 396
 397 \1f
 398 File: internals.info,  Node: Markers and Extents,  Next: Bufbytes and Emchars,  Prev: Buffer Lists,  Up: Buffers and Textual Representation
 399
 400 Markers and Extents
 401 ===================
 402
 403    Among the things associated with a buffer are things that are
 404 logically attached to certain buffer positions.  This can be used to
 405 keep track of a buffer position when text is inserted and deleted, so
 406 that it remains at the same spot relative to the text around it; to
 407 assign properties to particular sections of text; etc.  There are two
 408 such objects that are useful in this regard: they are "markers" and
 409 "extents".
 410
 411    A "marker" is simply a flag placed at a particular buffer position,
 412 which is moved around as text is inserted and deleted.  Markers are
 413 used for all sorts of purposes, such as the `mark' that is the other
 414 end of textual regions to be cut, copied, etc.
 415
 416    An "extent" is similar to two markers plus some associated
 417 properties, and is used to keep track of regions in a buffer as text is
 418 inserted and deleted, and to add properties (e.g. fonts) to particular
 419 regions of text.  The external interface of extents is explained
 420 elsewhere.
 421
 422    The important thing here is that markers and extents simply contain
 423 buffer positions in them as integers, and every time text is inserted or
 424 deleted, these positions must be updated.  In order to minimize the
 425 amount of shuffling that needs to be done, the positions in markers and
 426 extents (there's one per marker, two per extent) are stored in Meminds.
 427 This means that they only need to be moved when the text is physically
 428 moved in memory; since the gap structure tries to minimize this, it also
 429 minimizes the number of marker and extent indices that need to be
 430 adjusted.  Look in `insdel.c' for the details of how this works.
 431
 432    One other important distinction is that markers are "temporary"
 433 while extents are "permanent".  This means that markers disappear as
 434 soon as there are no more pointers to them, and correspondingly, there
 435 is no way to determine what markers are in a buffer if you are just
 436 given the buffer.  Extents remain in a buffer until they are detached
 437 (which could happen as a result of text being deleted) or the buffer is
 438 deleted, and primitives do exist to enumerate the extents in a buffer.
 439
 440 \1f
 441 File: internals.info,  Node: Bufbytes and Emchars,  Next: The Buffer Object,  Prev: Markers and Extents,  Up: Buffers and Textual Representation
 442
 443 Bufbytes and Emchars
 444 ====================
 445
 446    Not yet documented.
 447
 448 \1f
 449 File: internals.info,  Node: The Buffer Object,  Prev: Bufbytes and Emchars,  Up: Buffers and Textual Representation
 450
 451 The Buffer Object
 452 =================
 453
 454    Buffers contain fields not directly accessible by the Lisp
 455 programmer.  We describe them here, naming them by the names used in
 456 the C code.  Many are accessible indirectly in Lisp programs via Lisp
 457 primitives.
 458
 459 `name'
 460      The buffer name is a string that names the buffer.  It is
 461      guaranteed to be unique.  *Note Buffer Names: (lispref)Buffer
 462      Names.
 463
 464 `save_modified'
 465      This field contains the time when the buffer was last saved, as an
 466      integer.  *Note Buffer Modification: (lispref)Buffer Modification.
 467
 468 `modtime'
 469      This field contains the modification time of the visited file.  It
 470      is set when the file is written or read.  Every time the buffer is
 471      written to the file, this field is compared to the modification
 472      time of the file.  *Note Buffer Modification: (lispref)Buffer
 473      Modification.
 474
 475 `auto_save_modified'
 476      This field contains the time when the buffer was last auto-saved.
 477
 478 `last_window_start'
 479      This field contains the `window-start' position in the buffer as of
 480      the last time the buffer was displayed in a window.
 481
 482 `undo_list'
 483      This field points to the buffer's undo list.  *Note Undo:
 484      (lispref)Undo.
 485
 486 `syntax_table_v'
 487      This field contains the syntax table for the buffer.  *Note Syntax
 488      Tables: (lispref)Syntax Tables.
 489
 490 `downcase_table'
 491      This field contains the conversion table for converting text to
 492      lower case.  *Note Case Tables: (lispref)Case Tables.
 493
 494 `upcase_table'
 495      This field contains the conversion table for converting text to
 496      upper case.  *Note Case Tables: (lispref)Case Tables.
 497
 498 `case_canon_table'
 499      This field contains the conversion table for canonicalizing text
 500      for case-folding search.  *Note Case Tables: (lispref)Case Tables.
 501
 502 `case_eqv_table'
 503      This field contains the equivalence table for case-folding search.
 504      *Note Case Tables: (lispref)Case Tables.
 505
 506 `display_table'
 507      This field contains the buffer's display table, or `nil' if it
 508      doesn't have one.  *Note Display Tables: (lispref)Display Tables.
 509
 510 `markers'
 511      This field contains the chain of all markers that currently point
 512      into the buffer.  Deletion of text in the buffer, and motion of
 513      the buffer's gap, must check each of these markers and perhaps
 514      update it.  *Note Markers: (lispref)Markers.
 515
 516 `backed_up'
 517      This field is a flag that tells whether a backup file has been
 518      made for the visited file of this buffer.
 519
 520 `mark'
 521      This field contains the mark for the buffer.  The mark is a marker,
 522      hence it is also included on the list `markers'.  *Note The Mark:
 523      (lispref)The Mark.
 524
 525 `mark_active'
 526      This field is non-`nil' if the buffer's mark is active.
 527
 528 `local_var_alist'
 529      This field contains the association list describing the variables
 530      local in this buffer, and their values, with the exception of
 531      local variables that have special slots in the buffer object.
 532      (Those slots are omitted from this table.)  *Note Buffer-Local
 533      Variables: (lispref)Buffer-Local Variables.
 534
 535 `modeline_format'
 536      This field contains a Lisp object which controls how to display
 537      the mode line for this buffer.  *Note Modeline Format:
 538      (lispref)Modeline Format.
 539
 540 `base_buffer'
 541      This field holds the buffer's base buffer (if it is an indirect
 542      buffer), or `nil'.
 543
 544 \1f
 545 File: internals.info,  Node: MULE Character Sets and Encodings,  Next: The Lisp Reader and Compiler,  Prev: Buffers and Textual Representation,  Up: Top
 546
 547 MULE Character Sets and Encodings
 548 *********************************
 549
 550    Recall that there are two primary ways that text is represented in
 551 XEmacs.  The "buffer" representation sees the text as a series of bytes
 552 (Bufbytes), with a variable number of bytes used per character.  The
 553 "character" representation sees the text as a series of integers
 554 (Emchars), one per character.  The character representation is a cleaner
 555 representation from a theoretical standpoint, and is thus used in many
 556 cases when lots of manipulations on a string need to be done.  However,
 557 the buffer representation is the standard representation used in both
 558 Lisp strings and buffers, and because of this, it is the "default"
 559 representation that text comes in.  The reason for using this
 560 representation is that it's compact and is compatible with ASCII.
 561
 562 * Menu:
 563
 564 * Character Sets::
 565 * Encodings::
 566 * Internal Mule Encodings::
 567 * CCL::
 568
 569 \1f
 570 File: internals.info,  Node: Character Sets,  Next: Encodings,  Up: MULE Character Sets and Encodings
 571
 572 Character Sets
 573 ==============
 574
 575    A character set (or "charset") is an ordered set of characters.  A
 576 particular character in a charset is indexed using one or more
 577 "position codes", which are non-negative integers.  The number of
 578 position codes needed to identify a particular character in a charset is
 579 called the "dimension" of the charset.  In XEmacs/Mule, all charsets
 580 have dimension 1 or 2, and the size of all charsets (except for a few
 581 special cases) is either 94, 96, 94 by 94, or 96 by 96.  The range of
 582 position codes used to index characters from any of these types of
 583 character sets is as follows:
 584
 585      Charset type            Position code 1         Position code 2
 586      ------------------------------------------------------------
 587      94                      33 - 126                N/A
 588      96                      32 - 127                N/A
 589      94x94                   33 - 126                33 - 126
 590      96x96                   32 - 127                32 - 127
 591
 592    Note that in the above cases position codes do not start at an
 593 expected value such as 0 or 1.  The reason for this will become clear
 594 later.
 595
 596    For example, Latin-1 is a 96-character charset, and JISX0208 (the
 597 Japanese national character set) is a 94x94-character charset.
 598
 599    [Note that, although the ranges above define the _valid_ position
 600 codes for a charset, some of the slots in a particular charset may in
 601 fact be empty.  This is the case for JISX0208, for example, where (e.g.)
 602 all the slots whose first position code is in the range 118 - 127 are
 603 empty.]
 604
 605    There are three charsets that do not follow the above rules.  All of
 606 them have one dimension, and have ranges of position codes as follows:
 607
 608      Charset name            Position code 1
 609      ------------------------------------
 610      ASCII                   0 - 127
 611      Control-1               0 - 31
 612      Composite               0 - some large number
 613
 614    (The upper bound of the position code for composite characters has
 615 not yet been determined, but it will probably be at least 16,383).
 616
 617    ASCII is the union of two subsidiary character sets: Printing-ASCII
 618 (the printing ASCII character set, consisting of position codes 33 -
 619 126, like for a standard 94-character charset) and Control-ASCII (the
 620 non-printing characters that would appear in a binary file with codes 0
 621 - 32 and 127).
 622
 623    Control-1 contains the non-printing characters that would appear in a
 624 binary file with codes 128 - 159.
 625
 626    Composite contains characters that are generated by overstriking one
 627 or more characters from other charsets.
 628
 629    Note that some characters in ASCII, and all characters in Control-1,
 630 are "control" (non-printing) characters.  These have no printed
 631 representation but instead control some other function of the printing
 632 (e.g. TAB or 8 moves the current character position to the next tab
 633 stop).  All other characters in all charsets are "graphic" (printing)
 634 characters.
 635
 636    When a binary file is read in, the bytes in the file are assigned to
 637 character sets as follows:
 638
 639      Bytes           Character set           Range
 640      --------------------------------------------------
 641      0 - 127         ASCII                   0 - 127
 642      128 - 159       Control-1               0 - 31
 643      160 - 255       Latin-1                 32 - 127
 644
 645    This is a bit ad-hoc but gets the job done.
 646
 647 \1f
 648 File: internals.info,  Node: Encodings,  Next: Internal Mule Encodings,  Prev: Character Sets,  Up: MULE Character Sets and Encodings
 649
 650 Encodings
 651 =========
 652
 653    An "encoding" is a way of numerically representing characters from
 654 one or more character sets.  If an encoding only encompasses one
 655 character set, then the position codes for the characters in that
 656 character set could be used directly.  This is not possible, however, if
 657 more than one character set is to be used in the encoding.
 658
 659    For example, the conversion detailed above between bytes in a binary
 660 file and characters is effectively an encoding that encompasses the
 661 three character sets ASCII, Control-1, and Latin-1 in a stream of 8-bit
 662 bytes.
 663
 664    Thus, an encoding can be viewed as a way of encoding characters from
 665 a specified group of character sets using a stream of bytes, each of
 666 which contains a fixed number of bits (but not necessarily 8, as in the
 667 common usage of "byte").
 668
 669    Here are descriptions of a couple of common encodings:
 670
 671 * Menu:
 672
 673 * Japanese EUC (Extended Unix Code)::
 674 * JIS7::
 675
 676 \1f
 677 File: internals.info,  Node: Japanese EUC (Extended Unix Code),  Next: JIS7,  Up: Encodings
 678
 679 Japanese EUC (Extended Unix Code)
 680 ---------------------------------
 681
 682    This encompasses the character sets Printing-ASCII,
 683 Japanese-JISX0201, and Japanese-JISX0208-Kana (half-width katakana, the
 684 right half of JISX0201).  It uses 8-bit bytes.
 685
 686    Note that Printing-ASCII and Japanese-JISX0201-Kana are 94-character
 687 charsets, while Japanese-JISX0208 is a 94x94-character charset.
 688
 689    The encoding is as follows:
 690
 691      Character set            Representation (PC=position-code)
 692      -------------            --------------
 693      Printing-ASCII           PC1
 694      Japanese-JISX0201-Kana   0x8E       | PC1 + 0x80
 695      Japanese-JISX0208        PC1 + 0x80 | PC2 + 0x80
 696      Japanese-JISX0212        PC1 + 0x80 | PC2 + 0x80
 697
 698 \1f
 699 File: internals.info,  Node: JIS7,  Prev: Japanese EUC (Extended Unix Code),  Up: Encodings
 700
 701 JIS7
 702 ----
 703
 704    This encompasses the character sets Printing-ASCII,
 705 Japanese-JISX0201-Roman (the left half of JISX0201; this character set
 706 is very similar to Printing-ASCII and is a 94-character charset),
 707 Japanese-JISX0208, and Japanese-JISX0201-Kana.  It uses 7-bit bytes.
 708
 709    Unlike Japanese EUC, this is a "modal" encoding, which means that
 710 there are multiple states that the encoding can be in, which affect how
 711 the bytes are to be interpreted.  Special sequences of bytes (called
 712 "escape sequences") are used to change states.
 713
 714    The encoding is as follows:
 715
 716      Character set              Representation (PC=position-code)
 717      -------------              --------------
 718      Printing-ASCII             PC1
 719      Japanese-JISX0201-Roman    PC1
 720      Japanese-JISX0201-Kana     PC1
 721      Japanese-JISX0208          PC1 PC2
 722
 723
 724      Escape sequence   ASCII equivalent   Meaning
 725      ---------------   ----------------   -------
 726      0x1B 0x28 0x4A    ESC ( J            invoke Japanese-JISX0201-Roman
 727      0x1B 0x28 0x49    ESC ( I            invoke Japanese-JISX0201-Kana
 728      0x1B 0x24 0x42    ESC $ B            invoke Japanese-JISX0208
 729      0x1B 0x28 0x42    ESC ( B            invoke Printing-ASCII
 730
 731    Initially, Printing-ASCII is invoked.
 732
 733 \1f
 734 File: internals.info,  Node: Internal Mule Encodings,  Next: CCL,  Prev: Encodings,  Up: MULE Character Sets and Encodings
 735
 736 Internal Mule Encodings
 737 =======================
 738
 739    In XEmacs/Mule, each character set is assigned a unique number,
 740 called a "leading byte".  This is used in the encodings of a character.
 741 Leading bytes are in the range 0x80 - 0xFF (except for ASCII, which has
 742 a leading byte of 0), although some leading bytes are reserved.
 743
 744    Charsets whose leading byte is in the range 0x80 - 0x9F are called
 745 "official" and are used for built-in charsets.  Other charsets are
 746 called "private" and have leading bytes in the range 0xA0 - 0xFF; these
 747 are user-defined charsets.
 748
 749    More specifically:
 750
 751      Character set           Leading byte
 752      -------------           ------------
 753      ASCII                   0
 754      Composite               0x80
 755      Dimension-1 Official    0x81 - 0x8D
 756                                (0x8E is free)
 757      Control-1               0x8F
 758      Dimension-2 Official    0x90 - 0x99
 759                                (0x9A - 0x9D are free;
 760                                 0x9E and 0x9F are reserved)
 761      Dimension-1 Private     0xA0 - 0xEF
 762      Dimension-2 Private     0xF0 - 0xFF
 763
 764    There are two internal encodings for characters in XEmacs/Mule.  One
 765 is called "string encoding" and is an 8-bit encoding that is used for
 766 representing characters in a buffer or string.  It uses 1 to 4 bytes per
 767 character.  The other is called "character encoding" and is a 19-bit
 768 encoding that is used for representing characters individually in a
 769 variable.
 770
 771    (In the following descriptions, we'll ignore composite characters for
 772 the moment.  We also give a general (structural) overview first,
 773 followed later by the exact details.)
 774
 775 * Menu:
 776
 777 * Internal String Encoding::
 778 * Internal Character Encoding::
 779
 780 \1f
 781 File: internals.info,  Node: Internal String Encoding,  Next: Internal Character Encoding,  Up: Internal Mule Encodings
 782
 783 Internal String Encoding
 784 ------------------------
 785
 786    ASCII characters are encoded using their position code directly.
 787 Other characters are encoded using their leading byte followed by their
 788 position code(s) with the high bit set.  Characters in private character
 789 sets have their leading byte prefixed with a "leading byte prefix",
 790 which is either 0x9E or 0x9F. (No character sets are ever assigned these
 791 leading bytes.) Specifically:
 792
 793      Character set           Encoding (PC=position-code, LB=leading-byte)
 794      -------------           --------
 795      ASCII                   PC-1 |
 796      Control-1               LB   |  PC1 + 0xA0 |
 797      Dimension-1 official    LB   |  PC1 + 0x80 |
 798      Dimension-1 private     0x9E |  LB         | PC1 + 0x80 |
 799      Dimension-2 official    LB   |  PC1 + 0x80 | PC2 + 0x80 |
 800      Dimension-2 private     0x9F |  LB         | PC1 + 0x80 | PC2 + 0x80
 801
 802    The basic characteristic of this encoding is that the first byte of
 803 all characters is in the range 0x00 - 0x9F, and the second and
 804 following bytes of all characters is in the range 0xA0 - 0xFF.  This
 805 means that it is impossible to get out of sync, or more specifically:
 806
 807   1. Given any byte position, the beginning of the character it is
 808      within can be determined in constant time.
 809
 810   2. Given any byte position at the beginning of a character, the
 811      beginning of the next character can be determined in constant time.
 812
 813   3. Given any byte position at the beginning of a character, the
 814      beginning of the previous character can be determined in constant
 815      time.
 816
 817   4. Textual searches can simply treat encoded strings as if they were
 818      encoded in a one-byte-per-character fashion rather than the actual
 819      multi-byte encoding.
 820
 821    None of the standard non-modal encodings meet all of these
 822 conditions.  For example, EUC satisfies only (2) and (3), while
 823 Shift-JIS and Big5 (not yet described) satisfy only (2). (All non-modal
 824 encodings must satisfy (2), in order to be unambiguous.)
 825
 826 \1f
 827 File: internals.info,  Node: Internal Character Encoding,  Prev: Internal String Encoding,  Up: Internal Mule Encodings
 828
 829 Internal Character Encoding
 830 ---------------------------
 831
 832    One 19-bit word represents a single character.  The word is
 833 separated into three fields:
 834
 835      Bit number:     18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
 836                      <------------> <------------------> <------------------>
 837      Field:                1                  2                    3
 838
 839    Note that fields 2 and 3 hold 7 bits each, while field 1 holds 5
 840 bits.
 841
 842      Character set           Field 1         Field 2         Field 3
 843      -------------           -------         -------         -------
 844      ASCII                      0               0              PC1
 845         range:                                                   (00 - 7F)
 846      Control-1                  0               1              PC1
 847         range:                                                   (00 - 1F)
 848      Dimension-1 official       0            LB - 0x80         PC1
 849         range:                                    (01 - 0D)      (20 - 7F)
 850      Dimension-1 private        0            LB - 0x80         PC1
 851         range:                                    (20 - 6F)      (20 - 7F)
 852      Dimension-2 official    LB - 0x8F         PC1             PC2
 853         range:                    (01 - 0A)       (20 - 7F)      (20 - 7F)
 854      Dimension-2 private     LB - 0xE1         PC1             PC2
 855         range:                    (0F - 1E)       (20 - 7F)      (20 - 7F)
 856      Composite                 0x1F             ?               ?
 857
 858    Note that character codes 0 - 255 are the same as the "binary
 859 encoding" described above.
 860
 861 \1f
 862 File: internals.info,  Node: CCL,  Prev: Internal Mule Encodings,  Up: MULE Character Sets and Encodings
 863
 864 CCL
 865 ===
 866
 867      CCL PROGRAM SYNTAX:
 868           CCL_PROGRAM := (CCL_MAIN_BLOCK
 869                           [ CCL_EOF_BLOCK ])
 870
 871           CCL_MAIN_BLOCK := CCL_BLOCK
 872           CCL_EOF_BLOCK := CCL_BLOCK
 873
 874           CCL_BLOCK := STATEMENT | (STATEMENT [STATEMENT ...])
 875           STATEMENT :=
 876                   SET | IF | BRANCH | LOOP | REPEAT | BREAK
 877                   | READ | WRITE
 878
 879           SET := (REG = EXPRESSION) | (REG SELF_OP EXPRESSION)
 880                  | INT-OR-CHAR
 881
 882           EXPRESSION := ARG | (EXPRESSION OP ARG)
 883
 884           IF := (if EXPRESSION CCL_BLOCK CCL_BLOCK)
 885           BRANCH := (branch EXPRESSION CCL_BLOCK [CCL_BLOCK ...])
 886           LOOP := (loop STATEMENT [STATEMENT ...])
 887           BREAK := (break)
 888           REPEAT := (repeat)
 889                   | (write-repeat [REG | INT-OR-CHAR | string])
 890                   | (write-read-repeat REG [INT-OR-CHAR | string | ARRAY]?)
 891           READ := (read REG) | (read REG REG)
 892                   | (read-if REG ARITH_OP ARG CCL_BLOCK CCL_BLOCK)
 893                   | (read-branch REG CCL_BLOCK [CCL_BLOCK ...])
 894           WRITE := (write REG) | (write REG REG)
 895                   | (write INT-OR-CHAR) | (write STRING) | STRING
 896                   | (write REG ARRAY)
 897           END := (end)
 898
 899           REG := r0 | r1 | r2 | r3 | r4 | r5 | r6 | r7
 900           ARG := REG | INT-OR-CHAR
 901           OP :=   + | - | * | / | % | & | '|' | ^ | << | >> | <8 | >8 | //
 902                   | < | > | == | <= | >= | !=
 903           SELF_OP :=
 904                   += | -= | *= | /= | %= | &= | '|=' | ^= | <<= | >>=
 905           ARRAY := '[' INT-OR-CHAR ... ']'
 906           INT-OR-CHAR := INT | CHAR
 907
 908      MACHINE CODE:
 909
 910      The machine code consists of a vector of 32-bit words.
 911      The first such word specifies the start of the EOF section of the code;
 912      this is the code executed to handle any stuff that needs to be done
 913      (e.g. designating back to ASCII and left-to-right mode) after all
 914      other encoded/decoded data has been written out.  This is not used for
 915      charset CCL programs.
 916
 917      REGISTER: 0..7  -- referred by RRR or rrr
 918
 919      OPERATOR BIT FIELD (27-bit): XXXXXXXXXXXXXXX RRR TTTTT
 920              TTTTT (5-bit): operator type
 921              RRR (3-bit): register number
 922              XXXXXXXXXXXXXXXX (15-bit):
 923                      CCCCCCCCCCCCCCC: constant or address
 924                      000000000000rrr: register number
 925
 926      AAAA:   00000 +
 927              00001 -
 928              00010 *
 929              00011 /
 930              00100 %
 931              00101 &
 932              00110 |
 933              00111 ~
 934
 935              01000 <<
 936              01001 >>
 937              01010 <8
 938              01011 >8
 939              01100 //
 940              01101 not used
 941              01110 not used
 942              01111 not used
 943
 944              10000 <
 945              10001 >
 946              10010 ==
 947              10011 <=
 948              10100 >=
 949              10101 !=
 950
 951      OPERATORS:      TTTTT RRR XX..
 952
 953      SetCS:          00000 RRR C...C      RRR = C...C
 954      SetCL:          00001 RRR .....      RRR = c...c
 955                      c.............c
 956      SetR:           00010 RRR ..rrr      RRR = rrr
 957      SetA:           00011 RRR ..rrr      RRR = array[rrr]
 958                      C.............C      size of array = C...C
 959                      c.............c      contents = c...c
 960
 961      Jump:           00100 000 c...c      jump to c...c
 962      JumpCond:       00101 RRR c...c      if (!RRR) jump to c...c
 963      WriteJump:      00110 RRR c...c      Write1 RRR, jump to c...c
 964      WriteReadJump:  00111 RRR c...c      Write1, Read1 RRR, jump to c...c
 965      WriteCJump:     01000 000 c...c      Write1 C...C, jump to c...c
 966                      C...C
 967      WriteCReadJump: 01001 RRR c...c      Write1 C...C, Read1 RRR,
 968                      C.............C      and jump to c...c
 969      WriteSJump:     01010 000 c...c      WriteS, jump to c...c
 970                      C.............C
 971                      S.............S
 972                      ...
 973      WriteSReadJump: 01011 RRR c...c      WriteS, Read1 RRR, jump to c...c
 974                      C.............C
 975                      S.............S
 976                      ...
 977      WriteAReadJump: 01100 RRR c...c      WriteA, Read1 RRR, jump to c...c
 978                      C.............C      size of array = C...C
 979                      c.............c      contents = c...c
 980                      ...
 981      Branch:         01101 RRR C...C      if (RRR >= 0 && RRR < C..)
 982                      c.............c      branch to (RRR+1)th address
 983      Read1:          01110 RRR ...        read 1-byte to RRR
 984      Read2:          01111 RRR ..rrr      read 2-byte to RRR and rrr
 985      ReadBranch:     10000 RRR C...C      Read1 and Branch
 986                      c.............c
 987                      ...
 988      Write1:         10001 RRR .....      write 1-byte RRR
 989      Write2:         10010 RRR ..rrr      write 2-byte RRR and rrr
 990      WriteC:         10011 000 .....      write 1-char C...CC
 991                      C.............C
 992      WriteS:         10100 000 .....      write C..-byte of string
 993                      C.............C
 994                      S.............S
 995                      ...
 996      WriteA:         10101 RRR .....      write array[RRR]
 997                      C.............C      size of array = C...C
 998                      c.............c      contents = c...c
 999                      ...
1000      End:            10110 000 .....      terminate the execution
1001
1002      SetSelfCS:      10111 RRR C...C      RRR AAAAA= C...C
1003                      ..........AAAAA
1004      SetSelfCL:      11000 RRR .....      RRR AAAAA= c...c
1005                      c.............c
1006                      ..........AAAAA
1007      SetSelfR:       11001 RRR ..Rrr      RRR AAAAA= rrr
1008                      ..........AAAAA
1009      SetExprCL:      11010 RRR ..Rrr      RRR = rrr AAAAA c...c
1010                      c.............c
1011                      ..........AAAAA
1012      SetExprR:       11011 RRR ..rrr      RRR = rrr AAAAA Rrr
1013                      ............Rrr
1014                      ..........AAAAA
1015      JumpCondC:      11100 RRR c...c      if !(RRR AAAAA C..) jump to c...c
1016                      C.............C
1017                      ..........AAAAA
1018      JumpCondR:      11101 RRR c...c      if !(RRR AAAAA rrr) jump to c...c
1019                      ............rrr
1020                      ..........AAAAA
1021      ReadJumpCondC:  11110 RRR c...c      Read1 and JumpCondC
1022                      C.............C
1023                      ..........AAAAA
1024      ReadJumpCondR:  11111 RRR c...c      Read1 and JumpCondR
1025                      ............rrr
1026                      ..........AAAAA
1027
1028 \1f
1029 File: internals.info,  Node: The Lisp Reader and Compiler,  Next: Lstreams,  Prev: MULE Character Sets and Encodings,  Up: Top
1030
1031 The Lisp Reader and Compiler
1032 ****************************
1033
1034    Not yet documented.
1035
1036 \1f
1037 File: internals.info,  Node: Lstreams,  Next: Consoles; Devices; Frames; Windows,  Prev: The Lisp Reader and Compiler,  Up: Top
1038
1039 Lstreams
1040 ********
1041
1042    An "lstream" is an internal Lisp object that provides a generic
1043 buffering stream implementation.  Conceptually, you send data to the
1044 stream or read data from the stream, not caring what's on the other end
1045 of the stream.  The other end could be another stream, a file
1046 descriptor, a stdio stream, a fixed block of memory, a reallocating
1047 block of memory, etc.  The main purpose of the stream is to provide a
1048 standard interface and to do buffering.  Macros are defined to read or
1049 write characters, so the calling functions do not have to worry about
1050 blocking data together in order to achieve efficiency.
1051
1052 * Menu:
1053
1054 * Creating an Lstream::         Creating an lstream object.
1055 * Lstream Types::               Different sorts of things that are streamed.
1056 * Lstream Functions::           Functions for working with lstreams.
1057 * Lstream Methods::             Creating new lstream types.
1058
1059 \1f
1060 File: internals.info,  Node: Creating an Lstream,  Next: Lstream Types,  Up: Lstreams
1061
1062 Creating an Lstream
1063 ===================
1064
1065    Lstreams come in different types, depending on what is being
1066 interfaced to.  Although the primitive for creating new lstreams is
1067 `Lstream_new()', generally you do not call this directly.  Instead, you
1068 call some type-specific creation function, which creates the lstream
1069 and initializes it as appropriate for the particular type.
1070
1071    All lstream creation functions take a MODE argument, specifying what
1072 mode the lstream should be opened as.  This controls whether the
1073 lstream is for input and output, and optionally whether data should be
1074 blocked up in units of MULE characters.  Note that some types of
1075 lstreams can only be opened for input; others only for output; and
1076 others can be opened either way.  #### Richard Mlynarik thinks that
1077 there should be a strict separation between input and output streams,
1078 and he's probably right.
1079
1080    MODE is a string, one of
1081
1082 `"r"'
1083      Open for reading.
1084
1085 `"w"'
1086      Open for writing.
1087
1088 `"rc"'
1089      Open for reading, but "read" never returns partial MULE characters.
1090
1091 `"wc"'
1092      Open for writing, but never writes partial MULE characters.
1093
1094 \1f
1095 File: internals.info,  Node: Lstream Types,  Next: Lstream Functions,  Prev: Creating an Lstream,  Up: Lstreams
1096
1097 Lstream Types
1098 =============
1099
1100 stdio
1101
1102 filedesc
1103
1104 lisp-string
1105
1106 fixed-buffer
1107
1108 resizing-buffer
1109
1110 dynarr
1111
1112 lisp-buffer
1113
1114 print
1115
1116 decoding
1117
1118 encoding