This is ../info/internals.info, produced by makeinfo version 4.0 from
internals/internals.texi.

INFO-DIR-SECTION XEmacs Editor
START-INFO-DIR-ENTRY
* Internals: (internals).       XEmacs Internals Manual.
END-INFO-DIR-ENTRY

   Copyright (C) 1992 - 1996 Ben Wing.  Copyright (C) 1996, 1997 Sun
Microsystems.  Copyright (C) 1994 - 1998 Free Software Foundation.
Copyright (C) 1994, 1995 Board of Trustees, University of Illinois.

   Permission is granted to make and distribute verbatim copies of this
manual provided the copyright notice and this permission notice are
preserved on all copies.

   Permission is granted to copy and distribute modified versions of
this manual under the conditions for verbatim copying, provided that the
entire resulting derived work is distributed under the terms of a
permission notice identical to this one.

   Permission is granted to copy and distribute translations of this
manual into another language, under the above conditions for modified
versions, except that this permission notice may be stated in a
translation approved by the Foundation.

   Permission is granted to copy and distribute modified versions of
this manual under the conditions for verbatim copying, provided also
that the section entitled "GNU General Public License" is included
exactly as in the original, and provided that the entire resulting
derived work is distributed under the terms of a permission notice
identical to this one.

   Permission is granted to copy and distribute translations of this
manual into another language, under the above conditions for modified
versions, except that the section entitled "GNU General Public License"
may be included in a translation approved by the Free Software
Foundation instead of in the original English.


File: internals.info,  Node: The XEmacs Object System (Abstractly Speaking),  Next: How Lisp Objects Are Represented in C,  Prev: XEmacs From the Inside,  Up: Top

The XEmacs Object System (Abstractly Speaking)
**********************************************

   At the heart of the Lisp interpreter is its management of objects.
XEmacs Lisp contains many built-in objects, some of which are simple
and others of which can be very complex; and some of which are very
common, and others of which are rarely used or are only used
internally. (Since the Lisp allocation system, with its automatic
reclamation of unused storage, is so much more convenient than
`malloc()' and `free()', the C code makes extensive use of it in its
internal operations.)

   The basic Lisp objects are

`integer'
     28 or 31 bits of precision, or 60 or 63 bits on 64-bit machines;
     the reason for this is described below when the internal Lisp
     object representation is described.

`float'
     Same precision as a double in C.

`cons'
     A simple container for two Lisp objects, used to implement lists
     and most other data structures in Lisp.

`char'
     An object representing a single character of text; chars behave
     like integers in many ways but are logically considered text
     rather than numbers and have a different read syntax. (the read
     syntax for a char contains the char itself or some textual
     encoding of it--for example, a Japanese Kanji character might be
     encoded as `^[$(B#&^[(B' using the ISO-2022 encoding
     standard--rather than the numerical representation of the char;
     this way, if the mapping between chars and integers changes, which
     is quite possible for Kanji characters and other extended
     characters, the same character will still be created.  Note that
     some primitives confuse chars and integers.  The worst culprit is
     `eq', which makes a special exception and considers a char to be
     `eq' to its integer equivalent, even though in no other case are
     objects of two different types `eq'.  The reason for this
     monstrosity is compatibility with existing code; the separation of
     char from integer came fairly recently.)

`symbol'
     An object that contains Lisp objects and is referred to by name;
     symbols are used to implement variables and named functions and to
     provide the equivalent of preprocessor constants in C.

`vector'
     A one-dimensional array of Lisp objects providing constant-time
     access to any of the objects; access to an arbitrary object in a
     vector is faster than for lists, but the operations that can be
     done on a vector are more limited.

`string'
     Self-explanatory; behaves much like a vector of chars but has a
     different read syntax and is stored and manipulated more compactly.

`bit-vector'
     A vector of bits; similar to a string in spirit.

`compiled-function'
     An object containing compiled Lisp code, known as "byte code".

`subr'
     A Lisp primitive, i.e. a Lisp-callable function implemented in C.

   Note that there is no basic "function" type, as in more powerful
versions of Lisp (where it's called a "closure").  XEmacs Lisp does not
provide the closure semantics implemented by Common Lisp and Scheme.
The guts of a function in XEmacs Lisp are represented in one of four
ways: a symbol specifying another function (when one function is an
alias for another), a list (whose first element must be the symbol
`lambda') containing the function's source code, a compiled-function
object, or a subr object. (In other words, given a symbol specifying
the name of a function, calling `symbol-function' to retrieve the
contents of the symbol's function cell will return one of these types
of objects.)

   XEmacs Lisp also contains numerous specialized objects used to
implement the editor:

`buffer'
     Stores text like a string, but is optimized for insertion and
     deletion and has certain other properties that can be set.

`frame'
     An object with various properties whose displayable representation
     is a "window" in window-system parlance.

`window'
     A section of a frame that displays the contents of a buffer; often
     called a "pane" in window-system parlance.

`window-configuration'
     An object that represents a saved configuration of windows in a
     frame.

`device'
     An object representing a screen on which frames can be displayed;
     equivalent to a "display" in the X Window System and a "TTY" in
     character mode.

`face'
     An object specifying the appearance of text or graphics; it has
     properties such as font, foreground color, and background color.

`marker'
     An object that refers to a particular position in a buffer and
     moves around as text is inserted and deleted to stay in the same
     relative position to the text around it.

`extent'
     Similar to a marker but covers a range of text in a buffer; can
     also specify properties of the text, such as a face in which the
     text is to be displayed, whether the text is invisible or
     unmodifiable, etc.

`event'
     Generated by calling `next-event' and contains information
     describing a particular event happening in the system, such as the
     user pressing a key or a process terminating.

`keymap'
     An object that maps from events (described using lists, vectors,
     and symbols rather than with an event object because the mapping
     is for classes of events, rather than individual events) to
     functions to execute or other events to recursively look up; the
     functions are described by name, using a symbol, or using lists to
     specify the function's code.

`glyph'
     An object that describes the appearance of an image (e.g.  pixmap)
     on the screen; glyphs can be attached to the beginning or end of
     extents and in some future version of XEmacs will be able to be
     inserted directly into a buffer.

`process'
     An object that describes a connection to an externally-running
     process.

   There are some other, less-commonly-encountered general objects:

`hash-table'
     An object that maps from an arbitrary Lisp object to another
     arbitrary Lisp object, using hashing for fast lookup.

`obarray'
     A limited form of hash-table that maps from strings to symbols;
     obarrays are used to look up a symbol given its name and are not
     actually their own object type but are kludgily represented using
     vectors with hidden fields (this representation derives from GNU
     Emacs).

`specifier'
     A complex object used to specify the value of a display property; a
     default value is given and different values can be specified for
     particular frames, buffers, windows, devices, or classes of device.

`char-table'
     An object that maps from chars or classes of chars to arbitrary
     Lisp objects; internally char tables use a complex nested-vector
     representation that is optimized to the way characters are
     represented as integers.

`range-table'
     An object that maps from ranges of integers to arbitrary Lisp
     objects.

   And some strange special-purpose objects:

`charset'
`coding-system'
     Objects used when MULE, or multi-lingual/Asian-language, support is
     enabled.

`color-instance'
`font-instance'
`image-instance'
     An object that encapsulates a window-system resource; instances are
     mostly used internally but are exposed on the Lisp level for
     cleanness of the specifier model and because it's occasionally
     useful for Lisp program to create or query the properties of
     instances.

`subwindow'
     An object that encapsulate a "subwindow" resource, i.e. a
     window-system child window that is drawn into by an external
     process; this object should be integrated into the glyph system
     but isn't yet, and may change form when this is done.

`tooltalk-message'
`tooltalk-pattern'
     Objects that represent resources used in the ToolTalk interprocess
     communication protocol.

`toolbar-button'
     An object used in conjunction with the toolbar.

   And objects that are only used internally:

`opaque'
     A generic object for encapsulating arbitrary memory; this allows
     you the generality of `malloc()' and the convenience of the Lisp
     object system.

`lstream'
     A buffering I/O stream, used to provide a unified interface to
     anything that can accept output or provide input, such as a file
     descriptor, a stdio stream, a chunk of memory, a Lisp buffer, a
     Lisp string, etc.; it's a Lisp object to make its memory
     management more convenient.

`char-table-entry'
     Subsidiary objects in the internal char-table representation.

`extent-auxiliary'
`menubar-data'
`toolbar-data'
     Various special-purpose objects that are basically just used to
     encapsulate memory for particular subsystems, similar to the more
     general "opaque" object.

`symbol-value-forward'
`symbol-value-buffer-local'
`symbol-value-varalias'
`symbol-value-lisp-magic'
     Special internal-only objects that are placed in the value cell of
     a symbol to indicate that there is something special with this
     variable - e.g. it has no value, it mirrors another variable, or
     it mirrors some C variable; there is really only one kind of
     object, called a "symbol-value-magic", but it is sort-of halfway
     kludged into semi-different object types.

   Some types of objects are "permanent", meaning that once created,
they do not disappear until explicitly destroyed, using a function such
as `delete-buffer', `delete-window', `delete-frame', etc.  Others will
disappear once they are not longer used, through the garbage collection
mechanism.  Buffers, frames, windows, devices, and processes are among
the objects that are permanent.  Note that some objects can go both
ways: Faces can be created either way; extents are normally permanent,
but detached extents (extents not referring to any text, as happens to
some extents when the text they are referring to is deleted) are
temporary.  Note that some permanent objects, such as faces and coding
systems, cannot be deleted.  Note also that windows are unique in that
they can be _undeleted_ after having previously been deleted. (This
happens as a result of restoring a window configuration.)

   Note that many types of objects have a "read syntax", i.e. a way of
specifying an object of that type in Lisp code.  When you load a Lisp
file, or type in code to be evaluated, what really happens is that the
function `read' is called, which reads some text and creates an object
based on the syntax of that text; then `eval' is called, which possibly
does something special; then this loop repeats until there's no more
text to read. (`eval' only actually does something special with
symbols, which causes the symbol's value to be returned, similar to
referencing a variable; and with conses [i.e. lists], which cause a
function invocation.  All other values are returned unchanged.)

   The read syntax

     17297

   converts to an integer whose value is 17297.

     1.983e-4

   converts to a float whose value is 1.983e-4, or .0001983.

     ?b

   converts to a char that represents the lowercase letter b.

     ?^[$(B#&^[(B

   (where `^[' actually is an `ESC' character) converts to a particular
Kanji character when using an ISO2022-based coding system for input.
(To decode this goo: `ESC' begins an escape sequence; `ESC $ (' is a
class of escape sequences meaning "switch to a 94x94 character set";
`ESC $ ( B' means "switch to Japanese Kanji"; `#' and `&' collectively
index into a 94-by-94 array of characters [subtract 33 from the ASCII
value of each character to get the corresponding index]; `ESC (' is a
class of escape sequences meaning "switch to a 94 character set"; `ESC
(B' means "switch to US ASCII".  It is a coincidence that the letter
`B' is used to denote both Japanese Kanji and US ASCII.  If the first
`B' were replaced with an `A', you'd be requesting a Chinese Hanzi
character from the GB2312 character set.)

     "foobar"

   converts to a string.

     foobar

   converts to a symbol whose name is `"foobar"'.  This is done by
looking up the string equivalent in the global variable `obarray',
whose contents should be an obarray.  If no symbol is found, a new
symbol with the name `"foobar"' is automatically created and added to
`obarray'; this process is called "interning" the symbol.

     (foo . bar)

   converts to a cons cell containing the symbols `foo' and `bar'.

     (1 a 2.5)

   converts to a three-element list containing the specified objects
(note that a list is actually a set of nested conses; see the XEmacs
Lisp Reference).

     [1 a 2.5]

   converts to a three-element vector containing the specified objects.

     #[... ... ... ...]

   converts to a compiled-function object (the actual contents are not
shown since they are not relevant here; look at a file that ends with
`.elc' for examples).

     #*01110110

   converts to a bit-vector.

     #s(hash-table ... ...)

   converts to a hash table (the actual contents are not shown).

     #s(range-table ... ...)

   converts to a range table (the actual contents are not shown).

     #s(char-table ... ...)

   converts to a char table (the actual contents are not shown).

   Note that the `#s()' syntax is the general syntax for structures,
which are not really implemented in XEmacs Lisp but should be.

   When an object is printed out (using `print' or a related function),
the read syntax is used, so that the same object can be read in again.

   The other objects do not have read syntaxes, usually because it does
not really make sense to create them in this fashion (i.e.  processes,
where it doesn't make sense to have a subprocess created as a side
effect of reading some Lisp code), or because they can't be created at
all (e.g. subrs).  Permanent objects, as a rule, do not have a read
syntax; nor do most complex objects, which contain too much state to be
easily initialized through a read syntax.


File: internals.info,  Node: How Lisp Objects Are Represented in C,  Next: Rules When Writing New C Code,  Prev: The XEmacs Object System (Abstractly Speaking),  Up: Top

How Lisp Objects Are Represented in C
*************************************

   Lisp objects are represented in C using a 32-bit or 64-bit machine
word (depending on the processor; i.e. DEC Alphas use 64-bit Lisp
objects and most other processors use 32-bit Lisp objects).  The
representation stuffs a pointer together with a tag, as follows:

      [ 3 3 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 ]
      [ 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 ]
     
        <---------------------------------------------------------> <->
                 a pointer to a structure, or an integer            tag

   A tag of 00 is used for all pointer object types, a tag of 10 is used
for characters, and the other two tags 01 and 11 are joined together to
form the integer object type.  This representation gives us 31 bit
integers and 30 bit characters, while pointers are represented directly
without any bit masking or shifting.  This representation, though,
assumes that pointers to structs are always aligned to multiples of 4,
so the lower 2 bits are always zero.

   Lisp objects use the typedef `Lisp_Object', but the actual C type
used for the Lisp object can vary.  It can be either a simple type
(`long' on the DEC Alpha, `int' on other machines) or a structure whose
fields are bit fields that line up properly (actually, a union of
structures is used).  Generally the simple integral type is preferable
because it ensures that the compiler will actually use a machine word
to represent the object (some compilers will use more general and less
efficient code for unions and structs even if they can fit in a machine
word).  The union type, however, has the advantage of stricter type
checking.  If you accidentally pass an integer where a Lisp object is
desired, you get a compile error.  The choice of which type to use is
determined by the preprocessor constant `USE_UNION_TYPE' which is
defined via the `--use-union-type' option to `configure'.

   Various macros are used to convert between Lisp_Objects and the
corresponding C type.  Macros of the form `XINT()', `XCHAR()',
`XSTRING()', `XSYMBOL()', do any required bit shifting and/or masking
and cast it to the appropriate type.  `XINT()' needs to be a bit tricky
so that negative numbers are properly sign-extended.  Since integers
are stored left-shifted, if the right-shift operator does an arithmetic
shift (i.e. it leaves the most-significant bit as-is rather than
shifting in a zero, so that it mimics a divide-by-two even for negative
numbers) the shift to remove the tag bit is enough.  This is the case
on all the systems we support.

   Note that when `ERROR_CHECK_TYPECHECK' is defined, the converter
macros become more complicated--they check the tag bits and/or the type
field in the first four bytes of a record type to ensure that the
object is really of the correct type.  This is great for catching places
where an incorrect type is being dereferenced--this typically results
in a pointer being dereferenced as the wrong type of structure, with
unpredictable (and sometimes not easily traceable) results.

   There are similar `XSETTYPE()' macros that construct a Lisp object.
These macros are of the form `XSETTYPE (LVALUE, RESULT)', i.e. they
have to be a statement rather than just used in an expression.  The
reason for this is that standard C doesn't let you "construct" a
structure (but GCC does).  Granted, this sometimes isn't too
convenient; for the case of integers, at least, you can use the
function `make_int()', which constructs and _returns_ an integer Lisp
object.  Note that the `XSETTYPE()' macros are also affected by
`ERROR_CHECK_TYPECHECK' and make sure that the structure is of the
right type in the case of record types, where the type is contained in
the structure.

   The C programmer is responsible for *guaranteeing* that a
Lisp_Object is the correct type before using the `XTYPE' macros.  This
is especially important in the case of lists.  Use `XCAR' and `XCDR' if
a Lisp_Object is certainly a cons cell, else use `Fcar()' and `Fcdr()'.
Trust other C code, but not Lisp code.  On the other hand, if XEmacs
has an internal logic error, it's better to crash immediately, so
sprinkle `assert()'s and "unreachable" `abort()'s liberally about the
source code.  Where performance is an issue, use `type_checking_assert',
`bufpos_checking_assert', and `gc_checking_assert', which do nothing
unless the corresponding configure error checking flag was specified.


File: internals.info,  Node: Rules When Writing New C Code,  Next: A Summary of the Various XEmacs Modules,  Prev: How Lisp Objects Are Represented in C,  Up: Top

Rules When Writing New C Code
*****************************

   The XEmacs C Code is extremely complex and intricate, and there are
many rules that are more or less consistently followed throughout the
code.  Many of these rules are not obvious, so they are explained here.
It is of the utmost importance that you follow them.  If you don't,
you may get something that appears to work, but which will crash in odd
situations, often in code far away from where the actual breakage is.

* Menu:

* General Coding Rules::
* Writing Lisp Primitives::
* Adding Global Lisp Variables::
* Coding for Mule::
* Techniques for XEmacs Developers::


File: internals.info,  Node: General Coding Rules,  Next: Writing Lisp Primitives,  Prev: Rules When Writing New C Code,  Up: Rules When Writing New C Code

General Coding Rules
====================

   The C code is actually written in a dialect of C called "Clean C",
meaning that it can be compiled, mostly warning-free, with either a C or
C++ compiler.  Coding in Clean C has several advantages over plain C.
C++ compilers are more nit-picking, and a number of coding errors have
been found by compiling with C++.  The ability to use both C and C++
tools means that a greater variety of development tools are available to
the developer.

   Almost every module contains a `syms_of_*()' function and a
`vars_of_*()' function.  The former declares any Lisp primitives you
have defined and defines any symbols you will be using.  The latter
declares any global Lisp variables you have added and initializes global
C variables in the module.  For each such function, declare it in
`symsinit.h' and make sure it's called in the appropriate place in
`emacs.c'.  *Important*: There are stringent requirements on exactly
what can go into these functions.  See the comment in `emacs.c'.  The
reason for this is to avoid obscure unwanted interactions during
initialization.  If you don't follow these rules, you'll be sorry!  If
you want to do anything that isn't allowed, create a
`complex_vars_of_*()' function for it.  Doing this is tricky, though:
You have to make sure your function is called at the right time so that
all the initialization dependencies work out.

   Every module includes `<config.h>' (angle brackets so that
`--srcdir' works correctly; `config.h' may or may not be in the same
directory as the C sources) and `lisp.h'.  `config.h' must always be
included before any other header files (including system header files)
to ensure that certain tricks played by various `s/' and `m/' files
work out correctly.

   When including header files, always use angle brackets, not double
quotes, except when the file to be included is always in the same
directory as the including file.  If either file is a generated file,
then that is not likely to be the case.  In order to understand why we
have this rule, imagine what happens when you do a build in the source
directory using `./configure' and another build in another directory
using `../work/configure'.  There will be two different `config.h'
files.  Which one will be used if you `#include "config.h"'?

   *All global and static variables that are to be modifiable must be
declared uninitialized.*  This means that you may not use the "declare
with initializer" form for these variables, such as `int some_variable
= 0;'.  The reason for this has to do with some kludges done during the
dumping process: If possible, the initialized data segment is re-mapped
so that it becomes part of the (unmodifiable) code segment in the
dumped executable.  This allows this memory to be shared among multiple
running XEmacs processes.  XEmacs is careful to place as much constant
data as possible into initialized variables during the `temacs' phase.

   *Please note:* This kludge only works on a few systems nowadays, and
is rapidly becoming irrelevant because most modern operating systems
provide "copy-on-write" semantics.  All data is initially shared
between processes, and a private copy is automatically made (on a
page-by-page basis) when a process first attempts to write to a page of
memory.

   Formerly, there was a requirement that static variables not be
declared inside of functions.  This had to do with another hack along
the same vein as what was just described: old USG systems put
statically-declared variables in the initialized data space, so those
header files had a `#define static' declaration. (That way, the
data-segment remapping described above could still work.) This fails
badly on static variables inside of functions, which suddenly become
automatic variables; therefore, you weren't supposed to have any of
them.  This awful kludge has been removed in XEmacs because

  1. almost all of the systems that used this kludge ended up having to
     disable the data-segment remapping anyway;

  2. the only systems that didn't were extremely outdated ones;

  3. this hack completely messed up inline functions.

   The C source code makes heavy use of C preprocessor macros.  One
popular macro style is:

     #define FOO(var, value) do {            \
       Lisp_Object FOO_value = (value);      \
       ... /* compute using FOO_value */     \
       (var) = bar;                          \
     } while (0)

   The `do {...} while (0)' is a standard trick to allow FOO to have
statement semantics, so that it can safely be used within an `if'
statement in C, for example.  Multiple evaluation is prevented by
copying a supplied argument into a local variable, so that
`FOO(var,fun(1))' only calls `fun' once.

   Lisp lists are popular data structures in the C code as well as in
Elisp.  There are two sets of macros that iterate over lists.
`EXTERNAL_LIST_LOOP_N' should be used when the list has been supplied
by the user, and cannot be trusted to be acyclic and nil-terminated.  A
`malformed-list' or `circular-list' error will be generated if the list
being iterated over is not entirely kosher.  `LIST_LOOP_N', on the
other hand, is faster and less safe, and can be used only on trusted
lists.

   Related macros are `GET_EXTERNAL_LIST_LENGTH' and `GET_LIST_LENGTH',
which calculate the length of a list, and in the case of
`GET_EXTERNAL_LIST_LENGTH', validating the properness of the list.  The
macros `EXTERNAL_LIST_LOOP_DELETE_IF' and `LIST_LOOP_DELETE_IF' delete
elements from a lisp list satisfying some predicate.


File: internals.info,  Node: Writing Lisp Primitives,  Next: Adding Global Lisp Variables,  Prev: General Coding Rules,  Up: Rules When Writing New C Code

Writing Lisp Primitives
=======================

   Lisp primitives are Lisp functions implemented in C.  The details of
interfacing the C function so that Lisp can call it are handled by a few
C macros.  The only way to really understand how to write new C code is
to read the source, but we can explain some things here.

   An example of a special form is the definition of `prog1', from
`eval.c'.  (An ordinary function would have the same general
appearance.)

     DEFUN ("prog1", Fprog1, 1, UNEVALLED, 0, /*
     Similar to `progn', but the value of the first form is returned.
     \(prog1 FIRST BODY...): All the arguments are evaluated sequentially.
     The value of FIRST is saved during evaluation of the remaining args,
     whose values are discarded.
     */
            (args))
     {
       /* This function can GC */
       REGISTER Lisp_Object val, form, tail;
       struct gcpro gcpro1;
     
       val = Feval (XCAR (args));
     
       GCPRO1 (val);
     
       LIST_LOOP_3 (form, XCDR (args), tail)
         Feval (form);
     
       UNGCPRO;
       return val;
     }

   Let's start with a precise explanation of the arguments to the
`DEFUN' macro.  Here is a template for them:

     DEFUN (LNAME, FNAME, MIN_ARGS, MAX_ARGS, INTERACTIVE, /*
     DOCSTRING
     */
        (ARGLIST))

LNAME
     This string is the name of the Lisp symbol to define as the
     function name; in the example above, it is `"prog1"'.

FNAME
     This is the C function name for this function.  This is the name
     that is used in C code for calling the function.  The name is, by
     convention, `F' prepended to the Lisp name, with all dashes (`-')
     in the Lisp name changed to underscores.  Thus, to call this
     function from C code, call `Fprog1'.  Remember that the arguments
     are of type `Lisp_Object'; various macros and functions for
     creating values of type `Lisp_Object' are declared in the file
     `lisp.h'.

     Primitives whose names are special characters (e.g. `+' or `<')
     are named by spelling out, in some fashion, the special character:
     e.g. `Fplus()' or `Flss()'.  Primitives whose names begin with
     normal alphanumeric characters but also contain special characters
     are spelled out in some creative way, e.g. `let*' becomes
     `FletX()'.

     Each function also has an associated structure that holds the data
     for the subr object that represents the function in Lisp.  This
     structure conveys the Lisp symbol name to the initialization
     routine that will create the symbol and store the subr object as
     its definition.  The C variable name of this structure is always
     `S' prepended to the FNAME.  You hardly ever need to be aware of
     the existence of this structure, since `DEFUN' plus `DEFSUBR'
     takes care of all the details.

MIN_ARGS
     This is the minimum number of arguments that the function
     requires.  The function `prog1' allows a minimum of one argument.

MAX_ARGS
     This is the maximum number of arguments that the function accepts,
     if there is a fixed maximum.  Alternatively, it can be `UNEVALLED',
     indicating a special form that receives unevaluated arguments, or
     `MANY', indicating an unlimited number of evaluated arguments (the
     C equivalent of `&rest').  Both `UNEVALLED' and `MANY' are macros.
     If MAX_ARGS is a number, it may not be less than MIN_ARGS and it
     may not be greater than 8. (If you need to add a function with
     more than 8 arguments, use the `MANY' form.  Resist the urge to
     edit the definition of `DEFUN' in `lisp.h'.  If you do it anyways,
     make sure to also add another clause to the switch statement in
     `primitive_funcall().')

INTERACTIVE
     This is an interactive specification, a string such as might be
     used as the argument of `interactive' in a Lisp function.  In the
     case of `prog1', it is 0 (a null pointer), indicating that `prog1'
     cannot be called interactively.  A value of `""' indicates a
     function that should receive no arguments when called
     interactively.

DOCSTRING
     This is the documentation string.  It is written just like a
     documentation string for a function defined in Lisp; in
     particular, the first line should be a single sentence.  Note how
     the documentation string is enclosed in a comment, none of the
     documentation is placed on the same lines as the comment-start and
     comment-end characters, and the comment-start characters are on
     the same line as the interactive specification.  `make-docfile',
     which scans the C files for documentation strings, is very
     particular about what it looks for, and will not properly extract
     the doc string if it's not in this exact format.

     In order to make both `etags' and `make-docfile' happy, make sure
     that the `DEFUN' line contains the LNAME and FNAME, and that the
     comment-start characters for the doc string are on the same line
     as the interactive specification, and put a newline directly after
     them (and before the comment-end characters).

ARGLIST
     This is the comma-separated list of arguments to the C function.
     For a function with a fixed maximum number of arguments, provide a
     C argument for each Lisp argument.  In this case, unlike regular C
     functions, the types of the arguments are not declared; they are
     simply always of type `Lisp_Object'.

     The names of the C arguments will be used as the names of the
     arguments to the Lisp primitive as displayed in its documentation,
     modulo the same concerns described above for `F...' names (in
     particular, underscores in the C arguments become dashes in the
     Lisp arguments).

     There is one additional kludge: A trailing `_' on the C argument is
     discarded when forming the Lisp argument.  This allows C language
     reserved words (like `default') or global symbols (like `dirname')
     to be used as argument names without compiler warnings or errors.

     A Lisp function with MAX_ARGS = `UNEVALLED' is a "special form";
     its arguments are not evaluated.  Instead it receives one argument
     of type `Lisp_Object', a (Lisp) list of the unevaluated arguments,
     conventionally named `(args)'.

     When a Lisp function has no upper limit on the number of arguments,
     specify MAX_ARGS = `MANY'.  In this case its implementation in C
     actually receives exactly two arguments: the number of Lisp
     arguments (an `int') and the address of a block containing their
     values (a `Lisp_Object *').  In this case only are the C types
     specified in the ARGLIST: `(int nargs, Lisp_Object *args)'.

   Within the function `Fprog1' itself, note the use of the macros
`GCPRO1' and `UNGCPRO'.  `GCPRO1' is used to "protect" a variable from
garbage collection--to inform the garbage collector that it must look
in that variable and regard the object pointed at by its contents as an
accessible object.  This is necessary whenever you call `Feval' or
anything that can directly or indirectly call `Feval' (this includes
the `QUIT' macro!).  At such a time, any Lisp object that you intend to
refer to again must be protected somehow.  `UNGCPRO' cancels the
protection of the variables that are protected in the current function.
It is necessary to do this explicitly.

   The macro `GCPRO1' protects just one local variable.  If you want to
protect two, use `GCPRO2' instead; repeating `GCPRO1' will not work.
Macros `GCPRO3' and `GCPRO4' also exist.

   These macros implicitly use local variables such as `gcpro1'; you
must declare these explicitly, with type `struct gcpro'.  Thus, if you
use `GCPRO2', you must declare `gcpro1' and `gcpro2'.

   Note also that the general rule is "caller-protects"; i.e. you are
only responsible for protecting those Lisp objects that you create.  Any
objects passed to you as arguments should have been protected by whoever
created them, so you don't in general have to protect them.

   In particular, the arguments to any Lisp primitive are always
automatically `GCPRO'ed, when called "normally" from Lisp code or
bytecode.  So only a few Lisp primitives that are called frequently from
C code, such as `Fprogn' protect their arguments as a service to their
caller.  You don't need to protect your arguments when writing a new
`DEFUN'.

   `GCPRO'ing is perhaps the trickiest and most error-prone part of
XEmacs coding.  It is *extremely* important that you get this right and
use a great deal of discipline when writing this code.  *Note
`GCPRO'ing: GCPROing, for full details on how to do this.

   What `DEFUN' actually does is declare a global structure of type
`Lisp_Subr' whose name begins with capital `SF' and which contains
information about the primitive (e.g. a pointer to the function, its
minimum and maximum allowed arguments, a string describing its Lisp
name); `DEFUN' then begins a normal C function declaration using the
`F...' name.  The Lisp subr object that is the function definition of a
primitive (i.e. the object in the function slot of the symbol that
names the primitive) actually points to this `SF' structure; when
`Feval' encounters a subr, it looks in the structure to find out how to
call the C function.

   Defining the C function is not enough to make a Lisp primitive
available; you must also create the Lisp symbol for the primitive (the
symbol is "interned"; *note Obarrays::) and store a suitable subr
object in its function cell. (If you don't do this, the primitive won't
be seen by Lisp code.) The code looks like this:

     DEFSUBR (FNAME);

Here FNAME is the same name you used as the second argument to `DEFUN'.

   This call to `DEFSUBR' should go in the `syms_of_*()' function at
the end of the module.  If no such function exists, create it and make
sure to also declare it in `symsinit.h' and call it from the
appropriate spot in `main()'.  *Note General Coding Rules::.

   Note that C code cannot call functions by name unless they are
defined in C.  The way to call a function written in Lisp from C is to
use `Ffuncall', which embodies the Lisp function `funcall'.  Since the
Lisp function `funcall' accepts an unlimited number of arguments, in C
it takes two: the number of Lisp-level arguments, and a one-dimensional
array containing their values.  The first Lisp-level argument is the
Lisp function to call, and the rest are the arguments to pass to it.
Since `Ffuncall' can call the evaluator, you must protect pointers from
garbage collection around the call to `Ffuncall'. (However, `Ffuncall'
explicitly protects all of its parameters, so you don't have to protect
any pointers passed as parameters to it.)

   The C functions `call0', `call1', `call2', and so on, provide handy
ways to call a Lisp function conveniently with a fixed number of
arguments.  They work by calling `Ffuncall'.

   `eval.c' is a very good file to look through for examples; `lisp.h'
contains the definitions for important macros and functions.


File: internals.info,  Node: Adding Global Lisp Variables,  Next: Coding for Mule,  Prev: Writing Lisp Primitives,  Up: Rules When Writing New C Code

Adding Global Lisp Variables
============================

   Global variables whose names begin with `Q' are constants whose
value is a symbol of a particular name.  The name of the variable should
be derived from the name of the symbol using the same rules as for Lisp
primitives.  These variables are initialized using a call to
`defsymbol()' in the `syms_of_*()' function. (This call interns a
symbol, sets the C variable to the resulting Lisp object, and calls
`staticpro()' on the C variable to tell the garbage-collection
mechanism about this variable.  What `staticpro()' does is add a
pointer to the variable to a large global array; when
garbage-collection happens, all pointers listed in the array are used
as starting points for marking Lisp objects.  This is important because
it's quite possible that the only current reference to the object is
the C variable.  In the case of symbols, the `staticpro()' doesn't
matter all that much because the symbol is contained in `obarray',
which is itself `staticpro()'ed.  However, it's possible that a naughty
user could do something like uninterning the symbol out of `obarray' or
even setting `obarray' to a different value [although this is likely to
make XEmacs crash!].)

   *Please note:* It is potentially deadly if you declare a `Q...'
variable in two different modules.  The two calls to `defsymbol()' are
no problem, but some linkers will complain about multiply-defined
symbols.  The most insidious aspect of this is that often the link will
succeed anyway, but then the resulting executable will sometimes crash
in obscure ways during certain operations!  To avoid this problem,
declare any symbols with common names (such as `text') that are not
obviously associated with this particular module in the module
`general.c'.

   Global variables whose names begin with `V' are variables that
contain Lisp objects.  The convention here is that all global variables
of type `Lisp_Object' begin with `V', and all others don't (including
integer and boolean variables that have Lisp equivalents). Most of the
time, these variables have equivalents in Lisp, but some don't.  Those
that do are declared this way by a call to `DEFVAR_LISP()' in the
`vars_of_*()' initializer for the module.  What this does is create a
special "symbol-value-forward" Lisp object that contains a pointer to
the C variable, intern a symbol whose name is as specified in the call
to `DEFVAR_LISP()', and set its value to the symbol-value-forward Lisp
object; it also calls `staticpro()' on the C variable to tell the
garbage-collection mechanism about the variable.  When `eval' (or
actually `symbol-value') encounters this special object in the process
of retrieving a variable's value, it follows the indirection to the C
variable and gets its value.  `setq' does similar things so that the C
variable gets changed.

   Whether or not you `DEFVAR_LISP()' a variable, you need to
initialize it in the `vars_of_*()' function; otherwise it will end up
as all zeroes, which is the integer 0 (_not_ `nil'), and this is
probably not what you want.  Also, if the variable is not
`DEFVAR_LISP()'ed, *you must call* `staticpro()' on the C variable in
the `vars_of_*()' function.  Otherwise, the garbage-collection
mechanism won't know that the object in this variable is in use, and
will happily collect it and reuse its storage for another Lisp object,
and you will be the one who's unhappy when you can't figure out how
your variable got overwritten.


File: internals.info,  Node: Coding for Mule,  Next: Techniques for XEmacs Developers,  Prev: Adding Global Lisp Variables,  Up: Rules When Writing New C Code

Coding for Mule
===============

   Although Mule support is not compiled by default in XEmacs, many
people are using it, and we consider it crucial that new code works
correctly with multibyte characters.  This is not hard; it is only a
matter of following several simple user-interface guidelines.  Even if
you never compile with Mule, with a little practice you will find it
quite easy to code Mule-correctly.

   Note that these guidelines are not necessarily tied to the current
Mule implementation; they are also a good idea to follow on the grounds
of code generalization for future I18N work.

* Menu:

* Character-Related Data Types::
* Working With Character and Byte Positions::
* Conversion to and from External Data::
* General Guidelines for Writing Mule-Aware Code::
* An Example of Mule-Aware Code::


File: internals.info,  Node: Character-Related Data Types,  Next: Working With Character and Byte Positions,  Prev: Coding for Mule,  Up: Coding for Mule

Character-Related Data Types
----------------------------

   First, let's review the basic character-related datatypes used by
XEmacs.  Note that the separate `typedef's are not mandatory in the
current implementation (all of them boil down to `unsigned char' or
`int'), but they improve clarity of code a great deal, because one
glance at the declaration can tell the intended use of the variable.

`Emchar'
     An `Emchar' holds a single Emacs character.

     Obviously, the equality between characters and bytes is lost in
     the Mule world.  Characters can be represented by one or more
     bytes in the buffer, and `Emchar' is the C type large enough to
     hold any character.

     Without Mule support, an `Emchar' is equivalent to an `unsigned
     char'.

`Bufbyte'
     The data representing the text in a buffer or string is logically
     a set of `Bufbyte's.

     XEmacs does not work with the same character formats all the time;
     when reading characters from the outside, it decodes them to an
     internal format, and likewise encodes them when writing.
     `Bufbyte' (in fact `unsigned char') is the basic unit of XEmacs
     internal buffers and strings format.  A `Bufbyte *' is the type
     that points at text encoded in the variable-width internal
     encoding.

     One character can correspond to one or more `Bufbyte's.  In the
     current Mule implementation, an ASCII character is represented by
     the same `Bufbyte', and other characters are represented by a
     sequence of two or more `Bufbyte's.

     Without Mule support, there are exactly 256 characters, implicitly
     Latin-1, and each character is represented using one `Bufbyte', and
     there is a one-to-one correspondence between `Bufbyte's and
     `Emchar's.

`Bufpos'
`Charcount'
     A `Bufpos' represents a character position in a buffer or string.
     A `Charcount' represents a number (count) of characters.
     Logically, subtracting two `Bufpos' values yields a `Charcount'
     value.  Although all of these are `typedef'ed to `EMACS_INT', we
     use them in preference to `EMACS_INT' to make it clear what sort
     of position is being used.

     `Bufpos' and `Charcount' values are the only ones that are ever
     visible to Lisp.

`Bytind'
`Bytecount'
     A `Bytind' represents a byte position in a buffer or string.  A
     `Bytecount' represents the distance between two positions, in
     bytes.  The relationship between `Bytind' and `Bytecount' is the
     same as the relationship between `Bufpos' and `Charcount'.

`Extbyte'
`Extcount'
     When dealing with the outside world, XEmacs works with `Extbyte's,
     which are equivalent to `unsigned char'.  Obviously, an `Extcount'
     is the distance between two `Extbyte's.  Extbytes and Extcounts
     are not all that frequent in XEmacs code.


File: internals.info,  Node: Working With Character and Byte Positions,  Next: Conversion to and from External Data,  Prev: Character-Related Data Types,  Up: Coding for Mule

Working With Character and Byte Positions
-----------------------------------------

   Now that we have defined the basic character-related types, we can
look at the macros and functions designed for work with them and for
conversion between them.  Most of these macros are defined in
`buffer.h', and we don't discuss all of them here, but only the most
important ones.  Examining the existing code is the best way to learn
about them.

`MAX_EMCHAR_LEN'
     This preprocessor constant is the maximum number of buffer bytes to
     represent an Emacs character in the variable width internal
     encoding.  It is useful when allocating temporary strings to keep
     a known number of characters.  For instance:

          {
            Charcount cclen;
            ...
            {
              /* Allocate place for CCLEN characters. */
              Bufbyte *buf = (Bufbyte *)alloca (cclen * MAX_EMCHAR_LEN);
          ...

     If you followed the previous section, you can guess that,
     logically, multiplying a `Charcount' value with `MAX_EMCHAR_LEN'
     produces a `Bytecount' value.

     In the current Mule implementation, `MAX_EMCHAR_LEN' equals 4.
     Without Mule, it is 1.

`charptr_emchar'
`set_charptr_emchar'
     The `charptr_emchar' macro takes a `Bufbyte' pointer and returns
     the `Emchar' stored at that position.  If it were a function, its
     prototype would be:

          Emchar charptr_emchar (Bufbyte *p);

     `set_charptr_emchar' stores an `Emchar' to the specified byte
     position.  It returns the number of bytes stored:

          Bytecount set_charptr_emchar (Bufbyte *p, Emchar c);

     It is important to note that `set_charptr_emchar' is safe only for
     appending a character at the end of a buffer, not for overwriting a
     character in the middle.  This is because the width of characters
     varies, and `set_charptr_emchar' cannot resize the string if it
     writes, say, a two-byte character where a single-byte character
     used to reside.

     A typical use of `set_charptr_emchar' can be demonstrated by this
     example, which copies characters from buffer BUF to a temporary
     string of Bufbytes.

          {
            Bufpos pos;
            for (pos = beg; pos < end; pos++)
              {
                Emchar c = BUF_FETCH_CHAR (buf, pos);
                p += set_charptr_emchar (buf, c);
              }
          }

     Note how `set_charptr_emchar' is used to store the `Emchar' and
     increment the counter, at the same time.

`INC_CHARPTR'
`DEC_CHARPTR'
     These two macros increment and decrement a `Bufbyte' pointer,
     respectively.  They will adjust the pointer by the appropriate
     number of bytes according to the byte length of the character
     stored there.  Both macros assume that the memory address is
     located at the beginning of a valid character.

     Without Mule support, `INC_CHARPTR (p)' and `DEC_CHARPTR (p)'
     simply expand to `p++' and `p--', respectively.

`bytecount_to_charcount'
     Given a pointer to a text string and a length in bytes, return the
     equivalent length in characters.

          Charcount bytecount_to_charcount (Bufbyte *p, Bytecount bc);

`charcount_to_bytecount'
     Given a pointer to a text string and a length in characters,
     return the equivalent length in bytes.

          Bytecount charcount_to_bytecount (Bufbyte *p, Charcount cc);

`charptr_n_addr'
     Return a pointer to the beginning of the character offset CC (in
     characters) from P.

          Bufbyte *charptr_n_addr (Bufbyte *p, Charcount cc);