XEmacs 21.4.17 "Jumbo Shrimp".

[chise/xemacs-chise.git.1] / man / internals / internals.texi
diff --git a/man/internals/internals.texi b/man/internals/internals.texi

index 19563d4..d7df3da 100644 (file)
--- a/man/internals/internals.texi
+++ b/man/internals/internals.texi
@@ -12,7 +12,7 @@
  
  Copyright @copyright{} 1992 - 1996 Ben Wing.
  Copyright @copyright{} 1996, 1997 Sun Microsystems.
-Copyright @copyright{} 1994 - 1998 Free Software Foundation.
+Copyright @copyright{} 1994 - 1998, 2002, 2003 Free Software Foundation.
  Copyright @copyright{} 1994, 1995 Board of Trustees, University of Illinois.
  
  
@@ -1841,16 +1841,43 @@ Lisp objects use the typedef @code{Lisp_Object}, but the actual C type
  used for the Lisp object can vary.  It can be either a simple type
  (@code{long} on the DEC Alpha, @code{int} on other machines) or a
  structure whose fields are bit fields that line up properly (actually, a
-union of structures is used).  Generally the simple integral type is
-preferable because it ensures that the compiler will actually use a
-machine word to represent the object (some compilers will use more
-general and less efficient code for unions and structs even if they can
-fit in a machine word).  The union type, however, has the advantage of
-stricter type checking.  If you accidentally pass an integer where a Lisp
-object is desired, you get a compile error.  The choice of which type
-to use is determined by the preprocessor constant @code{USE_UNION_TYPE}
-which is defined via the @code{--use-union-type} option to
-@code{configure}.
+union of structures is used).  The choice of which type to use is
+determined by the preprocessor constant @code{USE_UNION_TYPE} which is
+defined via the @code{--use-union-type} option to @code{configure}.
+
+Generally the simple integral type is preferable because it ensures that
+the compiler will actually use a machine word to represent the object
+(some compilers will use more general and less efficient code for unions
+and structs even if they can fit in a machine word).  The union type,
+however, has the advantage of stricter @emph{static} type checking.
+Places where a @code{Lisp_Object} is mistakenly passed to a routine
+expecting an @code{int} (or vice-versa), or a check is written @samp{if
+(foo)} (instead of @samp{if (!NILP (foo))}, will be flagged as errors.
+None of these lead to the expected results!  @code{Qnil} is not
+represented as 0 (so @samp{if (foo)} will *ALWAYS* be true for a
+@code{Lisp_Object}), and the representation of an integer as a
+@code{Lisp_Object} is not just the integer's numeric value, but usually
+2x the integer +/- 1.)
+
+There used to be a claim that the union type simplified debugging.
+There may have been a grain of truth to this pre-19.8, when there was no
+@samp{lrecord} type and all objects had a separate type appearing in the
+tag.  Nowadays, however, there is no debugging gain, and in fact
+frequent debugging *@emph{loss}*, since many debuggers don't handle
+unions very well, and usually there is no way to directly specify a
+union from a debugging prompt.
+
+Furthermore, release builds should *@emph{not}* be done with union type
+because (a) you may get less efficiency, with compilers that can't
+figure out how to optimize the union into a machine word; (b) even
+worse, the union type often triggers miscompilation, especially when
+combined with Mule and error-checking.  This has been the case at
+various times when using GCC and MS VC, at least with @samp{--pdump}.
+Therefore, be warned!
+
+As of 2002 4Q, miscompilation is known to happen with current versions
+of @strong{Microsoft VC++} and @strong{GCC in combination with Mule,
+pdump, and KKCC} (no error checking).
  
  Various macros are used to convert between Lisp_Objects and the
  corresponding C type.  Macros of the form @code{XINT()}, @code{XCHAR()},
@@ -1911,6 +1938,7 @@ get something that appears to work, but which will crash in odd
  situations, often in code far away from where the actual breakage is.
  
  @menu
+* A Reader's Guide to XEmacs Coding Conventions::
  * General Coding Rules::
  * Writing Lisp Primitives::
  * Writing Good Comments::
@@ -1920,6 +1948,99 @@ situations, often in code far away from where the actual breakage is.
  * Techniques for XEmacs Developers::
  @end menu
  
+@node A Reader's Guide to XEmacs Coding Conventions
+@section A Reader's Guide to XEmacs Coding Conventions
+@cindex coding conventions
+@cindex reader's guide
+@cindex coding rules, naming
+
+Of course the low-level implementation language of XEmacs is C, but much
+of that uses the Lisp engine to do its work.  However, because the code
+is ``inside'' of the protective containment shell around the ``reactor
+core,'' you'll see lots of complex ``plumbing'' needed to do the work
+and ``safety mechanisms,'' whose failure results in a meltdown.  This
+section provides a quick overview (or review) of the various components
+of the implementation of Lisp objects.
+
+  Two typographic conventions help to identify C objects that implement
+Lisp objects.  The first is that capitalized identifiers, especially
+beginning with the letters @samp{Q}, @samp{V}, @samp{F}, and @samp{S},
+for C variables and functions, and C macros with beginning with the
+letter @samp{X}, are used to implement Lisp.  The second is that where
+Lisp uses the hyphen @samp{-} in symbol names, the corresponding C
+identifiers use the underscore @samp{_}.  Of course, since XEmacs Lisp
+contains interfaces to many external libraries, those external names
+will follow the coding conventions their authors chose, and may overlap
+the ``XEmacs name space.''  However these cases are usually pretty
+obvious.
+
+  All Lisp objects are handled indirectly.  The @code{Lisp_Object}
+type is usually a pointer to a structure, except for a very small number
+of types with immediate representations (currently characters and
+integers).  However, these types cannot be directly operated on in C
+code, either, so they can also be considered indirect.  Types that do
+not have an immediate representation always have a C typedef
+@code{Lisp_@var{type}} for a corresponding structure.
+@c #### mention l(c)records here?
+
+  In older code, it was common practice to pass around pointers to
+@code{Lisp_@var{type}}, but this is now deprecated in favor of using
+@code{Lisp_Object} for all function arguments and return values that are
+Lisp objects.  The @code{X@var{type}} macro is used to extract the
+pointer and cast it to @code{(Lisp_@var{type} *)} for the desired type.
+
+  @strong{Convention}: macros whose names begin with @samp{X} operate on
+@code{Lisp_Object}s and do no type-checking.  Many such macros are type
+extractors, but others implement Lisp operations in C (@emph{e.g.},
+@code{XCAR} implements the Lisp @code{car} function).  These are unsafe,
+and must only be used where types of all data have already been checked.
+Such macros are only applied to @code{Lisp_Object}s.  In internal
+implementations where the pointer has already been converted, the
+structure is operated on directly using the C @code{->} member access
+operator.
+
+  The @code{@var{type}P}, @code{CHECK_@var{type}}, and
+@code{CONCHECK_@var{type}} macros are used to test types.  The first
+returns a Boolean value, and the latter signal errors.  (The
+@samp{CONCHECK} variety allows execution to be CONtinued under some
+circumstances, thus the name.)  Functions which expect to be passed user
+data invariably call @samp{CHECK} macros on arguments.
+
+  There are many types of specialized Lisp objects implemented in C, but
+the most pervasive type is the @dfn{symbol}.  Symbols are used as
+identifiers, variables, and functions.
+
+  @strong{Convention}: Global variables whose names begin with @samp{Q}
+are constants whose value is a symbol.  The name of the variable should
+be derived from the name of the symbol using the same rules as for Lisp
+primitives.  Such variables allow the C code to check whether a
+particular @code{Lisp_Object} is equal to a given symbol.  Symbols are
+Lisp objects, so these variables may be passed to Lisp primitives.  (An
+alternative to the use of @samp{Q...} variables is to call the
+@code{intern} function at initialization in the
+@code{vars_of_@var{module}} function, which is hardly less efficient.)
+
+  @strong{Convention}: Global variables whose names begin with @samp{V}
+are variables that contain Lisp objects.  The convention here is that
+all global variables of type @code{Lisp_Object} begin with @samp{V}, and
+no others do (not even integer and boolean variables that have Lisp
+equivalents). Most of the time, these variables have equivalents in
+Lisp, which are defined via the @samp{DEFVAR} family of macros, but some
+don't.  Since the variable's value is a @code{Lisp_Object}, it can be
+passed to Lisp primitives.
+
+  The implementation of Lisp primitives is more complex.
+@strong{Convention}: Global variables with names beginning with @samp{S}
+contain a structure that allows the Lisp engine to identify and call a C
+function.  In modern versions of XEmacs, these identifiers are almost
+always completely hidden in the @code{DEFUN} and @code{SUBR} macros, but
+you will encounter them if you look at very old versions of XEmacs or at
+GNU Emacs.  @strong{Convention}: Functions with names beginning with
+@samp{F} implement Lisp primitives.  Of course all their arguments and
+their return values must be Lisp_Objects.  (This is hidden in the
+@code{DEFUN} macro.)
+
+
  @node General Coding Rules
  @section General Coding Rules
  @cindex coding rules, general
@@ -2379,10 +2500,15 @@ XEmacs crash!].)
  @code{defsymbol()} are no problem, but some linkers will complain about
  multiply-defined symbols.  The most insidious aspect of this is that
  often the link will succeed anyway, but then the resulting executable
-will sometimes crash in obscure ways during certain operations!  To
-avoid this problem, declare any symbols with common names (such as
+will sometimes crash in obscure ways during certain operations!
+
+To avoid this problem, declare any symbols with common names (such as
  @code{text}) that are not obviously associated with this particular
-module in the module @file{general.c}.
+module in the file @file{general-slots.h}.  The ``-slots'' suffix
+indicates that this is a file that is included multiple times in
+@file{general.c}.  Redefinition of preprocessor macros allows the
+effects to be different in each context, so this is actually more
+convenient and less error-prone than doing it in your module.
  
    Global variables whose names begin with @samp{V} are variables that
  contain Lisp objects.  The convention here is that all global variables
@@ -2983,7 +3109,7 @@ Speed up redisplay.
  @item
  Speed up syntax highlighting.  It was suggested that ``maybe moving some
  of the syntax highlighting capabilities into C would make a
-difference.''  Wrong idea, I think.  When processing one large file a
+difference.''  Wrong idea, I think.  When processing one 400kB file a
  particular low-level routine was being called 40 @emph{million} times
  simply for @emph{one} call to @code{newline-and-indent}.  Syntax
  highlighting needs to be rewritten to use a reliable, fast parser, then
@@ -3122,21 +3248,24 @@ All @file{.c} files should @code{#include <config.h>} first.  Almost all
  @file{.c} files should @code{#include "lisp.h"} second.
  
  @item
-Generated header files should be included using the @code{#include <...>} syntax,
-not the @code{#include "..."} syntax.  The generated headers are:
+Generated header files should be included using the @samp{#include <...>}
+syntax, not the @samp{#include "..."} syntax.  The generated headers are:
  
  @file{config.h sheap-adjust.h paths.h Emacs.ad.h}
  
-The basic rule is that you should assume builds using @code{--srcdir}
-and the @code{#include <...>} syntax needs to be used when the
+The basic rule is that you should assume builds using @samp{--srcdir}
+and the @samp{#include <...>} syntax needs to be used when the
  to-be-included generated file is in a potentially different directory
-@emph{at compile time}.  The non-obvious C rule is that @code{#include "..."}
-means to search for the included file in the same directory as the
-including file, @emph{not} in the current directory.
+@emph{at compile time}.  The non-obvious C rule is that
+@samp{#include "..."} means to search for the included file in the same
+directory as the including file, @emph{not} in the current directory.
+Normally this is not a problem but when building with @samp{--srcdir},
+@file{make} will search the @samp{VPATH} for you, while the C compiler
+knows nothing about it.
  
  @item
-Header files should @emph{not} include @code{<config.h>} and
-@code{"lisp.h"}.  It is the responsibility of the @file{.c} files that
+Header files should @emph{not} include @samp{<config.h>} and
+@samp{"lisp.h"}.  It is the responsibility of the @file{.c} files that
  use it to do so.
  
  @end itemize
@@ -4145,6 +4274,11 @@ highlighting different syntactic constructs of a source file in
  different colors, for easy reading.  The C support is provided so that
  this is fast.
  
+As of 21.4.10, bugs introduced at the very end of the 21.2 series in the
+``syntax properties'' code were fixed, and highlighting is acceptably
+quick again.  However, presumably more improvements are possible, and
+the places to look are probably here, in the defun-traversing code, and
+in @file{syntax.c}, in the comment-traversing code.
  
  
  @example
@@ -4438,6 +4572,37 @@ delimiter.  The @samp{newline} character is given (single-character)
  sequence'' @emph{flag}.  The ``comment-end'' class allows the scanner to
  determine that no second character is needed to terminate the comment.
  
+There used to be a syntax class @samp{Sextword}.  A character of
+@samp{Sextword} class is a word-constituent but a word boundary may
+exist between two such characters.  Ken'ichi HANDA <handa@@etl.go.jp>
+explains the purpose of the Sextword syntax category:
+
+@quotation
+Japanese words are not separated by spaces, which makes finding word
+boundaries very difficult.  Theoretically it's impossible without
+using natural language processing techniques.  But, by defining
+pseudo-words as below (much simplified for letting you understand it
+easily) for Japanese, we can have a convenient forward-word function
+for Japanese.
+
+@display
+A Japanese word is a sequence of characters that consists of
+zero or more Kanji characters followed by zero or more
+Hiragana characters.
+@end display
+
+Then, the problem is that now we can't say that a sequence of
+word-constituents makes up a word.  For instance, both Hiragana "A"
+and Kanji "KAN" are word-constituents but the sequence of these two
+letters can't be a single word.
+
+So, we introduced Sextword for Japanese letters.
+@end quotation
+
+There seems to have been some controversy about this category, as it has
+been removed, readded, and removed again.  Currently neither GNU Emacs
+(21.3.99) nor XEmacs (21.5.17) seems to use it.
+
  
  @example
  casefiddle.c
@@ -4954,13 +5119,13 @@ respectively.  This is currently in beta.
  
  @file{mule-mcpath.c} provides some functions to allow for pathnames
  containing extended characters.  This code is fragmentary, obsolete, and
-completely non-working.  Instead, @var{pathname-coding-system} is used
+completely non-working.  Instead, @code{pathname-coding-system} is used
  to specify conversions of names of files and directories.  The standard
  C I/O functions like @samp{open()} are wrapped so that conversion occurs
  automatically.
  
-@file{mule.c} provides a few miscellaneous things that should probably
-be elsewhere.
+@file{mule.c} contains a few miscellaneous things.  It currently seems
+to be unused and probably should be removed.
  
  
  
@@ -5005,6 +5170,7 @@ mule-tests.el
  regexp-tests.el
  symbol-tests.el
  syntax-tests.el
+tag-tests.el
  @end example
  
  @file{test-harness.el} defines the macros @code{Assert},
@@ -5283,7 +5449,7 @@ different section of code.
  
  If you don't understand whether to @code{GCPRO} in a particular
  instance, ask on the mailing lists.  A general hint is that @code{prog1}
-is the canonical example
+is the canonical example.
  
  @cindex garbage collection, conservative
  @cindex conservative garbage collection
@@ -9419,6 +9585,15 @@ elements, i.e. they hold all the information necessary to produce an
  image on-screen but the image need not exist at this stage, and multiple
  screen images can be instantiated from a single glyph.
  
+@c #### find a place for this discussion
+@c The decision to make image specifiers a separate type is debatable.
+@c In fact, the design decision to create a separate image specifier
+@c type, rather than make glyphs themselves be specifiers, is
+@c debatable---the other properties of glyphs are rarely used and could
+@c conceivably have been incorporated into the glyph's instantiator.
+@c The rarely used glyph types (buffer, pointer, icon) could also have
+@c been incorporated into the instantiator.
+
  Glyphs are lazily instantiated by calling one of the glyph
  functions. This usually occurs within redisplay when
  @code{Fglyph_height} is called. Instantiation causes an image-instance