Object representation specification for Gambit
----------------------------------------------

This document describes the object representation scheme that all Gambit
back ends must support.


1. Abstract object representation
---------------------------------

Objects are encoded by (signed) integers.  It is required that the
size of the encoding be the same as that for a C `long' (so that
Scheme and C can interface easily).  A certain number of bits are
allocated for each of 2 fields in encodings: the `data' and `type
tag'.  The type tag determines which class the object belongs to out
of 7 possible classes.  Thus, the type tag field must be at least 3
bits wide.  Some of the objects are scalars (the encoding of a scalar
is always the same), the others are known as memory allocated objects
(a memory allocated object might have a different encoding after a GC
because its encoding depends on its address).  The actual mapping of
type tags to object classes is totally left to the implementor and can
thus be different from machine to machine.

The 7 object classes are defined as follows:

Class        Objects represented

FIXNUM       small signed integers
SPECIAL      characters, #t, #f, () and other special objects
PAIR         cons cells
WEAK_PAIR    'weak' cons cells (car is replaced by #f if not referenced)
PLACEHOLDER  placeholders (returned by `future' special form)
SUBTYPED     vectors, strings, symbols, flonums, ports, etc.
PROCEDURE    procedures (includes true closures and return addresses)

The FIXNUM and SPECIAL classes are scalars.  The PAIR, WEAK_PAIR,
SUBTYPED, PROCEDURE and PLACEHOLDER classes are memory allocated objects.

There are some restrictions to the representation of scalars.  The
encoding of the fixnum `n' is required to have `n' in the data field
(in 2s-complement representation).  This should make the
implementation of fixnum arithmetic simple on most machines and also
simplifies the interface to the object representation.  Additionally,
the encoding of a character is required to have `n' in the data field,
where `n' is the machine representation of the character and 0<=n<=255
(i.e. the lower 8 bits of the data field are the machine
representation of the character and the other data field bits are 0).

Also, in order to permit the creation of new scalars by the runtime or the
user (e.g.  the introduction of an `end-of-file' object), there should
be a reserved range of encodings in the SPECIAL class.  This range is
back end dependent.

Some memory allocated objects have a header.  Headers are used to
store length and subtype information.  PAIRS, WEAK_PAIRS and PLACEHOLDERS
don't have a header because their length is fixed.  PAIRS and WEAK_PAIRS
have two fields (the CAR and CDR).  PLACEHOLDERS have 4 fields (the VALUE,
LOCK, THUNK and QUEUE).  The VALUE slot of placeholders must come first.

SUBTYPED objects have a header.  Headers are the same
size as a C `long' and are always at the very beginning of the object.
Headers contain two fields, the `length' and `subtype'.  The length
corresponds to the length of the object in bytes (excluding the
header).  The following is a list of the required subtypes:

VECTOR, SYMBOL, PORT, RATNUM, COMPNUM, STRING, BIGNUM, FLONUM

SUBTYPED objects are all vector-like.  They are subdivided into 2
classes: object-vectors (those whose elements are objects) and
byte-vectors (those that contain binary information).  The key
difference between these is that the first type has to be scanned by
the GC and the second doesn't.  There must be reserved ranges of
subtypes for the creation of both new object-vector and byte-vector
object types.  These ranges are back end dependent.

Closures (i.e. special PROCEDURE objects) have two parts.  The first
part contains binary data that is not scanned by the GC.  The second
part holds the closed variables of the closure and is thus scanned by
the GC.  The length of the first part is constant and is back end
dependent.


2. Notes on implementation
--------------------------

The abstract representation leaves a certain amount of freedom to the
implementor so that representation decisions can be tailored to each
target machine.  What follows is a sample implementation for the M68000.

First of all, we can choose encodings to be 32 bits wide (the length of
a C `long'), using the lower 3 bits for the type tag and the upper
29 for the data field.

We can then elect to allocate memory allocated objects at addresses that
are multiples of 8.  This should not waste too much memory as pairs
neatly fit in 8 bytes and 2 bytes are wasted on the average per
object-vector.  The address of a memory allocated object thus has the 3
bits `000' in the type field.  The encoding of memory allocated objects
can thus be the address of the start of the object plus the type tag
in the lower 3 bits.  This has the advantage that no addressing range
is lost and the address of a slot in an object can be computed simply
by adding an offset to the object encoding.

We can also be clever in the selection of the type tags.  For example,
we could choose the following:

000 FIXNUM        001 WEAK_PAIR     010 PROCEDURE     011 SUBTYPED
100 PAIR          101 PLACEHOLDER   110 not used      111 SPECIAL

The advantages of this mapping are:

  1) Fixnum arithmetic is very simple.  For addition and substraction,
     just add or substract the encodings themselves.  Multiplication
     and division require a single extra shift by 3.

  2) Pairs can be represented as: CDR followed by CAR.  This means that
     the encoding of a pair is the address of the CAR of the pair.
     Thus, `address register indirect' addressing can be used to access
     the CAR and `address register indirect with predecrement' addressing
     can be used to access the CDR.  Both of these are efficient
     addressing modes on the M68000.

  3) Type tags can be tested in a single instruction.  This is done with
     a `btst' (bit-test) instruction and a couple of data registers that
     contain masks.  For example, if register d1 must be tested for
     `pair?' and register d7 always contains the `pair mask' value
     11101111111011111110111111101111, then the instruction sequence

        btst d1,d7
        beq  xxx

     will jump to `xxx' if register d1 is a pair.  The mask for
     placeholders is 11011111110111111101111111011111.  Note that these
     two values are in fact valid encodings of SPECIAL objects and we
     could choose:

        pair mask        = encoding of #f
        placeholder mask = encoding of ()

     Tests for `falseness' and `null?' would then be very efficient if
     these two masks are in fact kept in registers.

  4) If the entry point of procedures is at an offset of 2 off of the
     start of the procedure object then jumps to procedures can be
     performed by a direct jump to the encoding of the procedure.  This
     respects the M68000 restriction that only even addresses can be jumped
     to.  Note that this will require careful alignment of the procedure
     entry points (and return addresses which are also `procedures') to
     addresses having 010 in the lower 3 bits.

  5) The format of SUBTYPED objects can be defined as:

                               upper 24 bits       lower 8 bits
        start of     +---------+---------+---------+---------+
        object   --> |   length of data in bytes   | subtype |  header
                     +---------+---------+---------+---------+
                     |  object 0    (or first 4 bytes)       | \
                     +---------------------------------------+ |
                     |  object 1    (or next 4 bytes)        | | data
                     +---------------------------------------+ |
                     | ...                                   | /
                     +---------------------------------------+
                      <-------------- 32 bits -------------->

     where `subtype' = n*8 with `n' in the range 0..15 for object-vectors
     and in the range 16..31 for byte-vectors.  It is then fairly easy to get
     the subtype and the length information.  The encoding of a SUBTYPED
     object is the address of the subtype information so `address register
     indirect' addressing can be used to get the subtype.  The length of an
     object-vector (in fixnum format) is the header shifted right by 7.
     For a byte vector, a shift of 5 followed by a resetting of the lower
     3 bits to 0 is needed.

Similar implementation tricks can be used for other machines to take
advantage of their architecture and instruction set.


3. Interface to the object representation
-----------------------------------------

The interface to the object representation is defined through global
variable bindings that all back ends must have.  Some of these
bindings are to procedures (which we will call primitives).
The primitives described are not expected to check for error
conditions (e.g. type and bounds checks).  The names of the global
variables used are prefixed by `##' so that they reside in a different
name space than the identifiers available to the user.


Miscellaneous

  (##NOT x)                 #t if x is false, else #f

  (##EQ? x y)               #t if encoding of x and y are the same, else #f


Type tags

  (##TYPE obj)              Returns the fixnum equal to type tag of `obj'.
                            E.g. (##type "") --> SUBTYPED type tag.

  (##TYPE-CAST obj tag)     Returns the `data' field of `obj' with `tag'
                            in the `type tag' field.
                            E.g. (##type-cast x (##type 0)) --> fixnum
                            equal to data field part of encoding of x.
                            Notice that it is safe to convert a memory
                            allocated object into a scalar type, but
                            generally not the reverse.


Subtype tags

  (##SUBTYPE obj)           Returns fixnum equal to subtype tag of SUBTYPED
                            object `obj'.
                            E.g. (##subtype "hi") --> subtype of strings.

  (##SUBTYPE-SET! obj tag)  Changes the subtype of the SUBTYPED object `obj'
                            to `tag' and returns `obj'.


Pairs and weak pairs

  (##CONS x y)              Same as (cons x y).
  (##SET-CAR! x val)        Same as (set-car! x val) but returns `x'.
  (##SET-CDR! x val)        Same as (set-cdr! x val) but returns `x'.
  (##CAR x)                 Same as (car x).
  (##CDR x)                 Same as (cdr x).

  (##WEAK-CONS x y)         Similar but for weak pairs
  (##WEAK-SET-CAR! x val)   
  (##WEAK-SET-CDR! x val)   
  (##WEAK-CAR x)            
  (##WEAK-CDR x)            


Object vectors

  (##MAKE-VECTOR n val)     Same as (make-vector n val).
  (##VECTOR-LENGTH x)       Same as (vector-length x).
  (##VECTOR-REF x i)        Same as (vector-ref x i).
  (##VECTOR-SET! x i val)   Same as (vector-set! x i val) but returns `x'.
  (##VECTOR-SHRINK! x n)    Shrinks length of object-vector `x' to `n'
                            and returns `x'.


Byte vectors

  (##MAKE-STRING n val)     Same as (make-string n val).
  (##STRING-LENGTH x)       Same as (string-length x).
  (##STRING-REF x i)        Same as (string-ref x i).
  (##STRING-SET! x i val)   Same as (string-set! x i val) but returns `x'.
  (##STRING-SHRINK! x n)    Shrinks length of byte-vector `x' to `n'
                            and returns `x'.


Fixnums

(numeric arguments and results are fixnums)

  (##FIXNUM.+ x...)
  (##FIXNUM.* x...)
  (##FIXNUM.- x y...)
  (##FIXNUM.QUOTIENT x y)   

  (##FIXNUM.= x...)
  (##FIXNUM.< x...)


Flonums

(numeric arguments and results are flonums)

  (##FLONUM.+ x...)
  (##FLONUM.* x...)
  (##FLONUM.- x y...)
  (##FLONUM./ x y...)

  (##FLONUM.= x...)
  (##FLONUM.< x...)


Procedures

  (##APPLY p args)
  (##CALL-WITH-CURRENT-CONTINUATION p)


Placeholders

  (##TOUCH x)  Return `x' if it is not a placeholder, otherwise
               wait until value of `x' is known and return it.


Interpreter related

  (##GLOBAL-VAR name)            Return index of global var. `name'
  (##GLOBAL-VAR-REF index)       Reference a global var. using its index
  (##GLOBAL-VAR-SET! index val)  Assign value to a global var. using its index