Object representation specification for Gambit ---------------------------------------------- This document describes the object representation scheme that all Gambit back ends must support. 1. Abstract object representation --------------------------------- Objects are encoded by (signed) integers. It is required that the size of the encoding be the same as that for a C `long' (so that Scheme and C can interface easily). A certain number of bits are allocated for each of 2 fields in encodings: the `data' and `type tag'. The type tag determines which class the object belongs to out of 7 possible classes. Thus, the type tag field must be at least 3 bits wide. Some of the objects are scalars (the encoding of a scalar is always the same), the others are known as memory allocated objects (a memory allocated object might have a different encoding after a GC because its encoding depends on its address). The actual mapping of type tags to object classes is totally left to the implementor and can thus be different from machine to machine. The 7 object classes are defined as follows: Class Objects represented FIXNUM small signed integers SPECIAL characters, #t, #f, () and other special objects PAIR cons cells WEAK_PAIR 'weak' cons cells (car is replaced by #f if not referenced) PLACEHOLDER placeholders (returned by `future' special form) SUBTYPED vectors, strings, symbols, flonums, ports, etc. PROCEDURE procedures (includes true closures and return addresses) The FIXNUM and SPECIAL classes are scalars. The PAIR, WEAK_PAIR, SUBTYPED, PROCEDURE and PLACEHOLDER classes are memory allocated objects. There are some restrictions to the representation of scalars. The encoding of the fixnum `n' is required to have `n' in the data field (in 2s-complement representation). This should make the implementation of fixnum arithmetic simple on most machines and also simplifies the interface to the object representation. Additionally, the encoding of a character is required to have `n' in the data field, where `n' is the machine representation of the character and 0<=n<=255 (i.e. the lower 8 bits of the data field are the machine representation of the character and the other data field bits are 0). Also, in order to permit the creation of new scalars by the runtime or the user (e.g. the introduction of an `end-of-file' object), there should be a reserved range of encodings in the SPECIAL class. This range is back end dependent. Some memory allocated objects have a header. Headers are used to store length and subtype information. PAIRS, WEAK_PAIRS and PLACEHOLDERS don't have a header because their length is fixed. PAIRS and WEAK_PAIRS have two fields (the CAR and CDR). PLACEHOLDERS have 4 fields (the VALUE, LOCK, THUNK and QUEUE). The VALUE slot of placeholders must come first. SUBTYPED objects have a header. Headers are the same size as a C `long' and are always at the very beginning of the object. Headers contain two fields, the `length' and `subtype'. The length corresponds to the length of the object in bytes (excluding the header). The following is a list of the required subtypes: VECTOR, SYMBOL, PORT, RATNUM, COMPNUM, STRING, BIGNUM, FLONUM SUBTYPED objects are all vector-like. They are subdivided into 2 classes: object-vectors (those whose elements are objects) and byte-vectors (those that contain binary information). The key difference between these is that the first type has to be scanned by the GC and the second doesn't. There must be reserved ranges of subtypes for the creation of both new object-vector and byte-vector object types. These ranges are back end dependent. Closures (i.e. special PROCEDURE objects) have two parts. The first part contains binary data that is not scanned by the GC. The second part holds the closed variables of the closure and is thus scanned by the GC. The length of the first part is constant and is back end dependent. 2. Notes on implementation -------------------------- The abstract representation leaves a certain amount of freedom to the implementor so that representation decisions can be tailored to each target machine. What follows is a sample implementation for the M68000. First of all, we can choose encodings to be 32 bits wide (the length of a C `long'), using the lower 3 bits for the type tag and the upper 29 for the data field. We can then elect to allocate memory allocated objects at addresses that are multiples of 8. This should not waste too much memory as pairs neatly fit in 8 bytes and 2 bytes are wasted on the average per object-vector. The address of a memory allocated object thus has the 3 bits `000' in the type field. The encoding of memory allocated objects can thus be the address of the start of the object plus the type tag in the lower 3 bits. This has the advantage that no addressing range is lost and the address of a slot in an object can be computed simply by adding an offset to the object encoding. We can also be clever in the selection of the type tags. For example, we could choose the following: 000 FIXNUM 001 WEAK_PAIR 010 PROCEDURE 011 SUBTYPED 100 PAIR 101 PLACEHOLDER 110 not used 111 SPECIAL The advantages of this mapping are: 1) Fixnum arithmetic is very simple. For addition and substraction, just add or substract the encodings themselves. Multiplication and division require a single extra shift by 3. 2) Pairs can be represented as: CDR followed by CAR. This means that the encoding of a pair is the address of the CAR of the pair. Thus, `address register indirect' addressing can be used to access the CAR and `address register indirect with predecrement' addressing can be used to access the CDR. Both of these are efficient addressing modes on the M68000. 3) Type tags can be tested in a single instruction. This is done with a `btst' (bit-test) instruction and a couple of data registers that contain masks. For example, if register d1 must be tested for `pair?' and register d7 always contains the `pair mask' value 11101111111011111110111111101111, then the instruction sequence btst d1,d7 beq xxx will jump to `xxx' if register d1 is a pair. The mask for placeholders is 11011111110111111101111111011111. Note that these two values are in fact valid encodings of SPECIAL objects and we could choose: pair mask = encoding of #f placeholder mask = encoding of () Tests for `falseness' and `null?' would then be very efficient if these two masks are in fact kept in registers. 4) If the entry point of procedures is at an offset of 2 off of the start of the procedure object then jumps to procedures can be performed by a direct jump to the encoding of the procedure. This respects the M68000 restriction that only even addresses can be jumped to. Note that this will require careful alignment of the procedure entry points (and return addresses which are also `procedures') to addresses having 010 in the lower 3 bits. 5) The format of SUBTYPED objects can be defined as: upper 24 bits lower 8 bits start of +---------+---------+---------+---------+ object --> | length of data in bytes | subtype | header +---------+---------+---------+---------+ | object 0 (or first 4 bytes) | \ +---------------------------------------+ | | object 1 (or next 4 bytes) | | data +---------------------------------------+ | | ... | / +---------------------------------------+ <-------------- 32 bits --------------> where `subtype' = n*8 with `n' in the range 0..15 for object-vectors and in the range 16..31 for byte-vectors. It is then fairly easy to get the subtype and the length information. The encoding of a SUBTYPED object is the address of the subtype information so `address register indirect' addressing can be used to get the subtype. The length of an object-vector (in fixnum format) is the header shifted right by 7. For a byte vector, a shift of 5 followed by a resetting of the lower 3 bits to 0 is needed. Similar implementation tricks can be used for other machines to take advantage of their architecture and instruction set. 3. Interface to the object representation ----------------------------------------- The interface to the object representation is defined through global variable bindings that all back ends must have. Some of these bindings are to procedures (which we will call primitives). The primitives described are not expected to check for error conditions (e.g. type and bounds checks). The names of the global variables used are prefixed by `##' so that they reside in a different name space than the identifiers available to the user. Miscellaneous (##NOT x) #t if x is false, else #f (##EQ? x y) #t if encoding of x and y are the same, else #f Type tags (##TYPE obj) Returns the fixnum equal to type tag of `obj'. E.g. (##type "") --> SUBTYPED type tag. (##TYPE-CAST obj tag) Returns the `data' field of `obj' with `tag' in the `type tag' field. E.g. (##type-cast x (##type 0)) --> fixnum equal to data field part of encoding of x. Notice that it is safe to convert a memory allocated object into a scalar type, but generally not the reverse. Subtype tags (##SUBTYPE obj) Returns fixnum equal to subtype tag of SUBTYPED object `obj'. E.g. (##subtype "hi") --> subtype of strings. (##SUBTYPE-SET! obj tag) Changes the subtype of the SUBTYPED object `obj' to `tag' and returns `obj'. Pairs and weak pairs (##CONS x y) Same as (cons x y). (##SET-CAR! x val) Same as (set-car! x val) but returns `x'. (##SET-CDR! x val) Same as (set-cdr! x val) but returns `x'. (##CAR x) Same as (car x). (##CDR x) Same as (cdr x). (##WEAK-CONS x y) Similar but for weak pairs (##WEAK-SET-CAR! x val) (##WEAK-SET-CDR! x val) (##WEAK-CAR x) (##WEAK-CDR x) Object vectors (##MAKE-VECTOR n val) Same as (make-vector n val). (##VECTOR-LENGTH x) Same as (vector-length x). (##VECTOR-REF x i) Same as (vector-ref x i). (##VECTOR-SET! x i val) Same as (vector-set! x i val) but returns `x'. (##VECTOR-SHRINK! x n) Shrinks length of object-vector `x' to `n' and returns `x'. Byte vectors (##MAKE-STRING n val) Same as (make-string n val). (##STRING-LENGTH x) Same as (string-length x). (##STRING-REF x i) Same as (string-ref x i). (##STRING-SET! x i val) Same as (string-set! x i val) but returns `x'. (##STRING-SHRINK! x n) Shrinks length of byte-vector `x' to `n' and returns `x'. Fixnums (numeric arguments and results are fixnums) (##FIXNUM.+ x...) (##FIXNUM.* x...) (##FIXNUM.- x y...) (##FIXNUM.QUOTIENT x y) (##FIXNUM.= x...) (##FIXNUM.< x...) Flonums (numeric arguments and results are flonums) (##FLONUM.+ x...) (##FLONUM.* x...) (##FLONUM.- x y...) (##FLONUM./ x y...) (##FLONUM.= x...) (##FLONUM.< x...) Procedures (##APPLY p args) (##CALL-WITH-CURRENT-CONTINUATION p) Placeholders (##TOUCH x) Return `x' if it is not a placeholder, otherwise wait until value of `x' is known and return it. Interpreter related (##GLOBAL-VAR name) Return index of global var. `name' (##GLOBAL-VAR-REF index) Reference a global var. using its index (##GLOBAL-VAR-SET! index val) Assign value to a global var. using its index