
------  pcpccode.txt ----------- Copyright (C) 1991  by Georg Post


This file documents some internals of the PCPC Pascal-C code converter.
It is a complement to the user's guide PCPCDOC.TXT .

Whenever I refer to some object (type,procedure,variable...) in the code,
I'll use the notation [UnitId.ObjectId] in the following discussion.

Topics included here are:
  1. Coding discipline
  2. Overview of the code
  3. Data structures of the Pascal to C translation program
  4. The theory of AuxParams
  5. The Parser
  6. The Scanner
  7. Standard library
  8. Organization of the grammar specification file Grammar5
  9. Syntax debug utility Chekgram.pas
 10. Grammar Bugs
 11. Hints on customizing PCPC
 12. Work in progress on future version of PCPC


1. Coding discipline

    PCPC is a conservatively written program supplied in FULL SOURCE CODE.
It compiles under Turbo Pascal Version 4.0 (no other versions tested).
It does not use any of the following hardware specific, "unsafe" or
"outright dangerous" features:
  no  Units Dos,Crt;
  no  sins against Structured Program Flow  such as label,goto,halt,exit;
  no  inline,interrupt,absolute,external;
  no  mem,port, move,fillchar, addr,@,ptr,ofs,seg, cseg,dseg,sptr,sseg;
  no  typeless parameters, typeless files, generic pointers;
  no  typecasts or sizeofs;
  no  conditional compilation.
(The sole exception to these rules is the [Pascannr.rangeCheck] routine,
included for debugging purposes. That one would disappear if MS-DOS were a
true operating system, with memory protection.)
   Among the "sound" features of Turbo Pascal, I did not use: typed constants,
automatic unit initialisation code, files other than Text, variant records,
qualified identifiers.

    At the expense of speed, the array bounds and stack overflow checkings are
ON {$R+,S+}, and almost every pointer which isn't NEW undergoes a lengthy
test before anything is written to the place it is pointing to. All that means
you can be confident to get a slow but virus-free binary code - if your copy
of the TP compiler isn't infected, of course. Because of the complicated data
structures on the heap, however, I won't guarantee that my translator is free
from pointer bugs and doesn't crash the system, some ugly source code being
some day fed into it.

    Hopefully, this program as it evolves will never indulge in
hardware dependencies or safety peepholes. One single @-operator in a Turbo
Pascal code might suffice to create the most ugly self-modifying program,
overwriting one of its procedures with some bomb extracted from innocent-
looking ASCII strings... even if all the rest spoke the purest Jensen-Wirth
dialect only. While Turbo Pascal has pitfalls related to most of the
above-mentioned "extensions" which I have banished from my code, I do not
even dare to confess my true opinion on the C language.


2. Overview of the code

Program files .PAS ( after the ":" list of those upon which they depend)

#(1)          for PCPC.EXE first pass of translator:
pcpcdata:
pascannr: pcpcdata
semanti6: pascannr
cdeclara: pascannr
cnesting: cdeclara
cbulk   : semanti6 cnesting
pcpcpars: cbulk
getunits:
pcpc    : pcpcpars getunits

#(2)           the REORD2.EXE postprocessor
killansi:
reord2  : getunits killansi

#(3)           the CHEKGRAM.EXE grammar debugger
gramtool:
chekgram: semanti6 getunits gramtool


Essential parts of the PCPC program (pass 1):

- The Startup Code: units [pcpcdata] and [pascannr]
               makes the Grammar Table and other global and heap data
- The Scanner: extracts elementary items from the source text
               (numbers, strings, reserved and user-defined symbols)
               and updates the Symbol Table (a series of binary trees).
               It strips comments and converts numbers to binary format.
- The Parser:  [pcpcpars] is steered recursively by the Grammar Table
               [pcpcdata.Gt] and the incoming symbols from the Scanner.
- The Semantics Part: [semanti6] does housekeeping of types, partial
               expressions etc.
               All the information contained in declaration parts is stored
               here in internal heap data.
               Feedback interaction with the symbol table handler for the
               management of identifier scopes (block nesting,
               record field scope nesting).
- The Code Generator: 3 Units [cdeclara], [cnesting], [cbulk].
               writes program code in the destination language.
               In my design, the toughest part of it (Unit [Cnesting])
               as well as the second pass (postprocessor [Reord2]) does the
               hard work of "peeling" nested procedures.

The main loop of PCPC acts on any of the PAS files in the list to be converted.
It first calls three INIT routines for: Pascannr Semantic Cbulk.
  (allocation of heap memory, global variables start values).
Then it translates the file, preceded by all the interface parts of used units.
Finally, it calls the TERM routines for: Cbulk Semantic Pascannr.
  (de-allocation of used heap, garbage collection)
Remark:
  The most efficient (but non-portable to C) heap management would have been
a single Mark/Release pair. I chose to implement explicit, slow Dispose loops
in each one of the modules. Any pointed objects are inserted into trees and
linked lists so that one can reach them at heap cleanup time.


3. Data structures of the Pascal to C translation program

 The program keeps the following data on the heap:
- a binary tree  of identifiers (the Symbol Table), of type [Pcpcdata.Ide].
  Identifiers that come in lists (record fields, enumerated consts,
  proc/funct parameters, var declaratons) have a .chain link .
  The identifier forest has one tree for each nested block level ("scope").
- a collection of all the Type Elements [Pcpcdata.Tpel] used by the source text
  There are pre-defined standard type elements for Boolean, Integer ...
  Each Identifier is linked to some type element via .typo
  Each Named type element is linked back to an identifier via .tName
  We look up the type hierarchy for (assignment) compatibility checks.
     Unit [Semantic] has local data:
- a stack of type element pointers, accompanying the target data stack
- a stack of identifier pointers
- a stack of Labels for control statement translation

Type elements have a "class" character .cl, two pointers p,q to other type
elements, as well as two integers l,m. Their meaning varies.
Data for type Elements (with their letter codes .cl):

Y Array     : p=first index type,
              q="rest array" type or base type,
              l=total length,
              m="rest array" length for index offset calculations.
R Record    : p=first field type,
              q="rest record" type or Nil,
              l=total length,
              m=first field length.
U Function  : p=result type,
              q=rest A/L type,
              l=total length of the parameter zone,
              m=number of optional parameters
D Procedure : p=nil,
              q=rest A/L type,
              l and m: like U
A var-param : p=parameter type,
              q=rest A/L, (vAr/vaL parameters)
              m=1 for weak parameter type, else 0
L val-param : p=parameter type,
              q=rest A/L,
              m= as for A.
P Pointer   : p=pointed type, l=2
S Subrange  : p=parent type,
              l,m= ord of mini and maxi
E Enumerated: p,q=Nil,
              l=ord of maxi,
              m=discriminating number.
F typed File: p= record type (if Nil, untyped file)
e Set       : p= base type

Pre-defined type elements:

b Boolean
c Char
y Byte
h Short
i Integer
w Word
l LongInt
r Real
d Double
s String
t Text
p (Generic) Pointer
* typeless (parameter)


        All const, var, type, procedure, function identifiers have a
        pointer to a Type entry. All field id's point to their
        origin Record type.

Conventions for entries in the symbol table (type is [Pcpcdata.ide] ):

 name: is the identifier , truncated to 15 characters or padded with #0
       First character converted to upper case, else spelled as encountered
       for the first time
 class: one of the classification hints (enumerated type PASCANN4.idClass)
 typof: pointer to a type element related to the identifier. Should never be
        undefined or NIL, during the useful lifetime of the symbol.
 rg,lf: right-left pointers for the binary tree structure of the
        symbol table. No effort has been made to balance that tree...
 chain :another link used by chains of parameter or variable declaration
        lists, and by field identifiers in Records.
 x:     Number of various meaning, defined in the Semantics part:
        For variables, procedure parameters, and record fields, x is the
        address offset that a compiler would use. For procedures and
        function Ids, x is the label number of the entry point.
        For constants, x is the integer value.
 y:     Is the ordinal number of chained elements (parameter lists, record
        fields, VAR identifiers). For Type and Const identifiers, y is the
        scope level: useful for local types to be made global in C.
        For procedure and function Ids, y classifies normal,forward,external
        and interface-part procedures.
 rScope:record scope number, is positive for record field Id's.
 reUse: flag for Pascal-legal, C-illegal, reused global identifier.


Procedures/Functions:
At declaration time, there are 2 chains:
The type element of the proceDure/fUnction, defined at _blockEntry
[Semantic.defFunct] is linked via q to the chain of vAr/vaL parameters.
Each Parameter type element has a pointer p to the base type.
The Function's pointer p is the result type.
All these tpel are linked to their names (of type Ide) via typof, backward
link with tname.
The Id chain has its own (redundant) link via the Chain field.
HOWEVER, only the proc/function Name is externally visible at scope level L.
The parameter names have level L+1 and will be scratched at the exit point
of the proc/funct definition. Their Chain, and the Tname pointers, will
lose any meaning at Call Time of a proc/funct. The q links survive intact.


4. The theory of AuxParams:

Nested procedures are not allowed in C. Such Pascal procedures can escape from
their local context by acquiring auxiliary parameters (abbrev. as AuxParam).
A "semilocal" symbol which can occur as a new auxiliary parameter will have
6  places of use:

1.  Origin,      some procedure declares a VAR or a var/val parameter
2.  Use,         in the procedure where it originates
3.  Use,         in the body of an INNER procedure (scope_of_use > decl.scope)
4.  Declaration, in the header of the inner procedure to be made global
5.  Call INNER   param. passed from the proc. declaring the symbol (at place 1)
6.  Call INNER   by another proc. inside the scope of symbol's origin (place 4)

Cases 3,4 and 6 refer to AuxParams: the identifier gets a _p prefix there
to make it unique. Cases 5,6 are handled in the Reorder postprocessor.

Three classes of variables should be considered: simple types, arrays, records.
Typed Const and File are treated like declared Variables, simple Constants
become global.

Example types for the translation list:

type Stri=string[23];              -->   typedef char Stri[24];
type Compx=record r,i:real end;    -->   typedef struct {r,i:double;} Compx;
Stri is an array type, Cmpx is a record type.

Here is what the C program will contain in the places 1 to 6:

  Pascal origin           generated C  program code

A.  Variable        1         2       3       4            5       6

var i:integer;     Int I;     I       *_pI    int *_pI     &I      _pI
const pi=3.14;     Float Pi=  Pi      _pPi    float _pPi   Pi      _pPi
var s:Stri;        Stri S;    S       _pS     char *_pS    S       _pS
var c:Compx;       Compx C;   C       *_pC    Compx *_pC   &C      _pC

B. Value Parameter  (local copies _g for arrays & records )

pi: integer        int pi     pi      *pi     int *pi      &pi     pi
ps: Stri           Stri ps    _gps    ps      Stri ps      _gps    ps
pc: Compx          Compx *pc  _gpc    *pc     Compx *pc    &_gpc   pc

C. Var Parameter    (the easiest job )

var vi:integer;    int *vi    *vi     *vi     int *vi      vi      vi
var vs:Stri;       Stri vs    vs      vs      Stri vs      vs      vs
var vc:Compx;      Compx *vc  *vc     *vc     Compx *vc    vc      vc

{ --------   Example code for nested procedures and AuxParams ------- }

program nestdemo;

procedure global;
type arr= array[1..10] of real;
     rec= record re,im: real end;
var  x,y,z: real;
     a:arr;  r:rec;   {x,y,z,a,r are semilocal variables}

procedure local1;
var i:byte;
begin
  with r do begin re:=x; im:=y; end;
  for i:=1 to 10 do a[i]:=z;
end;

procedure local2;
begin
  x:=1.0;y:=2.0; z:=10.0;
  local1;
end;

procedure local3(a1:arr; var a2:arr; r1:rec; var r2:rec);
begin
  a1:=a2; a2:=a1; {array & record assignments, Var or Value parameters}
  r1:=r2; r2:=r1;
end;

begin local2 end; {global}

begin global end. {nestdemo}
---------------------------------------------
translation with -a option, commented extract
---------------------------------------------
  typedef Real _tArr[10];     /* types become global, with _t prefix  */
  typedef struct _gRec {      /* struct get tags with -g, seldom used */
      Real Re;
      Real Im;
    } _tRec;

void _fLocal1 (        /* prefix _f for globalised functions */
_pR,_pX,_pY,_pA,_pZ)   /* AuxParams get -p prefix */
                       /* all but array (implicit pointer) are pointers */
_tRec *_pR;Real *_pX;Real *_pY;_tArr _pA;Real *_pZ;
{
  Byte I;
  {                 /* dummy {} from With block in Pascal */
    (_pR)->Re = *_pX;
    (_pR)->Im = *_pY;
  }
  for(I= 1;I<= 10;I++) _pA[I-1] = *_pZ; /* array index corrected by offset */
}

void _fLocal2 (       /* AuxParams come in a different order of use */
_pX,_pY,_pZ,_pR,_pA)

Real *_pX;Real *_pY;Real *_pZ;_tRec *_pR;_tArr _pA;
{
  *_pX = 1.0;
  *_pY = 2.0;
  *_pZ = 10.0;
  _fLocal1(_pR,_pX,_pY,_pA,_pZ); /* call of 1 local by another one */
}

void _fLocal3 (A1,A2,R1,R2  /* demonstrates structured type parameters */
)
_tArr A1;_tArr A2;_tRec *R1;_tRec *R2  /* ALL params are pointers */
;
{
  _tArr _gA1;  /*  local copies required for A1, R1 */
  _tRec _gR1;
  _mY(_gA1,A1);
  _mR(_gR1,R1);
  _mA(_gA1,A2,sizeof(_tArr));  /* for  array Parameters: sizeof(Type)  */
  _mA(A2,_gA1,sizeof(_tArr));  /* sizeof(A2) would be wrong ! */
  _mR(_gR1,(*R2));
  _mR((*R2),_gR1);
}

void Global (){
  Real X,Y,Z;  /* outer level context: variables have their Pascal Id's */
  _tArr A;
  _tRec R;
  _fLocal2(&X,&Y,&Z,&R,A); /* call of the local by global procedure */
}                          /* note the missing & for arrays */
----------------------------------------------------------------------


5. The Parser.

The parser relies on a Scanner and the syntax definition file GRAMMAR5.TXT.
The syntax is defined in a readable text file, in a more-Prolog-than-
Backus-Naur like notation. That grammar file is read at the start of the
program [Pcpcdata.ReadGram], and is squeezed into a grammar table,
[Pcpcdata.GT] in a very compact format. The syntax productions in GRAMMAR5
are definitions of Nonterminal Symbols, in terms of sequences of
other symbols (terminal or nonterminal) = the AND clauses, separated by OR
bars. The AND/OR formulae may be nested with {  } brackets, and a special
#-operator signals 0 or more repetitions of an element. The parser moves along
in this table somewhat like a primitive Prolog interpreter,
BUT without backtracking.

The grammar must be LL(1): the recognition of a leading nonterminal symbol
of an AND chain must uniquely qualify the rest of the chain as the only valid
production. As "side effects", the grammar rules contain references to
"semantic actions" marked with the _-symbol. These side effects tell the
parser to call the Action part of the program which manages things like the
type checking database, and which hands information over to the Code Generator
parts (for C).


6. The Scanner.

Calling the Scanner (procedure [Pascann.Scanner]) returns the following
information inside the record [Pcpcdata.ScanStatus],
after advancing by 1 item in the source text:
 cc = a char symbol, ii = an integer, pp = a pointer to an [.Ide] entry.
 eventually, a character chain (for string,real number ..)
Meanings of the symbol cc:
  cc> chr(127)  represents a Reserved multicharacter symbol of Pascal
  cc= '%'  : an unsigned decimal integer, ii=its value
  cc= '''  : a string constant, ii=its length
  cc= '?'  : an identifier, pp  points to its data in the symbol table
  cc= '$'  : end of source file is reached.
  cc= other character: single-character symbol of Pascal.

The Grammar Tables contain more "reserved keywords" than Pascal officially
allows. Those standard words have their first character only in upper case,
in the alphabetical list preceding the grammar rules. For example, the
read/write operations show up in the grammar tables since their flexible
argument syntax doesn't conform to that of standard Pascal procedures. The
standard procedure New for heap space allocation will translate into some
macro call in the C code requiring the pointer's precise data type; that's why
it figures as a terminal grammar element, too. When the scanner detects one of
those standard identifiers, flagged as "not really reserved" in an
accompanying data field, it looks at the user-defined symbol list and gives
priority to the user's identifier if it finds one: [Pascannr.scankeywd].
[pascannr.treatSymbol] does a context check to guess if a standard-but-reusable
Symbol is wanted as a newly defined item: in that case it enters by force
into the user's Symbol Table.

Search rules for the symbol table manager [Pascannr.SearchId] and [.AnnexId]:

There are two status numbers in the [Pcpcdata.Scope] record:
actual: is the depth of the procedure block nesting, at any moment  (0,1,2...).
         Any active level has its own symbol tree rooted at pstart[level].
recIndex: is the number identifying the record structure we're in.
          It is 0 at any time when no field identifiers are expected.

A given identifier Id will match the one at PP in the symbol list, IF:
1. The identifiers (converted to upper case, truncated to size 15) are the same
2. After a Unit-dot prefix, the unit qualifier Id^.qualif must match
3. If a Record Dot operation is pending, the record number Id^.rscope is Ok
4. Inside With clauses, the record number may be one of the With records
   With record field Ids are searched before non-record Ids.

Inside a record context, PCPC screens anything but that record's field
identifers, const and type identifers. Elsewhere, searching starts at the
innermost block level, down to level 0, for any non-field identifier
(rscope=0). The rule applies in the declaration parts to reject confusing
objects as field names, and in the statement parts to identify field names
versus synonymous non-field identifiers.

The Parser kernel compares the source symbols with the LL(1)  productions
of the syntax data in GT. It generates error messages for any unexpected
non-leading symbol and for undefined, multiply defined or misused
identifiers.
For any error, the errorCounter is incremented. Above a fixed limit,
the tooManyErrors flag is set.
For identifier errors, the symbol table is modified to fit the last use
of the offending identifier and parsing goes on. Any grammar rule that cannot
be satisfied triggers a fatal error and the translating program finishes
without any panic Exit or Goto, by leaving all loops and returning from any
depth of recursion.


7. Standard library.

The System unit interface code, as seen by PCPC, is the last part of the
grammar.txt file. Those procedures and functions which are not (yet)
imitated in my C library have the mark "@" at the end of their declaration.
   The System procedures and functions enjoy some freedom which the language
denies to ordinary functions, as declared by us disciplined Pascal users:
 *  they may have a LIST of an arbitrary number of parameters (Write, Concat).
      Such functions are not declared but absorbed in the syntax rules.
 *  they may have OPTIONAL parameters. To account for this, an exceptional
      declaration syntax is tolerated ONLY in the System interface code, e.g:
          procedure Halt [exitcode:word] ;
          procedure SetTextBuf (var f:text; var buf [; size:word] );
 *  their parameter's TYPE may be weakly defined. Weak typing is denoted by
      using the "::" symbol in the declaration. For example
          procedure Inc(var x::integer [;n:integer]);
      means that any "weak integer" (finite, scalar type) may be incremented
      but that the optional count (1 by default) must be assignment compatible
      to a true integer.

All these "Sloppy Pascal" features ( [] @ :: ) are vigorously rejected by
PCPC in the application code. Who wants to extend Turbo Pascal even more?
Procedures with an optional parameter have two C versions, for example
Dec(x,y) and DeC(x). [Cnesting.MissingParam] makes the distinction.


8.  Organization of the grammar specification file Grammar5.txt.

 That file has 3 main parts:
 - declaration of 4 lists of symbols used in the syntax rules
 - the syntax rules of the language, written in a meta-language
 - the Interface declaration for the Turbo Pascal SYSTEM unit.

The four lists of symbols are:

- The Semantic Actions which are interspersed in the syntax rules with a
  leading underbar. This list MUST have a verbatim copy in the Pascal file
  pcpcdata.pas, where is serves as an enumeration type declaration. The last
  symbol MUST be "forbidden" since it stops some loop in [pcpcdata.]
  In fact, the action identifiers act as CASE labels at 2 places of
  the translator:
  in Semanti5 they trigger the semantic "understanding" of the source code;
  in Cbulk they direct the subsequent target code generation.

- The explicit nonterminal symbols of the grammar. These are re-declared later
  between double quotes at the heading of each set of syntax expansions. Note
  that the grammar has many more (implicit) nonterminal symbols.
  Pascannr needs this list, in the precise order of declarations, to detect
  the position of the entry symbols of the Turbo Pascal grammar which are
  "goodFile" and "intrFace".

- The list of Pascal identifier classes. We introduce 13 kinds of identifiers
  in Pascal (e.g. Constant Identifiers, Function Identifiers, ...) which may
  occur in the syntax rules with 2 styles of reference:
  ?suchId    means : here Pascal wants a NEW identifier of class "suchId"
  &suchId    means:  here we should find a KNOWN ident. of class "suchId".

- The ORDERED list of all special symbols of Turbo Pascal, separated by "|" and
  terminated by "||". This includes all two-character reserved symbols, all
  Reserved Words of Pascal (in upper case) and all Special Features of Pascal
  (first char in upper case, then lower case). Special Features are standard
  type, procedure, function etc. identifiers which are not reserved: any
  Pascal program may redeclare them as something else. We include here those
  procedures which have a flexible parameter scheme (like Writeln) so that it
  is easier to describe their calls together with the other syntax rules, than
  to define them with a header in the System interface.

 The meta-rules for the syntax rules:

  The notation of syntax rules makes use of symbols which do not otherwise
  occur in Pascal (or which the scanner preprocessor strips away).
  A nonterminal symbol which may be expanded into an alternative set of
  symbol sequences ( a List) is written:
    "nonterm" { sequ1 | seq2 | sequ3 | .......  }
  If the List stops with |} , this means that the symbol may expand to Null.
  Inside a List {||| },  each item is a sequence of:
   - Pascal terminal symbols (one-char, two-char, all-uppercase-words)
   - Semantic action triggers (symbols preceded by the underbar)
   - Identifier symbols (preceded by & or ? )
   - Explicit nonterminals (must have their expansion elsewhere)
   - Implicit nonterminals which are in fact Lists {||| } again.
  Optionally, an explicit or implicit nonterminal has a prefix, the special
  character # which means: Repeat this construct 0 or more times. Obviously,
  such a nonterminal must not be null-reducible.
  The special symbol % denotes a number, ' denotes a string constant.
  The character ! means that the rest of the line is a comment.
  The syntax of string and numeric constants is "hard-wired" in the scanner.
     Pascannr.ReadGram reads the rules and packs them in a condensed form
  into the global table Gt. Consistency checks on that material are the
  aim of the CHEKGRAM utility.

 The SYSTEM interface part:

  Here we have CONST TYPE VAR PROCEDURE and FUNCTION declarations for standard
  Turbo Pascal features, in "almost" regular Pascal syntax. As the System
  procedures and functions may have varying numbers of parameters and relaxed
  type checking on those parameters, we introduce "Sloppy Pascal extensions".
  These are valid ONLY in system-related declarations, NOT in user code:
  - A Sloppy Pascal procedure may have trailing optional parameters which are
    declared after the regular parameters, between []. If all the parameters
    may be missing, the parameter part is enclosed in [] instead of ().
  - A Sloppy Pascal parameter may have a weak type indicated by the "::"
    symbol (for example var i::integer;). A weak type is a set of the regular
    types (see   .pas) to which an actual parameter may conform in that place.
  - The symbol "@" may follow a Sloppy Pascal procedure/function declaration.
    This simply means that the library code is not (yet) available in C and
    that warning messages occur at each call.


9.   Syntax debug utility  Chekgram.pas:

   The program CHEKGRAM.PAS should be run any time the grammar file GRAMMAR5
is modified. There are checks for trivial errors (braces that do not match,
redundant terminal or nonterminal symbols) and a systematic investigation
of the LL(1) conflicts and of misplaced semantic action markers. CHEKGRAM
builds a transition matrix holding for each pair (terminal, nonterminal) the
next grammar rule to be observed. If abundant memory space were available, that
table would control a fast non-recursive parser kernel, instead of the
recursive one driven by the original syntax rules. Anyway, the grammar is
too big to allow for an elegant handcrafted ("hard-wired") recursive descent
parser; and by principle I would dislike "compiler-compiler"-generated Pascal
source code: I prefer to keep the superiour nonprocedural data.

If you start Chekgram without command-line parameters, you get a menu. Else
it goes through an automatic sequence of tests on the grammar5.txt file and
then takes the first argument as a Pascal file to be parsed.
Chekgram's current output is too confusing, neat formatting still is on my
to-do list.
To catch all of Chekgram's remarks in a file, simply type at MS-DOS:
  chekgram blabla >chekgram.out

 Actions of Chekgram (unit [gramtool] ):

- Import the gramar table Gt of "condensed" syntax specs (repetition clauses,
  implicit nonterminals, nested braces ): [pascannr.readGram]
- Expansion into simple grammar rules
  (procedure productionList). Repetitions are transformed into right (i.e.
  tail) recursions.  Left recursion is strictly forbidden in LL(1) grammars.
  Terminals, Nonterminals and Semantic Actions are part of the productions.
- For any nonterminal, make the First and the Follow sets.
- Then check that the Continuation sets (director symbols) of each pair of
  expansion rules for the same nonterminal are  disjoint.
  Ambiguous rule pairs and director symbols are shown (procedure testLL1).
- Standard heuristics handle ambiguities such as the "dangling else problem":
  Take the FIRST rule that works while tracking the syntax. For example,
  specify : "IF x THEN s {ELSE s|}" never: "IF x THEN s {|ELSE s}" .
- CheckActions: Check that the semantic actions for a given rule occur only
  after it has been accepted (a nontrivial First symbol has been seen).
  Only trivial actions (e.g. output a statement terminator) that don't alter
  the semantic state database may occur in epsilon-reducible rules.


Samples of Chekgram output, with comments:

Too many "}" in intrFace        Ignore this, it's a bug biting the last one.

    constant       :- FALSE _falsSymb $         this is a primary rule
    posiConst      :- 301 _endConst $           number 301 points to sub-rule
301.posiConst      :- &constId _constRef $      sub-rules expanded
301.posiConst      :- % _intConst $             $ marks end of rule
    paramType      :- BOOLEAN _typeBool $

------- Checking LL(1) properties
Ambiguous rules: &typeId BOOLEAN BYTE CHAR COMP
  DOUBLE EXTENDED FILE INTEGER LONGINT
  POINTER REAL SHORTINT STRING TEXT WORD
363.factor         :- paramType _typeSize $
363.factor         :- expression _getSize $
Ambiguous rules: &fileId
378.inOut          :- &fileId _fileRef , $
378.inOut          :- $
Ambiguous rules: &fileId
382.inOut          :- &fileId _fileRef $
382.inOut          :- varRefer _rdVar $
Ambiguous rules: &fileId
385.inOut          :- &fileId _fileRef , $
385.inOut          :- $
Ambiguous rules: &fileId
389.inOut          :- &fileId _fileRef $
389.inOut          :- format _wrFmt $
Ambiguous rules: ELSE
400.statemt        :- ELSE _elseDo statement $
400.statemt        :- $

   CHEKGRAM finds three kinds of ambiguities in the Turbo Pascal grammar file.
All of them are resolved automatically by the parser, in favour of the first
matching production. Here they are (bottom up):
 * the dangling ELSE, ambiguous for nested if-then-else
 * the optional leading file Id in the Read/Write statements which has
   priority over other Ids in that context (4 cases)
 * type identifiers are ambiguous after SIZEOF: either the expression
   Sizeof(TypeId) is intended - the preferred interpretation, or the TypeId
   is just a typecast operator of another expression: rejected after Sizeof.

------- Checking for suspect triples:
    oneCase        :- constant interval _caseFirst 392
                        : _caseLast statement _caseEnd $ at 5
    statemt        :- BEGIN _beginSymb statement 398 END _endSymb $ at 5
    statemt        :- REPEAT _repLoop statement 401
                        UNTIL expression _endRep $ at 5
406.statemt        :- 407 _caseOther statement 408 END $ at 5
    block          :- BEGIN _blockBegin statement 447 END _blockEnd $ at 5
    externals      :- ?unitId _unitFile ; INTERFACE useList
                        451 IMPLEMENTATION $ at 7
(List of all production which contain sequences of the form A B c, where
 A and B are null-reducible nonterminals and c is a terminal symbol.
 This was a first step to detect the last Grammar Bug explained below.
 Apparently, there is nothing wrong with these productions and the critical
 ones must be more subtle).

------- Checking for anticipated Actions
Unsafe Action _terminator      in     statement      :- statemt _terminator $
(The _terminator action is triggered even for a void "statemt"!)


 Usefulness of the grammar check:

 Early versions of Grammar5 had a nonobvious LL(1) bug which had Chekgram tell
 that the "term+term" construct in "simpExpr" was ambiguous, precisely
 at the + sign!? After playing around with the grammar by leaving out things,
 I found that the bug came from the CASE construct expanding into sequences
 like
     statement { ;| } next-case-label  .
 The signed constant of case labels could be confused with the +term tail of
 an expression of some assignment statement.
 Repair: enforce the semicolon, even before the ELSE section of CASE.
 That was a deviation from Turbo Pascal, however. Finally, the expansion
     oneCase #{; oneCase}
 was adopted, where oneCase is reducible to Null. This allows parasitic ;;;;
 between case clauses: never mind!

 There are places where my Grammar5 still is in slight contradiction with the
 documented railroad diagrams of Turbo Pascal. Checkgram cannot know that.
 Currently, Chekgram does report 6 internal inconsistencies in Grammar5.
 These ambiguities comprise Pascal's (=ALGOL's) classic "dangling else" bug,
 and the optional fileId arguments in READ/WRITE operations. Here I tried
 to mix too much of the semantics into the syntax. Maybe there is no precise
 frontier between the two?  Anyway, as long as the first-come-first-serve
 disambiguating principle will work, take Chekgram complaints as "warnings",
 not "errors".


10. Grammar BUGS:

 Some trouble may arise from the old (now undocumented!) syntax of control
 character constants in Turbo Pascal. Here the border line between the lexical
 analyser and the parser fades away. Consider the Pascal text:
   TYPE ctrl= ^C..^Z;
   TYPE ptrC= ^C;
 The first type is a subrange of characters, the second one is a pointer to
 some other type called C, to be defined later. A parser must keep two
 symbols in mind, the "^" and the "C", before deciding at ".." or ";" what
 "^C" should mean!  The scanner is too dumb to see when it is allowed to
 contract ^C into one token, and the parser, by principle, should not backtrack
 or rely on a two-token lookahead. What to do: introduce an ugly exceptional
 lookahead kludge in the scanner ? Write messy syntax rules for the parser?
 No idea. PCPC accepts the historical control character syntax but excludes
 one-letter type names (thus strains the Pascal syntax a bit!)

 The Scanner always finds #12 for new identifiers, but the parser tables only
 have pre-selected (typed) new Identifiers. The line 12 of the parsing table is
 artificially filled with the non-null entry from any other new-Id line,
 in procedure [gramtool.modifParsTab].
 Other ad-hoc fill-ins apply to known Ids.

 Turbo Pascal constructs PointerFunctionCall(....)^ and typecast(...) are
 allowed to the left of assignments. Top-down parsing implies clumsy
 production rules here. I needed 3 kinds of variable references, to separate
 the  "f:="           function return syntax from the
      "f(x,y)^.z := " -like construction.

 Here are two grammar constructs where ChekGram detects nothing wrong but
 my parsers only work with the first one.
   Good: ( subList #{;subList} {)| [ #{;subList} ]  ) }
   Bad : ( subList #{;subList} {   [ #{;subList} ] |  } )
 Sequences ... #B C d  ( C null-reducible nonterminal, d terminal, here =')')
 seem to generate a parsing error: When the parser exits from B, and if C is
 absent, it does not find d in the input. Where was it swallowed?
 Exercise for the reader. I did not yet find the bug's origin.

----------------------------------------------------------------------------

11. Hints on customizing PCPC:

Any maintenance should be done on the Pascal source code, not on the C version
of PCPC and its utilities. By all means, take care of the self-translation
property of the software and the compatibility with classic non-DOS platforms
and traditional C compilers of the Eighties.

If you add your own code for missing features to the supplied .C library,
do not forget to delete the flag "@" at the end of the standard procedure
declaration in one of the system files. That will deactivate the warning
message for unimplemented functions.

   Some obvious speed improvements could be made if you are willing to give up
failsafe operation under DOS and portability: throw out the {$R+,S+} compiler
option and the pointer bounds check routine, replace the Dispose loops with a
Mark/Release, use big buffers for text file read/write or even
blockread/blockwrite.
I'm too conservative to do that.


12. Work in progress on future version of PCPC:

   In the Cbulk unit, a series of bugs are mentioned as comments. They will be
corrected as soon as possible. Unit Semanti6 needs some improvement on type
compatibility checking.

   Version 1.0 has serious shortcomings with regard to efficiency. After each
translation of a file, it throws away all the accumulated information on the
grammar, the System procedures, and the system units Dos, Crt etc.
For the next translation, it must read and interpret grammar5.txt a second
time, as well as the system units.
It uses a slow recursive parser, a price to pay for the small size: it needs
no tables in memory except for the initial grammar table Gt.

   Version 1.1 will repair some inefficiencies. Grammar and system infos will
stay in RAM between translation runs.  Gt will disappear, making room for the
condensed tables of the "medium size" parser which is tested in ChekGram.
The tables will be read in from ready-made files, instead of being
reconstructed over and over from scratch, as still happens in ChekGram.
In summary, the "parser generator" and the "parser" will become two separate
programs. Of course, the package will contain the source code of everything,
and nothing but source code.

   The C code from PCPC's self-translation, as well as the tables produced
by the parser generator, shall always be plain 7-bit ASCII and compatible
with basic Unix-style systems. I'll consider the machine-made C and data files
as public-domain OBJECT code, by contrast to my (Pascal and English) sources.
What I want to achieve is a free Turbo-Pascal compatible "compiler" (i.e.
preprocessor for K&R C "assemblers"), running on Unix, Dos and their clones.
This translator will never compete with the original Turbo Pascal compiler,
produced with aggressive assembler programming to get all that unbeatable
performance out of a PC. My modest goal is to contribute to a wider
portability of the de-facto standard dialect of Pascal.

