                                        DRAFT

                                  GEDCOM PROCESSOR
                                    Version 0.52

                                    Dec. 12, 1990

General Information.
 
This document describes the GEDCOM processor, a package of utility functions in the
GEDCOM library.  The GEDCOM processor is designed to simplify the processing of
GEDCOM files.  Readers should have programming experience in C plus a basic understanding
of GEDCOM and the GEDCOM library. 
 
WARNING: this release is experimental --  future versions may not be compatible with your
code written using this release.
 
This processor includes a library of semantic routines that check the integrity of GEDCOM
data, i.e., verifying that an individual's birth date precedes the death date. Feedback,
suggestions, and additions are needed for this library to fulfill its purpose.  Please contact: 
 
      Bill Harten 
      GEDCOM Developers Group 
      50 East North Temple Street 
      Salt Lake City, UT 84150 
 
      Phone (801) 240-5225 

Features of the GEDCOM processor.

     Checks for invalid tags, including tags that are not in the right context.

     As a GEDCOM file is read, the GEDCOM processor will call user routines to process
      the data.  The routines called can easily be changed by changing a grammar file.  This
      can eliminate recompilation when changing tags.Using the functions of the GEDCOM processor.

You must include the GEDCOM library and processor's header files at the top of your
program:

      #include "gedcom.h"  /* Must be included first */
      #include "gedproc.h"

The basic command line compiler/linker commands to build your file containing the GEDCOM
processor and the other GEDCOM library functions are as follows:

      Microsoft C:
            
            cl /AL /Gcs myfile.c gedproc.lib gedcom.lib

      CYBER C:

            cv2 i=:NVE.fhdlam.process.myfile_c b=myfile_obj da=dt ol=debug
            list=myfile_lst..
                  dm=('CYBER')
            creol
            addm :NVE.fhdlam.process.gedcom_lib
            addm :NVE.fhdlam.process.gedproc_obj
            addm :NVE.fhdlam.process.myfile_obj
            genl l=myfile
            quit
            attf myfile sm=execute

Portability

This code is intended to be portable.  Testing has so far been performed on an IBM PC
compatible, compiled with Microsoft C 6.0 and on a CYBER 992 under NOS/VE.

To compile on a CYBER, CYBER must be defined.  If you have a compiler that does not
support the ANSI standard, define NON_ANSI when compiling.

Overview

The purpose of the GEDCOM processor is to visit each record and line in a GEDCOM data
file, matching tags in a context-sensitive fashion and calling corresponding user-/defined
functions to process the information.

Input tags are matched with tags in a grammar file.  The grammar file is a GEDCOM file
containing a specialized GEDCOM grammar record for each type of input record that is to be
processed.  The GEDCOM grammar record is fully populated with all expected tags, all placed
in their proper context as they would appear in the data, as shown below in the example
grammar.
      Example Grammar File - Two Types of Records (indentation added).

      0 INDI
        1 - preFunc
        1 + postFunc 1 3
        1 NAME
        1 BIRT
          2 DATE Comments are allowed in the value field.
            3 + func2
            3 - func1
            3 + func3
        1 DEAT
          2 DATE
          2 PLAC  
      0 FAM
        1 HUSB
        1 WIFE
        1 CHIL

The above grammar instructs gedproc to call two user-defined functions each time the tag
"INDI" is scanned in the GEDCOM data. The two instruction lines that call these functions
must be positioned immediately following the "INDI" tag and must be one level lower.  Notice
the tag and value fields of these two  instruction lines.  The value field corresponds to the name
of the function that is called. The plus (+) and minus (-) signs in the tag field determine when
the function is called. The minus sign instructs gedproc to call a function immediately after
scanning "INDI."  The plus sign tells gedproc not to call the function until after all the children
of "INDI" have been scanned. This pre and post processing ability is provided to allow
initialization before processing subordinate lines and analysis after.

Two parameters are passed to your functions, one of which is user-defined.  This is illustrated in
the second instruction line.  Here, a byte pointer to the beginning of the line, "postFunc 1 3," is
passed to the user-defined function.  The function may then parse out the line as required. The
other parameter, a NODE pointer, points to the INDI tag in the data tree.  For the instruction
lines under the DATE tag, shown above, the NODE pointer points to the DATE tag in the
data tree.

Context sensitivity means that tags must match within the context of the superior tag under
which they occur, all the way out to the root (level zero) of the record.  In this hypothetical
example, a data record containing 

      0 INDI
        .
        .
        1 BIRT
          .
          .
          2 PLAC
(path INDI BIRT PLAC) would not match this grammar, because PLAC is not a valid tag in
the INDI BIRT context of the grammar.  The path INDI DEAT PLAC would match.  The
occurrence of an unexpected tag is not necessarily an error, but it is a condition to be detected
and reported.

Each tag combination appears only once in a grammar record.  For example, the combination, 
FAM CHIL, appears only once in the grammar, though multiple children may appear in the
data.  By contrast, the same tag may appear more than once in different contexts, such as INDI
BIRT DATE and INDI DEAT DATE.  These are different kinds of DATEs, and need to be
processed differently.

The processing approach is as follows.  At initialization, the grammar and data files are opened.
The grammar records are then read into memory using ged_read_tree and combined into a
single grammar tree in internal form.  Then, in the outermost processing loop, the
ged_read_tree function is used to read and convert each data record (beginning at level zero)
into an internal tree.  The system then traverses the data tree, one node (line) at a time.  It
traverses in parallel the grammar tree, matching the tag in the data to the tag in the grammar
within the corresponding context.

As each tag is matched, a user-defined segment of code is called that is identified by the
instruction lines immediately following the tag, as explained in the above example.

Errors are reported to another user-defined function that must be named
int ged_error(char *error_msg, NODE *data_context, size_t status).

Note
      An attempt to modify the GEDCOM data tree will result in unpredictable behavior.
      Copy the data tree with ged_copy_tree and modify the copy.

      Some GEDCOM library routines use the memory pool that was last opened by
      GED_ALLOC_POOL.  To avoid loss of data, user-defined functions called from
      ged_process should use ged_get_pool to save the pointer to the pool that is used by
      ged_process. ged_set_pool should then be called to cause subsequent memory allocations
      to be drawn from the user's pool.
                          The GEDCOM Processor Functions

ged_process                                                              - gedproc.c


Summary

#include "gedproc.h"

int ged_process(data, grammar)
      char *data, *grammar;


Description

The ged_process function accepts two parameters, a pointer to the grammar file and a pointer
to the GEDCOM data file.  ged_process loads the grammar file and a data tree into memory,
and then calls the traverse function to process it.

Return Value

If the functions that are called by the grammar return an integer from 100 through 199,
ged_process immediately stops processing the current tree and  begins processing the next data
tree. If an integer greater than 199 is returned, ged_process immediately stops processing and
returns the value of that integer.  If a value less than 100 is returned, END_OF_FILE is
returned as described under ged_read_tree in the GEDCOM library documentation.

Example

See the appendix for an example.
load_grammar                                                             - gedproc.c


Summary

#include "gedproc.h"

NODE *load_grammar(fg)
    FILE *fg;

Description

The load_grammar function accepts as a parameter a pointer to your grammar file.  You can
load as many grammars into memory as you wish.  load_grammar is normally accessed through
the ged_process function.

Return Value

The load_grammar function returns a NODE pointer to the top of the grammar tree.  A parent
node is always tacked on to the top of this tree and contains the tag "grammar."  This extra tag
is created to facilitate the traverse function but is transparent to you. If an error occurs while
loading the grammar then error status, 4, is sent to ged_error.

Example

See the appendix for an example.

traverse                                                                 - gedproc.c


Summary

#include "gedproc.h"

int traverse(dataCtxt, grammarCtxt)
    NODE *dataCtxt, *grammarCtxt;

Description

The traverse function traverses the grammar and data trees, checking for valid tags and calling
their respective functions.
 
Return Value

If the user-defined functions (including ged_error) return an integer greater than 99,
ged_process immediately stops processing and returns that value, else END_OF_FILE is
returned as described under ged_read_tree in the GEDCOM library documentation.
 
traverse                                                                - gedproc.c


Example

int my_ged_process(fg, ft)
    FILE *fg, *ft;
{
POOL *dataPool, *grammarPool;
NODE *grammar, *data;
short status;
int errStat= DFLT_ERR; /* Equals 99 */
    if((grammarPool = ged_set_pool(ged_create_pool(2500,
           sizeof(NODE))))==(POOL *)NULL)
      ged_error("Could not create a grammar pool.\n",(NODE *)NULL, 1);
    grammar = load_grammar(fg);/* you may load more than one grammar*/

    /* Create a new pool for the GEDCOM data */
    if((dataPool = ged_set_pool(ged_create_pool(60000, sizeof(NODE))))
               == (POOL *)NULL)
         ged_error("Could not create a data pool.\n",(NODE *)NULL,2);
    while((status !=LAST_RECORD) && ((data = load_tree(ft, &status,   
            60000))!=(NODE *)NULL) && (status != END_OF_FILE))
          {
          if((errStat = traverse(data, grammar)) >=200)
              break;
          GED_RESET_POOL();
          }
    GED_DESTROY_POOL();
    ged_set_pool(grammarPool); /* destroy the grammar pool */
    GED_DESTROY_POOL();
    return((errStat >99) ? (int)status : errStat);
    }
NODE *load_tree(fg, status, limit)/*allocates space and loads a tree*/
    FILE *fg;
    short *status;
    unsigned long limit;
    {
    static char *dataBuf;
    static NODE *node;
      dataBuf = GED_ALLOC_POOL(limit);
      if ((!(dataBuf = GED_ALLOC_POOL(limit))) || ((node =      
      ged_read_tree(fg, dataBuf, limit, status))==(NODE *)NULL)               ||
      (*status == REC_TOO_LONG))
            {
            ged_error("Cannot read node from file.\n",(NODE *)NULL,4);
            return((NODE *)NULL);
            }
        return(node);
    } 

This function duplicates the ged_process function.  It loads the grammar, calls load_tree to load
in a tree from the GEDCOM data, and then calls traverse.  If ged_process is not flexible
enough, customize the above code to fit your needs.ged_define_semantic  - gedproc.c


Summary

#include "gedproc.h"

int ged_define_semantic(funcName, funcPointer)
    byte *dataCtxt;
    int (*funcPointer)(NODE *, byte *);
   
Description

Function registration.  This will register the semantic functions created by the user.
 
Return Value

An integer corresponding to an error code, to be determined in a later version.

Example

See the appendix for an example.

                                      Appendix


The following example will process a GEDCOM file called BIG.DAT, using a grammar called
GRAMMAR.DAT:

#ifdef CYBER
#define CDECL
#include "gedcom_h"  /* Must be declared first */
#include "gedproc_h"
void error_path();
int CDECL main();
int pre();
int post();
void setupFunc();
#else
#include "gedcom.h"  /* Must be declared first */
#include "gedproc.h"
#define CDECL cdecl  /* Microsoft keyword used to cancel pascal calling convention */
void error_path(NODE *node);
int CDECL main(void);
int pre(NODE *, byte *);
int post(NODE *, byte *);
void setupFunc(void);
#endif

/*********************************************************************
 *
 * NAME: pre
 *
 * DESCRIPTION:  Example of a user-defined function that is called in
 *               connection with the following sample grammar:
 *
 * SAMPLE GRAMMAR:
 *                0 HEAD
 *                1 - pre 
 *                1 + post
 *                0 TRLR
 *
 ******************************************************************/
int pre(node, line)
NODE *node;
byte *line;
{
      printf("pre\n");
      return(DFLT_ERR); /* returns 99 */
}

int post(node, line) /* Example of a user-defined function */
NODE *node;
byte *line;
{
      printf("post\n");
      return(DFLT_ERR); /* returns 99 */
}
/*********************************************************************
 *
 * NAME: ged_error
 *
 * DESCRIPTION:  prints out an error message.
 *
 * STATUS CODES:  User-set status codes accomplish the following:
 *
 *                The status parameter is set by GEDPROC as follows:
 *
 *                1 Could not create a pool for the grammar.
 *                2 Could not create a pool for the GEDCOM data.
 *                3 Tag not found.  The node of this tag is sent.  You
 *                  can use error_path to display the path to this bad
 *                     tag.
 *                4 Could not read a tree from a file--either a
 *                  grammar file or a GEDCOM file.
 *
 ******************************************************************/
int ged_error(msg, dataCtxt, status)
char *msg;
NODE *dataCtxt;
unsigned short status;
{
  if(status == 1)
       printf("Could not create a pool for the grammar.");
  if(dataCtxt)
      {
        error_path(dataCtxt);
        printf("\n");
      }
return(DFLT_ERR); /* default value */
}
/*********************************************************************
 * NAME: error_path
 *
 * DESCRIPTION: Prints the tags leading from the tree top to the error
 *              tag.  node is the line with the bad tag. This function
 *              is not necessary.
 ********************************************************************/
void error_path(node)
    NODE *node;
   {
    if (node->parent)
          error_path(node->parent);
    else
        {
          printf("%s",ged_get_line(node));
          return;
        }
    printf("-->%s",ged_get_line(node));
    }
/*********************************************************************
 *
 * NAME:  setupFunc
 * 
 * DESCRIPTION:  Registers the user-defined functions
 *
 *******************************************************************/
 
   void setupFunc(void)
        {
            ged_define_semantic("pre",pre);
            ged_define_semantic("post",post);
        }

int CDECL main()
{
 FILE *fg, *ft;

    /* Open the grammar file */
#ifdef CYBER
if((fg = fopen("grammar_dat", "r")) == (FILE *)NULL)
#else
if((fg = fopen("grammar.dat", "r")) == (FILE *)NULL)
#endif
      ged_error("Couldn't open grammar_dat for reading.\n",
            (NODE *)NULL, 0);
      
    /* Open the data file */
#ifdef CYBER
if((ft = fopen("big_dat", "r")) == (FILE *)NULL)
#else
if((ft = fopen("big.dat", "r")) == (FILE *)NULL)
#endif
    ged_error("Could not open data file: big_dat\n",(NODE *)NULL,0);
setupFunc(); /* Register user functions */
ged_process(fg, ft); /* Process GEDCOM data */
printf("\nend of program.\n");
}
