
A wordlist data base contains the following records:

0. header record
1. unmap table for converting pseudochars back to chars
2. decoder offset table, for selecting huffman decoders
3. huffman decoder tables
4. index table giving first char of first and last words in each chunk
5 -> end. compressed dictionary chunks

/**********************************************************************/

data types

typedef struct header_rec
{
  short n_chunks;             /* total chunks in dictionary */
  long  n_words;              /* total words in dictionary */
};

unsigned char unmap_table [n_uniq_chars];

short decoder_offset_table [n_uniq_chars];  

unsigned char huffman_decoder_tables [total_size_of_tables];

unsigned char index_table [2*n_chunks]; 

typedef struct chunk_rec
{
  short chunk_len; /* number of bytes in stripped but uncompressed chunk */
  short chunk_words; /* number of words in chunk */
  unsigned char chunk_contents [compressed_chunk_len];  
};


/**********************************************************************/

Of course, the above declarations contain almost no information about
what goes into the various buffers and tables. Here is a procedural
description of their contents.

At a very high level, here is how you turn a dictionary into a
wordlist database.

1. Sort the list of words.
2. Strip word prefixes.
3. Copy stripped words into chunk buffers.
4. Compress these buffers using multiple huffman codes.

Simple enough.  Unfortunately, there are many details. 

Before starting, here are some important restrictions.  Step 2
introduces characters 0-15 and 25, so the initial words must not
contain them.  The huffman codes of step 4 can only represent 85
different characters, so the initial wordlist can only contain about
68 different characters.  The program is 8-bit clean, though, so crazy
accent characters are just fine.

Step 1. Sort the list of words.

   Nothing to say here except that step 2 won't achieve any 
   compression unless the list really is sorted.  Intermingling 
   upper and lower case would break the sorted order.
   Just use the unix sort program.

step 2. Strip common prefixes from beginnings of words.

Starting at the beginning of each word, count how many characters
match the corresponding character in the previous word.  Strip off the
common characters and replace them with the character whose ascii code
is the match count.  The maximum allowed count is 15.  There is one
other rule.  Because of sorted order, when a character doesn't match
it is often the succeeding character. So, if the first character after
the final match is equal to 1 plus the corresponding character in the
previous word, we replace it with character 25.

Here is an example.  

aftermath
afternoon -> 5 25 o o n

Step 3. Divide the dictionary into chunks.

It is time to copy contiguous subsets of the dictionary into "stripped
chunk buffers".  We have stripped and unstripped versions of each
word.  For the first word in each chunk we use the unstripped version,
preceded by an ascii 0. For the rest we use the stripped version.  The
ascii 0-15 characters delimit the words, so there is no need for
additional delimiters except at the very end of the buffer where we
add an ascii 0.  The stripped chunk buffers should be as close as
possible to 5k bytes without exceeding that limit.

We can now create record 4 of the pdb file.  It is an array of
unsigned chars.  Entry 2*i is the first character of the first word in
chunk i, entry 2*i+1 is the first character of the last word in chunk i.

Step 4. Compress the stripped chunk buffers using multiple huffman codes.

First of all, our representation for huffman decoding tables imposes a
limit of 85 on the number of different characters that can be encoded.
Before doing the huffman coding, we map all characters to
"pseudochars" in the range 0-84. (In the decompression program we will
map them back to their correct values using the table which is record
1 of the pdb file.)

The key idea of our scheme is to use a different huffman code for each
context, where a context is defined by the character (actually
pseudochar) occuring in the previous position.  While this is good
because different contexts have different statistics, it is bad
because of the extra decoding tables.  Therefore, we create a globally
applicable huffman code, and only use a context-specific code when it
saves enough bits to pay for its decoding table.

During decompression, we use the table which is record 2 of the pdb
file to find the decoding table for each context.  In some cases this
table will be the global table, and in other cases it will be a
context-specific table.  Note that pseudochar values are used as
indices into this table.

A huffman decoding table is actually a tree that is represented by an
array of unsigned chars.  Each node is a position in the array.  The
root is position 0.  If a node's value is >= 170, it is a leaf that
can be converted to a pseudochar by subtracting 170.  Otherwise it is
an interior node whose left child (selected by a 0) is the next
position in the array and whose right child (selected by a 1) is
indicated by the node's value.  Record 3 of the pdb file is a simple
concatentation of all the huffman decoding tables for the dictionary.

Note that all of the statistics and huffman codes apply to the entire
dictionary, not just to a single chunk.


----------------------------------------------------------------
