The simplicity of the Lempel-Ziv-Walsh (LZW) compression algorithm is only surpassed by its power and versatility. Made popular by programs such as the shareware programs like ARC (by System Enhancement Associates) and by PKARC (by Phil Katz), it is easily implemented as well. The LZW algorithm is a table based one. Each entry in the table contains two fields; the first field is a pointer to some other entry in the table, the second field is a character. The first 256 entries in the table is initially populated with each character field set to the ASCII representation of that entries offset in the table: the first entry is set to zero, the next to 1, and so on. The 65th entry in the table, for example, would have the character field set to 'A', the ASCII representation of which is 65. As will be seen later, the contents of the pointer field for the first 256 entries in the table need not be initialized at all, although it is typically set to some illegal pointer value. For naming convenience, think of the pointer field being called the "code" field, and the character field being called the "suffix" field. If a file contains the sequence "ABC" repeated ad infinitum, an examination of the resulting operation shows how the LZW algorithm works. A couple of rules, first: every unique sequence of characters (a sequence can be defined as consisting of at least two characters) read from the file will result in the generation of a new entry in the table. And, each sequence of characters will generate a single code to be output to the resulting compressed file. The first byte read is a special case. Read the first byte from the file, the initial 'A'. It exists in the table (the 65th entry), so output the 12-bit representation of that entry's offset. Not very good compression so far: 12 bits of output for 8 bits of input. It gets better soon, though. Read the next byte in from the file (the 'B'). The sequence of characters 'AB' are now looked up in the table, and is found not to exist. It gets added to the table as the 256th entry (the table starts at 0). The character sequence 'AB' can be considered as the code sequence for the 'A' (entry number 65) followed by the suffix character 'B': the table entry is a 65B. Output the table entry for the suffix character, a 12-bit representation of the 66th entry in the table. Two 12 bit entries have been output to represent 16 bits of input so far. Continue the process for the next character, entering the representation for 'BC' (66C) into the table as the 257th entry, and output the 12-bit entry for the 'C'. 36-bits of output for 24 bits of input. Next character read from the file gives thee 258th entry in the table as 67A and outputs the 12-bit 'C' entry, number 67. Read the next byte, and get the sequence 'AB'. When we look it up on the table, we find it as the 256 entry. Output the 12-bit code for that entry, read the next character in and generate a new table entry if the 256C sequence does not already exist in the table (it doesn't): entry #259 is set to 256C. And the first benefit of the compression is realized: a single 12-bit code was output for a 16-bit sequence. Read the next byte and the sequence stands at 'CA', found as the 258th entry. Generate the 12-bit code for output, read the next character in and create a table entry as 258B. Again, 12-bits of output represent 16-bits of input. Reading in the next character gives the 'BC' sequence (entry #257), reading another byte gives the '257A' sequence. 257 gets output, table entry 261 is set to 257A. Read a byte, get 'AB', entry 256. Read a byte, get '256C', entry 259. Read a byte, get 259A. It doesn't exist in the table, so it gets added as entry 262, and the 12-bit code of 259 if output. But, 259 represents the sequence 'ABC', and we've only output 12-bits to represent a 24 bit entry. Reviewing the output so far, here's what's been output for the processed string of 'ABCABCABC' (9 bytes * 8 bits = 72 bits): 65 66 67 256 258 257 259 = (7 codes * 12 bits) = 84 bits One more run through gets breakeven, though: The last character read was a 'C'. Read a byte, get 'CA', entry 258. Read a byte get 258B, entry 260. Read a byte get 260C, a new entry on the table (number 262) and output a 260, representing the next three bytes in the file. On the input side, 72 + 24 bits. On the output side, 84 + 12. Parity has been reached, at 96 bits input and output. Compression at last: read a byte, sequence 'CA' exists as entry #258. Read a byte, and look for sequence 258B. It exists as entry 260, so read a byte and look for sequence 260C. It doesn't exist, gets added to the table, and the 12-bit 260 is generated for the 24 bit input sequence of 'CAB'. Now we're cooking, with (96 + 24) on the input side of the equation and (96 + 12) on the output side of the equation. 256 - 65 B 257 - 66 C 258 - 67 A 259 - 256 C 260 - 258 B 261 - 257 A This process continues until the input file is exhausted of its data. With a repetitive file such as the one above, compression ratios of more than 70% are not uncommon. So, then, how does the decompressor work? Like the compressor, the initial table is set up and the first byte is read. In the above instance, this would be a 65. With only one exception (discussed below) every code read from the input file will already be on the table. Like in the compressor, a special case is made for the first code received: it is remembered as part of the next entry in the table. Output the character value for the 65th entry on the table: the letter 'A gets output. When you get a code, here's what happens: first, look the code up on the table. If found, then recursively read the table, saving each suffix character on a stack. When a code is less than 256, save the suffix on the stack, then output the stack in reverse (last in, first out) order. So, the "hold" code is a 65, the next code read is a 66. It is found on the table, the suffix character is saved on the stack, then (because the code is less than 256) the recursion loop is terminated and the stack contents are out: the letter 'B' is output. The combination 65B is now added to the table of the decompressor as the 256th entry. The last code (66) is remembered and the next code read in. A search is made on the table for the 66th entry. It is found, and output of 'B' is made, and the 66C entry is made to the table. Notice how the table generated in the decompressor is exactly the same as the table generated in the compressor. Yet, the only thing output by the compressor ad read by in the decompressor, the only thing they have in common is the compressed code! That, and the initial table, of course. Reading in the next code, gives the sequence of 67A, which outputs the 'C' and makes the next entry in the table equal to 67A. The next code read is 256which is found on the table and recursively generates the 'AB' sequence. And so on. Each code is read, is looked up on the table and then recursively generates a stack of characters, which is then output in the reverse order.