
Disclaimer
----------

Although PKWARE will attempt to supply current and accurate information
relating to its file formats, algorithms, and the subject programs, the
possibility of error cannot be eliminated.  PKWARE therefore expressly
disclaims any warranty that the information contained in the associated
materials relating to the subject programs and/or the format of the files 
created or accessed by the subject programs and/or the algorithms used by 
the subject programs, or any other matter, is current, correct or accurate 
as delivered.  Any risk of damage due to any possible inaccurate 
information is assumed by the user of the information.  Furthermore, the 
information relating to the subject programs and/or the file formats 
created or accessed by the subject programs and/or the algorithms used by 
the subject programs is subject to change without notice.


General Format of a ZIP file
----------------------------

  Files stored in arbitrary order.  Large zipfiles can span multiple
  diskette media.

  Overall zipfile format:

    [local file header+file data] . . .
    [central directory] end of central directory record


  A.  Local file header:
  
	local file header signature	4 bytes  (0x04034b50)
	version needed to extract	2 bytes
	general purpose bit flag	2 bytes
	compression method		2 bytes
	last mod file time 		2 bytes
	last mod file date		2 bytes
	crc-32   			4 bytes
	compressed size			4 bytes
	uncompressed size		4 bytes
	filename length			2 bytes
	extra field length		2 bytes

	filename (variable size)
	extra field (variable size)
      

  B.  Central directory structure:

      [file header] . . .  end of central dir record

      File header:

	central file header signature	4 bytes  (0x02014b50)
	version made by			2 bytes
	version needed to extract	2 bytes
	general purpose bit flag	2 bytes
	compression method		2 bytes
	last mod file time 		2 bytes
	last mod file date		2 bytes
	crc-32   			4 bytes
	compressed size			4 bytes
	uncompressed size		4 bytes
	filename length			2 bytes
	extra field length		2 bytes
	file comment length		2 bytes
	disk number start		2 bytes
	internal file attributes	2 bytes
	external file attributes	4 bytes
	relative offset of local header	4 bytes

	filename (variable size)
	extra field (variable size)
	file comment (variable size)

      End of central dir record:

	end of central dir signature	4 bytes  (0x06054b50)
	number of this disk		2 bytes
	number of the disk with the
	start of the central directory	2 bytes
	total number of entries in
	the central dir on this disk	2 bytes
	total number of entries in
	the central dir			2 bytes
	size of the central directory   4 bytes
	offset of start of central
	directory with respect to
	the starting disk number	4 bytes
	zipfile comment length		2 bytes
	zipfile comment (variable size)
      



  C.  Explanation of fields:

      version made by
      
	  The upper byte indicates the host system (OS) for the
	  file.  Software can use this information to determine
	  the line record format for text files etc.  The current
	  mappings are:
	  
	  0 - IBM (MS-DOS)	1 - Amiga	2 - VMS
	  3 - *nix		4 thru 255 - unused
	  
	  The lower byte indicates the version number of the 
	  software used to encode the file.  The value/10 
	  indicates the major version number, and the value 
	  mod 10 is the minor version number.

      version needed to extract
      
	  The minimum software version needed to extract the 
	  file, mapped as above.

      general purpose bit flag:

	  The lowest bit, if set, indicates that the file is 
	  encrypted.  The upper three bits are reserved and 
	  used internally by the software when processing the 
	  zipfile.  The remaining bits are unused in version 
	  1.0.

      compression method:
      
	  (see accompanying documentation for algorithm 
	  descriptions)
      
	  0 - The file is stored (no compression)
	  1 - The file is Shrunk
	  2 - The file is Reduced with compression factor 1
	  3 - The file is Reduced with compression factor 2
	  4 - The file is Reduced with compression factor 3
	  5 - The file is Reduced with compression factor 4

      date and time fields:

	  The date and time are encoded in standard MS-DOS 
	  format.

      CRC-32:
      
	  The CRC-32 algorithm was generously contributed by 
	  David Schwaderer and can be found in his excellent 
	  book "C Programmers Guide to NetBIOS" published by
	  Howard W. Sams & Co. Inc.  The 'magic number' for 
	  the CRC is 0xdebb20e3.  The proper CRC pre and post 
	  conditioning is used, meaning that the CRC register 
	  is pre-conditioned with all ones (a starting value 
	  of 0xffffffff) and the value is post-conditioned by 
	  taking the one's complement of the CRC residual.
	  
      compressed size:
      uncompressed size:

	  The size of the file compressed and uncompressed, 
	  respectively.
      
      filename length:
      extra field length:
      file comment length:

	  The length of the filename, extra field, and comment 
	  fields respectively.  The combined length of any
	  directory record and these three fields should not
	  generally exceed 65,535 bytes.

      disk number start:

	  The number of the disk on which this file begins.

      internal file attributes:

	  The lowest bit of this field indicates, if set, that 
	  the file is apparently an ASCII or text file.  If not
	  set, that the file apparently contains binary data.
	  The remaining bits are unused in version 1.0.

      external file attributes:

	  The mapping of the external attributes is 
	  host-system dependent (see 'version made by').  For 
	  MS-DOS, the low order byte is the MS-DOS directory 
	  attribute byte.

      relative offset of local header:

	  This is the offset from the start of the first disk on
	  which this file appears, to where the local header should
	  be found.

      filename:

	  The name of the file, with optional relative path.  
	  The path stored should not contain a drive or 
	  device letter, or a leading slash.  All slashes 
	  should be forward slashes '/' as opposed to 
	  backwards slashes '\' for compatibility with Amiga
	  and Unix file systems etc.

      extra field:

	  This is for future expansion.  If additional information
	  needs to be stored in the future, it should be stored
	  here.  Earlier versions of the software can then safely
	  skip this file, and find the next file or header.  This
	  field will be 0 length in version 1.0.

      file comment:

	  The comment for this file.


      number of this disk:
      
	  The number of this disk, which contains central 
	  directory end record.
    
      number of the disk with the start of the central directory:
      
	  The number of the disk on which the central 
	  directory starts.

      total number of entries in the central dir on this disk:
      
	  The number of central directory entries on this disk.
	  
      total number of entries in the central dir:
      
	  The total number of files in the zipfile.
      

      size of the central directory:
      
	  The size (in bytes) of the entire central directory.

      offset of start of central directory with respect to
      the starting disk number:
      
	  Offset of the start of the central direcory on the 
	  disk on which the central directory starts.
      
      zipfile comment length:
      
	  The length of the comment for this zipfile.
      
      zipfile comment:
      
	  The comment for this zipfile.


  D.  General notes:

      1)  All fields unless otherwise noted are unsigned and stored
	  in Intel low-byte:high-byte, low-word:high-word order.

      2)  String fields are not null terminated, since the
	  length is given explicitly.

      3)  Local headers should not span disk boundries.  Also, even
	  though the central directory can span disk boundries, no
	  single record in the central directory should be split
	  across disks.

      4)  The entries in the central directory may not necessarily
	  be in the same order that files appear in the zipfile.


UnShrinking
-----------

Shrinking is a Dynamic Ziv-Lempel-Welch compression algorithm 
with partial clearing.  The initial code size is 9 bits, and 
the maximum code size is 13 bits.  Shrinking differs from 
conventional Dynamic Ziv-lempel-Welch implementations in several 
respects:

1)  The code size is controlled by the compressor, and is not 
    automatically increased when codes larger than the current 
    code size are created (but not necessarily used).  When 
    the decompressor encounters the code sequence 256 
    (decimal) followed by 1, it should increase the code size 
    read from the input stream to the next bit size.  No 
    blocking of the codes is performed, so the next code at 
    the increased size should be read from the input stream 
    immediately after where the previous code at the smaller 
    bit size was read.  Again, the decompressor should not 
    increase the code size used until the sequence 256,1 is 
    encountered.

2)  When the table becomes full, total clearing is not 
    performed.  Rather, when the compresser emits the code 
    sequence 256,2 (decimal), the decompressor should clear 
    all leaf nodes from the Ziv-Lempel tree, and continue to 
    use the current code size.  The nodes that are cleared 
    from the Ziv-Lempel tree are then re-used, with the lowest 
    code value re-used first, and the highest code value 
    re-used last.  The compressor can emit the sequence 256,2
    at any time.



Expanding
---------

The Reducing algorithm is actually a combination of two 
distinct algorithms.  The first algorithm compresses repeated 
byte sequences, and the second algorithm takes the compressed
stream from the first algorithm and applies a probabilistic 
compression method.  

The probabilistic compression stores an array of 'follower 
sets' S(j), for j=0 to 255, corresponding to each possible 
ASCII character.  Each set contains between 0 and 32 
characters, to be denoted as S(j)[0],...,S(j)[m], where m<32.  
The sets are stored at the beginning of the data area for a 
Reduced file, in reverse order, with S(255) first, and S(0) 
last.  

The sets are encoded as { N(j), S(j)[0],...,S(j)[N(j)-1] }, 
where N(j) is the size of set S(j).  N(j) can be 0, in which 
case the follower set for S(j) is empty.  Each N(j) value is 
encoded in 6 bits, followed by N(j) eight bit character values 
corresponding to S(j)[0] to S(j)[N(j)-1] respectively.  If 
N(j) is 0, then no values for S(j) are stored, and the value 
for N(j-1) immediately follows.

Immediately after the follower sets, is the compressed data 
stream.  The compressed data stream can be interpreted for the 
probabilistic decompression as follows:


let Last-Character <- 0.
loop until done
    if the follower set S(Last-Character) is empty then
	read 8 bits from the input stream, and copy this
	value to the output stream.
    otherwise if the follower set S(Last-Character) is non-empty then
	read 1 bit from the input stream.
	if this bit is not zero then
	    read 8 bits from the input stream, and copy this
	    value to the output stream.
	otherwise if this bit is zero then
	    read B(N(Last-Character)) bits from the input 
	    stream, and assign this value to I.
	    Copy the value of S(Last-Character)[I] to the 
	    output stream.
	
    assign the last value placed on the output stream to 
    Last-Character.
end loop


B(N(j)) is defined as the minimal number of bits required to 
encode the value N(j)-1.


The decompressed stream from above can then be expanded to 
re-create the original file as follows:


let State <- 0.

loop until done
    read 8 bits from the input stream into C.
    case State of
	0:  if C is not equal to DLE (144 decimal) then
		copy C to the output stream.
	    otherwise if C is equal to DLE then
		let State <- 1.

	1:  if C is non-zero then
		let V <- C.
		let Len <- L(V)
		let State <- F(Len).
	    otherwise if C is zero then
		copy the value 144 (decimal) to the output stream.
		let State <- 0

	2:  let Len <- Len + C
	    let State <- 3.
    
	3:  move backwards D(V,C) bytes in the output stream 
	    (if this position is before the start of the output 
	    stream, then assume that all the data before the 
	    start of the output stream is filled with zeros).
	    copy Len+3 bytes from this position to the output stream.
	    let State <- 0.
    end case
end loop


The functions F,L, and D are dependent on the 'compression 
factor' (see FORMAT.DOC), 1 through 4, and are defined as follows:

For compression factor 1:
    L(X) equals the lower 7 bits of X.
    F(X) equals 2 if X equals 127 otherwise F(X) equals 3.
    D(X,Y) equals the (upper 1 bit of X) * 256 + Y + 1.
For compression factor 2:
    L(X) equals the lower 6 bits of X.
    F(X) equals 2 if X equals 63 otherwise F(X) equals 3.
    D(X,Y) equals the (upper 2 bits of X) * 256 + Y + 1.
For compression factor 3:
    L(X) equals the lower 5 bits of X.
    F(X) equals 2 if X equals 31 otherwise F(X) equals 3.
    D(X,Y) equals the (upper 3 bits of X) * 256 + Y + 1.
For compression factor 4:
    L(X) equals the lower 4 bits of X.
    F(X) equals 2 if X equals 15 otherwise F(X) equals 3.
    D(X,Y) equals the (upper 4 bits of X) * 256 + Y + 1.


