PKZIP Data Compression Techniques

Old Tools and New Applications:

PKZIP for the AS/400 PART 2: Compression Techniques

By Thibault Dambrine

In the last issue of the TUG Magazine we presented the first part of this article, entitled: PKZIP, the Product which was an introduction to PKZIP for the AS/400, how we came to use it in a concrete situation and a list of pros & cons.

In this issue we present Part 2. Several data compression algorithms will be covered, how they work, and how they are used by PKZIP.

Part 2: Compression Algorithms & Technical Highlights

If you read the previous article, you discovered that data compression is not for the faint of heart but certainly is an eye-opener as far as the variety of techniques used. The use of probabilities, statistical models and the unsuspected potential use of binary trees were most interesting to me. The information you will find in these technical highlights is more for general software culture than directly applicable knowledge; but then what fun would this article be if it did not peek under the covers?

Following this paragraph, you will find the definition for the Probabilistic compression methods, the Lempel-Ziv-Welch compression algorithm, the Shannon-Fano Coding method and the Huffman Coding method. Note that this is not a complete list of compression methods available, simply the ones I found the most interesting to read about. Beyond these definitions, you will find a write-up on how PKZIP uses these algorithms when it sends the familiar "Reducing", "Imploding" and "Deflating" messages. Also, as a little trivia, did you know that the PK in PKZIP stands for "Phil Katz", the founder of PKWARE?

Data Compression Techniques

As an introduction to the subject of data compression, we can say there are two basic forms of data compression: "lossy" and "lossless". Lossy compression algorithms give up bit-for-bit accuracy for higher compression ratios. Lossy compression is often used to compress video images and sound recordings, where the loss of small amounts of data does not affect the overall quality of the output. A Lossless data compression algorithm ensures that no information is lost in the compression/decompression process. Naturally, lossy compression would be completely unacceptable for data processing applications where every byte is important and losing even a bit could completely alter the meaning of the data. For relevance to this article, it is important to note that PKZIP is a lossless compression product.

1. Probabilistic compression methods

Probabilistic compression works on the principle that it is possible to compress a file by using varying numbers of bits to represent different characters (or character patterns). Frequently occurring characters are assigned the fewest number of bits, and characters that appear infrequently are assigned the greatest number of bits.

For example, suppose the letter A appears 2,000 times in this document but the letter Q only appears 40 times. Stored as 8-bit EBCDIC or ASCII text, the letters A and Q would occupy 2,040 bytes. Assume that the letter A could be represented in 4 bits, while the letter Q must be represented with 12. Now A and Q would account for only 1,060 bytes – 1,000 bytes for the letter A (2,000 characters at 4 bits per character) and 60 bytes for the letter Q (40 characters at 1½ bytes per character).

This is obviously a more efficient way to store the data and would result in compression ratios in the region of 50%. This is a simplified example of how probabilistic compression works. It does not take into account the fact that a probability table must be attached to the compressed data, partially offsetting any gains achieved.

2. Lempel-Ziv-Welch Compression Algorithm

The Lempel-Ziv-Welch (LZW) algorithm analyses a file's content, therefore producing higher compression ratios. LZW works by building a dictionary of phrases (groups of one or more bytes) from the file. When a new phrase is found, the compression mechanism checks to see if that phrase is already in the dictionary. If not, the phrase is added to the dictionary and a token that identifies the phrase's position in the dictionary is output. If the phrase was already in the dictionary, then the compression mechanism simply outputs the token for the existing phrase.

The Sliding Dictionary LZ is a derivative of LZW, it works by writing tokens that identify where repeating phrases have previously occurred in the file. Instead of checking the entire file for matching phrases, it uses only a part of the file. The term sliding dictionary is used because the algorithm uses fixed-size blocks whose addresses are repeatedly incremented as the file is read.

While the LZW and the Probabilistic compression methods are fairly easy to understand with a simple explanation, the Shannon-Fano and Huffman coding methods do stretch the math cells in your brain a bit more.

3. Shannon-Fano Coding

For a given information source, the best compression rate that we can archive is the source entropy.

Entropy is a measurement defined to measure the uncertainty of an information source. In Communications, entropy is the statistical measure of the predictable accuracy of a system in transmitting information.

The basic idea behind Shannon-Fano coding is using a variable length of bits to encode the source symbols according to their probabilities.

I did not find what the definition of the words "source symbols" really mean in this formulation of Shannon-Fano Coding. I suspect it is simply a regular character. In the Huffman Coding, the "source symbols" are clearly the characters that are found in the file to compress. The Huffman example, to be found later in this article, illustrates very well how the Shannon-Fano tree works.

The Shannon-Fano algorithm is as follows:

Sort the source symbols with their probabilities in a decreasing probability order.
Divide the full set of symbols into two parts such that each part has an equal or approximately equal probability.
Code the symbols in the first part with the bit zero and the symbols in the second part with the bit one.
Go back to Step 2, continue the process recursively for each of two halves until each subdivision contains only one symbol.

4. Huffman Coding

What are Huffman Codes?

Huffman codes are a subset of a larger family of uniquely decodable variable length codes. Not all variable length codes can be uniquely decoded. Here is a simple example:

Symbols	A	B	C	D
Codewords	0	1	10	11

It is obvious that using this code would produce ambiguous results. For instance, 0110 could be interpreted as ABBA, ADA or ABC.

But, for a code to be useful, it is not sufficient to be uniquely decodable. Let us consider the following example:

Symbols	A	B	C	D
Codewords	0	01	011	111

This code is uniquely decodable, but it is not suitable for any practical use. The reason is that, while (for instance) 001111101111 is uniquely decodable to ACDBD, it cannot be decoded before the message as a whole has been received. If we only had, say, the first five bits of this message, we could have mistaken it for AAC... This type of coding would place a great burden on memory and as theory proves it, would not yield anything in terms of compression efficiency.

This brings us to the fact that for any practical use, a variable length code must be constructed so that a codeword can be recognized by examining only its own bits (the bits used to encode it). This type of code is then termed an instantaneous code. An example of this is the "comma" code:

Symbols	A	B	C	D
Codewords	0	10	110	111

A received message 010111000 can now be decoded (ABDAAA) without previously memorizing it as a whole. Huffman codes are instantaneous, as well as some of the other codes used in data compression (e.g. Shannon-Fano).

How do Huffman Codes Work?

A Huffman code is constructed by examining the probabilities of source symbols and assigning appropriate codewords, so that the resulting code has a minimized average code length. It has been proven that Huffman codes have the minimum average code length among the uniquely decodable variable length codes.

The algorithm for creating Huffman codes is as follows:

1. Line up the source symbols according to their probabilities, from highest to lowest.

2. Combine the two least probable symbols into a composite symbol whose probability is equal to the sum of their probabilities.

3. Repeat step 2 until there is a single (composite) symbol with a probability of 1. This is commonly called the root and the structure obtained is called the Huffman tree.

4. Now trace the route from the root of the tree to the leaf corresponding to each symbol, writing down a 0 for left (up) and a 1 for right (down) branch. This will in the end give you a valid Huffman code.

Constructing a Huffman Code

Have a look at Table 1. For best understanding of this chart, read it from left to right starting from Column 1, and make your way to the root of the tree from the leaves. In (Column 1, Rows 4 & 5), D and E have the two lowest probabilities. They are combined and ported to the next column, yielding DE in (Column 2, Row 4). In (Column 2, Row 3 & 4), The two lowest leaves of the branch (both in terms of position and probability) are combined to give CDE. CDE has now a probability of .04, and it beats the letter B at .02 probability. CDE goes up one level in the next column (column 3, row 2), and B goes down to (Column 3, Row 3).

In (Column 3, Rows 2 & 3), the two lowest leaves of the branch (both in terms of position and probability) are combined to give BCDE, which is ported to (Column 4, Row 2). At this point, we only have two leaves and we are one away from the root. There is no more reducing necessary. Note that in Column 4, A does not go down one notch because it is still the single character with the most probability, where BCDE's probability is a combined one.

	Column 1	Column 2	Column 3	Column 4	Column 5
Row 1	A 0.4	A 0.4	A 0.4	A 0.4 (1)	root 1
Row 2	B 0.2	B 0.2	CDE 0.4 (0)	BCDE 0.6 (0)
Row 3	C 0.2	C 0.2 (0)	B 0.2 (1)
Row 4	D 0.1 (0)	DE 0.2 (1)
Row 5	E 0.1 (1)

Table 2. The resulting Huffman codewords for each symbol

Symbols	A	B	C	D	E
Codewords	1	01	000	0010	0011

To figure out the resulting value of each Huffman code (it took me a while), go from the root and follow the path of each letter. A appears only once with (1). B appears in Column 4 with (0) and column 3 with (1) and thus gets the value 01. C appears in Column 4 with (0), Column 3 with (0) and column 2 with (0) and thus gets the value 000. You can figure out the rest.

This is where the compressing ability comes from: In this sample data, it was determined that there will be more "A" letters than any other characters. The tree algorithm therefore gave it a "1" code word, the smallest possible space available, at one bit. The letter that was deemed least probable to appear was deemed "E" and it gets the largest code word: "0011". The net result is as follows: if, for example, the text sample contained a large proportion of letters "A", each of these letters "A" would hold only one bit. Other less frequently occurring characters like the letter "E" can occupy more space, but since they would occur less frequently they would still occupy minimal space.

Please note at this point that it is possible to create various (different) Huffman codes from one set of source symbols and their probabilities. Each one of these codes will have the same average code length. If we need a unique way to construct a Huffman code, we can define a canonical Huffman code.

Extending the Huffman Concept

There are several extensions to the basic Huffman coding explained above. Adaptive Huffman coding enables dynamically changing the code as the probabilities of input symbols change - this is much more flexible than static (standard) Huffman coding.

Another improvement is extended Huffman code. In this coding scheme, we encode sequences of input symbols rather than symbols themselves, to obtain better coding efficiency i.e. shorter average code length.

PKZIP's usage of Probabilistic, Lempel-Ziv-Welch, Shannon-Fano and Huffman Coding Methods:

The following excerpt comes from Technical Support Magazine. It shows what methods are used behind the scenes when PKZIP yields messages such as "Reducing", "Imploding" and "Deflating" as it compresses a file. It is apparent that compounding different compression methods produce better results.

The Reducing Algorithm:

The Reducing algorithm is actually a combination of two distinct algorithms. The first algorithm compresses repeated byte sequences. The second algorithm takes the compressed stream from the first and applies a probabilistic compression method.

The Imploding algorithm:

The Imploding algorithm is also a combination of two distinct algorithms. The first algorithm compresses repeated byte sequences using a sliding dictionary (as explained in the LZW algorithm earlier). The second algorithm is used to compress the encoding of the sliding dictionary output, using multiple Shannon-Fano trees.

The Deflating Algorithm:

The Deflate algorithm is similar to the Implode algorithm using a sliding dictionary of up to 32 KB with a secondary compression from Huffman/Shannon-Fano codes. The compressed data is stored in blocks with a header describing the block and the Huffman codes used in the data block.

In Conclusion

As a closing word, the algorithms shown here barely scratch the surface of data compression technology. Other methods used are "Arithmetic Coding", "Pulse Code Modulation", "Differential Pulse Code Modulation", "Run Length Coding" and more. This is a huge subject for anyone who is interested and there are a number of web sites dedicated to this field.

Data compression is an obscure science that does not get a lot of visibility in our every-day programming lives. While only a few of us will ever apply these techniques in our own code, this whole area is none-the-less a fascinating one to explore.

Demand for data compression on digital voice and data networks is exploding. With data mining technology, plain old data has suddenly become much more valuable. Increasingly large quantities of data are stored for mining, trend analysis and decision support purposes. In this context, I suspect data compression will be used more and more for all types of applications from storage to data transmission. While most of us will probably not invent the next Huffman Coding rule, most of us did not write the OS/400 operating system either. My point here is, it may be an advantage to understand how things work and what our options are for data compression. We may have to use it eventually.

Credits, Disclaimers and Other Definitions:

The data compression subjects on Shannon-Fano and Huffman coding algorithm explanations have been taken from the World Wide Web at http://www.rasip.fer.hr/research/compress/index.html .

ASCENT SOLUTIONS Inc. can be found on the World Wide Web at http://www.asizip.com , where I have also found valuable information used in this article.

Technical Support Magazine can be reached at (414) 768-8000.

Other articles on FTP and other AS/400 related subjects can be found at http://www.tug.on.ca , the TUG web site. The author or his employer have no links or partnerships of any kind with ASCENT SOLUTIONS Inc.

The author thanks the individuals who took time to proofread this paper before it went to press.

Back to Tylogix Home Page