Sunday, June 05, 2011

Dissecting the GZIP format

April 24, 2011
In this article I describe the DEFLATE algorithm that GZIP implements and depends on. The DEFLATE algorithm uses a combination of LZ77, Huffman codes and run-length-encoding; this article describes each in detail by walking through an example and developing source code to implement the algorithm. My aim is to implement readable rather than efficient or extensible code. I'll focus here on unzipping, rather than zipping, but by the end of the article, the zipping process should be clear.
Material intended for human consumption tends to be highly redundant, from an information processing perspective. Because of the mysterious human mind, we need to have the same thing repeated to us in different ways in order for it to be processed correctly. Natural language, for example, is inefficient - the English language phrase "I am going to the store" could easily be abbreviated "I go store". If the subject of conversation has already been established as "me", it could be further abbreviated unambiguously as "go store". This is how children communicate until they are taught proper grammar - but as we mature, we like/want/need the redundancy.

Computers are no better when it comes to demanding redundancy. Computer programs repeat the same instructions over and over again, and program source code is verbose, even when the programmer isn't concerned about using readable variable names (you know who you are). Computer storage, on the other hand, is a scarce resource. It becomes less and less scarce every day, but every time capacity increases, we seem to find a way to run out of it. I'm sure that someday, my children will be complaining that their thumbnail computer devices can "only store four holodeck simulations!"

Computer scientists have been studying ways to use computer storage more efficiently since the 1960s. In 1977, Abraham Lempel and Jacob Ziv published "A Universal Algorithm for Sequential Data Compression " [1] which described what is now referred to as the "LZ77" compression algorithm. Conceptually, LZ77 is pretty straightforward - read the source document, and, for each sequence of bytes encountered, search backward through the document to see if the same sequence occurs previously. If so, rather than outputting the sequence, output a back pointer to the first occurrence of the sequence.

To get a sense of just how redundant English text is, take a look at figure 1 and figure 2. Figure 1 contains the first few verses of the Bible's book of Genesis and figure 2 shows all of the places where text from a prior point in the document is repeated. LZ77 does a very good job of exploiting that redundancy and making efficient use of available storage.