Wikipedia: Huffman coding

Huffman coding is an algorithm used for data compression.

The basic idea is borrowed from an older and slightly less efficient method called [Shannon-Fano coding]?.

The text to be compressed is considered as a string of symbols. Symbols that are likely to be frequent are represented by a short sequence of bits, and symbols that are likely to be rare are represented by a longer sequences of bits.

Huffman coding uses a specific method for choosing the representations for each symbol. It has been proven that Huffman coding is the most effective compression method of this type, that is, no other mapping of source symbols to strings of bits will produce a smaller output when the actual symbol frequencies agree with those used to create the code. Huffman coding is optimal when the frequencies of input characters are powers of two. Arithmetic coding produces slight gains over Huffman coding, but in practice these gains have not been large enough to offset arithmetic coding's higher computational complexity and patent royalties (as of November 2001, IBM owns patents on the core concepts of arithmetic coding in several jurisdictions).

Huffman works by creating a binary tree of symbols:

Start with as many trees as there are symbols.
While there is more than one tree:
1. Find the two trees with the smallest total probability.
2. Combine the trees into one, setting one as the left child and the other as the right.
Now the tree contains all the symbols. A '0' represents following the left child; a '1' represents following the right child.

There are variations. The frequencies used can be generic ones for the application domain that are based on average experience, or they can be the actual frequencies found in the text being compressed. (This variation requires that a [frequency table]? or other hint as to the encoding must be stored with the compressed text; implementations employ various tricks to store these tables efficiently.) A variation called "adaptive Huffman coding" calculates the frequencies dynamically based on recent actual frequencies in the source string.

Huffman coding today is often used as a "back-end" to some other compression method. deflation? (PKZIP's algorithm) and multimedia codecs such as JPEG and MP3 have a front-end model and quantization followed by Huffman coding.