You are reading The Rietta Blog, a publication about the web since 2005.

A Little Huffman Coding With Java Tricks

♦
Written by Frank Rietta

A Huffman code is a way to utilize a binary tree to construct a minimal-length encoding for messages where certain characters or groups of characters have known frequencies. The tree used for such an operation called a Huffman tree. Huffman codes are the most efficient compression method for random data and are often found as steps in other compression algorithms such as JPEG and Deflate (ZIP). Building such trees is a very common exercise for Computer Science and Math classes so I can skip the details.

To construct a Huffman tree, first initialize a series of nodes, which are simply data structures holding the represented character and its frequency within the message. The series is sorted by frequency and the two lowest frequency nodes are picked from the list and combined into a composite node, which is given the sum of its child node frequencies as its frequency. The composite node is added back to the sorted list and the process is repeated until there is only one node in the list, which is the head of the Huffman tree.

In Java, the treeset data structure maintains a sorted list of comparable objects and can be used for an implementation of the Huffman code algorithm in the last paragraph. One caveat is that a treeset is a set, which means that duplicate nodes will simply disappear. The work around is to rig the compareTo method, which is implemented by all comparable objects, so that equality is never returned. In this implementation getCount returns the frequency of the node.

12345678

publicintcompareTo(Objecta){HuffmanNodeha=(HuffmanNode)a;if(this.getCount()==ha.getCount()){return-1;}else{returnthis.getCount()-ha.getCount();}}// end compareTo

The treeset is an ideal data structure for helping build the Huffman coding tree because it maintains a sorted list with a “guaranteed log(n) time cost for the basic operations (add, remove and contains).” My construction method for the Huffman tree uses two removes and one add each time nodes are combined so the worst running time is log^{3}(n), which is better than sorting algorithms that take n*log(n) or even n^{2} time.

12345678910111213141516171819

publicintbuildTree(){tsNodes.clear();tsNodes.addAll(htNodes.values());while(tsNodes.size()>1){Iteratorit=tsNodes.iterator();HuffmanNodea=(HuffmanNode)it.next();it.remove();HuffmanNodeb=(HuffmanNode)it.next();it.remove();// Combine node a and node b into a composite nodeHuffmanNodecombinedNode=newHuffmanNode(a,b);tsNodes.add(combinedNode);}// end while tree > 1huffmanTreeRoot=(HuffmanNode)tsNodes.first();// ...}// end buildTree

The compression rates achieved by this simple algorithm are pretty good even though they are not nearly as good as more specific ones in common use, such as JPEG, gzip, b2zip, etc.

A little searching revealed that there are a number of resources about the bzip and bzip2 algorithms.

A Apache-license Java implementation is at: http://www.kohsuke.org/bzip2/

The main project homepage is at: http://www.bzip.org/

There is some good summary information at: http://en.wikipedia.org/wiki/Bzip2

From Wikipedia: “bzip2 uses the Burrows-Wheeler transform to convert frequently recurring character sequences into strings of identical letters, and then applies a move-to-front transform and finally Huffman coding. In bzip2 the blocks are all the same size in plain text…”

Jesse

Thanks man for you good explanation of huffman encoding. I am finding it tricky to do it in java

About Frank Rietta

Frank Rietta is a web application
security architect, author, and speaker. He is a
computer scientist with a Masters in Information Security from the
College of Computing at the Georgia Institute of Technology. He speaks
about security topics and was a contributor to the security chapter of the
7th edition of the "Fundamentals of Database Systems" textbook
published by Addison-Wesley.

A little searching revealed that there are a number of resources about the bzip and bzip2 algorithms.

A Apache-license Java implementation is at:

http://www.kohsuke.org/bzip2/

The main project homepage is at:

http://www.bzip.org/

There is some good summary information at:

http://en.wikipedia.org/wiki/Bzip2

From Wikipedia: “bzip2 uses the Burrows-Wheeler transform to convert frequently recurring character sequences into strings of identical letters, and then applies a move-to-front transform and finally Huffman coding. In bzip2 the blocks are all the same size in plain text…”