CMSC 341 Project 3

Assigned

Wednesday, April 10, 2013

Due

11:59 pm on Sunday, April 28, 2013

Updates

 


Objectives

Background

In today's world, many files are transmitted across the internet including text files, image files, video files, and music files. To send them efficiently and to store them efficiently requires that we compress those files before we send them and then uncompress them when they are received. Tools such as Winzip and gzip do this for us. Music players and video players know how to uncompress their respective files. We are going to investigate one method used for compressing text files.

Many of the files we deal with on a daily basis are text files which contain ASCII characters. Each ASCII character is represented by one byte (8 bits). When we wish to archive many old text files or to send a text file across the internet, we prefer to "compress" the text file. Compressing a text file means encoding the ASCII characters in the file so that not all characters require 8 bits of storage. As a result, the size of the compressed file is smaller than the original text file. Of course, if we compress a text file, we must also be able to uncompress the file to retrieve the original file.

Description

In this project you will be using a priority queue and a binary tree of your design to implement a file compression/uncompression algorithm called "Huffman Coding". Huffman coding is named for David Huffman who developed this algorithm as a PhD student at MIT in 1952. Before you begin, you should read and understand this explanation of Huffman Coding.
Your program will read a text file and compress it using your implementation of the Huffman coding algorithm found in the explanation. The compressed text will be written to a file. That compressed file will be then be read back by your program and uncompressed. The uncompressed text will then be written to a third file. The uncompressed text file should of course match the original text file character for charcter.

Your project will accept the three command line arguments listed below, in the order listed:

Summary of Processing

Read the specified text file and count the frequency of all characters in the file.
Create the Huffman coding tree using a PQ based on the frequencies.
Create the table of encodings for each character from the Huffman coding tree.
Encode the text file and output the encoded/compressed file.
Read the encoded/compressed file you just created, decode it and output the decoded file.

Project Notes

  1. As the Huffman coding document suggests, one way to select the trees with the smallest weights is to use a Priority Queue. For this purpose you can use the java.util.PriorityQueue class.
  2. The compressed file is a stream of bits. You can use the java.io.FileInputStream and java.io.FileOutputStream classes to read and write byte streams, but you'll need to add some logic on top of them to handle reading and writing bits.
  3. You may find it helpful to use a Java class that implements the java.util.Map interface for tracking characters and their frequency.
  4. The Huffman coding tree is based on a binary tree. You may design and implement your own "Huffman Tree" class from scratch, but the Binary Node and Binary Tree classes available from the textbook's author might make a good starting point. You may change the author's BinaryNode/BinaryTree to be specific to this project. I.e. the "Huffman Tree" does not have to be generic. Also note that in a Huffman Tree internal nodes and external nodes (leaves) are different (yet your Huffman Tree can have just one kind of node).
  5. You may assume that the text file does not contain the "^" (carat - above the '6' on your keyboard) character, so this character may be used as the "pseudo-eof" character.
  6. Note that two passes of the original text file are required. The first pass counts the frequency of each character (which are then used to build the Huffman Tree and create the encoding table). The second pass is made to do the actual encoding. However, rather than actually reading the file twice you may find it easier to read the file just once to create a String that holds the entire text file. You may find the Java java.lang.StringBuffer class helpful for this purpose.
  7. Since we will be compressing and uncompressing the file within the same program, you need not write a file header in the compressed file.
  8. The Unix command xxd can be used to display a file in various formats including hex and binary. Check the man pages for more details.
  9. The Unix command diff can be used to compare the contents of text files. Your decoded text file should be an exact copy of the original text file.
  10. It's always a good idea to draw a picture of your data structure(s) and walk through a small data set (e.g. a single word) with paper and pencil before trying to run your program. Test your program with small files and first, then add more complexity to your files as you work out the early bugs and become more confident that your code works. Check out the project functionality testing section for ideas on the type of files to use for your own debugging.
  11. Your program must provide the following required output. See the sample output below.
    • Echo the names of the files from the command line.
    • The table of characters and their frequency counts (the number of times they appear in the text file), sorted alphabetically (i.e. by ASCII character values).
    • The total number of characters in the text file.
    • The table of characters and their bit encoding sorted alphabetically (i.e. by ASCII character values).


Sample Output

This sample output is provided only as a format example. It was not produced by any program and the data in the tables may not be consistent nor even valid.

     linux3[7]% java Project3 file1.txt file2.huff file3.txt
     Compressing "file1.txt" into "file2.huff"
     Uncompressing "file2.huff" into "file3.txt"
     Character Frequencies
     ---------------------
     Space    3
     4        1
     ;        2
     A        5
     C       10
     D        3
     a        7
     c        1
     m        1
     n        2
     
     Total Characters: 35
     
     Character Encodings
     -------------------
     Space   10
     4       11
     ;     0100
     A     0101
     C     0110
     D     0111
     a      000
     c      001
     m     1000
     n     1001


Project Grading

Project grading is described in the Project Policy handout. Cheating in any form will not be tolerated. Please re-read the Project Policy handout for further details on honesty in doing projects for this course.

Remember, the due date is firm. Submittals made after midnight of the due date will result in a late penalty.

Your project's functionality will be tested with a variety of text files such as those described below. This list describes only some types of text files that may be used to grade your project and should not be considered all-inclusive.

Submission

You must use CVS to check out your module in the Proj3 repository and to check in your code and your build.xml. Grading scripts will execute the commands ant compile and ant run to compile and execute your project. You shpould be able to re-use a build.xml file from an earlier project.

Check your submittals. Make sure they work. Do this before the due date. As in previous projects, use the CVS utilities to verify that your code was submitted properly and will build and execute properly using the commands ant compile and ant run. See the projects page for more information.