. CMSC 341 Fall 2008 - Project 5

CMSC 341 Fall 2008
Project 5

Assigned Monday, Nov 25, 2008
Due 11:59pm, Tuesday Dec 9, 2008
Updates

Objectives

Background

In today's world, many files are transmitted across the internet including text files, image files, video files, and music files. To send them efficiently and to store them efficiently requires that we compress those files before we send them and then uncompress them when they are received. Tools such as Winzip and gzip do this for us. Music players and video players know how to uncompress their respective files. We are going to investigate one method used for compressing text files.

Many of the files we deal with on a daily basis are text files which contain ASCII characters. Each ASCII character is represented by one byte (8 bits). When we wish to archive many old text files or to send a text file across the internet, we prefer to "compress" the text file. Compressing a text file means encoding the ASCII characters in the file so that not all characters require 8 bits of storage. As a result, the size of the compressed file is smaller than the original text file. Of course, if we compress a text file, we must also be able to uncompress the file to retrieve the original file.


Description

In this project you will be using a priority queue and a binary tree of your design to implement a file compression/uncompression algorithm called "Huffman Coding". Huffman coding is named for David Huffman who developed this algorithm as a PhD student at MIT in 1952.

Before you begin, you should read and understand the explanation of Huffman Coding.

Your program will read a text file and compress it using your implementation of the Huffman coding algorithm found in the explanation. The compressed text will be written to a file. That compressed file will be then be read back by your program and uncompressed. The uncompressed text will then be written to a third file. The uncompressed text file should of course match the original text file character for charcter.

Your project will accept the three command line arguments listed below, in the order listed:

  1. The name of the text file to be compressed.
  2. The name of the compressed file (created from the compression of the text file). If this file already exists, it should be deleted.
  3. The name of the uncompressed text file (created from uncompressing the compressed file). If this file already exists, it should be deleted.

Project Notes, Hints, and Requirements

  1. As the Huffman coding document suggests, one way to select the trees with the smallest weights is to use a Priority Queue. For this purpose you can use the java.util.PriorityQueue class.
  2. The compressed file is a stream of bits. You can use the java.io.FileInputStream and java.io.FileOutputStream classes to read and write byte streams, but you'll need to add some logic on top of them to handle reading and writing bits.
  3. The Huffman coding tree is based on a binary tree. You may design and implement your own "Huffman Tree" class from scratch, but the Binary Node and Binary Tree classes available from the textbook's author might make a good starting point. You may change the author's BinaryNode/BinaryTree to be specific to this project. A "Huffman Tree" does not have to be generic. Also note that in a Huffman Tree internal nodes and external nodes (leaves) are different (yet your Huffman Tree can have just one kind of node).
  4. You may assume that the text file does not contain the "^" (carat - above the '6' on your keyboard) character, so this character may be used as the "pseudo-eof" character.
  5. Note that two passes of the text file are required. The first pass counts the frequency of each character (which are then used to build the Huffman Tree and create encoding table). The second pass is made to do the actual encoding.
  6. Since we will be compressing and uncompressing the file within the same program, you need not write a file header in the compressed file.
  7. The Unix command od can be used to display a file in octal, hex, etc. Check the man pages for more details.
  8. The Unix command diff can be used to compare the contents of text files.
  9. Your program must provide the following required output. See the sample output below.
    1. Echo the names of the files from the command line
    2. The table of characters and their frequency counts (the number of times they appear in the text file), sorted alphabetically (i.e. by ASCII character values)
    3. The total number of characters in the text file.
    4. The table of characters and their bit encoding sorted alphabetically (i.e. by ASCII character values)

Sample Output

This sample output is provided only as a format example. It is not produced by any program and the data in the tables may not be consistent nor even valid. linux3[7]%Proj4 file1.txt file2.huff file3.txt Compressing "file1.txt" into "file2.huff" Uncompressing "file2.huff" into "file3.txt" Character Frequencies --------------------- Space 3 4 1 ; 2 A 5 C 10 D 3 a 7 c 1 m 1 n 2 Total Characters: 35 Character Encodings ------------------- Space 10 4 11 ; 0100 A 0101 C 0110 D 0111 a 000 c 001 m 1000 n 1001

Submission

You must use CVS to check out your module in the Proj5 repository and to check in your code. That must include an ant build.xml file and javadoc. The grading scripts will issue commands like the following, so be sure that your build.xml supports them:
  ant
  ant doc
  ant -Dargs="arguments go here" run

See the projects page for more information on all of these topics.

If you don't submit a project correctly, you will not get credit for it. Why throw away all that hard work you did to write the project? Check your submittals. Make sure they work. Do this before the due date.


Project grading is described in the Project Policy handout.
Cheating in any form will not be tolerated. Please re-read the Project Policy handout for further details on honesty in doing projects for this course.

Remember, the due date is firm. Submittals made after midnight of the due date will not be accepted. Do not submit any files after that time.