CMSC 341 Fall 2005
Project 4

Assigned Wednesday, Nov 9, 2005
Due 11:59pm, Sunday Nov 27, 2005

This project was originally scheduled for completion on Wednesday, Nov 23rd. In anticipation of your request, and to give you as much time as possible to complete this challenging project, we have moved the due date to Sunday Nov 27th (the Sunday after Thanksgiving). Note that while help may be available via email and on Blackboard after Nov 23rd, no instructor or TA will be on campus to assist you. We recommend that you complete the project by the 23rd so that you can enjoy your holiday.

Updates

Objectives

Background

In today's world, many files are transmitted across the internet including text files, image files, video files, and music files. To send them efficiently and to store them efficiently requires that we compress those files before we send them and then uncompress them when they are received. Tools such as Winzip and gzip do this for us. Music players and video players know how to uncompress their respective files. We are going to investigate one method used for compressing text files.

Many of the files we deal with on a daily basis are text files which contain ASCII characters. Each ASCII character is represented by one byte (8 bits). When we wish to archive many old text files or to send a text file across the internet, we prefer to "compress" the text file. Compressing a text file means encoding the ASCII characters in the file so that not all characters require 8 bits of storage. As a result, the size of the compressed file is smaller than the original text file. Of course, if we compress a text file, we must also be able to uncompress the file to retrieve the original file.


Description

In this project you will be using a priority queue and a binary tree of your design to implement a file compression/uncompression algorithm called "Huffman Coding". Huffman coding is named for David Huffman who developed this algorithm as a PhD student at MIT in 1952.

Before you begin, you should read and understand the explanation of Huffman Coding.

Your program will read a text file and compress it using your implementation of the Huffman coding algorithm found in the explanation. The compressed text will be written to a file. That compressed file will be then be read back by your program and uncompressed. The uncompressed text will then be written to a third file. The uncompressed text file should of course match the original text file character for charcter.

Mr. Frey's public directory for this project is

/afs/umbc.edu/users/d/e/dennis/pub/CMSC341/Proj4

Your project will accept the three command line arguments listed below, in the order listed

  1. The name of the text file to be compressed.
  2. The name of the compressed file (created from the compression of the text file). If this file already exists, it should be deleted.
  3. The name of the uncompressed text file (created from uncompressing the compressed file). If this file already exists, it should be deleted.

Project Notes, Hints, and Requirements

  1. As the Huffman coding document suggests, one way to select the trees with the smallest weights is to use a Priority Queue. Your program must use the author's Binary Heap implementation of the Priority Queue for this part of the project. The author's files BinaryHeap.C and BinaryHeap.H are found in Mr. Frey's public directory. DO NOT copy them into your directory. As in earlier projects, your makefile should reference them.
  2. The compressed file is a stream of bits. Unfortunately C++ does not supply a "bit stream" class. We are therefore providing you with functions which can read/write bits to a file. These functions are found in the files, BitIO.cpp and BitIO.h are also found in Mr. Frey's public directory. Be sure to read the documentation in BitIO.h that explains how to use these functions. Note that these functions perform "buffered" input/output. In particular, be sure you call FlushBits( ) to force any buffered bits to actually be written to the file. Once again, your makefile should reference these files. There is no need to copy them to your directory.
  3. The Huffman coding tree is based on a binary tree. You may design and implement your own "Huffman Tree" class from scratch, but the author's Binary Node class and Binary Tree class from the class notes might make a good starting point. You may change the author's BinaryNode/BinaryTree to be specific to this project. A "Huffman Tree" does not have to be a template. Also note that in a Huffman Tree internal nodes and external nodes (leaves) are different (yet your Huffman Tree can have just one kind of node).
  4. You may assume that the text file does not contain the "^" (carat - above the '6' on your keyboard) character, so this character may be used as the "pseudo-eof" character.
  5. Note that two passes of the text file are required. The first pass counts the frequency of each character (which are then used to build the Huffman Tree and create encoding table). The second pass is made to do the actual encoding.
  6. Since we will be compressing and uncompressing the file within the same program, you need not write a file header in the compressed file.
  7. The Unix command od can be used to display a file in octal, hex, etc. Check the man pages for more details.
  8. The Unix command diff can be used to compare the contents of text files.
  9. Your program must provide the following required output. See the sample output below.
    1. Echo the names of the files from the command line
    2. The table of characters and their frequency counts (the number of times they appear in the text file), sorted alphabetically (i.e. by ASCII character values)
    3. The total number of characters in the text file.
    4. The table of characters and their bit encoding sorted alphabetically (i.e. by ASCII character values)

Sample Output

This sample output is provided only as a format example. It is not produced by any program and the data in the tables may not be consistent nor even valid. linux3[7]%Proj4 file1.txt file2.huff file3.txt Compressing "file1.txt" into "file2.huff" Uncompressing "file2.huff" into "file3.txt" Character Frequencies --------------------- Space 3 4 1 ; 2 A 5 C 10 D 3 a 7 c 1 m 1 n 2 Total Characters: 35 Character Encodings ------------------- Space 10 4 11 ; 0100 A 0101 C 0110 D 0111 a 000 c 001 m 1000 n 1001

Files To Be Submitted

Submit any files which you have written or modified -- source code and makefile.
DO NOT submit any unchanged code provided by the author (e.g. the author's BinaryHeap code). The grading scripts will remove these files from your submittal directory.

Be sure to submit the answers to the project questions (341-Fall05-p4_questions.txt found in Mr. Frey's public directory) in plain text format.

Submit the files using the procedure given to you for your section of the course.
If your makefile is set up correctly, you should be able to execute the command make submit.

Submission Tools
There are a number of tools available to you to check on your submittal. It is your responsibility to ensure that the submittal is correct and will result in a successful compilation of your project. Do not wait till the last minute to submit your files. Give yourself enough time to check that your submittal is correct.

If you don't submit a project correctly, you will not get credit for it. Why throw away all that hard work you did to write the project? Check your submittals. Make sure they work. Do this before the due date.

Documentation for the submit program is on the web at http://www.gl.umbc.edu/submit/ . One of the tools provided by the submit program is
submitls. It lists the names of the files you have submitted.

Additionally, there are two programs for use only by CMSC-341 students (not part of the UCS submit program). They are in the directory
/afs/umbc.edu/users/d/e/dennis/pub/CMSC341/ and are named submitmake and submitrun. You can use these programs to make or run your submitted projects.

The syntax is similar to that for submit:

submitmake <class> <project>

Example:  submitmake cs341 Proj4 <parameter list>

Submission Tools
There are a number of tools available to you to check on your submittal. It is your responsibility to ensure that the submittal is correct and will result in a successful compilation of your project. Do not wait till the last minute to submit your files. Give yourself enough time to check that your submittal is correct.

If you don't submit a project correctly, you will not get credit for it. Why throw away all that hard work you did to write the project? Check your submittals. Make sure they work. Do this before the due date.

Documentation for the submit program is on the web at http://www.gl.umbc.edu/submit/ . One of the tools provided by the submit program is
submitls. It lists the names of the files you have submitted.

Additionally, there are two programs for use only by CMSC-341 students (not part of the UCS submit program). They are in the directory
/afs/umbc.edu/users/d/e/dennis/pub/CMSC341/ and are named submitmake and submitrun. You can use these programs to make or run your submitted projects.

The syntax is similar to that for submit:

submitmake <class> <project>

Example:  submitmake cs341 Proj4 <parameter list>

Submission Tools
There are a number of tools available to you to check on your submittal. It is your responsibility to ensure that the submittal is correct and will result in a successful compilation of your project. Do not wait till the last minute to submit your files. Give yourself enough time to check that your submittal is correct.

If you don't submit a project correctly, you will not get credit for it. Why throw away all that hard work you did to write the project? Check your submittals. Make sure they work. Do this before the due date.

Documentation for the submit program is on the web at http://www.gl.umbc.edu/submit/ . One of the tools provided by the submit program is
submitls. It lists the names of the files you have submitted.

Additionally, there are two programs for use only by CMSC-341 students (not part of the UCS submit program). They are in the directory
/afs/umbc.edu/users/d/e/dennis/pub/CMSC341/ and are named submitmake and submitrun. You can use these programs to make or run your submitted projects.

The syntax is similar to that for submit:

submitmake <class> <project>

Example:  submitmake cs341 Proj4

This makes the project, and shows you the report from the make utility. It cleans up the directory after making the project (removes .o and ii_files), but leaves the
executable in place.

submitrun <class> <project>

Example:  submitrun cs341 Proj4 <parameter list>

This runs the project, assuming there is an executable (i.e. submitmake was run successfully).


Project grading is described in the Project Policy handout.
Cheating in any form will not be tolerated. Please re-read the Project Policy handout for further details on honesty in doing projects for this course.

Remember, the due date is firm. Submittals made after midnight of the due date will not be accepted. Do not submit any files after that time.