Assigned |
Wednesday, April 10, 2013 |
Due |
11:59 pm on Sunday, April 28, 2013
|
Updates |
|
In today's world, many files are transmitted across the internet including text files, image files, video files, and music files. To send them efficiently and to store them efficiently requires that we compress those files before we send them and then uncompress them when they are received. Tools such as Winzip and gzip do this for us. Music players and video players know how to uncompress their respective files. We are going to investigate one method used for compressing text files.
Many of the files we deal with on a daily basis are text files which contain ASCII characters. Each ASCII character is represented by one byte (8 bits). When we wish to archive many old text files or to send a text file across the internet, we prefer to "compress" the text file. Compressing a text file means encoding the ASCII characters in the file so that not all characters require 8 bits of storage. As a result, the size of the compressed file is smaller than the original text file. Of course, if we compress a text file, we must also be able to uncompress the file to retrieve the original file.
In this project you will be using a priority queue and a binary tree of your design to implement a file compression/uncompression algorithm called "Huffman Coding". Huffman coding is named for David Huffman who developed this algorithm as a PhD student at MIT in 1952. Before you begin, you should read and understand this explanation of Huffman Coding.
Your program will read a text file and compress it using your implementation of the Huffman coding algorithm found in the explanation. The compressed text will be written to a file. That compressed file will be then be read back by your program and uncompressed. The uncompressed text will then be written to a third file. The uncompressed text file should of course match the original text file character for charcter.
Your project will accept the three command line arguments listed below, in the order listed:
Read the specified text file and count the frequency of all characters in the file.
Create the Huffman coding tree using a PQ based on the frequencies.
Create the table of encodings for each character from the Huffman coding tree.
Encode the text file and output the encoded/compressed file.
Read the encoded/compressed file you just created, decode it and output the decoded file.
This sample output is provided only as a format example. It was not produced by any program and the data in the tables may not be consistent nor even valid.
linux3[7]% java Project3 file1.txt file2.huff file3.txt Compressing "file1.txt" into "file2.huff" Uncompressing "file2.huff" into "file3.txt" Character Frequencies --------------------- Space 3 4 1 ; 2 A 5 C 10 D 3 a 7 c 1 m 1 n 2 Total Characters: 35 Character Encodings ------------------- Space 10 4 11 ; 0100 A 0101 C 0110 D 0111 a 000 c 001 m 1000 n 1001
Project grading is described in the Project Policy handout. Cheating in any form will not be tolerated. Please re-read the Project Policy handout for further details on honesty in doing projects for this course.
Remember, the due date is firm. Submittals made after midnight of the due date will result in a late penalty.
Your project's functionality will be tested with a variety of text files such as those described below. This list describes only some types of text files that may be used to grade your project and should not be considered all-inclusive.
You must use CVS to check out your module in the Proj3 repository and to check in your code and your build.xml. Grading scripts will execute the commands ant compile and ant run to compile and execute your project. You shpould be able to re-use a build.xml file from an earlier project.
Check your submittals. Make sure they work. Do this before the due date. As in previous projects, use the CVS utilities to verify that your code was submitted properly and will build and execute properly using the commands ant compile and ant run. See the projects page for more information.