CMSC 341 Spring 2005
Project 4

Assigned	Wednesday, April 13, 2005
Due	11:59 pm on Tuesday April 26, 2005
Correction, April 20	In section Statistics to be collected, the modulo operator "%" should be division "/".
April 20	Due date of Project 4 is extended to 11:59 pm, Sunday, May 1, 2005
Corrections, April 27	1. Formula for computing average word frequency (# of distinct words in HT1 / N_words) is incorrect. It should be 1/# of distinct words in HT1 2. Formula for theoretical estimate of average number of probings in project questions is incorrect. It should be (1/lambda)*ln(1/(1 - lambda)) where ln is natural logarithm.

Objectives

The purpose of this project is to give you some exposure to Hash Tables (HT) through a simple application that determines the frequencies of words appearing in a collection of text paragraphs. Time performance of HT will also be explored.

Description

Word frequency analysis has many applications. For example, it can provide guidance for learning English (learn the most frequent words first), and help to determine if two text documents were written by the same person (compare the word frequency distributions of the two documents). In this project, you are asked to compute the frequencies of words appearing in a collection of text paragraphs using hash table technique. These paragraphs, read from a text file, referred to as input file in this description, will first undergo some pre-process in which punctuations and other special symbols will be removed. Since some words, such as articles, prepositions, pronouns, occur frequently but are not of great interest in many applications, they will be excluded from the word frequency calculation in this project. The list of words that are to be excluded is given in another text file referred to as exclusion file. Words, after pre-processing, will first be checked against exclusion file. Those words that are not excluded be inserted into a HT, and if a word is already in the table, its occurrence count will be increment by 1. HT is selected as the data structure because find and insert operations of HT are of O(1) time complexity, as long as the load factor is low. When the load factor for the current HT becomes too high, a new HT of approximately doubled size will be created, and all words already processed will be re-hashed into the new table together with their occurrence counts. This is called rehash in author's code. The process continues until all words in the input file are processed. You will also collect some statistics related to time performance.

Below are the specifics.

Pre-processing
A text paragraph, such as a new story from wire services or texts from web, often contains punctuations (e.g., , . ? ! "), special symbols (e.g, @ $ % -), and numerals (0, ..., 9). This pre-process is to remove all such symbols from the text read from input file. Specifically, all symbol other than space or those in {A ,..., Z, a, ..., z} shall be removed. In addition, all capital cases are converted to lower cases. After the process, a text paragraph becomes a collection of sequences of lower case characters "a" through "z", separated by space(s).We call such a sequence of characters a word.

Exclusion file

This file contains a list of words that are to be excluded from frequency analysis. They include articles, prepositions, pronouns and some others. Words in this file are also sequences of lower case characters, separated by space(s). The location of this file is given as the first command line argument. Part of the exclusion file may look like this:
a an the not and she her hers that this are arent will wont did didnt

An example exclusion file can be found here.

Word frequency

The frequency of a word in the collection of texts is computed as the ratio of the number of occurrences of that word in the collection and the total word occurrences in the collection. Here the collection of text should be understood as the input file after the pre-processing and having all excluded words removed.

Hash tables

Since every word read and pre-processed will be checked against the list of words in exclusion file before inserting into HT, it is more efficient if we store those excluded words in a hash table as well. Therefore, you will build two hash tables, one, HT0 to hold the words from the exclusion file, and another, HT1, for the rest of words.

Both HT0 and HT1 are open address tables, using the division method for hashing and quadratic probing for resolving collisions as specified below:

     h'(K) = K mod M

     h(K, I) = (h'(K) + I²) mod M

Here

M, a prime number, is the table size.
I is the number of probings. A probing is defined as a call to function h'. Since each call to function h includes a call to h', the number of probings for a given key equals the number of collisions + 1.
K is an integer key. The keys are converted from words using Horner's rule as discussed in the class.

HT0 will have a fixed size M0. Since fewer than 50 words will be included in the exclusion file, we choose M0 to be the smallest prime greater than 100. HT1 will start with table size M1 = M0. You need to rehash HT1 to double its size when
1) the load factor lambda becomes greater or equal to 0.8; or

2) the quadratic probing cannot find a empty cell.
Note: Quadratic probing guarantees to find an empty slot if 1) the table size is prime and 2) lambda is less than 0.5. Since in this project lambda is allowed to be as large as 0.8, it is possible, although not very likely, that you may not be able to find an empty slot. This condition can be detected when the current probing falls into a slot that has been probed earlier while attempting to insert K (i.e.,h'(K) + I² = h'(K) + J² where I !=J).

Statistics to be collected
The following statistics are to be collected and reported:

N_total: total number of words read in from the input file (counting multiple occurrences).
N_excl : total number of words excluded from the input file (counting multiple occurrences).
N_words :total number of words included (counting multiple occurrences). It is clear that N_total = N_excl + N_words.
Average number of probings per attempted insertion into HT1 at the following time points

when the total number of distinct words stored in the HT = M / 4 (lambda approx. 0.25);
when the total number of distinct words stored in the HT = M / 2 (lambda approx. 0.5);
when the table is to be rehashed.

The averages are non-cumulative. For example, the second average should be with the insertion of the M/4 +1^st distinct word, not with the first word inserted into the current HT1.
Probing count should be reset to zero after each rehashing. Probings for rehashing keys from the current HT1 to the new one should be counted. The keys in HT1 should be rehashed in the ascending order of their indexes in HT1.

The average word frequency of all words in HT1. ( 1 / # of distinct words in HT1 )
The 20 most frequently occurring words in HT1 and their respective frequencies and number of occurrences, in descending order of their frequencies.

Your Tasks

Implement a Hash Table class by modifying the author's code at afs/umbc.edu/users/y/p/ypeng/pub/CMSC341/Proj4.
Write your main( ) that drives and controls the experiment.
Write any other classes or auxiliary code you feel are necessary for this project
Write a makefile for this project.
Answer the questions found in the file

/afs/umbc.edu/users/y/p/ypeng/pub/CMSC341/Proj4/341-Spring05-p4_questions.txt

These questions are worth 10% of the project grade.

Command Line

Project 4 will be invoked with a command line that consists of two arguments:

The first argument is the location of the exclusion file. (Note: the list is by no means complete, just including those we can think of at the time).
The second argument is the location of the input file (warning: text paragraphs in this file may be very large).

An example command line looks like (full paths are omitted)
Proj4 excludedWords.txt AssociatedPress.txt

Sample Output

The following example output is for format purpose only, it is NOT generated from execution of an actual program.

Project 4 excludedWords.txt AssociatedPress.txt

Average number of probings per attempted insertion
    lambda = 0.25     lambda = 0.5     before rehashing     current table size     # distinct words before rehashing
1.55              3.67             11.45                101                     81
      1.12              3.06             10.89                203                    163

The total number of word occurrences in the text file is:    257
The total number of excluded word occurrences is:            78
The total number of included word occurrences is:            179
The total number of distinct words is:                       103
The average word frequency is:                               0.0056
The 20 most frequently occurring words are
     xxxxx (0.051)
     xxxxx (0.049)
     ...
     ...
     xxxxx (0.012)

Grading and Academic Integrity

Your project will be tested using a variety of command lines, some of which will be invalid.
Your project will also be tested using a variety of command files which will test various conditions which your code should handle.

Project grading is described in the Project Policy handout.
Cheating in any form will not be tolerated. Please re-read the Project Policy handout for further details on honesty in doing projects for this course.

Remember, the due date is firm. Submittals made after midnight of the due date will not be accepted. Do not submit any files after that time.