CMSC 341 Spring 2005
Project 4

Assigned

Wednesday, April 13, 2005

Due

11:59 pm on Tuesday April 26, 2005

Correction, April 20
In section Statistics to be collected, the modulo operator "%" should be division "/".
April 20 Due date of Project 4 is extended to 11:59 pm, Sunday, May 1, 2005
Corrections,
April 27
1. Formula for computing average word frequency (# of distinct words in HT1 / N_words) is incorrect.
    It should be 1/# of distinct words in HT1
2. Formula for theoretical estimate of average number of probings in project questions is incorrect.
    It should be (1/lambda)*ln(1/(1 - lambda)) where ln is natural logarithm.

Objectives

The purpose of this project is to give you some exposure to Hash Tables (HT) through a simple application that determines the frequencies of words appearing in a collection of text paragraphs. Time performance of HT will also be explored.


Description

Word  frequency analysis has many applications. For example, it can provide guidance for learning English (learn the most frequent words first), and help to determine if two text documents were written by the same person (compare the word frequency distributions of the two documents). In this project, you are asked to compute the frequencies of words appearing in a collection of text paragraphs using hash table technique. These paragraphs, read from a text file, referred to as input file in this description, will first undergo some pre-process in which punctuations and other special symbols will be removed. Since some words, such as articles, prepositions, pronouns, occur frequently but are not of great interest in many applications, they will be excluded from the word frequency calculation in this project. The list of words that are to be excluded is given in another text file referred to as exclusion file. Words, after pre-processing, will  first be checked against exclusion file. Those words that are not excluded be inserted into a HT, and if a word is already in the table, its occurrence count will be increment by 1. HT is selected as the data structure because find and insert operations of HT are of O(1) time complexity, as long as the load factor is low. When the load factor for the current HT becomes too high, a new HT of approximately doubled size will be created, and all words already processed will be re-hashed into the new table together with their occurrence counts. This is called rehash in author's code. The process continues until all words in the input file are processed. You will also collect some statistics related to time performance.


Below are the specifics.

Pre-processing
A text paragraph, such as a new story from wire services or texts from web, often contains punctuations (e.g., , . ? ! "), special symbols (e.g, @ $ % -), and numerals (0, ..., 9). This pre-process is to remove all such symbols from the text read from input file. Specifically, all symbol other than space or those in {A ,..., Z, a, ..., z} shall be removed. In addition, all capital cases are converted to lower cases. After the process, a text paragraph becomes a collection of sequences of lower case characters "a" through "z", separated by space(s).We call such a sequence of characters a word.


Exclusion file

This file contains a list of words that are to be excluded from frequency analysis. They include articles, prepositions, pronouns and some others. Words in this file are also  sequences of lower case characters, separated by space(s). The location of this file is given as the first command line argument. Part of the exclusion file may look like this:
     a an the not and she her hers that this are arent will wont did didnt

An example exclusion file can be found here.

Word frequency

The frequency of a word in the collection of texts is computed as the ratio of the number of occurrences of that word in the collection and the total word occurrences in the collection.  Here the collection of text should be understood as the input file after the pre-processing and having all excluded words removed.

Hash tables

Since every word read and pre-processed will be checked against the list of words in exclusion file before inserting into HT, it is more efficient if we store those excluded words in a hash table as well. Therefore, you will build two hash tables, one, HT0 to hold the words from the exclusion file, and another, HT1, for the rest of words.


Both HT0 and HT1 are open address tables, using the division method for hashing and quadratic probing for resolving collisions as specified below:

     h'(K) = K mod M
     h(K, I) = (h'(K) + I2) mod M

Here


HT0 will have a fixed size M0. Since fewer than 50 words will be included in the exclusion file,  we choose M0 to be the smallest prime greater than 100. HT1 will start with table size M1 = M0. You need to rehash HT1 to double its size when
    1) the load factor lambda becomes greater or equal to 0.8; or

    2) the quadratic probing cannot find a empty cell.
Note: Quadratic probing guarantees to find an empty slot if 1) the table size is prime and 2) lambda is less than 0.5. Since in this project lambda is allowed to be as large as 0.8,  it is possible, although not very likely, that you may not be able to find an empty slot. This condition can be detected when  the current probing falls into a slot that has been probed earlier while attempting to insert K (i.e.,h'(K) + I2 = h'(K) + J2 where I !=J).


Statistics to be collected
The following statistics are to be collected and reported:


Your Tasks

/afs/umbc.edu/users/y/p/ypeng/pub/CMSC341/Proj4/341-Spring05-p4_questions.txt

These questions are worth 10% of the project grade.


Command Line

Project 4 will be invoked with a command line that consists of two arguments:
  1. The first argument is the location of the exclusion file. (Note:  the list is by no means complete, just including those we can think of at the time).
  2. The second argument is the location of the input file (warning: text paragraphs in this file may be very large).

An example command line looks like (full paths are omitted)
     Proj4 excludedWords.txt AssociatedPress.txt


Sample Output

The following example output is for format purpose only, it is NOT generated from execution of an actual program.

Project 4  excludedWords.txt AssociatedPress.txt

Average number of probings per attempted insertion
    lambda = 0.25     lambda = 0.5     before rehashing     current table size     # distinct words before rehashing
      1.55              3.67             11.45                101                     81
      1.12              3.06             10.89                203                    163

The total number of word occurrences in the text file is:    257
The total number of excluded word occurrences  is:            78
The total number of included word
occurrences is:            179
The total number of distinct words is:                       103

The average word frequency is:                               0.0056
The 20 most frequently occurring words are
     xxxxx (0.051)
     xxxxx (0.049)
     ...
     ...
     xxxxx (0.012)


Grading and Academic Integrity

Your project will be tested using a variety of command lines, some of which will be invalid.
Your project will also be tested using a variety of command files which will test various conditions which your code should handle.

Project grading is described in the Project Policy handout.
Cheating in any form will not be tolerated. Please re-read the Project Policy handout for further details on honesty in doing projects for this course.

Remember, the due date is firm. Submittals made after midnight of the due date will not be accepted. Do not submit any files after that time.