CMSC 341 Spring 2011
Project 4

Assigned

Tuesday, April 26, 2011

Due

11:59 pm on Wednesday May 11, 2011

Update

Wed. May 3, 2011

1.      The format of the sample output is a suggestion, you do NOT have to literally follow it.

If you choose to use your own format, please be sure to 1) report all required content

And 2) print in a way that is readable by OTHERS.

2.      See insertions (in read) in the section of Statistics to be collected for clarification.

 

Objectives

The purpose of this project is to give you some exposure to Hash Tables (HT) through a simple application that determines the frequencies of words appearing in a collection of text paragraphs. Time performance of HT operations will also be examined.


Description and Tasks

Word frequency analysis has many applications. For example, it can provide guidance for learning English (learn the 4000 most frequent words first). It can and help to determine if two text documents were written by the same person (compare the word frequency distributions of the two documents). In this project, you are asked to compute the frequencies of words appearing in a collection of text paragraphs using hash tables. These paragraphs, read from a text file, referred to as input file in this description, will first undergo some pre-processing in which punctuations and other special symbols will be removed and all capital letters changed to lower cases. Since some words, called stop words, such as articles, prepositions, pronouns, occur frequently but are not of great interest in many applications, they will be excluded from the word frequency calculation in this project. The list of words that are to be excluded is given in another text file, referred to as exclusion file. Words, after pre-processing, will first be checked against exclusion file. Those that are not excluded are inserted into a HT, and if a word is already in the table, its occurrence count will be increment by 1. HT is selected as the data structure because find and insert operations of HT are of O(1) time complexity, as long as the load factor is low. When the load factor for the current HT becomes too high, rehash shall occur and the size of the hash table is doubled. The process continues until all words in the input file are processed. You will also collect some statistics related to time performance.


Below are the specifics.

Pre-processing
A text paragraph, such as a news story from wire services or texts from web, often contains punctuations (e.g. , ; . ? ! "), special symbols (e.g, @ $ % -), and numerals (0, ..., 9). This pre-processing is to remove all such symbols from the text read from input file. Specifically, all symbol other than space and those in {A ,..., Z, a, ..., z} shall be removed. In addition, all capital cases are converted to lower cases. After the process, a text paragraph becomes a collection of sequences of lower case characters "a" through "z". We call such a sequence of characters a word. For example, the following paragraph

RALEIGH, N.C. – Rescue crews searched for survivors in wind-blasted landscapes Sunday in North Carolina. The spring storm, North Carolina's deadliest in two decades, spun off 62 tornadoes in that state alone Saturday night.

becomes

raleigh nc rescue crews searched for survivors in windblasted landscapes sunday in north carolina the spring storm north carolinas deadliest in two decades spun off tornadoes in that state alone saturday night

after this pre-processing. Note that string “62” is completely removed, and that strings “carolina” and “carolinas” are considered two distinct words in this project (more sophisticated pre-processor can remove the letter “s” from “carolinas”, thus making the two the same). 

 

Exclusion file

This file contains a list of words that are to be excluded from frequency analysis. They include articles, prepositions, pronouns and some others. Words in this file are also sequences of lower case characters a, …, z, separated by space(s).

 

Word frequency

The frequency of a word in the collection of texts is computed as the ratio of the number of occurrences of that word in the collection and the total number of occurrences of all words in the collection.  Here the collection of text should be understood as the input file after the pre-processing and having all excluded words removed.

Hash tables

Since every word read and pre-processed will be checked against the list of words in exclusion file before inserting into HT, it is more efficient if we store those excluded words in a hash table as well. Therefore, you will build two hash tables, one, HT0, to hold the words from the exclusion file, and another, HT1, for the rest of words read from the text.


Both HT0 and HT1 are open address tables using the division method for hashing and quadratic probing for resolving collisions as specified below:

     h'(K) = K mod M
     h(K, I) = (h'(K) + I2) mod M

Here

 

The sizes of HT0 and HT1 are denoted M0 and M1, respectively. Both HT0 and HT1 will have the same initial size of 101. You need to rehash to double its size whenever number of distinct words in the table > M/2 (e.g., the load factor lambda becomes greater than 0.5). When rehashing is called, the keys in current table should be rehashed in the ascending order of their indices (see authors’ code).

 

You can base your implementation on the authors’ code for quadratic probing hash table. You are NOT allowed to use hashtable class in Java Library.

 

Statistics to be collected

The following statistics are to be collected:

The average number of probings per attempted insertion into HT1 at the following time points

These averages, together with the table size (M1) and the total number of distinct words in HT1, shall be reported before each rehashing and at the end of the execution.

Note:


The following statistics are to be reported at the end of the execution:

 

Note: the frequency of word w = (# occurrences of w / N_incl)


Command Line

Project 4 will be invoked with a command line with two arguments which are the full paths of the exclusion file and the input file. Following is an example of the command line:

 
ant -Dargs="/afs/umbc.edu/users/y/p/ypeng/pub/cs341s11/Proj4/excludedWords.txt /afs/umbc.edu/users/y/p/ypeng/pub/cs341s11/Proj4/APstory1.txt" run

A small example input file can be found here; a small example exclusion file can be found here.


Sample Output

The following example output is for format purpose only; it is NOT generated from execution of an actual program.

    Average number of probings
    lambda > 0.25     lambda > 0.5     current table size     # distinct words before rehashing
      1.55              3.67             101                     51
      1.12              3.06             211                    106
  ...
  ...

For the input file /afs/umbc.edu/users/y/p/ypeng/pub/cs341s11/Proj4/APstory1.txt
The total number of word occurrences is:                      1,196
The total number of excluded word occurrences is:               317
The total number of included word occurrences is:               879
The total number of distinct words is:                          451
The average word frequency is:                                0.00222
The 20 most frequently occurring words are
     xxxxx (0.00796)
     xxxxx (0.00796)
     ...
     ...
     xxxxx (0.00455)


Files to Be Submitted

Submit the following files:

  1. build.xml (you should modify the build file you used for previous projects for Proj4)
  2. all your source code (including Proj4.java), your source code must be organized in the way that is consistent with your build file, otherwise it won’t compile and run
  3. README (optional)

You must use CVS to submit your project into the Proj4 repository. (No file to be checked out for this project). The repository location for this project is:

 
      /afs/umbc.edu/users/y/p/ypeng/pub/cs341s11/Proj4