CMSC-341 Fall 2003

Project 4

Assigned 5 Nov 2003
Due 24 Nov 2003


Background

Despite the fact that Red-Black trees offer guaranteed O(lgN) performance per operation, there are many applications where the O(1) performance of hash tables is essential. One such application is query processing by web search engines like Google and Yahoo. In this project you will use hash tables to implement a simple, yet efficient, query engine.


Description

Web search engines have response times on the order of a second for arbitrary queries because they index the entire web. They do this by maintaining for each word a list of the pages that contain the word. Then, to figure out what pages contain all of the words in a multiword query you just intersect the lists. For example, if the query "project four" is issued, the index is consulted to find out that "project" occurs in, say, pages P1, P4, P5, and P9; the index is consulted again to find out that "four" occurs in, say, pages P4, P7, P9, and P12; and finally the intersection of these two lists of pages is computed yielding P4 and P9 as the only ones that contain both "project" and "four".

You will implement a system just like the one described above using a hash table to store the index.

Here are your tasks:

  1. Obtain these files: from this location: /afs/umbc.edu/users/o/a/oates/pub/CMSC341/Proj4/. The file SeparateChaining.C contains Weiss' implementation of a hash table with separate chaining collision resolution.
  2. Create a class named Occurrence that stores information about a word and the pages in which it occurs.
  3. Write Proj4.C. It will validate command line arguments, read commands from a command file, and print the results of executing commands.
  4. Answer the questions posed in 341-Fall03-proj4_questions.txt. Copy the file
  5. /afs/umbc.edu/users/o/a/oates/pub/CMSC341/Proj4/341-Fall03-p4_questions.txt

    to your own directory and edit it to provide your answers to the questions. Don't forget to submit the edited file; it's 10% of this project's grade.


Below is a detailed discussion of how to implement this project.

The Command Line

Project 4 will be invoked with a command line that consists of one argument - the name of a command file. Your program should validate the command line, ensure that the specified file can be opened, then read and process each of the commands in the file.

The Command File

The command file format is as follows: There can be multiple LOAD and QUERY commands in a single command file in any order.

Implementation Details

Your index will be implemented using a hash table. For each word that occurs in a file that's loaded with a LOAD command, you must record the fact that the word occurred in the file. You will do that by storing objects of type Occurrence in the hash table. That is, your program will declare a hash table that contains Occurrences:

HashTable < Occurrence > hash(ITEM_NOT_FOUND);

Objects of type Occurrence have two private data members:

You will need to write the following member methods for this class: You will also need to write the following non-member method:

To process a LOAD command you simply open the file and for each word contained in the file either (1) insert a new Occurrence if the word is not in the hash table or (2) add the file to an existing Occurrence if the word is in the hash table. Note that method addFile must modify an Occurrence, so you might need to modify the hash table method find to allow this.

To process a QUERY command, you read each of the query terms, look for them in the hash table, and get the lists of files in which they occur. The intersection of these lists is a list of files in which all of the query terms occur. You will probably want to write the following List method:


Files To Be Submitted

You should submit all files needed to build your program and the file containing your answers to the questions.
Submit Tools
There are a number of tools available to you to check on your submittal. It is your responsibility to ensure that the submittal is correct and will result in a successful
compilation of your project. Do not wait till the last minute to submit your files. Give yourself enough time to check that your submittal is correct.

If you don't submit a project correctly, you will not get credit for it. Why throw away all that hard work you did to write the project? Check your
submittals. Make sure they work. Do this before the due date.

Documentation for the submit program is on the web at http://www.gl.umbc.edu/submit/. One of the tools provided by the submit program is
submitls. It lists the names of the files you have submitted.

Additionally, there are two programs for use only by CMSC-341 students (not part of the UCS submit program). They are in the directory
/afs/umbc.edu/users/o/a/oates/pub/CMSC341/ and are named submitmake and submitrun. You can use these programs to
make or run your submitted projects.

The syntax is similar to that for submit:

submitmake <class> <project>

Example:  submitmake cs341 Proj4

This makes the project, and shows you the report from the make utility. It cleans up the directory after making the project (removes .o and ii_files), but leaves the
executable in place.

submitrun <class> <project> [command-line args]

Example:  submitrun cs341Proj4 checkers checkfile.dat

This runs the project, assuming there is an executable (i.e. submitmake was run successfully).


Grading and Academic Integrity

Your project will be tested using a variety of command lines, some of which will be invalid.
Your project will also be tested using a variety of command files which will test various conditions which your code should handle.

Project grading is described in the Project Policy handout.

Your answers to 341-Fall03-proj4_questions.txt are worth 10% of your project grade.

Cheating in any form will not be tolerated. Please re-read the Project Policy handout for further details on honesty in doing projects for this course.

Remember, the due date is firm. Submittals made after midnight of the due date will not be accepted. Do not submit any files after that time.