CMSC-341 Fall 2006

Project 3

Assigned Wed Oct 11, 2006
Due Tues Oct 24, 2006 by 11:53PM
Updates 17 OctHints on parsing the text file have been added.

12 Oct
The file of excluded words is just a list of words (lower-case, possibly including hyphen or apostrophe) separated by whitespace. No other punctuation will be in the file. The file will not contain comments.


Background

This project will give you experience working with Lists, Binary Search Trees in the context of implementing a cross-reference (an "X-Ref") ADT. To build an X-Ref your program will read the words in text data file, store them in a BST, and record the page number(s) and line number(s) within each page for each occurrence of a word. Overly common words such as "a", "an", and "or" will be excluded from the X-Ref. Your program will read commands from a file to query the Xref.


Description

An X-Ref is built by reading a text file and storing page and line number information about each word. Words which are to excluded from the X-Ref are found in a separate file. Your project will open a command file (a command line paramter) and execute the commands found within it to produce the required output.

For this project, the page and line number should be recorded for every non-excluded word read from the text file.

The text file

In the text file, a "word" is a sequence of characters. Words may contain letters, numbers, the "-" symbol (hyphen, as in "vice-versa") and "'" (apostrophe as in "can't"). All other characters should be considered delimiters. Lines containing only delimiters (and no valid characters) should be considered as empty lines. Empty lines are counted, but do not add entries to the X-Ref. All words should be converted to lower-case. Page and line numbering start at 1. In the following quote Tomorrow, and tomorrow, and tomorrow Creeps in this petty pace from day to day To the last syllable of recorded time; the word "tomorrow" is included in the X-Ref only once, even though one occurrence is capitalized and the word appears multiple times on the same line. The word "to" appears twice -- once for each line on which it occurs.

The special word PAGE-BREAK indicates the end of the current page and the start of a new page. PAGE-BREAK should NOT be included in the X-Ref.

Hints for parsing the textfile
"Parsing" means examing a string/line/set of characters and splitting it apart into separate "words". In the text file in this project all characters except alphanumeric characters (AB..YZab...yz012...9), hyphen, and apostrophe are delimiters. Delimiters denote the end of a word. To parse the text file, we look for the delimiters and create separate words for the Xref. One easy C++ way to do this is to use getline() to read an entire line from the file, replace all delimeters with spaces, then use operator>> with an istringstream to extract the words. The pseudo-code below should be helpful. The istring stream notes from CMSC 202 may also prove helpful. 1. use getline( ) to read an entire line from the texfile into a C++ string or array of chars 2. examine each character in the string/array and replace every delimiter with a space 3. create/initialize an istringstream object with the string/array of chars 4a. use the istringstream extraction operator>> to "read" the istringstream one word at a time. (The extraction operator skips the spaces.) 4b. Do something with each word Also, to make your job easier, we promise that PAGE-BREAK will appear on its own line.

Your primary tasks are

  1. Write main( ) in Proj3.cpp (and helper functions in an auxiliary file) that instantiates the X-Ref class, then reads the and executes each command. Some commands (LOAD and EXCLUDE) direct you to read text files. All other commands invoke an appropriate X-Ref method to produce the required output.

  2. Copy the author's binary search tree (BST) code from the directory /afs/umbc.edu/users/d/e/dennis/pub/CMSC341/Proj3. You may augment the BST class by adding generic methods which are helpful for this project, but the BST may have no code specific to an Xref. The BST must remain a class template.
    Other potentially helpful code is also available in the same directory.

  3. Implement an X-Ref class that will be instantiated in main( ). The X-Ref class should use a binary search tree to store its data. As a "user" of the BST, the X-Ref is limited to using only the public methods of the BST. The X-Ref is the only project-specific object that main( ) instantiates.

  4. Implement other classes related to the X-Ref that good OO design dictates for data hiding and encapsulation.

  5. Implement a makefile for this project.

  6. Answer the questions posed in 341-fall06-proj3_questions.txt. Copy the file /afs/umbc.edu/users/d/e/dennis/pub/CMSC341/Proj3/341-fall06-p3_questions.txt to your own directory and edit it to provide your answers to the questions. Don't forget to submit the edited file; it is 10% of this project's grade.

Suggestions for this project

This project isn't trivial, but it's pretty straightforward if attacked in the appropriate way. Although this shouldn't be new to you, it seems prudent to offer some advice. Try implementing the project in this order.
  1. Think about how best to store the required data (word, pages, line numbers) for each word in the Xref. There are several options, some more time and/or space efficient than others.
  2. Implement the basics of the Xref class (and its nested classes/structs) such as constructing, LOADing, and printing.
  3. Write a small program with hard-coded data to test the basics of Xref.
  4. Continue to add functionality to the Xref (to support the commands in the command file) and to enhance your Xref test program to test them, still using hard-coded data.

  5. Implement a skeleton for main( ) including an instantiation of an Xref object.
  6. Implement and test your command file processor, one function at a time.
    Continue to implement and test the remaining commands.
  7. Determine how/where to store the excluded words for quick lookup, then implement code to support the EXCLUDED command.


The Command File

Commands in the file specify operations to be performed with X-Refs. Each line in the file represents one command. Blank lines may appear anywhere in the file and should be ignored. Lines beginning with '#' are comments and should be ignored. Otherwise, you can assume that any line containing a command is syntactically well-formed. We make this assumption so you don't have to spend lots of time making the command file parser bullet proof.

The command file format follows:


Sample Output

Sample output is available for your study in the file /afs/umbc.edu/users/d/e/dennis/pub/CMSC341/Proj3/341-fall06-p3-sample_output.txt Xref entries printed in the output are of the form
to 1/2, 2/1
indicating that the word "to" appears on page 1/ line 2 and also on page 2 / line 1

The datafile loaded in the sample output contains the following text:

Tomorrow, and tomorrow, and tomorrow
Creeps in this petty pace from day to day
PAGE-BREAK
To the last syllable of recorded time

None of these words were in the file of excluded words.


10 Points Extra Credit

To implement the PRINTPAGE command in the Xref and not write project-specific code within the BST (which is not permitted), the Xref method that implements PRINTPAGE must be able to perform an in-order traversal of the BST. To accomplish this, you must implement a BST InOrderIterator as discussed in class.

For 10 points extra credit

  1. Enhance the BST by adding a nested public InOrderIterator class.
    The InOrderIterator supports only operator++, operator*, operator==, and operator!= in addition to constructors, destructors, etc.
  2. Add a method to the BST which creates an InOrderIterator that is positioned at the beginning of an in-order traversal.
  3. Add a method to the BST which creates an InOrderIterator that is positioned at the end of an in-order traversal.
  4. Implement the PRINTPAGE command using InOrderIterators.
  5. Submit a file named README.txt to alert the graders that you have implemented the extra credit code.

Project Submission

Because you are submitting your own makefile, you are free to name your files whatever you like, except that the name of your executable MUST BE Proj3 (upper-case 'P'). The submitrun program and the scripts that we use to grade your files assume that you follow this convention for naming your executable.

DO NOT submit any .h file provided for you unless you have modified the file. If you have not modified the file, then your makefile should reference the file(s) as in previous projects.

Submit Tools

There are a number of tools available to you to check on your submittal. It is your responsibility to ensure that the submittal is correct and will result in a successful compilation of your project. Do not wait till the last minute to submit your files. Give yourself enough time to check that your submittal is correct.

If you don't submit a project correctly, you will not get credit for it. Why throw away all that hard work you did to write the project? Check your submittals. Make sure they work. Do this before the due date.

Documentation for the submit program is on the web at http://www.gl.umbc.edu/submit/. One of the tools provided by the submit program is submitls. It lists the names of the files you have submitted.

Additionally, there are two programs for use only by CMSC-341 students (not part of the UCS submit program). They are in the directory
/afs/umbc.edu/users/d/e/dennis/pub/CMSC341/ and are named submitmake and submitrun. You can use these programs to make or run your submitted projects.

The syntax is similar to that for submit:

submitmake <class> <project>

Example:  submitmake cs341 Proj2

This makes the project, and shows you the report from the make utility. It cleans up the directory after making the project (removes .o files), but leaves the executable in place.

submitrun <class> <project> [command-line args]

Example:  submitrun cs341 Proj2 test.dat

This runs the project, assuming there is an executable (i.e. submitmake was run successfully).


Grading and Academic Integrity

Your project will be tested using a variety of command lines, some of which will be invalid.
Your project will also be tested using a variety of command files which will test various conditions which your code should handle.

Project grading is described in the Project Policy handout.

Your answers to 341-fall06-proj3_questions.txt are worth 10% of your project grade.

Cheating in any form will not be tolerated. Please re-read the Project Policy handout for further details on honesty in doing projects for this course.

Remember, the due date is firm. Submittals made after the due date will not be accepted. Do not submit any files after that time.