UMBC CMSC 202
UMBC CMSC 202 CSEE | 202 | current 202

CMSC 202 Spring 2003
Project 1

Using Objects

Assigned Feb 03, 2003
Design Due Feb 09, 2003, 11:59pm
Program Due Feb 16, 2003, 11:59pm
Updates None

Objectives


Project Description

In this project, you will read words of text from a file and produce various outputs that tell us interesting information about the text file.

Your program will be invoked with one command line argument which is the name of the text file to process
(e.g. Proj1 myfile.txt). A "word" in the file is defined to be any sequence of characters surrounded by whitespace. Your program may assume that all words in the file are lowercase. The number of words in the file is unknown. Your program should NOT assume anything about the number of words in the file (except that they may be counted using an 'int').

Required Output

Your program will read words from the file and process them in order to produce the following output. You must output the following data about the text file in the order specified below.
  1. Print the name of the file being processed.
  2. Print the total number of words in the file.
  3. Print the total number of distinct words in the file.
  4. For each letter of the alphabet that begins one or more words, print the number of words, the number of distinct words, and a list of the words with the number of times each occurred (5 per line). The list of words should be in the order of their first appearance in the file. Print all words (if any) that begin with 'a' first, then with 'b', then with 'c', etc.
  5. Print the word that occurs most frequently in the file. If more than one word is most frequent, print any of the "most frequent" words.
  6. Print the longest word in the file and the number of characters in the word. If more than one word is longest, print any of the "longest" words.
  7. Print the letter of the alphabet which begins the most words and the number of words that begin with that letter. If more than one letter begins the most words, list the letter which comes first alphabetically.
  8. Print the letter of the alphabet which begins the most distinct words and the number of distinct words that begin with that letter. If more than one letter begins the most distinct words, list the letter which comes first alphabetically.
  9. Print a list of the letters which do not begin any words.

The WordCount class

Whenever two or more pieces of data are closely related (like the words and the number of times they occur in the file for this project), it is a good idea to associate them together in a single entity. In C, the enitity we used was a struct. In C++, these entities are classes. For your convenience, a class named "WordCount" is being provided for your use. This class contains a string to store a word from the file, an integer to store its count, constructors to create WordCount objects and methods to access the word and count.

The WordCount class is fully described in its header file, WordCount.H. You should #include this header file in your project (#include "WordCount.H") where appropriate. The project Makefile provided for you will find the header file when you compile and link your program. See the Project Makefile section for details.

The implementation of the WordCount class, WordCount.C, has been written for you. The makefile will reference the source file WordCount.C, compile it if necessary and link it with proj1.o.

There is no need to copy either WordCount.H or WordCount.C to your local directory since your makefile will find them when needed. MAKE NO CHANGES TO and DO NOT SUBMIT WordCount.H or WordCount.C.

Project Restrictions

  1. You must use the C++ string class -- no char arrays except argv are permitted.
  2. You must use the C++ vector class -- no arrays are permitted except as described in the Free Advice section.
  3. You must use the WordCount class
  4. You must use C++ (not C) output (i.e. use cout and not printf)
  5. You must use C++ file input.
  6. DO NOT prompt the user for any input. Graders will be using scripts to test your code and no user input will be provided.

Sample Output

This sample output is provided to show you a reasonable output format which satisfies the project requirements. It is not necessary that you follow this format exactly, but whatever format you choose must provide all required data and meet the project output requirements above.

The file p1.dat contains the following words

a very angry and pesky antelope ate all of my angelhair pasta i really really really dislike a very pesky antelope The following sample output was created manually using the data from "p1.dat". linux3[33]% Proj1 p1.dat Processing file "p1.dat" The file contains 21 words. The file contains 15 distinct words. Words that begin with 'a': a:2 angry:1 and:1 antelope:2 ate:1 all:1 angelhair:1 Total words: 9, Distinct words: 7 Words that begin with 'd': dislike:1 Total words: 1, Distinct words: 1 Words that begin with 'i': i:1 Total words: 1, Distinct words: 1 Words that begin with 'm': my:1 Total words: 1, Distinct words: 1 Words that begin with 'o': of:1 Total words: 1, Distinct words: 1 Words that begin with 'p': pesky:2 pasta:1 Total words: 3, Distinct words: 2 Words that begin with 'r': really:3 Total words: 3, Distinct words: 1 Words that begin with 'v': very:2 Total words: 2, Distinct words: 1 The most frequent word is "really" which occurred 3 times. The longest word is "angelhair" which contains 9 characters. The letter 'a' had the most words - 9 The letter 'a' had the most distinct words - 7 No words begin with the letters b, c, e, f, g, h, j, k, l, n, q, s, t, u, w, x, y, z

Free Advice

  1. Each distinct word from the text file and the number of times it occurs in the text file are stored in a WordCount object.
    Use one of the following ways to organize your WordCount objects.
    1. Store all WordCount objects in a single vector. This makes the coding easier, a little less elegant, and more inefficient. Loop through this vector as necessary to produce the required output.
    2. Create an array of 26 vectors (one for each letter of the alphabet), essentially creating a 2-dimensional array in which each row can have a different number of columns. Store all words that begin with 'a' in the first vector, all words that begin with 'b' in the second vector, etc. The coding's a bit more complex, but it's more elegant and more efficient since the words are already separated by starting letter and you can use some services of the vector class.
    Other vector(s) may also be useful.
  2. Use incremental development.
  3. Create your own small test files. You can manually count the number of words, distinct words, etc., and verify your output. As you feel more confident, create larger files. Share your test files with your study group and see that everyone's program gets the same results.
  4. Attend the TA office hours in ECS 104A.
  5. The Unix command wc can be used to verify the total number of words in a file. Check the man pages for the wc command (man wc).

Error Handling

Your program, and ALL programs in this class, must handle the following errors:


Project Design Assignment

Your project design document for project 1 must be named design1.txt. Be sure to read the
design specification carefully. Submit your design in the usual way. submit cs202 Proj1 design1.txt

Project Makefile

The "make" utility is used to help control projects with large numbers of files. It consists of targets, rules, and dependencies. You will be learning about make files in discussion. For this project, the makefile will be provided for you. You will be responsible for providing make files for all future projects. Copy the file

/afs/umbc.edu/users/d/e/dennis/pub/CMSC202/p1/Makefile to your directory.

When you want to compile and link your program, simply type the command make or make Proj1 at the Linux prompt. This will compile proj1.C and create the executable named Proj1

In addition to creating your project executable, make can be used for maintaining your directory. Typing make clean will remove any extraneous files in your directory, such as .o files and core files. Typing make cleanest will remove all .o files, core, Proj1, and backup files created by the editor. More information about these commands can be found at the bottom of the makefile.


Grading

The grade for this project will be broken down as follows. A more detailed breakdown will be provided in the grade form you recieve with your project grade.

85% - Correctness

15% - Coding Standards

Your code adheres to the CMSC 202 coding standards as discussed and reviewed in class.

Extra Credit

For 5 points of extra credit, do both of the following.
  1. If more than one word is "most frequent", print a list (5 per line) of all the most frequent words.
  2. If more than one word is "longest", print a list (5 per line) of all the longest words.

Project Submission

The only file required for this project is proj1.C that contains the function main().

If you write a few (1 - 3) other functions, they may part of proj1.C.
If you write many other functions, they must be written in a separate file named proj1Aux.C and their prototypes must be found in proj1Aux.H. In this case, you will be responsible for modifying your makefile to compile and link proj1Aux.C

To submit your project, type the command

submit cs202 Proj1 proj1.C Makefile The order in which the files are listed doesn't matter. However, you must make sure that all files necessary to compile your project (using the make file) are listed. DO NOT SUBMIT WordCount.H or WordCount.C

You can check to see what files you have submitted by typing

submitls cs202 Proj1

More complete documentation for submit and related commands can be found here.

Remember -- if you make any change to your program, no matter how insignificant it may seem, you should recompile and retest your program before submitting it. Even the smallest typo can cause compiler errors and a reduction in your grade.


Last Modified: Monday, 10-Feb-2003 09:06:50 EST