CMSC 341 Project 2

Assigned	Oct 10, 2012
Due	11:59pm Oct 28, 2012
Points	100

Background

There are many computational tasks where it is important to know how frequently sequences of words or characters occur in text. For example, in speech recognition (automatically turning spoken utterances into text) the sentence "sodium nitrate is a preservative" can be difficult because "nitrate" sounds a lot like "night rate". But if you have a large corpus of text and know that "night" almost never follows "sodium", you can rule out "night rate" and go with "nitrate". This example may seem silly - who would ever confuse "nitrate" with "night rate"? Sadly, this is a very real problem with all modern speech recognition systems.

Another application near and dear to the hearts of Computer Science teachers and students is figuring out whether students have copied code from each other. For example, if you see two project submissions that both have this code, you can be pretty sure they are cheating:

  for (int abc123 = 0; abc123 < 100 - 10; abc123 = abc123 + 2 - 1) 
    System.out.println(abc123);

Why? Because those two are probably the only projects that contain the strings "abc123" and "< 100 - 10" and "+ 2 - 1".

In this project, you will implement a program that checks for cheating using a data structure called a trie.

Tries

A trie (pronounced "try") is a tree useful for representing sets of strings. Each node in a trie represents a string whose length is equal to the depth of the node. You will build depth-limited tries for text files, and compare the tries to compute the similarity of the text files. Here's how you will do that.

Consider the following input text:

  'an ant sat and ate'

Suppose the depth bound is 3. To build the trie you'll take a window of width 3 and pass it over the input text file, yielding the following substrings:

  'an '
  'n a'
  ' an'
  'ant'
  'nt '
  't s'
  ' sa'
  'sat'
  'at '
  't a'
  ' an'
  'and'
  'nd '
  'd a'
  ' at'
  'ate'

You then build a tree such that there is a path from the root for each of these substrings, and keep counts for each node of the number of substrings that have passed though it. The final trie looks like this:

Each trie node contains a character and a count. There are 16 substrings of length 3, so the root was visited 16 times. The root node is the only node for which the character data member is ignored. Of the 16 substrings, 5 started with the letter 'a', so the (leftmost) child of the root labeled 'a' has a count of 5. The 5 substrings that started with 'a' are:

  'an '
  'ant'
  'at '
  'and'
  'ate'

The initial 'a' is followed by either an 'n' (in 'an ', 'ant', and 'and') or a 't' (in 'at ' and 'ate'), so that node has two children labeled 'n' and 't'. Note that substrings that share a common prefix (as 'and' and 'ant' share the two-character prefix 'an') share tree structure.

Input and Output

Your program will be run with four command line arguments. The first will be a positive integer, D, which is the maximum depth of the trie. The second will be a positive integer, N, which is the number of substrings to print. The third and fourth will be names of files containing text. Build one trie for each file. Do not modify the characters you read in any way. Do not remove white space or convert characters to upper or lower case. Then, for the N most frequent strings in the first trie at depth D, print the string, the number of times it occurs in the first text file, and the number of times it occurs in the second text file. This latter number must be obtained by searching the second trie.

Suppose your program is run with this command line:

  ant -Dargs="3 2 file1.txt file2.txt" run

That is, D = 3 and N = 2. Suppose the 2 most frequent 3-letter sequences in file1.txt are "and" and "the", and that they occurred 57 and 45 times, respectively. Forthermore, suppose those sequences occurred in file2.txt 100 and 94 times, respectively. Your output might look like this:

Sequence     # in file1.txt     # in file2.txt
--------     --------------     --------------
and          57                 100
the          45                 94

The list of strings should be sorted in descending order according to their frequencies in the first text file. If two strings to be printed have the same frequency, then print them in any order. If there are strings that have the same frequency as the Nth (i.e., the lst) string in the list, print all of them along with their frequencies in both files. In the example above, if the string "was" occurred 45 times in file1.txt, but the next most frequent string occurred 39 times, the output should look like this (assuming "was" occurred 52 times in file2.txt:

Sequence     # in file1.txt     # in file2.txt
--------     --------------     --------------
and          57                 100
the          45                 94
was          45                 52

Implementation

You have lots of lattitude in how you implement this project. A few things are required, which are listed below. After that are some hints and suggestions that you are free to use or ignore.

Requirements

You are NOT allowed to use any JAVA TREE CLASSES. Your class must be created from scratch.
Your Trie class must be generic. That is, rather than storing data items of type char in each node, it must store data items of any type where the type is specified as a compile-time parameter. For example:
```
  public class Trie< T > { ... }
```
Note that it must be possible to test objects of type T for equality (e.g., via the equals method).
You must implement a TrieNode class that is a private inner class of the Trie class. Note that TrieNode will need to be generic as well to store data items of the type specified for the enclosing trie.

Hints

You can store counts with TrieNode objects, along with the data item, and increment the count when a new sequence is added to the trie that passes through the node.
You will need some data structure inside the TrieNode, such as a vector or list, to store the children of that node.
You might find it useful to implement a Trie method such as public int getCount(Vector< T > sequence) that returns the count associated with a particular sequence.
In general, implement whatever methods you see fit as long as there is a good need for them and they adhere to good OOP principles.

Submission

You must use CVS to check out your module in the Proj2 repository and to check in your code. That must include an ant build.xml file and your java source code. The grading scripts will issue commands like the following, so be sure that your build.xml supports them:

  ant
  ant doc
  ant -Dargs="3 2 file1.txt file2.txt" run

See the projects page for more information on all of these topics.

If you don't submit a project correctly, you will not get credit for it. Why throw away all that hard work you did to write the project? Check your submittals. Make sure they work. Do this before the due date.

Project grading is described in the Project Policy handout.
Cheating in any form will not be tolerated. Please re-read the Project Policy handout for further details on honesty in doing projects for this course.