CMSC 341 Project 2

Assigned Oct 10, 2012
Due 11:59pm Oct 28, 2012
Points 100


Background

There are many computational tasks where it is important to know how frequently sequences of words or characters occur in text. For example, in speech recognition (automatically turning spoken utterances into text) the sentence "sodium nitrate is a preservative" can be difficult because "nitrate" sounds a lot like "night rate". But if you have a large corpus of text and know that "night" almost never follows "sodium", you can rule out "night rate" and go with "nitrate". This example may seem silly - who would ever confuse "nitrate" with "night rate"? Sadly, this is a very real problem with all modern speech recognition systems.

Another application near and dear to the hearts of Computer Science teachers and students is figuring out whether students have copied code from each other. For example, if you see two project submissions that both have this code, you can be pretty sure they are cheating:

  for (int abc123 = 0; abc123 < 100 - 10; abc123 = abc123 + 2 - 1) 
    System.out.println(abc123);
Why? Because those two are probably the only projects that contain the strings "abc123" and "< 100 - 10" and "+ 2 - 1".

In this project, you will implement a program that checks for cheating using a data structure called a trie.


Tries

A trie (pronounced "try") is a tree useful for representing sets of strings. Each node in a trie represents a string whose length is equal to the depth of the node. You will build depth-limited tries for text files, and compare the tries to compute the similarity of the text files. Here's how you will do that.

Consider the following input text:

  'an ant sat and ate'
Suppose the depth bound is 3. To build the trie you'll take a window of width 3 and pass it over the input text file, yielding the following substrings:
  'an '
  'n a'
  ' an'
  'ant'
  'nt '
  't s'
  ' sa'
  'sat'
  'at '
  't a'
  ' an'
  'and'
  'nd '
  'd a'
  ' at'
  'ate'

You then build a tree such that there is a path from the root for each of these substrings, and keep counts for each node of the number of substrings that have passed though it. The final trie looks like this:

Each trie node contains a character and a count. There are 16 substrings of length 3, so the root was visited 16 times. The root node is the only node for which the character data member is ignored. Of the 16 substrings, 5 started with the letter 'a', so the (leftmost) child of the root labeled 'a' has a count of 5. The 5 substrings that started with 'a' are:

  'an '
  'ant'
  'at '
  'and'
  'ate'
The initial 'a' is followed by either an 'n' (in 'an ', 'ant', and 'and') or a 't' (in 'at ' and 'ate'), so that node has two children labeled 'n' and 't'. Note that substrings that share a common prefix (as 'and' and 'ant' share the two-character prefix 'an') share tree structure.


Input and Output

Your program will be run with four command line arguments. The first will be a positive integer, D, which is the maximum depth of the trie. The second will be a positive integer, N, which is the number of substrings to print. The third and fourth will be names of files containing text. Build one trie for each file. Do not modify the characters you read in any way. Do not remove white space or convert characters to upper or lower case. Then, for the N most frequent strings in the first trie at depth D, print the string, the number of times it occurs in the first text file, and the number of times it occurs in the second text file. This latter number must be obtained by searching the second trie.

Suppose your program is run with this command line:

  ant -Dargs="3 2 file1.txt file2.txt" run
That is, D = 3 and N = 2. Suppose the 2 most frequent 3-letter sequences in file1.txt are "and" and "the", and that they occurred 57 and 45 times, respectively. Forthermore, suppose those sequences occurred in file2.txt 100 and 94 times, respectively. Your output might look like this:
Sequence     # in file1.txt     # in file2.txt
--------     --------------     --------------
and          57                 100
the          45                 94

The list of strings should be sorted in descending order according to their frequencies in the first text file. If two strings to be printed have the same frequency, then print them in any order. If there are strings that have the same frequency as the Nth (i.e., the lst) string in the list, print all of them along with their frequencies in both files. In the example above, if the string "was" occurred 45 times in file1.txt, but the next most frequent string occurred 39 times, the output should look like this (assuming "was" occurred 52 times in file2.txt:

Sequence     # in file1.txt     # in file2.txt
--------     --------------     --------------
and          57                 100
the          45                 94
was          45                 52


Implementation

You have lots of lattitude in how you implement this project. A few things are required, which are listed below. After that are some hints and suggestions that you are free to use or ignore.

Requirements

Hints


Submission

You must use CVS to check out your module in the Proj2 repository and to check in your code. That must include an ant build.xml file and your java source code. The grading scripts will issue commands like the following, so be sure that your build.xml supports them:
  ant
  ant doc
  ant -Dargs="3 2 file1.txt file2.txt" run

See the projects page for more information on all of these topics.

If you don't submit a project correctly, you will not get credit for it. Why throw away all that hard work you did to write the project? Check your submittals. Make sure they work. Do this before the due date.


Project grading is described in the Project Policy handout.
Cheating in any form will not be tolerated. Please re-read the Project Policy handout for further details on honesty in doing projects for this course.

Remember, the due date is firm. Submittals made after midnight of the due date will not be accepted. Do not submit any files after that time.