UMBC CMSC201, Computer Science I, Spring '98

Project 3: Document Statistics

Due date: Tuesday, April 14, 1998

Statistics about documents are used for many purposes by computer scientists. A tally of specific word occurrences within documents can be useful for determining the similarity of documents. This is helpful for document retrieval and is used by modern search engines for the Internet. Analysis of words used within a document can even determine authorship. The number of occurrences of letters can be useful, and has been well studied in the area cryptology. The percentages of individual letters that occur within a document are language specific. This can help determine the language of an encoded message.

This project will give you practice with strings, file-handling, malloc, arrays, pointers and sorting.

Description of the Program

There will be several text files made available to you for this project. They are named : sample.txt, English.txt, French.txt, Spanish.txt and German.txt. I suggest starting with sample.txt, since I have provided you sample output from my program that is an analysis of sample.txt. Instructions are given below for copying it into your account.

I have written docstat.c for you. You will need to copy it into your account. Instructions are given below. You must use it without modification. You must write charfreq.h and charfreq.c. Here is docstat.c:

/****************************************************************************\ * Filename: docstat.c * * Author: Sue Bogar * * Date written: 7/24/95 * * Modified: 3/24/98 For 201S98 Project 3 * * Description: This program reads in a string, determines its length and * * prints the string, its length, the number of whitespace * * characters, punctuation marks and digits found in the * * string, and the number of words in the string. The number * * of occurrences of each letter is found, using an array of * * counters and a report is generated that shows the letters * * and the number of times they occurred in descending order * * by occurrences. * * * * This program is to be run separately against 4 text files, * * having the same content, but written in different languages.* * The user can inspect the output produced from each of the * * four runs to see differences in character occurrences from * * each of the four languages. * \****************************************************************************/ #include <stdlib.h> #include "charfreq.h" #define SIZE 26 main () { char *string, letters [SIZE]; int i, length, freq [SIZE]; int word = 0, space = 0, punct = 0, digit = 0; string = ReadString (&length); CountCharTypes (string, length, &space, &punct, &digit); word = WordCount (string); PrintCharReport (string, length, space, punct, digit, word); InitArrays (letters, freq, SIZE); CountLetters (string, length, freq, SIZE); free(string); SortByFrequency (letters, freq, SIZE); PrintFreqReport (letters, freq, SIZE); } charfreq.c should contain the 8 functions that are called from main() and may contain other functions as well.

Notice that I am using two arrays. The array called letters should hold the letters a, b, c, etc. The array called freq will hold zeros at first, but will eventually hold the number of times the associated character occurred in the string.

In the function ReadString(), you are to read the entire contents of a file specified by the user into a string. You must malloc the space to hold this string and return the address of the string to main(). This function must also modify the variable, length, so that it will contain the number of characters read (the length of the string).
The function CountCharTypes() is to make use of isdigit() and other macros in ctype.h to count the number of digits, white space characters, and punctuation marks found in the string.
The function WordCount() is to return the number of words in the string.

The function PrintCharReport() should produce output similar to the following example:

The string is :
This is just a little sample file.  I need to find out
whether my code will handle newlines and multiple sentences
properly.  There are 4 sentences and 38 words in this file.
The integers count as words too.

It consists of 208 characters in all.
There are 40 space(s), 4 punctuation mark(s), 3 digit(s), 
and 38 word(s).

InitArrays() should initialize the array, letters, to hold the characters 'a' through 'z' and the array, freq, to all zeros.
After CountLetters() has executed, letters[0] should hold the character 'a', and freq[0] should hold the number of times the character 'a' or 'A' occurred in the string.
Next we want to sort these two arrays, so that if 'e' or 'E' was the most frequently occurring letter, then 'e' should be in letters[0], and the number of times it occurred should be in freq[0]. The alphabetic characters should be in the array letters[] in descending order by their frequency. You should write SortByFrequency() by modifying one of the sorting functions that you wrote for project 2. Please use the more efficient one of the two sorting algorithms.
PrintFreqReport simply prints out the contents of the two sorted arrays.

The final output for the whole program should look similar to this:

retriever[102] a.out
Enter the name of the text file to be examined: sample.txt

The string is :
This is just a little sample file.  I need to find out
whether my code will handle newlines and multiple sentences
properly.  There are 4 sentences and 38 words in this file.
The integers count as words too.

It consists of 208 characters in all.
There are 40 space(s), 4 punctuation mark(s), 3 digit(s), 
and 38 word(s).

The letters in descending order by their occurrences 
are shown below:

e occurred   26 times
t occurred   16 times
n occurred   14 times
s occurred   14 times
i occurred   13 times
l occurred   12 times
o occurred    9 times
r occurred    8 times
d occurred    8 times
h occurred    7 times
a occurred    7 times
w occurred    5 times
c occurred    4 times
p occurred    4 times
u occurred    4 times
m occurred    3 times
f occurred    3 times
y occurred    2 times
g occurred    1 times
j occurred    1 times
k occurred    0 times
v occurred    0 times
q occurred    0 times
x occurred    0 times
b occurred    0 times
z occurred    0 times

retriever[103]

More details

You will be graded on your design and on the efficiency of the program, as well as the correctness of the program, documentation and style.

Copying the files

The documents to use for this project are called sample.txt, English.txt, French.txt, Spanish.txt and German.txt. These files along with the source file, docstat.c, are found in my 201 directory. You should copy these files into your own directory. The executable and the data files need to be in the same directory. Here's how to copy the files:
Change directory until you are in the directory where you will write your code and have the executable, then type the following commands at the unix prompt.

     cp ~sbogar1/201/sample.txt .
     cp ~sbogar1/201/docstat.c .

After you have your project running properly on the sample.txt file, check out the other test files and compare the occurrences of letters in different languages.

What to Turn In

You must use separate compilation for this project. You must be able to compile the docstat.c file that I have provided for you with a file called charfreq.c that you have written. You must also write charfreq.h. charfreq.c and charfreq.h, contain functions related to the frequency of characters and the prototypes for those functions, respectively. You may, of course, have other .c and .h files, as you see fit.

Submit as follows:

submit cs201 proj3 charfreq.c charfreq.h

Please Note : You do not have to submit docstat.c because your charfreq.c and charfreq.h files must compile with my docstat.c