CMSC 313 Project 2

Tag Cloud Revisited

Assigned	Monday Feb 22, 2010
Program Due	11:59pm Tuesday Mar 9, 2010
Points	65
Updates

The Objective

The objective of this assignment is to become familiar using pointers and dynamic memory in a C application. By comparing their solution of this project with their solution to project 1, students will see and experience the close relationship between arrays and pointers in C. Students will also gain experience with command line arguments.

The Task

A "tag cloud" is a visual representation of the words which occur most frequently in a document. High frequency words are displayed with larger font sizes and brighter colors. Lower frequency words are displayed with smaller font sizes. The websites www.wordle.net and www.tagcrowd.com generates tag clounds from text that you provide. The website http://chir.ag/phernalia/preztags shows a chronological tag cloud timeline for US Presidential speeches. You assignment is to write a program that tokenizes a text file (reads one "word" at a time) and generates the information required to create a tag cloud.

This project has the same basic functionality as project 1.

How does your program work?

Your program accepts one command line parameter which is the name of the text file to process.
Your program reads the text file one "word" at a time and counts the frequency of each word. You should ignore words that are less than 4 characters long and words that contain non-alphabetic characters.
Find the 50 words with the highest frequency counts. If there are fewer than 50 different words, use them all.
Print these 50 words in alphabetical order along with their frequency count. Print each word and its frequency count on a separate line.

Hints, Notes and Requirements

Requirements designated as (R

) are new for project 2.
Requirements designated as (R) are repeated from project 1.

(N) A "word" is any sequence of non-whitespace characters, e.g. "Bob", "WAIT!!", "!??!". (the more technical term for a sequence of non-whitespace characters is "token").
(N) You may assume that there are no more than 500 UNIQUE words in the file.
(N) You may assume that no word is more than 30 characters long.
(N) You may assume that the name of the input text file is no more than 100 characters.
(N) You can read one word at a time using %s with fscanf. This technique will skip whitespace, but will not separate punctuation from adjoining lettters, therefore "Bobby!" will be read as a word, but will not be counted because of the '!' character. There are obviously better ways to deal with punctuation, but we're trying to keep this relatively easy.

(R) If the text file specified by the user cannot be opened for reading, an appropriate error message should be printed and your program should terminate.
(R) Distinctions in case should be ignored when counting occurrences of words, e.g. "Hello", "hello", and "HELLO" are all the same word.
(R) Words should be output in lower-case.
(R) This is an individual project. Do your own work.

) Dynamic memory allocation is required for all structs and arrays

All arrays of primitive types must be dynamically allocated.

This is ok: int *pScores = (int *)calloc( 42, sizeof(int) );
This is not ok: int scores[42];

All char arrays must be the minimum size necessary for the string that it contains.

All structs must be dynamically allocated.

This is ok: struct bob *pBob = (struct bob *)malloc( sizeof(struct bob) ):
This is not ok: stuct bob aBob;

Arrays of structs are not allowed. You must use dynamically allocated arrays of pointers to structs instead and then dynamicall allocate each struct as noted above.

This is ok : Code that dynamically allocates an array of 55 pointers to struct bobs
This is not ok: struct bob arrayOfBob[ 55 ];
This is not correct: struct bob *arrayOfBob = (struct bob *)malloc( 55 * sizeof(struct bob));

(R) All dynamically allocated memory must be free'd so that no memory leaks are created.
Use valgrind (see next item) to confirm that your code is correct.

) Your program must run cleanly under the Unix valgrind utility. Execute your program under control of valgrind with the command

 linxu2[2]% valgrind --leak-check=full project2 textfile

The result should look something like this

==3290== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 15 from 1)
==3290== malloc/free: in use at exit: 0 bytes in 0 blocks.
==3290== malloc/free: 355 allocs, 355 frees, 5,329 bytes allocated.
==3290== For counts of detected errors, rerun with: -v
==3290== All heap blocks were freed -- no leaks are possible.

See the course resources page for a link to a valgrind tutorial and / or see the on-line Unix manual for help with valgrind.

(R) Your code must be separated into multiple .c files as follows. Appropriate .h files are also required. You are free to choose any names for your files.
1. main( ) should be in its own .c file.
2. Helper functions specific to this project should be grouped into another .c file.
3. Helper functions that are generic enough to be used in a different project should be in yet another .c file.
(R) You must submit a makefile that creates an executable named project2 just by typing make at the Unix promt. A skeleton makefile is available in Mr. Frey's public directory /afs/umbc.edu/users/f/r/frey/pub/313/proj2/
(H) Use a struct to associate a word and its frequency.
(H) calloc( ) initializes dynamically allocated memory to zero, malloc( ) does not.
(H) C does NOT have a library function to convert a word to lower-case, but it does have a library function to convert a character to lower case.
(H) The best approach to this assignment requires sorting the words and their associated counts. You are free to use any sorting algorithm you choose. Bubblesort, insertion sort and selection sort are among the simplest, but also the slowest. Check the CMSC 201 website or search the internet for these algorithms. Mergesort and quicksort are much faster but are more complex and use recursion. Efficiency is not a grading criteria for this project

Project Grading

The expected point breakdown for this project will be something like this. Since the functionality of the program is the same as project 1, the focus of this project's grading is on coding.

Functionality
Note that your functionality score will be zero if your code does compile to create an executable.
1. Basic cases (5 points) - This might be a text file with a few unique words or a text file containing a few words repeated a few times.
2. More complex cases (10 points) - This would test word discarding rules, sorting, etc. using more complex and longer text files.
3. Stress Cases (10 points) - Large files that include all situations such as mixed case, discarded words, etc.
Code

Design (10 points) - we expect that your code is separated into multiple .c and .h files as required.
Makefile (5 points) - we expect to be able to type "make" at the Unix prompt and have an executable named project2 be created.
Coding requirements (20 points) - we expect that all you will adhere to all coding requirements listed above, especially those related to dynamic memory allocation
valgrind (5 points) - we expect that your project runs under valgrindwithout warnings or errors as described above.

Extra Credit (5 points)

The most common multi-purpose sorting algorithm used in programming applications is known as "quicksort". Quicksort is generally the fastest sorting algorithm for most applications. It uses a technique known as "partitioning". The quicksort algorithm is available in the standard C library under the name qsort( ). qsort( ) is discussed briefly in K & R (section 5.1 and appendix B5) and in the Unix on-line manual (man qsort). A quick look on the internet using Google will bring up many pages devoted to quicksort. qsort can be used to sort an array of any type so its prototype needs some explanation.

The prototype for qsort( ) in the C library (in <stdlib.h>) is


	void qsort( void *base, size_t n, size_t size, int (*cmp)(const void *p1, const void *p2));

where
	void *base is a pointer to (i.e. the name of) the array of elements to be sorted
	size_t n is the number of elements in the array
	size_t size is the size (in bytes) of each element in the array
	int (*cmp)(const void *p1, const void *p2) is a pointer to (i.e. the name of)
    	a user-defined function whose parameters are pointers to elements in the array to be compared.
		To sort the array in ascending order, the function must return
    		< 0 if the element pointed to by p1 is "less than" the element pointed to by p2
        	> 0 if the element pointed to by p1 is "greater than" the element pointed to by p2
        	0 if the elements pointed to by p1 and p2 are "equal"

For 5 points of extra credit, have your code call qsort( ) from the C library to do your sorting. You will of course have to write the necessary comparison functions for qsort to use. Note that qsort calls your function with pointers to the elements (p1 and p2) to compare. If your elements are themselves pointers, then the parameters passed to your comparision functions are really pointers to pointers to elements to be compared. Fun stuff !!

Submitting the Program

You can submit your project using the submit command.

submit cs313 Proj2 <list of files>

You may submit your project as often as you wish -- only the last submital will be graded. See this page for a description of other project submission related commands. To verify that your project was submitted, you can execute the following command at the Unix prompt. It will show all files that you submitted in a format similar to the Unix 'ls' command.

submitls cs313 Proj2