CMSC 313 Project 2

Tag Cloud

Assigned Wednesday, Feb 15, 2012
Program Due 11:59pm Sunday Feb 26, 2012
Points 65
Updates Feb 16, 2012
Look for the sign

The Objective

The objective of this assignment is to become familiar with good design practices, writing in C in a Unix environment, and using arrays, structs, functions and strings.

The Task

A "tag cloud" is a visual representation of the words which occur most frequently in a document. High frequency words are displayed with larger font sizes and brighter colors. Lower frequency words are displayed with smaller font sizes. The websites www.wordle.net and www.tagcrowd.com generates tag clounds from text that you provide. The website http://chir.ag/phernalia/preztags shows a chronological tag cloud timeline for US Presidential speeches. You assignment is to write a program that tokenizes a text file (reads one "word" at a time) and generates the information required to create a tag cloud.

How does your program work?

  1. Your program prompts the user for the name of the text file to be processed.
  2. Your program reads the text file one "word" at a time and counts the frequency of each word. You should ignore words that are less than 4 characters long and words that contain any non-alphabetic character.
  3. Find the 20 words with the highest frequency counts. If there are fewer than 20 different words, use them all.
  4. Print these 20 words in alphabetical order along with their frequency count. Print each word and its frequency count on a separate line.
  5. If there are multiple words "tied" as the 20th most frequent word, you may arbitrarily decide which words are not printed.

Hints, Notes and Requirements

  1. (N) A "word" is any sequence of non-whitespace characters, e.g. "Bob", "WAIT!!", "!??!". (the more technical term for a sequence of non-whitespace characters is "token").
  2. (N) You may assume that there are no more than 500 UNIQUE words in the file.
  3. (N) You may assume that no word is more than 30 characters long.
  4. (N) You may assume that the name of the input text file is no more than 100 characters.
  5. (N) You can read one word at a time using %s with fscanf. This technique will skip whitespace, but will not separate punctuation from adjoining lettters, therefore "Bobby!" will be read as a word, but will not be counted because of the '!' character. There are obviously better ways to deal with punctuation, but we're trying to keep this relatively easy.

  6. (R) If the text file specified by the user cannot be opened for reading, an appropriate error message should be printed and your program should terminate.
  7. (R) Distinctions in case should be ignored when counting occurrences of words, e.g. "Hello", "hello", and "HELLO" are all the same word.
  8. (R) Words should be output in lower-case.
  9. (R) Your code must make appropriate use of functions.
  10. (R) This is an individual project. Do your own work.
  11. (R) All your code should be placed into one .c file named project2.c.

  12. (H) Use a struct to associate a word and its frequency.
  13. (H) C does NOT have a library function to convert a word to lower-case, but it does have a library function to convert a character to lower case.
  14. (H) The best approach to this assignment requires sorting the words and their associated counts. You are free to use any sorting algorithm you choose. Bubblesort, insertion sort and selection sort are among the simplest, but also the slowest. Check the CMSC 201 website or search the internet for these algorithms. Mergesort and quicksort are much faster but are more complex and use recursion. Efficiency is not a grading criteria for this project.

Project Grading

The expected point breakdown for this project will be something like this.
  1. Functionality
    Note that your functionality score will be zero if your code does compile to create an executable.
    1. Basic cases (5 points) - This might be a text file with a few unique words or a text file containing a few words repeated a few times.
    2. More complex cases (25 points) - This would test word discarding rules, sorting, etc. using more complex and longer text files.
    3. Atypical cases (5 points) - This might be an empty file, a file that cannot be opened, a file containg words which are all discarded. We will not violate our promises regarding word length and number of words.
    4. Stress Cases (15 points) - Large files that include all situations such as mixed case, discarded words, etc.
  2. Code
    1. Design (5 points) - we expect that your code shows sufficient decompostion into appropriate functions
    2. Style (10 points) - we expect that your code adheres to the course coding standards, particularly with respect to function and file comments and to naming conventions.

Extra Credit (10 points -- Don't do this just for the points!)

For fun and a little extra credit, use your results to create an HTML page to produce the visual tag cloud. Create and output a file with a .html extension, output the HTML header, then output a span element for each word using a font size proportional to its frequency. Then output the HTML footer. Load the HTML page into your browser to view the cloud. Please submit a file named README to alert the graders.

Note: if you need any of the functions from the math library, be sure to #include <math.h> and link the math library with your file by using -lm at the end of the gcc command line.

Submitting the Program

You can submit your project using the submit command.

submit cs313 Proj2 project2.c

See this page for a description of other project submission related commands. To verify that your project was submitted, you can execute the following command at the Unix prompt. It will show all files that you submitted in a format similar to the Unix 'ls' command.

submitls cs313 Proj2