Homework 1

(Solutions)

Assignment: To write a program that performs simple lexical analysis of its input. To use that output to compile some summary statistics of a small text collection.

Goal: To get some basic exposure to the text operations needed for Phase I of the project.

Due Date: Tuesday, September 10, 2002.

Description

You will write a program that reads a document collection, tokenizes the text into words, and generates some basic statistics about the collection. In particular, your program will output the following:

The number of unique words in the collection
The number of documents in the collection
The document ID of the longest and shortest documents, along with the number of words they contain.
A list of the 10 most frequent words
A list of the 10 least frequent words occurring at least 10 times.
The number of words occurring only once in the collection

Additionally, you will create a graph showing the word frequencies in descending order. Put rank (nth most frequent) on the X axis, and number of occurrences on the Y. Make both axes logarithmic.

As we discussed in class, there are a number of ways to extract index terms from the document text. Accordingly, in order to make sense of the statistics above, you must say what you indexed and how. In particular:

What defines a word?
Which sections of the document (see below) did you index? Did you index the SGML tags?
How did you treat numbers and punctuation?

This assignment will use the Reuters-21578 collection (sometimes just called "Reuters" or "Reuters '87"). This collection is a standard tool for measuring text classification systems: programs that learn to classify documents into a fixed set of categories. The documents in the collection are news stories from the Reuters newswire from 1987. This collection is about 25MB in size, making it easy to work with on modern hardware. You'll find the collection on the CS systems at:

/data/nicholas2/ian/Reuters-21578

In this directory you will find a number of files. Please read the README.txt file; it contains vital information for understanding the file formats and contents. The documents themselves are contained in the "reut2-xxx.sgm" files.

The collection is marked up using an SGML format which is described more fully in the README.txt file. Briefly,

Each document starts with a <REUTERS> tag and ends with a </REUTERS> tag.
The REUTERS tag has several fields, of which NEWID gives a unique, chronologically-ordered document ID.
TOPICS, PLACES, PEOPLE, ORGS, EXCHANGES, and COMPANIES tags contain category assignments.
The textual content of a story is inside the TEXT tags. Inside the text you might find AUTHOR, DATELINE, TITLE, and BODY tags.

What to turn in

You will turn in a listing of your program, the answers to the nine questions listed above, and your graph. If you write more than one program, hand in them all. If you use a shell pipeline, put it in a shell script with some comments. Please make sure your name is on every page and that everything is stapled securely together.

Hints

Read some of the articles to get a feel for the collection.

Examine the terms you are finding closely. You will likely find some surprising exceptions to your rules!

What you write for this homework will be very useful as you begin the main semester project. Think about making your code reusable, documented, and modular. A filter-style of programming (where a program takes plain text on the standard input and produces (modified) text on the standard output) is useful here.

You can do a lot of this assignment with the UNIX programs we discussed in lecture 1. If you do, document your effort! I can't count the number of UNIX pipelines I've dreamt up, lost, and then had to recreate later on.