UMBC CMSC 491/691-I Fall 2002 Home  |  News  |  Syllabus  |  Project   ]
Last updated: 19 September 2002

Homework 2

(Solutions)

Assignment: Write a program that measures the rate at which a vocabulary grows as documents are read from a collection. Write a second program that estimates storage requirements for an inverted index of the collection.

Goal: To further explore corpus statistics and their implications for Phase I of the project.

Due Date: Tuesday, September 17, 2002.

Description

This assignment has two parts. You can accomplish both using the program(s) you wrote for HW1 as a starting point.

Part 1: Vocabulary Growth

First, you will measure how vocabulary grows as documents are read from the Reuters-21578 collection. To do this, make your program output the number of bytes read and the number of unique words found so far after each document is read. For example:

          $ ./growth.pl
          1       3054    260        (Doc, #bytes so far, #words so far)
          2       3684    305
          3       4207    335
          4       7288    520
          5       8661    584
          6       10220   651
          7       10850   681
          8       11607   717
	  9       12151   739
	  10      13644   808
          ...
Use this output to make two graphs showing vocabulary growth as a function of the number of bytes and documents read.

Part 2: Index size

Second, you will estimate the size of an in-memory index generated for the Reuters-21578 collection. Imagine that the lexicon is represented by an array of "lexicon-entry" structures, each pointing to a linked list of postings. Assume the following parameters:

For the moment, we will assume the lexicon array has no wasted space (as it would were it a hash table), and ignore storage for the mapping of document identifiers to document locations.

Write a program to count unique words in the collection, as well as the number of documents each word occurs in. These parameters will allow you to estimate the in-memory indexing requirements under the assumptions above. Does linuxserver1 (or your home PC) have enough memory to index Reuters-21578? What about a larger collection?

What to turn in

You will turn in a listing of your program(s), your graphs, and your estimate of the memory needed to index Reuters-21578. Everything should be handed in as hard-copy. Please make sure your name is on every page and that everything is stapled securely together.

Homework is due at the beginning of class. No late homework will be accepted.

Feel free to write one large program, several small programs, and to extend the programs in your previous homework.