UMBC CMSC 491/691-I Fall 2002 Home  |  News  |  Syllabus  |  Project   ]
Last updated: 24 September 2002

Homework 3

(Solutions)

Assignment: Write a program (possibly based on your solution to Homework 2) that computes the size of the uncompressed and compressed indices for the umbc-crawl collection.

Goal: To understand the mechanics and implications of index compression for Phase I of the project.

Due Date: Tuesday, October 1, 2002.

Description

First, adapt your Homework 2 solution to compute the size of an uncompressed index of the umbc-crawl collection. Use the same space assumptions as you made in Homework 2. This will necessitate re-examining many assumptions you were able to make in Reuters-21578 about the quality and layout of the collection:

Because of these differences, your program will need to report the number of documents and unique words.

There are several files at the top of /data/nicholas2/ian/umbc-crawl which you may find helpful:

After you have done the above, modify your program to compute the size of the index as if it were compressed. Do this for two index compression schemes:

  1. D-gaps: delta; Counts: gamma; (Offsets: delta)
  2. D-gaps, counts, (and offsets) with variable-byte encoding.

Counting offsets is optional, but recommended if you're planning on storing word offsets in your project!

Note that the size of the lexicon will not change, only the size of the inverted file.

What to turn in

You will turn in a HARD-COPY listing of your program(s) and your output giving the estimate of the space needed for these umbc-crawl indices. Don't forget to report the number of unique words and documents, since depending on which files you index this may vary from person to person. Please make sure your name is on every page and that everything is stapled securely together.

Homework is due at the beginning of class. No late homework will be accepted.