Homework 3 Solutions

Your results will likely differ somewhat from these, depending on which documents in the crawl you attempt to index, how you select terms from those documents, and how you process those terms. umbc-crawl is much more messy than Reuters, so I expect to see greater variation in the choices you make.

For my solution, I broke my HW2 solution into two parts. The first part, which I call dump-tuples.pl, performs document segmentation, parsing, and feature processing. It outputs a list of tuples to stdout:

	$ ~/ircourse/f02/dump-tuples.pl crawl-files 
	talks 1 1 188
	pdf 1 1 173
	presentation 1 2 101 177
	austin 1 2 28 125
	anupam 1 1 222
	survival 1 1 89
	are 1 2 71 81
	the 1 15 21 50 53 57 61 64 76 92 103 118 133 139 145 168 176
	washing 1 2 22 93
	printable 1 1 44
	conference 1 1 110
	...

Each line has a word, a document number (1 here), a count of occurrences in that document, and a list of offsets for the term within that document. Word offsets are numbered from 1. This script is very handy because (a) it separates my document processing from any later processing (counting or indexing), and (b) the output is appropriate for sort-based inversion techniques.

dump-tuples.pl reads a list of files to index, and treats each one as a separate document. It checks the type of the document using the UNIX file(1) command. If the file is binary data, it is skipped. If the file is HTML, we preprocess the file using nsgmls(1). If the file is plain text, we extract features from it directly without using a preprocessor.

Text processing and feature selection are as simplistic here as they were in my earlier solutions, with a little complexity to allow for non-SGML documents. No special treatment for the collection was done at all; naive, but a good first approximation.

Another script, count-tuples.pl, reads the tuples and accumulates the sizes for the uncompressed and compressed indices. This script only depends on the tuple format, so I can use it with any collection. When I have to re-do this homework for the new Reuters corpus, all I have to do is change dump-tuples.pl.

Part 1: Uncompressed umbc-crawl index

The HW2 assumptions for an uncompressed index were:

Each term in the lexicon requires (8 + n) bytes:
- Enough bytes to store the term itself, plus a trailing null,
- Four bytes to store the number of documents containing the term, and
- Four bytes to store a pointer to the term's inverted list.
Each posting (inverted list entry) requires 12 bytes:
- Four bytes to store a document identifier (sequence number)
- Four bytes to store a count of occurrences in this document.
- Four bytes to store a pointer to the next posting.

Since I am now interested in an on-disk inverted file, I didn't store a pointer for each posting. (I know this wasn't specified; keeping the pointer space increases the uncompressed inverted file size by 50%.) I am also now storing a count of the number of postings at the beginning of each on-disk inverted list. I get the following counts:

$ ~/ircourse/f02/dump-tuples.pl crawl-files | ~/ircourse/f02/count-tuples.pl
........................................................................................
Number of files: 44495
Number of unique words: 111034
Size of lexicon (in bytes):        1,806,256  (1,764 kb)
Uncompressed:
Size of inverted lists (in bytes): 41,165,832  (40,202 kb)
   (with word offsets) (in bytes): 89,176,408  (87,087 kb)

Notice that the difference from adding word offsets is now much less. This is because my HW2 word-position estimates were so pessimistic as to be nearly bogus, but now that I have a script that actually gathers the counts and offsets, I can be much more exact. So, for umbc-crawl, storing word-position information more than doubles the index size. It's also nice to note that I could store an ordinary, uncompressed index of umbc-crawl in under 50MB.

Part 2: Compressed index

I required you to compute the compressed index sizes according to two schemes:

D-gaps: delta; Counts: gamma; (Offsets: delta) (DelD-GamC-DelO, in Scholer's notation)
D-gaps, counts, (and offsets) with variable-byte encoding. (VbyD-VbyC-VbyO)

Obviously pointers between postings don't make much sense here. As far as my inverted-list-length field goes, (one extra number per inverted list, not much expense) I used Delta coding for the first scheme, and variable-byte for the second. Here are my results:

DelD-GamC-DelO:
Size of inverted lists (in bytes): 4,885,744  (4,772 kb)
   (with word offsets) (in bytes): 23,903,481  (23,344 kb)

VbyD-VbyC-VbyO:
Size of inverted lists (in bytes): 10,999,809  (10,743 kb)
   (with word offsets) (in bytes): 29,761,374  (29,064 kb)

If I don't store word position information, I can get a nearly 10/1 compression ratio using the bitwise scheme, and a nearly 4/1 compression ratio using a bytewise scheme. With word positions, the ratios are nearly 4/1 and 3/1. So, a complete umbc-crawl index can be stored in 7-31MB, depending on the compression scheme you use and whether you store word positions. It's also very interesting that the variable-byte scheme seems much better for offsets than Delta coding.