UMBC CMSC 491/691-I Fall 2002 Home  |  News  |  Syllabus  |  Project   ]
Last updated: 17 September 2002

Homework 2 Solutions

As with HW1, the answers here depend to some extent on your particular approach to document segmentation, word segmentation, selection, and processing. However, overall I think you'll find that your results are very similar to mine. My solutions started with the Perl script from HW1, question 3. To this I added a filter to remove numeric tokens, and then the code to count what I needed.

Part 1: Vocabulary Growth

I ended up using two separate scripts to count bytes and words respectively, since I wanted to count two kinds of bytes: raw bytes between <REUTERS>...</REUTERS> tags, and the number of bytes for the selected terms themselves. Here is a sample of the counts I got:

$ head bwcount2 
docid   bytes   wbytes  words
1       3375    2613    205
2       4319    3139    244
3       5148    3550    268
4       8560    6393    444
5       10272   7126    487
6       12206   8075    527
7       13138   8583    551
8       14206   9226    585
9       15067   9663    604
$ tail bwcount2 
21569   27358156        17209775        45661
21570   27359258        17210384        45661
21571   27360993        17211617        45663
21572   27365917        17215869        45667
21573   27369178        17218451        45670
21574   27370209        17219033        45671
21575   27372835        17221063        45671
21576   27373826        17221628        45672
21577   27375531        17222862        45673
21578   27377038        17223862        45679
And my graph:

The key thing to notice is that, while the collection size in bytes is linear, no matter which bytes you count (raw or from selected words), the number of unique words grows at a much lower rate, which is in fact O(n^k), 0 < k < 1.

Part 2: Index size
Under the assumptions given,

I estimate the following index space usage for a single "bundle" (reut-000.sgm) and the whole collection:

$ ./index-size.pl ~/work/corpora/Reuters-21578/reut2-000.sgm
Number of unique words: 10101
Size of lexicon (in bytes):        161,086  (158 kb)
Size of inverted lists (in bytes): 989,328  (967 kb)
   (with word offsets) (in bytes): 63,088,328  (61,610 kb)

$ ./index-size.pl ~/work/corpora/Reuters-21578/reut2-*.sgm
Number of unique words: 45679
Size of lexicon (in bytes):        733,876  (717 kb)
Size of inverted lists (in bytes): 20,626,524  (20,144 kb)
   (with word offsets) (in bytes): 26,480,114,920  (25,859,488 kb)
The estimate for including word-offsets in the index is pessimistic; I assume each word occurs once in each document.