Homework 1 Solutions

The answers to the questions will depend on how you segment the text into words. I preprocessed the text using NSGMLS, a simple SGML parsing tool, to output text and tags in a normalized, easy-to-parse format. After this, I converted non-alphanumeric characters to whitespace, and normalized the text to lower case. For some questions (it's clear which ones from the code) I excluded numbers from the counts.

The number of unique words in the collection.

This can be answered with the following pipeline:

      nsgmls -E0 reut2-*sgm | \        # Use NSGMLS to parse SGML out
      grep '^-' | \                    # and take only PCDATA lines
      perl -ne 's/^-//;                # cut leading - from NSGMLS
                s/\\n/ /g;             # Fix NSGMLS escaped newlines
		s/&[a-z]+;/ /g;        # Remove SGML entities
                s/[^[:alnum:]]/ /g;    # convert punct -> space
		tr/A-Z/a-z/;           # normalize to lower case
                print join("\n", split), "\n";' | \      # print 1 word/line
      sort | \                         # sort the words
      uniq | \                         # remove duplicate lines
      wc -l                            # count the number of lines

Try out the pipe, adding one stage at a time, to understand how it works. The 'perl' options '-ne' mean to repeat the commands in quotes on each line. The first three Perl transformations are to deal with NSGMLS output. According to this pipe, there are 55,377 unique words in Reuters-21578.

Note, if you modify the pipeline above by inserting the following before the 'sort':

	  grep -v '^[0-9][0-9]*$' |

to remove numbers, there are only 45,741 unique words.

The number of documents in the collection.
This can be answered by counting the REUTERS tags:
```
              cat reut2-*sgm | grep '<REUTERS' | wc -l
            
```
This question could have also been answered by reading the documentation, or even by making a clever guess based on the collection title: there are 21,578 documents. (The collection is so named to distinguish it from earlier versions that had a different number of documents.)
The document ID of the longest and shortest documents, along with the number of words they contain.

There are a lot of ways to answer this one. With a smaller collection, I might have used a Perl incantation to dump each document into a separate file:

      cat reut2-*sgm | \
      perl -ne 'if (/<REUTERS.*NEWID=\"([0-9]+)\"/) { # set output to
                   close OUT; open OUT, ">$1";        #   NEWID content
                }
		# ...
		# rest of Perl normalization mishmash
		# ...
                print OUT join("\n", split), "\n";'   # print 1 word/line

NOTE: This script had a bug when I showed it initially in class. It counted characters in the document, not numbers. Reasonable for a length measure, but not when the question asked for the number of words!


#!/usr/bin/perl

#
# Usage: ./reutcount.pl reut0*.sgm
#

use strict;

my $docid = -1;
my @doc;

@ARGV = map { /\.sgm$/ ? "nsgmls -E0 $_ 2>/dev/null |" : $_ } @ARGV;

LINE: while (<>) {
    if (/^ANEWID/) { 	        # new document (and ID)
	chomp;
	my (undef, undef, $newid) = split;  # get NEWID
	$docid = $newid;
	@doc = ();
	next LINE;
    }

    if (/^-/) {         	# text in a doc
	s/^-//;                 # Remove leading hyphen
	s/\\n/ /g;              # Fix NSGMLS newlines
	s/&[a-z]+;//g;          # Remove SGML escapes
	s/[^[:alnum:]]/ /g;     # Convert non-alphanumerics to spaces
	my $line = lc;          # Normalize to lower case

	push @doc, split ' ', $line;
	next LINE;
    }

    if (/^\)REUTERS/) {  	# End of a doc... do counting
	print "$docid\t" . scalar(@doc) . "\n";
	@doc = ();
	next LINE;
    }
}

With the following results:

Length (docid)

Top 10 Bottom 10

2429 (doc 15875) 32 (doc 10033)

2299 (doc 15871) 32 (doc 12234)

1391 (doc 11224) 32 (doc 13746)

1120 (doc 6657) 32 (doc 19354)

1089 (doc 5214) 32 (doc 1969)

1085 (doc 17396) 32 (doc 636)

1079 (doc 5985) 32 (doc 7025)

1052 (doc 17474) 31 (doc 12455)

1051 (doc 17953) 31 (doc 18639)

1051 (doc 7135) 9 (doc 12793)

A list of the 10 most frequent words

Length (docid)
Top 10	Bottom 10
2429 (doc 15875)	32 (doc 10033)
2299 (doc 15871)	32 (doc 12234)
1391 (doc 11224)	32 (doc 13746)
1120 (doc 6657)	32 (doc 19354)
1089 (doc 5214)	32 (doc 1969)
1085 (doc 17396)	32 (doc 636)
1079 (doc 5985)	32 (doc 7025)
1052 (doc 17474)	31 (doc 12455)
1051 (doc 17953)	31 (doc 18639)
1051 (doc 7135)	9 (doc 12793)

Modify the pipeline from question 1 (excluding numbers), by changing 'uniq' to 'uniq -c' (prints the number of duplicates with each line), and replace 'wc -l' (which simply counts the number of lines) with 'sort -n | tail -10' with sorts the lines numerically and prints the last 10 lines:

  26979 mln
  27621 for
  35275 s
  53153 a
  53276 said
  55098 and
  55635 in
  74262 of
  74937 to
 144803 the

A list of the 10 least frequent words occurring at least 10 times.

Another couple pipe tweaks:

      nsgmls -E0 reut2-*sgm | \
      grep '^-' | \
      perl -ne 's/^-//;
                s/\\n/ /g;
		s/&[a-z]+;/ /g;
                s/[^[:alnum:]]/ /g;
		tr/A-Z/a-z/;
                print join("\n", split), "\n";' | \
      grep -v '^[0-9][a-z0-9]*$' | \  # Reuters has some strange numbers
      sort | \
      uniq -c | \                  # print counts of duplicate lines
      sort -n | \                  # sort numerically
      awk '$1 >= 10 {print}' | \   # print lines where "field 1" (the count) is  greater than 10
      head -10                     # print the first 10 lines

     10 abnormal
     10 abrupt
     10 accomplish
     10 achievements
     10 acknowledging
     10 acp
     10 acqu
     10 acquir
     10 admitting
     10 adv

The number of words occurring only once in the collection

You get the idea... I count 15,010. Here's my pipeline:

      nsgmls -E0 reut2-*sgm | \
      grep '^-' | \
      perl -ne 's/^-//;
                s/\\n/ /g;
		s/&[a-z]+;/ /g;
                s/[^[:alnum:]]/ /g;
		tr/A-Z/a-z/;
                print join("\n", split), "\n";' | \
      grep -v '^[0-9][a-z0-9]*$' | \
      sort | \
      uniq -c | \                  # print counts of duplicate lines
      awk '$1 == 1 {print}' | \    # print lines where "field 1" (the count) is 1
      wc -l                        # count the number of lines

For the graph, I used a pipeline similar to question 4 to output a set of graph coordinates:

      nsgmls -E0 reut2-*sgm | \
      grep '^-' | \
      perl -ne 's/^-//;
                s/\\n/ /g;
		s/&[a-z]+;/ /g;
                s/[^[:alnum:]]/ /g;
		tr/A-Z/a-z/;
                print join("\n", split), "\n";' | \
      grep -v '^[0-9][a-z0-9]*$' | \
      sort | \
      uniq -c | \                  # print counts of duplicate lines
      sort -nr | \                 # sort numerically, in reverse order
      awk '{print $1}' | \         # print the count only
      nl                           # preface each line with a line number
                                   #    which will be the X coordinate

The graph [Postscript] was generated using R, a free statistical package similar to S-PLUS (see for more details).