CMSC 341 Spring 2008
Project 4

Assigned

Wednesday, April 16, 2008

Due

11:59 pm on Wednesday April 30, 2008

Revision

(4/18/2009)

Some revisions and clarifications have been made in the section

of Program Requirements and Notes on the specifics of the two

hash functions and on the requirements for output.

Objectives and background

In this project you will use Hash Tables (HT) and other data structures to implement a simple “toy” web search engine.

To build a Google-like search engine, one needs to do the following:

Collecting webpages and extracting the words that appear on that page.
Indexing the pages. Web pages are indexed by the words appearing on them in such a way that, when given a word x, one can quickly obtain the URLs of webpages which contain word x.
Building procedures to answer search queries. The search engine must find the answer (a list of webpage URLs that satisfy the query) quickly and with high relevancy.

Speed and relevancy are the two key factors that make Google stand out among all search engines we have seen in the last two decades or so. The high relevancy was attributed in a great part to Google’s PageRank algorithm. Due to the time limit, this project will not involve ranking the search results. For the same reason, you are not asked to collect webpages from the internet. Instead, the project will focus on the data structures and algorithms for efficient indexing and query processing. In particular, you will implement

Cuckoo hashing, a variant hashing technique that has O(1) worst case time performance for find/contain operation; and
Efficient set intersection with O(N) time performance where N is the size of the largest set in the operation.

Description

Webpages

Instead of crawling the web to collect all pages, you will be given a text file, called the page file, which contains a collection of pre-processed webpages. Each entry in the page file contains the URL of that page and a sequence of words. These words are taken from that page during the pre-process, they are strings of a – z (all lower case, no special characters, no punctuations, no numerals), and common stop words and html tags are removed. Please note that 1) the same word may appear more than once in a webpage; and 2) two words are treated the same if and only if the two strings are identical, therefore, for example, “car” and “cars” are treated as two different words.

Page Indexing

Webpages will be read in from the page file, and then indexed by the words it contains. Hash table (HT) is chosen for indexing for its fast access speed. Each element in HT consists of the following:

(1) The word (the key for hash functions),

(2) The reference to its page list which holds all pages that contain this word. Each element of the page list consists of the page’s URL and sequence number (a number you need to set according to the order the page is read in from the page file.

HT is constructed by reading and attempting to insert words from the page file. Specifically, when you attempt to insert word w read from page p into HT:

If w is not in HT, then insert w into HT and create the page list with p as its first element and set a reference to the page list;
If w is already in HT, then insert p into its page list if p is not already in the list.

For an example, suppose the following two pages are the first two read in from the page file:

Sequence number	1	2
URL	www.foo.com	www.bar.edu
words	x y z	y z y

Then pages www.foo.com and www.bar.edu are assigned sequence numbers of 1 and 2, and the HT contains three elements: x with page list (1), y with page list (1, 2), and z with page list (1, 2) (URLs are not shown here).

Cuckoo hashing

Cuckoo hashing was invented by Pagh and Rodler [1,2] in 2001. The basic idea of cuckoo hashing can be seen from the following definition, taken from http://www.nist.gov/dads/HTML/cuckooHashing.html:

“… A dictionary implemented with two hash tables, T₁ and T₂, and two different hash functions, h₁ and h₂. Each key, k, is either in T₁[h₁(k)] or T₂[h₂(k)]. A new key, k, is stored in T₁[h₁(k)]. If that location is already occupied by another key, l, the other key is moved to T₂[h₂(l)]. Keys are moved back and forth until a key moves to an empty location or a limit is reached. If the limit is reached, new hash functions are chosen, and the tables are rehashed.”

Following is an outline in pseudo code of cuckoo hashing algorithm where “x <-> y” denotes swapping the values of x and y:

procedure insert(x)

  if T1[h1(x)] = x or T2[h2(x)] = x then return;        /* x is already there      */

  loop jumpLimit times

    if T1[h1(x)] = NULL then { T1[h1(x)]<- x; return};

                 /* if T1[h1(x)] is free, insert x there   */

    x <-> T1[h1(x)];

                 /* else kick out what was in T1[h1(x)] and insert x here    */

    if T2[h2(x)] = NULL then { T2[h2(x)]<- x; return};

                /* attempt to insert element kicked out from T1 into T2       */

    x <-> T2[h2(x)]

  rehash();

  insert(x);

end

Because two different hashing functions are used in cuckoo hashing, two keys collide in one table will very unlikely to also collide in the other table. This is why cuckoo hashing has better time performance than linear or even quadratic probing. In rare occasions, insert may run into an infinite loop. The procedure is then forced to stop for rehashing when max number of times it is allowed to jump from one table to the other is reached. It is also clear that to find x one needs only to look at two places: T1[h1(x)] and T2[h2(x)], so the time performance is always O(1).

Queries

In this project you will need to handle only one type of search query: the conjunctive query. A conjunctive query consists of one or more words called query keys; the answer to the query is the list of ALL webpages that contain ALL keys in the query.

Efficient implementation for set Intersection

A conjunctive query is answered by intersecting page lists of all its query keys. Straightforward implementations of set intersection often result in poor time performance. For example, if we literally check if every element in S1 is also an element in S2, then the intersection operation would have time performance of O(|S1|*|S2|), which would be computationally intractable for large sets such as the page lists for commonly used words.

Google suggests another algorithm that has the time performance linear to the size of the sets involves. This algorithm, called walking together, can be best illustrated by an example taken from http://www.google.com/librariancenter/articles/0512_01.html. Suppose the word "civil" was in pages 3, 8, 22, 56, 68, and 92; the word "war" was in pages 2, 8, 15, 22, 68, and 77. The intersection is (8, 22, 68).

civil	3	8		22	56	68	92
war	2	8	15	22		68	77
Both words		8		22		68

In this algorithm, you walk along the two lists as follows: at any time, if the two numbers are equal then this page belongs to the intersection, otherwise move the leg with the small number one step further. In this example, one first moves along the second list from 2 to 8, then the first list from 3 to 8, where a match is found. The process then continues until one list reaches the end.

You are required to implement walking together algorithm for set intersection with multiple sets.

Project Requirements and Notes

1. Project 4 will be invoked with a command line that consists of 2 arguments: the names of the page file and query file.

The page file contains all pre-processed web pages from which you will build the HT. Pages are separated by an empty line. Each page starts with its URL, then is followed in a new line by all its words. These words are separated by a single space. A small example page file can be found here.
The query file contains search queries, one query per line. Keys in each query are separated by a single space. An examples query file can be found here. Your code shall answer every query in the query file
You can assume all files used to test your program are well structured without syntactic errors and follow the formats in these two example files.

2. Specifics about cuckoo hashing implementation.

Use Horner’s rule to convert words into integer keys. Use division method for h1 and multiplication method for h2. Your implementation should follow the pseudo code given below:

/* convert word to integer key k */

k = 0;

for (i = 0; i < word.length(); i++)

k = 37 * k + word.charAt(i);

return k;

/* h1 for T1 with integer key k */

hashValue1 = k % m;

If (hashValue1 < 0) hashValue1 += m1;

return hashValue1;

/* h2 for T2 with integer key k */

hashValue2 = floor(m2*(k*A – floor(k*A));

If (hashValue2 < 0) hashValue2 += m2;

return hashValue2;

Here m1 and m2 are the table size of T1 and T2, and A = 2/(1+sqrt(5)), the inverse of the golden ratio.

Both T1 and T2 start with size 101. For this project, use

jumpLimit = ceiling(current table size/2).

Rehashing (when jumpLimit is reached) shall be done in the same way as in quadratic probing discussed in the class: First create new tables with size set to the smallest prime number >= 2*current table size, then insert all keys from the current tables (first from T1 then T2) to the new tables.

You can implement your cuckoo hashing either from scratch or by modifying the author’s code for quadratic probing. The code can be found at

afs/umbc.edu/users/y/p/ypeng/pub/CMSC341/Proj4.

1. You are free to choose your own data structure and implementation for page list

2. Output.

During page indexing, for each rehashing output the following

The current combined table size (i.e., |T1| = |T2|)
The load factors for each table, and the overall load factor when the two tables are combined

When all pages have been indexed, out put the following

The total number of pages read in from the page file
The total number of words read in from all pages in the page file
The total number of distinct words in the two tables
The number of times the tables are rehashed
The final size of the two tables and their load factors, and the overall load factor.
The average number of cross-table jumps per insert operation, including insertions that cause rehashing and insertions in rehashing.
The maximum number of cross-table jumps of any insert operation, excluding those that cause rehashing.

For each search query, echo the query, then output the total number of pages you found satisfying this query, the sequence number and URL of each page you found, one per line in the ascending order of their sequence numbers. For example, with query “y z”, the output will look like

Query: y z

# of pages found: 2

List of page found:

(1) www.foo.com

(2) www.bar.edu

References

[1] Rasmus Pagh and Flemming Friche Rodler, Cuckoo Hashing, Proceedings of ESA 2001, Lecture Notes in Computer Science, vol. 2161, pages 121-133, 2001.

[2] Rasmus Pagh, Cuckoo Hashing for Undergraduates. (from http://www.it-c.dk/people/pagh/papers/cuckoo-undergrad.pdf)

CMSC 341 Spring 2008 Project 4

Objectives and background

Description

Project Requirements and Notes

References

CMSC 341 Spring 2008
Project 4