CMSC 341 Spring 2008
Project 4

Assigned

Wednesday, April 16, 2008

Due

11:59 pm on Wednesday April 30, 2008

Revision

(4/18/2009)

Some revisions and clarifications have been made in the section

of Program Requirements and Notes on the specifics of the two

hash functions and on the requirements for output.

 

Objectives and background

In this project you will use Hash Tables (HT) and other data structures to implement a simple “toy” web search engine.

 

To build a Google-like search engine, one needs to do the following:

  1. Collecting webpages and extracting the words that appear on that page.
  2. Indexing the pages. Web pages are indexed by the words appearing on them in such a way that, when given a word x, one can quickly obtain the URLs of webpages which contain word x.
  3. Building procedures to answer search queries. The search engine must find the answer (a list of webpage URLs that satisfy the query) quickly and with high relevancy.

 

Speed and relevancy are the two key factors that make Google stand out among all search engines we have seen in the last two decades or so. The high relevancy was attributed in a great part to Google’s PageRank algorithm. Due to the time limit, this project will not involve ranking the search results. For the same reason, you are not asked to collect webpages from the internet. Instead, the project will focus on the data structures and algorithms for efficient indexing and query processing. In particular, you will implement

  1. Cuckoo hashing, a variant hashing technique that has O(1) worst case time performance for find/contain operation; and
  2. Efficient set intersection with O(N) time performance where N is the size of the largest set in the operation.

 


Description

Webpages

Instead of crawling the web to collect all pages, you will be given a text file, called the page file, which contains a collection of pre-processed webpages. Each entry in the page file contains the URL of that page and a sequence of words. These words are taken from that page during the pre-process, they are strings of a – z (all lower case, no special characters, no punctuations, no numerals), and common stop words and html tags are removed. Please note that 1) the same word may appear more than once in a webpage; and 2) two words are treated the same if and only if the two strings are identical, therefore, for example, “car” and “cars” are treated as two different words.

 

Page Indexing

Webpages will be read in from the page file, and then indexed by the words it contains. Hash table (HT) is chosen for indexing for its fast access speed. Each element in HT consists of the following:

(1)  The word (the key for hash functions),

(2)  The reference to its page list which holds all pages that contain this word. Each element of the page list consists of the page’s URL and sequence number (a number you need to set according to the order the page is read in from the page file.

 

HT is constructed by reading and attempting to insert words from the page file. Specifically, when you attempt to insert word w read from page p into HT:

 

For an example, suppose the following two pages are the first two read in from the page file:

 

Sequence number

1

2

URL

www.foo.com

www.bar.edu

words

x  y z

y z y

 

Then pages www.foo.com and www.bar.edu are assigned sequence numbers of 1 and 2, and the HT contains three elements: x with page list (1), y with page list (1, 2), and z with page list (1, 2) (URLs are not shown here).

 

Cuckoo hashing

Cuckoo hashing was invented by Pagh and Rodler [1,2] in 2001. The basic idea of cuckoo hashing can be seen from the following definition, taken from http://www.nist.gov/dads/HTML/cuckooHashing.html:

 

“… A dictionary implemented with two hash tables, T1 and T2, and two different hash functions, h1 and h2. Each key, k, is either in T1[h1(k)] or T2[h2(k)]. A new key, k, is stored in T1[h1(k)]. If that location is already occupied by another key, l, the other key is moved to T2[h2(l)]. Keys are moved back and forth until a key moves to an empty location or a limit is reached. If the limit is reached, new hash functions are chosen, and the tables are rehashed.”

 

Following is an outline in pseudo code of cuckoo hashing algorithm where “x <-> y” denotes swapping the values of x and y:

procedure insert(x)
  if T1[h1(x)] = x or T2[h2(x)] = x then return;        /* x is already there      */
  loop jumpLimit times 
   {
    if T1[h1(x)] = NULL then { T1[h1(x)]<- x; return};  
                 /* if T1[h1(x)] is free, insert x there   */
    x <-> T1[h1(x)];                                    
                 /* else kick out what was in T1[h1(x)] and insert x here    */
    if T2[h2(x)] = NULL then { T2[h2(x)]<- x; return};  
                /* attempt to insert element kicked out from T1 into T2       */
    x <-> T2[h2(x)] 
   }
  rehash(); 
  insert(x);
end

Because two different hashing functions are used in cuckoo hashing, two keys collide in one table will very unlikely to also collide in the other table. This is why cuckoo hashing has better time performance than linear or even quadratic probing. In rare occasions, insert may run into an infinite loop. The procedure is then forced to stop for rehashing when max number of times it is allowed to jump from one table to the other is reached. It is also clear that to find x one needs only to look at two places: T1[h1(x)] and T2[h2(x)], so the time performance is always O(1).

 

Queries

In this project you will need to handle only one type of search query: the conjunctive query. A conjunctive query consists of one or more words called query keys; the answer to the query is the list of ALL webpages that contain ALL keys in the query. 

 

Efficient implementation for set Intersection

A conjunctive query is answered by intersecting page lists of all its query keys. Straightforward implementations of set intersection often result in poor time performance. For example, if we literally check if every element in S1 is also an element in S2, then the intersection operation would have time performance of O(|S1|*|S2|), which would be computationally intractable for large sets such as the page lists for commonly used words.

 

Google suggests another algorithm that has the time performance linear to the size of the sets involves. This algorithm, called walking together, can be best illustrated by an example taken from http://www.google.com/librariancenter/articles/0512_01.html. Suppose the word "civil" was in pages 3, 8, 22, 56, 68, and 92; the word "war" was in pages 2, 8, 15, 22, 68, and 77. The intersection is (8, 22, 68).

 

civil

3

8

 

22

56

68

92

war

2

8

15

22

 

68

77

Both words

 

8

 

22

 

68

 

 

In this algorithm, you walk along the two lists as follows: at any time, if the two numbers are equal then this page belongs to the intersection, otherwise move the leg with the small number one step further. In this example, one first moves along the second list from 2 to 8, then the first list from 3 to 8, where a match is found. The process then continues until one list reaches the end.

 

You are required to implement walking together algorithm for set intersection with multiple sets.

 


Project Requirements and Notes

1.     Project 4 will be invoked with a command line that consists of 2 arguments: the names of the page file and query file.

 

2.     Specifics about cuckoo hashing implementation.

 

/* convert word to integer key k */

k = 0;

for (i = 0; i < word.length(); i++)

  k = 37 * k + word.charAt(i);

return k;

 

/* h1 for T1 with integer key k */

hashValue1 = k % m;

If (hashValue1 < 0) hashValue1 += m1;

return hashValue1;

 

/* h2 for T2 with integer key k */

hashValue2 = floor(m2*(k*A – floor(k*A));

If (hashValue2 < 0) hashValue2 += m2;

return hashValue2;

 

Here m1 and m2 are the table size of T1 and T2, and A = 2/(1+sqrt(5)), the inverse of the golden ratio.

 

jumpLimit = ceiling(current table size/2).

afs/umbc.edu/users/y/p/ypeng/pub/CMSC341/Proj4.

 

1.     You are free to choose your own data structure and implementation for page list

 

2.     Output.

 

 

Query: y z

# of pages found: 2

List of page found:

     (1) www.foo.com

     (2) www.bar.edu


References

[1] Rasmus Pagh and Flemming Friche Rodler, Cuckoo Hashing, Proceedings of ESA 2001, Lecture Notes in Computer Science, vol. 2161, pages 121-133, 2001.

 

[2] Rasmus Pagh, Cuckoo Hashing for Undergraduates. (from http://www.it-c.dk/people/pagh/papers/cuckoo-undergrad.pdf)