Assigned 
Wednesday, April 16, 2008 
Due 
11:59 pm on Wednesday April 30, 2008 
Revision (4/18/2009) 
Some revisions and clarifications have been made in the section of Program Requirements and Notes on the specifics of the two hash functions and on the requirements for output. 
In this project you will use Hash Tables (HT) and other data structures to implement a simple “toy” web search engine.
To build a Googlelike search engine, one needs to do the following:
Speed and relevancy are the two key factors that make Google stand out among all search engines we have seen in the last two decades or so. The high relevancy was attributed in a great part to Google’s PageRank algorithm. Due to the time limit, this project will not involve ranking the search results. For the same reason, you are not asked to collect webpages from the internet. Instead, the project will focus on the data structures and algorithms for efficient indexing and query processing. In particular, you will implement
Webpages
Instead of crawling the web to collect all pages, you will be given a text file, called the page file, which contains a collection of preprocessed webpages. Each entry in the page file contains the URL of that page and a sequence of words. These words are taken from that page during the preprocess, they are strings of a – z (all lower case, no special characters, no punctuations, no numerals), and common stop words and html tags are removed. Please note that 1) the same word may appear more than once in a webpage; and 2) two words are treated the same if and only if the two strings are identical, therefore, for example, “car” and “cars” are treated as two different words.
Page Indexing
Webpages will be read in from the page file, and then indexed by the words it contains. Hash table (HT) is chosen for indexing for its fast access speed. Each element in HT consists of the following:
(1) The word (the key for hash functions),
(2) The reference to its page list which holds all pages that contain this word. Each element of the page list consists of the page’s URL and sequence number (a number you need to set according to the order the page is read in from the page file.
HT is constructed by reading and attempting to insert words from the page file. Specifically, when you attempt to insert word w read from page p into HT:
For an example, suppose the following two pages are the first two read in from the page file:
Sequence number 
1 
2 
URL 
www.foo.com 
www.bar.edu 
words 
x y z 
y z y 
Then pages www.foo.com and www.bar.edu are assigned sequence numbers of 1 and 2, and the HT contains three elements: x with page list (1), y with page list (1, 2), and z with page list (1, 2) (URLs are not shown here).
Cuckoo hashing
Cuckoo hashing was invented by Pagh
and Rodler [1,2] in 2001. The basic idea of
cuckoo hashing can be seen from the following definition, taken from http://www.nist.gov/dads/HTML/cuckooHashing.html:
“… A dictionary implemented with two hash tables, T_{1} and T_{2}, and two different hash functions, h_{1} and h_{2}. Each key, k, is either in T_{1}[h_{1}(k)] or T_{2}[h_{2}(k)]. A new key, k, is stored in T_{1}[h_{1}(k)]. If that location is already occupied by another key, l, the other key is moved to T_{2}[h_{2}(l)]. Keys are moved back and forth until a key moves to an empty location or a limit is reached. If the limit is reached, new hash functions are chosen, and the tables are rehashed.”
Following is an outline in pseudo code of cuckoo hashing algorithm where “x <> y” denotes swapping the values of x and y:
procedure insert(x)
if T1[h1(x)] = x or T2[h2(x)] = x then return; /* x is already there */loop jumpLimit times{
if T1[h1(x)] = NULL then { T1[h1(x)]< x; return};/* if T1[h1(x)] is free, insert x there */x <> T1[h1(x)];/* else kick out what was in T1[h1(x)] and insert x here */if T2[h2(x)] = NULL then { T2[h2(x)]< x; return};/* attempt to insert element kicked out from T1 into T2 */x <> T2[h2(x)]}
rehash();insert(x);end
Because two different hashing functions are used in cuckoo hashing, two keys collide in one table will very unlikely to also collide in the other table. This is why cuckoo hashing has better time performance than linear or even quadratic probing. In rare occasions, insert may run into an infinite loop. The procedure is then forced to stop for rehashing when max number of times it is allowed to jump from one table to the other is reached. It is also clear that to find x one needs only to look at two places: T1[h1(x)] and T2[h2(x)], so the time performance is always O(1).
Queries
In this project you will need to handle only one type of search query: the conjunctive query. A conjunctive query consists of one or more words called query keys; the answer to the query is the list of ALL webpages that contain ALL keys in the query.
Efficient
implementation for set Intersection
A conjunctive query is answered by intersecting page lists of all its query keys. Straightforward implementations of set intersection often result in poor time performance. For example, if we literally check if every element in S1 is also an element in S2, then the intersection operation would have time performance of O(S1*S2), which would be computationally intractable for large sets such as the page lists for commonly used words.
Google suggests another
algorithm that has the time performance linear to the size of the sets
involves. This algorithm, called walking
together, can be best illustrated by an example taken from http://www.google.com/librariancenter/articles/0512_01.html.
Suppose the word "civil"
was in pages 3, 8, 22, 56, 68, and 92; the word "war" was in pages 2,
8, 15, 22, 68, and 77. The intersection is (8, 22, 68).
civil 
3 
8 

22 
56 
68 
92 
war 
2 
8 
15 
22 

68 
77 
Both words 

8 

22 

68 

In
this algorithm, you walk along the two lists as follows: at any time, if the
two numbers are equal then this page belongs to the intersection, otherwise move
the leg with the small number one step further. In this example, one first
moves along the second list from 2 to 8, then the first list from 3 to 8, where
a match is found. The process then continues until one list reaches the end.
You are required to implement walking
together algorithm for set intersection with multiple sets.
1.
Project 4 will be invoked with a command line
that consists of 2 arguments: the names of the page file and query file.
2.
Specifics
about cuckoo hashing implementation.
/* convert word to integer key k */
k = 0;
for (i = 0; i <
word.length(); i++)
k = 37 * k + word.charAt(i);
return k;
/* h1 for T1 with integer key k */
hashValue1 = k % m;
If (hashValue1 < 0) hashValue1 += m1;
return hashValue1;
/* h2 for T2 with integer key k */
hashValue2 = floor(m2*(k*A
– floor(k*A));
If (hashValue2 < 0) hashValue2 += m2;
return hashValue2;
Here m1 and m2 are the table size of T1
and T2, and A = 2/(1+sqrt(5)), the
inverse of the golden ratio.
jumpLimit = ceiling(current
table size/2).
afs/umbc.edu/users/y/p/ypeng/pub/CMSC341/Proj4.
1.
You are free
to choose your own data structure and implementation for page list
2.
Output.
Query: y z
# of pages found: 2
List of page found:
(1) www.foo.com
(2) www.bar.edu
[1] Rasmus Pagh and Flemming Friche Rodler, Cuckoo Hashing, Proceedings of ESA 2001, Lecture Notes in Computer Science, vol. 2161, pages 121133, 2001.
[2]
Rasmus Pagh, Cuckoo Hashing for Undergraduates. (from
http://www.itc.dk/people/pagh/papers/cuckooundergrad.pdf)