Project 4, Perfect Hashing

Due: Tuesday, November 26, 11:59pm

Links: [Project Submission] [Late Submissions] [Project Grading] [Grading Guidelines] [Academic Conduct]

Objectives

The objective of this programming assignment is to increase your understanding of hashing by implementing a perfect hashing algorithm and by collecting statistics on its performance.

Introduction

In this project, you will implement the perfect hashing scheme described in Section 5.7.1 of the textbook using universal hash functions described in Section 5.8. We summarize the main ideas here. You should consult the textbook for details and proofs.

First, perfect hashing is hashing without collisions. Section 5.7.1 of the textbook shows that if you hash n items into a table of size n2 with a randomly chosen hash function, then the probability of not having any collisions is greater than 1/2. What happens if you are unlucky and you get a collision? Then, pick another hash function and try again. With independent trials, the expected number of attempts you have to make in order to achieve zero collisions is less than 2.

Now, a table of size n2 is really big and a huge waste of memory. So, in our perfect hashing scheme, we don't directly hash to a table of size n2. First, we hash into a primary hash table of size n. There will be some collisions. To resolve the collisions at a slot of the primary hash table, we create a secondary hash table. If t items collide at a certain slot of the primary hash table, then we create a secondary hash table of size t2 and use perfect hashing to store the t items. The textbook shows that the expected number of slots used by all of the secondary hash tables is less than 2n. If you are thinking that this hashing scheme is just separate chaining with the linked lists replaced by hash tables, that is pretty close — just remember that the linked lists are replaced by collision-free hash tables.

How do you search for an item in this hash table? You have to hash twice. First, you hash the item to find its slot in the primary hash table. If that slot is not empty, then you find its slot in the secondary hash table. If the slot in the secondary hash table is also non-empty, then you compare the item against the item stored in the secondary hash table. If there's a match, you found the item. Otherwise, the item is not in the hash table.

Note that each secondary hash table has its own hash function, since it might have been necessary to try a few hash functions before you found one that did not result in any collisions. So, the hash function would have to be stored in the secondary hash table.

The perfect hashing scheme described above requires the ability to "randomly pick a hash function." In particular, we have to be able to randomly pick a different hash function if the one we just tried doesn't work because it resulted in a collision. How do we do that? This accomplished by "universal hashing". (Never mind the word "universal". It is a bit of a misnomer. It should really be called "randomized hashing", but most people think hashing is already random, so "randomized random" doesn't make much sense either.)

The universal hash functions presented in Section 5.8 of our textbook provide a method for generating random hash functions. First, we need a prime number p that is larger than any key that will be hashed. Then, we select two random integers a and b, such that 1 ≤ ap − 1 and 0 ≤ bp − 1. Then, we can define a hash function ha, b( ) using these two random integers:

      ha, b( x ) = ( ( a x + b ) mod p ) mod m
where m is the table size (which does not need to be prime in this scheme). Thus, for every pair of a and b, we get a hash function. The proof in the textbook shows that random hash functions chosen this way satisfies the definition of "universal" and has provably good performance.

To hash strings we first convert the string into a number, then we use the hash function above to guarantee "universality". In one scheme, we pick a random constant c such that 1 ≤ ap − 1. Then, we interpret each character of the string as a number (think ASCII), so a string str becomes a sequence of numbers d[0] d[1] d[2] d[3] ... d[t]. Now we can convert the string into a number:

      gc( str ) = ( d[0] ct + d[1] ct − 1 + d[2] ct − 2 ... + d[t] ) mod p
In a program, we should calculate this value using Horner's rule. Also, we would have to make sure that the arithmetic does not result in any overflows. (This can be accomplished by "modding out" by p at every step.)

The last remaining thing to point out is that this perfect hashing scheme only works if we know all the keys in advance. (Otherwise, we cannot tell how many items hash into the same slot of the primary hash table.) There are several applications where we would know the keys in advance. One example is when we burn files onto a CD or DVD. Once the disc is finalized, no additional files can be added to the disc. We can construct a perfect hash table of the these filenames (perhaps to tell us the location of the file on the disc) and burn the hash table along with the files onto the disc. Another example is place names for a GPS device. Names of cities and towns will not change very often. We can build a perfect hash table for place names. When the GPS device is updated, a new hash table will have to be constructed, but updates are not frequent events. This last example is the basis of your programming project.


Assignment

Your assignment is to apply the perfect hashing scheme described above to a file containing approximately 16,000 city names in the United States. The data comes from geonames.org. In addition to the names of all the cities with population above 1000, the file also has the latitude and longitude of each of these cities. Here are a few lines from the file:

Abington, MA 42.10482 -70.94532 Abita Springs, LA 30.47853 -90.03758 Abram, TX 26.1998 -98.41113 Absarokee, MT 45.5205 -109.44294

The data for each city is stored in two lines of text. The first line is the name of the city followed by the state. The second line of text has the latitude and the longitude of the city (in that order). You should treat the city name and state as a single entity to be hashed — i.e., hash the string "Abington, MA" (comma included) rather than "Abington". You may assume that the city names with the state designation included is a unique string, even though this is not entirely true in real life. (For example, there are 3 separate cities in Indiana named "Georgetown". Two of these have been disappeared from our file.)

The complete file is here: US_Cities_LL.txt.

Since implementing universal hash functions can be a little bit tricky, a Java class that implements universal hash functions is provided: Hash24.java. The 24 in Hash24 indicates that the methods in the class work with values as large as 224 which is approximately 16 million. This more than large enough for the purposes of this project. To use the Hash24 class, simply create a Hash24 object and then use it to invoke the universal hash function:

Hash24 h1 = new Hash24() ; ... index = h1.hash("Abington, MA") ;

The Hash24 object h1 "remembers" the randomly chosen a, b and c used in the universal hash function. So, you can store h1 and retrieve it later and it can be used to compute the same hash function.

You will implement two separate programs. The first program takes the file US_Cities_LL.txt, creates a hash table using perfect hashing and stores the hash table in a file. While creating the hash table, the first program must also print out some statistics about the hash table (see below). The second program reads in the hash table from a file and lets the user query the hash table. If the city is found, the second program prints out the latitude and longitude as well as a URL that can be pasted into a web browser to locate the city in Google Maps.


Details

CityTable

Your hash table class must be called CityTable and live in a package called hash341 (so that it will be compatible with the Hash24 class). In addition, your CityTable class must implement these methods:

Other Classes

You will probably find it necessary to implement other classes (hint, hint). You should make sure that your collection of programs plus Proj4Test.java compile on GL using an Ant build file.

First Program

The first program that you implement should read in the text file US_Cities_LL.txt, construct a hash table using perfect hashing, print out the statistics described above and write out the hash table to a file named US_Cities_LL.ser.

Second Program

The second program that you implement should read in the hash table from the file named US_Cities_LL.ser and prompt the user to type in a city name (including the state). If the city is in the hash table, your program should print out the city's name and coordinates and also a URL that can be cut-and-pasted into a web browser.

Running your second program should look somewhat like this: Proj4Typescript.txt.

If you wish you can optionally include code that asks the system to open the Google Maps URL in the default browser. (See sample code below.) Make sure that the program still compiles and runs on GL using a terminal emulator and prints out the Google Maps URL to standard output in all situations. Otherwise the graders would not be able to check your program.

import java.awt.Desktop ; import java.net.URI ; ... static void openURL (String url) { // cribbed from: // http://java-hamster.blogspot.com/2007/06/troubles-with-javaawtdesktop-browse.html try { if (Desktop.isDesktopSupported()) { Desktop desktop = Desktop.getDesktop(); if (desktop.isSupported(Desktop.Action.BROWSE)) { desktop.browse(new URI(url)); } else { System.out.println("No browser. URL = " + url) ; } } else { System.out.println("No desktop. URL = " + url) ; } } catch (Exception e) { e.printStackTrace(); } }


What to Submit

Read the course project submission procedures. Submission closes by script immediately after midnight. Submit well before the 11:59pm deadline, because 12:00:01 might already be late (the script takes only a few seconds to run).

You should copy over all of your Java source code and have your .java files in their own directories which are in turn under the src directory. You must also supply an Ant build file.

Make sure that your code is in the ~/cs341proj/proj4/ directory and not in a subdirectory of ~/cs341proj/proj4/. In particular, the following Unix commands should work. (Check this.)

cd ~/cs341proj/proj4 ant compile ant run ant clean