UMBC CMSC202, Computer Science II, Fall 1998,
Sections 0101, 0102, 0103, 0104
25. Hash Tables
Thursday December 03, 1998
[Previous Lecture]
[Next Lecture]
Assigned Reading: 10.7-10.9
Handouts (available on-line):
Programs from this lecture:
Topics Covered:
- We revisit some issues about the new and delete
operators.
- First, despite what some manuals might say, when we use the
new operator with the SGI CC compiler, the operator returns NULL
if it cannot allocate that amount of memory. (See test program and sample run.)
- We can define a new operator in a class. We have discussed overloading
the new operator before in Lecture 22.
When new is used to dynamically allocate space for an object, the new
operator in that class is used if one is defined. You can also define new[]
operator, which is used when an array of objects is dynamically allocated.
We can also define analogous delete and delete[] operators.
- We looked at a simple program that uses a Test class that has
new, new[], delete and delete[] defined. (See program and sample run.)
Some noteworthy points:
- The Test constructors and Test destructors are still called as before.
The class defined new and delete operators simple allocate and deallocate
memory and do not have to deal with constructors and destructors. (This
is good.)
- If you mistakenly use delete instead of delete [] to destroy an array
of objects, the compiler only generates code to destroy the first element
of the array.
- Note that the array of 10 Test items takes 56 bytes and not 40 bytes.
Since each Test object only uses 4 bytes, the extra 16 bytes must be used
to store the number of objects in the array and the size of each object.
- We update our BString class and our GenList template class to have new
and delete operators.
- Hash Tables: we have looked at a few data structures: arrays,
linked-lists and binary search trees. Each of these have advantages and
disadvantages. In a sorted array, we can use binary search to find an item
in O(log n) time. However, inserting into a sorted array takes O(n) time
(linear time). In an unsorted linked list, search takes linear time, but
insertion can be done in constant time (just insert at the front or the
back). Using a sorted linked list increases the time it takes to insert an
item without making a big improvement in the search time, since both
operations now take linear time. A binary search tree allows you to
insert, delete and search in O(log n) time. A hash table allows you to
insert, delete and search in constant time on average. So, if the only
operations you need to support are insert, delete and search, a hash table
offers many advantages.
- An example: suppose that you are the UMBC registrar and you want to
store and retrieve student records based upon the student's social security
number (ssn). There is an easy way to this quickly, simply create a huge
array of records indexed from 0 to 999,999,999. To retrieve a student's
record simply use his/her social security number as the index. The only
disadvantage of this method is that it uses too much memory. As an
alternative, we can use just the last 4 digits of a student's social
security number. Then we would only need 10,000 entries. The disadvantage
here is that there are more than 10,000 students at UMBC, so many students
would have to use the same index. To solve this problem, we keep a linked
list at each entry. For example, if two students have social security
numbers that end in 6666, then the 6666 entry of the table is a linked list
with the two students' records.
- We have here are the main ideas of a hash table. The hash table is an
array of linked lists. The key used for hashing is the student's ssn. The
hash function takes the key and transforms it into a legal index value for
the hash table. In this example, the hash function simply takes the ssn
and removes the first 5 digits. Ideally, a hash function would evenly
distribute the keys in the hash table. That way, each linked list in the
hash table would be relatively short. When two keys hash to the same index
value, the situation is called a collision. With 12,000 students
and an ideal hash function, each linked list in the hash table would only
have 1 or 2 elements. Thus, searching, inserting and deleting from this
hash table would take constant time.
- So, is taking the last 4 digits of the ssn a good hash function? It is
theoretically possible that next every entering freshman has the same last
4 digits in their ssn. Then, our hash table would simply be an unsorted
linked list and the performance of search would be poor. However, our
experiences with ssn's tells us that the chances of this happening is
small. The design of a good hash table depends on having a good hash
function. There are schemes for picking provably good hash functions which
would be discussed in an algorithms class, not here.
- One disadvantage of using the last 4 digits of a ssn as a hash
function is that we are not able to control the size of our hash table very
well. If UMBC's enrollment increased to 20,000, our only choice is to use 5
digits of the ssn and have a table of size 100,000. Another hash function
we can use is to take the ssn and take its remainder modulo some prime
number N. That would leave us with a value between 0 and N-1. If we have
a hash table with N entries, then this value can be used directly as the
index into the hash table.
- We implement a hash table as an array of linked lists. (See the header
file hash.h and implementation.) Each linked list is a list
of StudentRecord using the latest version of the templated GenList class.
The StudentRecord class is straightforward (header file and implementation).
- The implementation of the HashTable
class is relatively simple. The most complicated function is the
constructor. Here we first look for a prime number greater than or equal
to the parameter size. Recall that our hash function simply takes
a student's ssn divide it by the size of the table and use the remainder as
the index for the table. Choosing a prime number for the table size tends
to reduce the number of collisions. A little number theory (Bertrand's
Lemma) tells us that there is always a prime number between size
and 2 * size. The HashTable class does not have a default
constructor. Each time you create a HashTable you must specify the size of
the table.
- Otherwise, the HashTable member functions are straightforward. In the
Insert function for example, we simply compute the hash table index, and
append the item to the list.
void HashTable::insert(const StudentRecord& x) {
int index ;
index = hash(x.ssn) ; // Hash by ssn
count++ ;
table[index].append(x) ;
}
- We test the HashTable class with two main programs. The first main program is a trivial test of the
HashTable member functions. (See sample
run.) The second program inserts
random ssn's into the hash table to test the number of collisions. The sample run shows that the average number of
collisions is fairly predictable. With a good hash function, we can
control the average number of collisions by adjusting the size of the
table.
[Previous Lecture]
[Next Lecture]
Last Modified:
5 Dec 1998 00:04:16 EST
by
Richard Chang
Back up
to Fall 1998 CMSC 202 Section Homepage