Project 2, Lazy Binary Search Trees

Due: Tuesday, October 22, 11:59pm

Links: [Project Submission] [Late Submissions] [Project Grading] [Grading Guidelines] [Academic Conduct]

Objectives

The objective of this programming assignment is to have you practice using recursion in your programs and to familiarize you with the binary search tree data structure.

Introduction

In real life, laziness if often disdained. (See Wikipedia article on sloth.) In computer science, however, laziness is sometimes a viable strategy. Why do today what you can put off until tomorrow? especially if there is a chance that you won't actually have to do it tomorrow either?

In contrast to a red-black tree, where we diligently maintain a balance condition to guarantee that the tree has O( log n) height, in a lazy BST, we don't worry about the balancing until things get really out of whack. Insert and delete proceed in the same manner as an unbalanced binary search tree until we notice that at some node of the BST, the left subtree is twice as large as the right subtree (or vice versa). When this happens we rebalance the subtree of the lazy BST rooted at this node.

When a subtree of a lazy BST is rebalanced, we convert the entire subtree into a sorted array. Then we convert the array back into a perfectly balanced BST. Rebuilding is easy because the array is sorted. We can find the middle element of the array in constant time and make it the root of the new subtree. Then, we recursively build the left subtree and the right subtree using, respectively, the portion of the array that has keys smaller than the root and the portion of the array that has keys larger than the root. The result is a binary search tree that is as balanced as possible. (See Project 2 Examples.) The rebalance procedure takes O( t ) on a BST subtree with t elements. However, we don't have to rebalance very often — amortized analysis shows that the insert and delete procedures take O( log n ) amortized time on a lazy BST with n elements.

Since rebalancing is expensive, we add another provision: we won't rebalance a subtree that has height ≤ 3. An unbalanced subtree that has height 3 will not add very much to the height of the overall tree and hence will not contribute very much to the running time of the BST procedures. (We adopt the convention used in the textbook and define the height of a leaf to be 0.) By ignoring, small unbalanced subtrees, we can avoid excessive rebalancing.

One note about the rebalance procedure: it is possible for a lazy BST to have two nodes x and y where rebalancing is needed where x is an ancestor of y. In this situation, we want to do the rebalancing at x since rebalancing the subtree rooted at x will also rebalance the subtree rooted at y. If we rebalanced at y first, the time spent rebalancing at y is completely wasted since all that work is undone when we rebalance at x. (See Project 2 Examples.)

Assignment

Note: Running time is one of the most important considerations in the implementation of a data structure. Programs that produce the desired output but exceed the required running times are considered wrong implementations and will receive substantial deductions during grading.

Your assignment is to implement a lazy BST. You may start with the BST class from the textbook or design your own. Each option has advantages and disadvantages. A primary objective of this programming assignment is to have you use recursion. So, one component of grading will evaluate how elegantly you employ recursion to implement this data structure. (Yes, you are being graded on aesthetics!)

In order to implement a lazy BST efficiently, your data structure must be able to determine the size and height of a subtree in constant time. You must have data members for the height and size of a subtree in the class representing the root of a subtree of a lazy BST. The height and size data members must be updated whenever the height or size of that subtree changes. The update must not affect the asymptotic running time of insert, delete and search. These must still run in time proportional to the height of the tree.

Furthermore, your lazy BST implementation must be generic. In particular, it must work with any class that implements the Comparable interface. The declaration of your LazyBST class must be:

package lazybst; public class LazyBST < T extends Comparable<? super T> > { ... }

This declaration is similar to the one used in the textbook. You must not change the package name because doing so will make your program incompatible with the test programs used for grading and will result in deductions.

Here are the methods you must implement in your LazyBST class. (You will need to implement others for your own coding needs.)

An insert() method that adds an item to the LazyBST that has the following signature:
```
   public void insert (T x) ;
```
The insert() method must run in time proportional to the height of the lazy BST. Your LazyBST must not allow duplicates. If the insert() method is invoked with a key value that already appears in the lazy BST, your insert() method should do nothing.
A remove method that finds and removes an item with the given key value. The remove method should return a reference to the item removed. If no such item is in the lazy BST, your remove method should return null. (Do not throw an exception.) The remove method must have the following signature:
```
   public T remove (T x) ;
```
For full credit, your remove() method must run in time proportional to the height of the tree.

Note that the textbook's implementation of the remove method is quite inefficient because in the case where the node to be removed has two children, it first finds the value of the smallest item in the right subtree and then, in a separate step, removes that smallest item from the right subtree. This is inefficient because it uses two traversals of the BST. Your implementation must find and remove the smallest item in the right subtree without traversing the BST a second time.
A find() method that returns the item that has the given key. The signature of the find() method should be:
```
   public T find (T x) ;
```
If no such item exists, the find() method should return null. For full credit, your find() method must run in time proportional to the height of the tree.
A method named span() that returns the number of items in the lazy BST with keys within a specified range. I.e., span(x,y) is the number of items with key ≥ x and ≤ y. The signature of the span() method should be:
```
   public int span (T x, T y) ;
```
For full credit, your span() method must run in time proportional to the height of the tree. Try to make good use of recursion in the computation of span(x,y).
A method named rebalance() that rebalances a subtree of the lazy BST as described above. The running time of rebalance() must be proportional to the subtree being rebalanced. Note that a proper implementation would require you the keep track of the size and height of the subtree. Read the description above.
A dump() method that prints out each item in the LazyBST in increasing order. Next to each item, the dump() method must also print out the item's height and the size of the subtree rooted at the item.
```
   public void dump () ;
```

Finally, your lazy BST class must be called LazyBST and must be in a package called lazybst ( NOT proj2 or anything else). Your LazyBST class must be accessible to a main program in a different package. You can check that your code compiles correctly with this sample main program: Proj2Test.java. This test program must be placed in a separate directory named driver (since it belongs to the driver package). The output of Proj2Test.java should look something like this: Proj2Test.txt. Your code must compile with Proj2Test.java without alteration.

Note: "without alteration" means you cannot change even one character of the file — so DO NOT change the package name and DO NOT change the import directives. If your program does not compile with Proj2Test.java on GL using ANT, you must change your program to make it compile. This is the whole point of providing a sample test program.

Addendum: Here are two simple test programs to check that your LazyBST class works with different data types.

Proj2StringTest.java uses a lazy BST of String objects. (Sample output.)
Proj2GroceryTest.java uses a lazy BST of GroceryItem objects. Each GroceryItem has an itemName, a category and a quantity. The binary search tree is sorted by the itemName field. This example also demonstrates the utilty of the find() method. You can use the itemName field to find the object in the LazyBST and then use that reference to access the other data members of the object (namely, category and quantity). (Sample output.)

Java 6: Here are versions of the test programs for people still using Java 6.

Implementation Notes

Here we list some recommendations and point out some traps and pitfalls.

Recall that it is rather complicated to create an array of generic type inside a generic method. In particular, this does not work:
```
   T [] myArray = new T[someSize];
```
For rebalance(), you should use ArrayList from the Java Collections API to create an array (because ArrayList knows how to create an array of generic type). Make sure that you specify the size of the ArrayList when you invoke the ArrayList constructor. This ensures that the ArrayList does not have to be expanded during the rebalance procedure.
Remember that we are defining the height of a leaf node to be 0. (The leaf node here is a node that contains actual data, not the null references at the bottom of a BST.)
There are many places where the height and size of a node needs to be updated including, for example, in the rebalance procedure.
When you insert a key that is already in the binary search tree, you are supposed to do nothing. (This is one of the standard alternatives and the one chosen by the textbook.) This means you have to be careful about how you update the sizes of the subtrees, because when you insert a duplicate, the size does not change! (and you won't find out that it is a duplicate until you've found its 'clone').
When should we check if we need to rebalance? One place to consider is after we modify the lazy BST in the insert() and remove() procedures. However, we want to do the rebalancing as high up the tree as possible. (See note above.) So, checking for rebalancing after insert() and remove() would require another traversal of the lazy BST from the root.

Instead, it is much more convenient to check for rebalancing before we insert or remove an item (since we are traversing the BST top down). This may seem counter-intuitive since insert() and remove() will mess up our nicely balanced BST right after we cleaned it up. However, even if we check for rebalancing after these operations, the next insert() or remove() will mess up the tree anyway.

Another temptation is to insert() or remove() the item during the rebalance() procedure. (Hey, we are taking this subtree apart anyway, surely we can toss in or remove an item while we are at it.) This is possible, but not elegant. Let's concentrate on elegant uses of recursion in this project.
When the dump() procedure prints out an item, it should rely on the toString() method of the item. (Note that System.out.println() does this already.) The lazy BST might hold references to complex objects that have many data members. It is up to the object to supply a reasonable toString() method. For example, if the object is a student record with the student's complete registration history, the toString() method might simply print out the student's name.
Here is a copy of the textbook's binary search tree implementation: BinarySearchTree.java. It has been modified to throw the standard NoSuchElement exception when findMin() or findMax() is invoked with an empty BST. There are no other changes. Think about how you want to re-implement remove() before deciding whether you want to modify the textbook implementation or design your own class. You definitely do not want to use the height() method from this implementation.

What to Submit

Read the course project submission procedures. Submission closes by script immediately after midnight. Submit well before the 11:59pm deadline, because 12:00:01 might already be late (the script takes only a few seconds to run).

You should copy over all of your Java source code and have your .java files in their own directories which are in turn under the src directory. You must also supply an ANT build file.

Make sure that your code is in the ~/cs341proj/proj2/ directory and not in a subdirectory of ~/cs341proj/proj2/. In particular, the following Unix commands should work. (Check this.)

cd ~/cs341proj/proj2
ant compile
ant run
ant clean