Hash Tables
At this point we have seen several mutable data structures using refs and arrays. Today we will look at hash tables.
We've seen various implementations for building sets or map abstractions. First we had simple lists, which had O(n) access time. We can also also implement sets (or maps) using binary search trees, by storing values (or key-value pairs) in the node of the tree. If the tree is balance, then access operations in tree implementations have O(lg n) access time. Our current best results are this:
linked list, no duplicates | balanced trees (e.g., AVL) | |
add (insert) | O(n) | O(lg n) |
delete (remove) | O(n) | O(lg n) |
member (contains) | O(n) | O(lg n) |
What if we could do even better? It turns out that we can implement mutable sets and maps more efficiently than the immutable (functional) sets and maps we've been looking at so far. In fact, we can perform these operations in O(1) time with hash tables, implemented using arrays. The idea is to exploit the power of arrays to update a random element in O(1) time.
In a hash table we store entries using an array of lists of elements. We ensure that each list is small, O(1) to be precise, so simple functional lists without duplicates work fine. This data structure (the hash table) is a big array of O(n) elements, called buckets. Each bucket is a functional (immutable) set containing O(1) elements, and the elements of the set as a whole are partitioned among all the buckets. We cannot really guarantee that the sets are O(1) in size, so hashing is expected case O(1) time, but in the worst case can be O(n) - if some fraction of the elements all hash to the same bucket. Thus it is important to ensure that the mapping of elements to buckets "spreads out" the elements across the buckets.
There is one key piece missing: in which bucket should a set element be stored? We provide a hash function h(e) that given a set element e returns the index of a bucket that element should be stored into. The hash table works well if each element is equally and independently likely to be hashed into any particular bucket; this condition is the simple uniform hashing assumption. Suppose we have n elements in the set and the bucket array is length m. Then we expect a = n/m elements per bucket. The quantity a is called the load factor of the hash table. If the set implementation used for the buckets has linear performance, then we expect to take O(1+a) time to do
add
, remove
, and member
. If the number of buckets is at proportional to the number of elements, the load factor a is O(1), so all the operations are also O(1) on average. Notice that the worst case performance of a hash table is O(n), however, because in the worst case all of the elements hash to the same bucket. If the hash function is chosen well, this will be extremely unlikely.A Hash Table Signature
Hash tables are generally implemented as maps, mapping from keys to values. It is easy to implement a set as a map, by simply using the presence of a key to indicate than an element is in a set.
Here is a simple signature for hash tables.
signature HASH_TABLE = sig type key type value type hashtable (* new(n) creates a hash table with n buckets *) val new : int -> hashtable (* insert(ht,key,value) updates a hash table to contain the * mapping from key to value. If key already is in the hash * table then this new value is stored in place of the old one.*) val insert : hashtable * key * value -> unit (* lookup(ht,key) returns the corresponding SOME value if key is * found in the hash table, otherwise NONE *) val lookup : hashtable * key -> value option (* delete(ht,key) removes the specified key and its value if present * otherwise has no effect *) val delete : hashtable * key -> unit (* The size of a hash table *) val size : hashtable -> int end
Hash functions
A key issue that affects our implementation is our choice of the hash function (and the number of buckets, which is of course determined by the hash function). Clearly, a bad hash function can destroy our attempts at a constant running time, since in the worst case we have to search O(n) buckets. If we're mapping names to phone numbers, then hashing each name to its length would be a very poor function, as would a hash function that used only the first name, or only the last name. We want our hash function to use all of the information in the key.
With modular hashing, the hash function is simply h(k) = k mod m for some m (typically the number of buckets). This is easy to compute quickly when we consider the bit-level representation of the key kas representing a number. Certain values of m produce poor results though; in particular if m is a power of two (i.e., m=2p), then h(k) is just the p lowest-order bits of k. Generally we prefer a hash function that uses all the bits of the key so that any change in the key it likely to change the bucket it maps to. In practice, primes not too close to powers of 2 work well.
Another alternative is multiplicative hashing, in which we compute (ka/2q) mod m for appropriately chosen values of a, m, and q. This works well for the same reason that linear congruential multipliers generate apparently random numbers. The multiplier a should be large and its binary representation should be a "random" mix of 1's and 0's; q is chosen so that all of the high bits of the product are retained before computing the modulus. Multiplicative hashing is cheaper than modular hashing and it works well with a bucket array of size m=2p, which is convenient.
Ideally you should test your hash function to make sure it behaves well with real data. With any hash function, it is possible to generate data that cause it to behave poorly, but a good hash function will make this unlikely. A good way to determine whether your hash function is working well is to measure the clustering of elements into buckets. If bucket i contains xi elements, then the clustering is (Si(xi2)/n) -n/m. A uniform hash function produces clustering near 1.0 with high probability. A clustering factor of c means that the performance of the hash table is slowed down by a factor of c relative to its performance with a uniform hash function and the same array size. If clustering is less than 1.0, the hash function is doing better than a uniform random hash function ought to: this is rare. Note that clustering is independent of the load factor.
Open Addressing
An alternative to hashing with buckets is open addressing. Instead of storing a set at every array index, a single element is stored there. If an element is inserted in the hash table and collides with an element already stored at that index, a second possible possible location for it is computed. If that is full, the process repeats. There are various strategies for generating a sequence of hash values for a given element: linear probing, quadratic probing, double hashing. We have chosen not talk about open addressing in detail because in practice it is slower than an array of buckets. The performance of open addressing becomes very bad when the load factor approaches 1, because a long sequence of array indices may need to be tried for any given element -- possibly every element in the array! Therefore it is important to resize the array when the load factor exceeds 2/3 or so. The bucket approach, by contrast, suffers gradually declining performance as the load factor grows, and no fixed point beyond which resizing is absolutely needed. With buckets, a sophisticated application can defer the O(n) cost of resizing its hash tables to a point in time when it is convenient to incur it: for example, when the user is idle.
A Hash Table Implementation
Here we develop an implementation of hash tables where the keys and values are both strings. First we extend the hash table signature to specify that the keys and values are strings, and then define an implementation of the resulting abstraction.
Note: what would happen if we tried to define StringHT as an opaque implementation of HASH_TABLE rather than STRING_HT?
Note: what would happen if we tried to define StringHT as an opaque implementation of HASH_TABLE rather than STRING_HT?
This is a basic implementation, that simply stores a list without "duplicates" in each bucket. Each element of the list is a (key,value) pair, such that the key hashes to the index of that bucket in the array. No two elements in the list have the same key (i.e., duplicates are duplicates on keys not pairs).
signature STRING_HT = HASH_TABLE where type key = string and type value = string structure StringHT :> STRING_HT = struct type key = string type value = string type hashtable = ((key * value) list) array (* helper function to hash a string to an int. hashes to a word using * byte shift operations and then converts to an int using the algorithm * from SML's hashString structure*) fun hash (k: key): int = let fun charToWord c = Word.fromInt(Char.ord c) fun hashChar (c, h) = Word.<<(h, 0w5) + h + 0w720 + (charToWord c) in Int.abs(Word.toIntX(CharVector.foldl hashChar 0w0 k)) end fun new(n: int): hashtable = Array.array(n,nil) (* Helper function to remove an element from a list *) fun remove(k: key, l: (key*value) list) = List.filter (fn(k',_) => k' <> k) l fun insert(t: hashtable, k: key, v: value): unit = let val bucket = Int.mod(hash k, Array.length t) val entries = Array.sub(t, bucket) in if List.exists(fn (k',_) => k=k') entries then Array.update(t, bucket, (k,v)::remove(k,entries)) else Array.update(t, bucket, (k,v)::entries) end fun lookup(t: hashtable, k: key): value option = let val bucket = Int.mod(hash k, Array.length t) val entries = Array.sub(t, bucket) in List.find (fn (k',_) => k=k') entries end fun delete(t: hashtable, k: key): unit = let val bucket = Int.mod(hash k, Array.length t) val entries = Array.sub(t, bucket) in if List.exists(fn (k',_) => k=k') entries then () else Array.update(t, bucket, remove(k,entries)) end fun size(t: hashtable): int = Array.foldl (fn (x,y) => length(x) + y) 0 ht end
Specifying hash functions
Hash tables are one of the most useful data structures ever invented. Unfortunately, they are also one of the most misused. Code built using hash tables often does not get anywhere near the possible performance, because of badly designed hash functions. The reason for this comes down to a common failure to adequately specify the requirements on the hash function.
Above we described hashing as a function that maps a key to a bucket index. Recall that hash tables work well when the hash function satisfies the simple uniform hashing assumption -- that the hash function should look random. If it is to look random, this means that any change to a key, even a small one, should change the bucket index in an apparently random way. If we imagine writing the bucket index as a binary number, a small change to the key should randomly flip the bits in the bucket index. This is a kind of information diffusion: a one-bit change to the key should randomly affect every bit in the index.
What happens if we try inserting "successive" strings into our hash table, such as "a", "b", "c", "d", "e" how many buckets are filled say for a table with 1000 elements?
Resizable hash tables and amortized analysis
The claim that hash tables give O(1) performance is based on the assumption that m = O(n). If a hash table has many elements inserted into it, n may become much larger than m and violate this assumption. The effect will be that the bucket sets will become large enough that their bad asymptotic performance will show through. The solution to this problem is relatively simple: the array must be increased in size and all the element rehashed into the new buckets using an appropriate hash function when the load factor exceeds some constant factor. Each resizing operation therefore takes O(n)time where n is the size of the hash table being resized. Therefore the O(1) performance of the hash table operations no longer holds in the case of
add
: its worst-case performance is O(n).
This isn't really as much of a problem as it might sound. If the bucket array is doubled in size every time it is needed, then the insertion of n elements in a row into an empty array takes only O(n) time, perhaps surprisingly. We say that add has O(1) amortized run time because the time required to insert an element is O(1) on the average even though some elements trigger a lengthy rehashing of all the elements of the hash table.
To see why this is, suppose we insert n elements into a hash table while doubling the number of buckets when the load factor crosses some threshold. A given element may be rehashed many times, but the total time to insert the n elements is still O(n). Consider inserting n = 2k elements, and suppose that we hit the worst case, where the resizing occurs on the very last element. Since the bucket array is being doubled at each rehashing, the rehashes must all occur at powers of two. The final rehash rehashes all n elements, the previous one rehashes n/2 elements, the one previous to that n/4 elements, and so on. So the total number of hashes computed is n hashes for the actual insertions of the elements, plus n + n/2 + n/4 + n/8 + ... = n(1 + 1/2 + 1/4 + 1/8 + ...) = 2n hashes, for a total of 3n hashing operations.
No matter how many elements we add to the hash table, there will be at most three hashing operations performed per element added. Therefore,
add
takes amortized O(1) time even if we start out with a bucket array of one element!
Another way to think about this is that the true cost of performing an
add
is about triple the cost observed on a typical call to add
. The remaining 2/3 of the cost is paid as the array is resized later. It is useful to think about this in monetary terms. Suppose that a hashing operation costs $1 (that is, 1 unit of time). Then a call to add
costs $3, but only $1 is required up front for the initial hash. The remaining $2 is placed into the hash table element just added and used to pay for future rehashing. Assume each time the array is resized, all of the remaining money gets used up. At the next resizing, there are n elements and n/2 of them have $2 on them; this is exactly enough to pay for the resizing. This is a really an argument by induction, so we'd better examine the base case: when the array is resized from one bucket to two, there is $2 available, which is $1 more than needed to pay for the resizing. That extra $1 will stick around indefinitely, so inserting n elements starting from a 1-element array takes at most 3n-1 element hashes, which is O(n) time. This kind of analysis, in which we precharge an operation for some time that will be taken later, typifies amortized analysis of run time.
Notice that it was crucial that the array size grows geometrically (doubling). It is tempting to grow the array by a fixed increment (e.g., 100 elements at time), but this causes n elements to be rehashed O(n)times on average, resulting in O(n2) asymptotic insertion time!
Any fixed threshold load factor is equally good from the standpoint of asymptotic run time, but a good rule of thumb is that rehashing should take place at a=3. One might think that a=1 is the right place to rehash, but in fact the best performance is seen (for buckets implemented as linked lists) when load factors are in the 1-2 range. When a<1, the bucket array contains many empty entries, resulting in suboptimal performance of the computer's memory system. There are many other tricks that are important for getting the very best performance out of hash tables.
How would one extend the above implementation to do resizing? Remember, we don't want to do anything expensive on every insertion, the whole goal is to have O(1) insert time. How about explicitly storing the number of entries in the table? Then grow the table (doubling the number of entries) and rehash all the elements when this gets to be larger than some multiple of the number of buckets.
How much of the above code would be the same if we defined a hash table where the keys and values are integers? How would you abstract out the commonalities?
Using references
ML allows you to allocate and manipulate mutable storage cells. Here are a few simple examples:val r = ref 0 (*create an int cell initialized to 0 *) val s = ref 0 (*create an int cell initialized to 0 *) val _ = r:= 3 (*update value of cell pointed to by r to 3 *) val x = !s + !r (*adds values in cells pointed to by s and t*) val t = r (*t and r now point to the same cell *) val _ = t := 5 (*update cell pointed to by t (and r) *) val y = !s + !tIn the presence of references, the aliasing problem arises: two different "names" for the same storage cell can be created. Aliasing makes it difficult to reason about programs, as shown in the following example:
fun f (a:int ref, b: int ref) = (a := !a +1; !b) val s = ref 0 val t = s f(s,t) (*will return 1*)
Implementing counters using references
Suppose we want to be able to count how many times a function is invoked in the program.First attempt: wrap the function to produce another function that increments counter and then calls the function
val counter = ref 0; fun count f = (counter := !counter +1; f) val add' = count add; add' 5 6 !counter (* will return 1 *) add' 6 7 !counter (* will return 1 *)Problem: counter is only incremented when we wrap the function. We want it to be incremented whenever we invoke the function.Second attempt:
fun count f = fn x => (counter := !counter +1; f x); val add' = count add; add' 4 5;Problem: this works correctly but we have only one counter for every function we want to wrap.Third attempt:
fun count f = let val counter = ref 0 in fn x => (counter := !counter + 1; f x) endProblem: counter is not visible to the rest of the program, so we have no way to read counter value when we are done invoking the function!Fourth attempt:
fun count f = let val counter = ref 0 in (fn x => (counter := !counter + 1; f x), fn () => !counter) end fun add x y = x + y val (add', read_counter) = count add read_counter() add' 3 4 read_counter() ...
Implementing recursive functions with references
From our previous discussions we know that we can only declare recursive functions using the fun, or val rec statements. We also know that we can use val statements to declare simple (non-recursive) functions. That is, a val declaration of function doesn't allow us to recursively call (or refer to the function) in the declaration. We show the power of references by implementing recursive functions using val statements.We illustrate our method using the canonical example of recursive functions, the factorial:
fun fact(n: int): int = if n = 0 then 1 else n * (fact(n - 1))If we were to use a val statement to declare fact, the declaration would fail because of the recursive call fact(n - 1). To avoid this problem, we replace the recursive call with a call to a function specified using a reference:
val x: (int -> int) ref = ref (fn _: int => 1) val fact: int -> int = (fn (n: int) => if n = 0 then 1 else n * (!x)(n - 1))Function fact does not really implement the factorial yet:
/ | 1, if n = 0, fact(n) = -| | n, otherwise. \We can "convince" fact to implement the factorial by resetting reference x to fact itself:
val () = x := factWith this transformation, we now have a correct implementation of the factorial function:
- fact 0; val it = 1 : int - fact 1; val it = 1 : int - fact 3; val it = 6 : int - fact 10; val it = 3628800 : int
Bad use of imperative language features
References can be used to implement imperative-style mutable variables, by creating one ref cell per variable. However, writing imperative-style programs in a functional language such as ML represents bad use of the language. The imperative-style factorial function below illustrates the misuse of references:fun imp_fact (n:int) = let val result = ref 1 val i = ref 0 fun loop () = if !i = n then () else (i := !i + 1; result := !result * !i; loop()) in (loop(); !result) endPoint: factorial is really a pure function that does not need storage or assignment.