Sunday, November 11, 2012

Perfect Hashing - Collision Free HashMap

The HashMap is a reliable Key-Value store in Java which works on the principles of Hashing. The structure of the HashMap class is as below -

Buckets are array of Linked Lists.
Initial size of Buckets is 11.

Each key/value pair is stored in the linked list
Key's hashCode is used to find the Bucket.

Due to Hash-Collision multiple Key-Value
pairs hash to same bucket.

For Instance consider the below code
map.put( new Integer(33), "ThirtyThree");
map.put( new Integer(11), "Eleven");
map.put( new Integer(25), "TwentyFive");

The Keys here are Integers and the default implementation of hashCode for Integer class is the intValue.
So for the first put for 33, the hashCode of the key is 33. Now this value is converted to Bucket Index-
index = hashCode % capacity
index = 33 % 11 = 0
So the key/value pair is inserted at bucket 0.
Next, for 11 as well the bucket index is 0. This is a hash-Collision. This increases the length of the Linked-List. More the conflicts, worse the performance would be. As the linked list has to be traversed.

How do we ensure that there are no conflicts at all??
The answer is Perfect Hashing. There are various techniques available. We will look at "Cuckoo Hashing". This is a type of hashing which ensures zero-conflicts.

Image Source Wikipedia

In this hashing we maintain 2 set of buckets(or tables). We use 2 hashing functions h(k) and h'(k).

h(k) = key % 11 
h'(k) =(key/11) % 11

Here 11 is the capacity and its changed at run-time based on the capacity of the table.

So to populate key 20, first h(k) is calculated.
20%11 = 9

The first preference is table1. The algorithm checks if the index 9 of table 1 is free. If it is, store the key ( key-value pair infact) at the index 9. Hence you don't need to calculate h'(k).
Next key is 50. For this h(k) is 6. Again table1 index 6 is free and the key is inserted.
Next key is 53. For this h(k) is 9. But at that index we already have key 20. So calculate h'(k). It is (53/11)%11 = 4%11 = 4. Check if the index 4 of table2 is free. It is, so insert it in table2.

When inserting 67, h(k) = 1 and h'(k) = 6. And you notice that table1 index 1 is not empty and so also table2 index 6. Now apply greedy-method.Steal the index of table1 index 1 and store 67 at that position. So the old-key in that position, 100 is pushed out. Now 100 has to find a new place. Since it originated from table1, now it seeks its position in table2 using h'(k). For 100, h'(k) is 9 and table2 index 9 is free. Hence it is stored there.
Similarly, follow the path for 39. For 39, h(k) is 6 and h'(k) is 3. Both are occupied. So force out 105 from table1 index 6. For 105, h'(k) is 9. Store 105 at table2 position 9 by forcing out 100. For 100 h(k) is 1. Store it in table1 index 1 by forcing out 67. For 67, h'(k) is 6. So store it at table2 position 6 by forcing out 75. For 75, h(k) is 9 hence force out 53. and so on...

If we proceed like this, will we not end up with infinite-loop? YES.
So this algorithm works by ensuring that anytime the table's loadfactor is not exceeding 0.5. That means the number of buckets occupied should not be more than 50%. For people who are not aware of load-factor it is (size/capacity). This way we can ensure that the possibility of an infinite-loop is remote. And if that happens we can apply a different hashing.

The code below implements Cuckoo Hashing. To simulate the infinite loop, test by adding 6 (comment out the portion mentioned in the code)

Finally, why the name Cuckoo?? Its a technique used by Cuckoo bird :)


No comments:

Post a Comment