We saw in the previous post how a simple hash table can be constructed without much effort. But, and that is a big But, that table was prone to collisions and could result in corrupted data. Of course we don't want a data structure which can corrupt our data even if it is faster. So to avoid data corruption we will study some collision resolution techniques and will see how to evenly distribute a key over the table to minimize collision. Lets continue the journey.
Collision Resolution
First of all, what is collision ? Collision is a phenomenon which happens when 2 or more distinct keys are mapped to the same hash table entry. But we don't want that so using some technique, we relocate that key to some other empty location. The technique used to do that is called collision resolution technique. There are 2 types of techniques in use. One is called Chaining and Other is Open Addressing. Open Addressing has further more types but the concept is basically the same. Lets start with chaining first and see how it can be useful.
Chaining
As the name suggests, In chaining technique , we create a list of colliding keys for every hash entry location. So our hash table is basically an array of pointers which are pointing to the corresponding list. Whenever 2 distinct keys are mapped to same location, we append them to that location's list. The below picture will make it more clear.
Now the question comes about searching for a key in chained table. As you can see in the above picture, we cannot find a key in one go. We'll first have to find the entry location in hash table and then will have to traverse the whole list associated with that location. Each list node has the original key and thus we can compare our key with node's key to find the match.
Chaining may seem to solve the problem of collision and in reality it does but there is an overhead of memory. Each new list node must have a pointer to the next node. Also you may have observed in the above picture that a bad hashing technique may try to place all the keys in one location, thus making search a linear operation. Thus we can get amortized O(1) time for get, set and search operations if our hashing technique evenly distributes the keys
Open Addressing
After seeing how the chaining works, its time to explore another collision resolution technique which utilizes the hash table to its full potential. As we saw in chaining, a lot of keys can be placed in one location and then many other locations remain empty and not fully utilized. Open addressing solves this problem. There are many types of open addressing techniques but in this post we will look at Linear Open Addressing
In linear open addressing, we first find a key location inside hash table where it should sit using hash function. Then we check whether that location is empty or not. If its empty, place the key there otherwise, find the next empty location linearly in circular motion, i.e., go to the start to the table if there is no empty location till the end of the table. After finding an empty entry, place the key there. The following picture will help you understand this better.
Now lets to come to search. Search operation is somewhat similar to insert operation. When we want to search a key in table, we first find its location using hash function. After that, we linearly go through all the keys and match with our key until we find it. If we encounter an empty entry during our search or if we are back to where we started, it implies that the key is not present. As you can imagine, linear probing may result in clustures if our hash function is not uniformly distributive enough. The other type of open addressing which tries to avoid clustures is Quadratic Probing. In this we don't go in linear fashion instead we go in steps of power of 2. Like we first go 1 step to find next empty bucket, 2 steps if not found, 4 steps if no empty location found, 8 steps and so on.
As we saw in this post Open addressing solves memory problem but chaining is faster for high load factors [load factor = (total number of keys)/(number of entries in table)]. In real world cases, people prefer open addressing as Open addressing is usually faster than chained hashing when the load factor is low because you don't have to follow pointers between list nodes. If the load factor approaches 1, open addressing becomes very very slower. But usually, the table size is kept double the number of total keys in the table and thus open addressing achieves good performance. This was it for this post, in the next post, we'll see how python manages its dictionaries and how we ensure that table size is always double the number of entries. Stay tuned. Also share this article to your friends if this helped you in any way.
Comments
Post a Comment
Comment on articles for more info.