Hash Tables : Scratching the Surface

Have you ever worked with key-value data stores where you can attach one value to a key and store it in some form to perform easy lookups ? If yes, then i guess you have already seen hash tables in action. Hash Tables are so common these days that you don't even realise when you start using them but not a lot of people are familiar with the internals of hash tables and how these can be used in real world problems like cryptography. This a series of articles all about hashing. So sit tight and enjoy the journey.

What Problems Does Hashing Solve ?

Hash Tables are the most popular data structure in computer science. We forget to worry about time complexities of finding our data in hash table as its basically constant. But it doesn't show how hashing solves real world problems. Following are some problems which are being solved by hashing only.

Computer Cache is internally a hash table with keys being the tag bits (refer to cache post)
Python Dictionaries are implemented using hashing scheme only.
Associative Arrays in any programming language such as php are maintained using hashing.
Cryptographic Hashing uses hash functions to digest a variable length message to a fixed length string.

Apart from the above scenarios, you can use hash tables anywhere you may find appropriate. As hash tables store data in unordered fashion, we can't use it in sorting applications.

Simple Implementation of Hash Table

We can consider a hash table as an array where each array entry holds the original key and value pair. If the key itself is a value then there is no need to store value separately. For any hash table to work, you need some way to map keys to array indices. This mapping is done by a hash function which can be anything you want it to be. Desirable properties of a hash function are:

Uniform Distribution of keys over hash table. We'll see shortly why is that needed.
Mapping of one key to one hash table entry must not change over time.

Considering the above properties in mind, we can build our hash function using many simple arithmetic techniques. One of them is modulo operator. Assume we are managing integer keys and there is a value associated to each key. Thus we need our hash table to perform constant time lookup of integers. This can be achieved by using an integer array. Our hash function which maps a key to one hash entry location performs the modulo operation on the key and returns the index in the array where the key should be placed. Following is the implementation of such a hash table in c++.

#include <iostream>
class HashTable{
public:
    static const int CAPACITY = 10;
private:
    int* entries;
    bool* valid_map;
    int size;
public:
    Table(){
        entries = new int[CAPACITY];
        valid_map = new bool[CAPACITY];
        size = 0;
    }
    void insert(int key, value){
        int index = hash(key);
        entries[index] = value;
        valid_map[index] = true;
        size++;
    }
    void delete(int key){
        int index = hash(key);
        valid_map[index] = false;
        size--;
    }
    bool exists(int key){
        int index = hash(key);
        return valid_map[index];
    }
    int find(int key){
        int index = hash(key);
        if(valid_map[index] == true){
            return entries[index];
        }
        return -1;
    }
private:
    int hash(int key){
        return key%CAPACITY;
    }
};

As you see above, our hash table is capable of storing 10 keys. But there is one major problem with the above approach. There is no collision handling. If two keys map to same location then second key will overwrite the first one. We'll see how can we tackle Collisions in the next post and will look at how python handles hashing of objects.

Information Well

Search This Blog