15 minute read

This article assumes the reader is already familiar with the C programming language and some basic concepts concerning hash functions. The target audience (like most of the time on my blog) are undergrad CS students or seasoned developers who, just like me, haven’t learned about Bloom Filters during their University years.

Introduction

If you just want to jump directly into the code, checkout this repository.

As Wikipedia states, Bloom Filters are space-efficient, probabilistic data structures, conceived by Burton Howard Bloom in 1970, used to test whether an element is a member of a set or not. What I find peculiar is that the real Mr. Howard Burton Bloom doesn’t have a wiki page, while the imaginary Mr. Leopold Bloom has one.

png

To make it short, a Bloom Filter is a data structure where we “add” elements. But after the addition, we cannot recover them further. They are chopped and hashed into pieces, and only a tiny footprint of what they once were remains. Afterward, we can ask the filter a delicate question:

[RandomDeveloper]: Is the element E in the set U or not?

The two possible answers a Bloom Filter can give us are:

[BloomFilter]: I am 100% sure the element E is not in U;

Or:

[BloomFilter]: I am almost sure the element E is in U, but I cannot guarantee you that…

For most of the non-critical scenarios you can think of, even the second answer is satisfactory in the light of how little space a Bloom Filter occupies. For example, you can check Prof. Michael Mitzenmacher ‘s presentation “Bloom Filters, Cuckoo Hashing, Cuckoo Filters, Adaptive Cuckoo Filters, and Learned Bloom Filters”, where he describes the ancient use of Bloom Filters: spellcheckers.

png

So, once upon a time, when memory was scarce, one of the first spellcheckers was based on a Bloom Filter who occupied 25KB to determine if 210KB of English words had possible spelling mistakes. Even if things have evolved since then, and spellcheckers don’t (necessarily or exclusively) use Bloom Filters anymore, the numbers stand impressive. 25Kb for a killer word editor feature in the early 90s is not that bad.

But, Bloom Filters were not part of the undergrad curricula I’ve taken while studying at the University, and I’ve rarely seen them (directly) used in practice. Even the more experienced engineers tend to ignore their properties and advantages and wrongly (or rather unjustly) replace them with Hash Tables or variously Set implementations. Again, most programming languages don’t have them implemented in their standard libraries.

On the brighter note, they still fit well in various back-end architectures where we need to implement blacklists, or as parts of complex caching systems. For a production implementation check RedisBloom.

In any case, what is particularly interesting about Bloom Filters is that even the most straightforward “book implementation” works decently. Compared, for example, to the extreme arts of implementing an efficient Hash Table, writing yourself an efficient Bloom Filter is more approachable. The knowledge of writing one can fit inside our heads without checking a reference book on Data Structures and Algorithms.

How a Bloom Filter actually works

The best way to understand how Bloom Filters are working is to “visualize” them:

png

\(E_{1},\ E_{2},\ E_{3},\ \text{...}\) and so on are the elements we want to add to the Bloom Filter. They can be anything, and as many of them are necessary, given there’s enough memory available.

\(h_{1}(x),\ h_{2}(x),\ h_{3}(x),\ h_{4}(x),\ \text{...}\) and so on are (non-cryptographic) hash functions. A Bloom Filter can have as many hash functions associated with it as you want, but usually, in practice, software developers pick a number between 4 and 8. In the section called Interesting ideas to explore, you will see why it’s not necessarily needed to have separate hash functions. One or two are enough to generate the others.

For practical reasons, let’s suppose our functions return uint32_t values, which are natural numbers between \(0\) and \(2^{32}-1\).

Internally a Bloom Filter has an array-like memory zone associated where elements can have only two values: 0 and 1. It’s up to us how we organize this area, but the classical implementation uses a Bit Vector (described later in the article).

Given an element \(E\), to insert it in the Bloom Filter we perform the following actions:

  1. We compute the hash values of \(E\) by applying each of the hash functions: \(h_{1}(E),\ h_{2}(E),\ h_{3}(E),\ h_{4}(E)\).
  2. The hash values are uint32_t numbers, so we apply % bloom_filter_size (modulo) to find the actual cells’ positions inside the array. We set those positions to 1.

To check if an element \(E\) is not in a Bloom Filter, we compute the hash values, and we test the bits again if they are all 1s or not.

So, if our Bloom Filter uses 4 Hash Functions, then for our element E, 4 bits in the array will be 1. If an element is already set to 1 we don’t perform any change.

Of course, hash collisions can happen - when two distinct elements \(E_{1}\) and \(E_{2}\) can have the same hash value.

If the Bit Vector size is too small, the values 1 can also start to overlap with each other, creating what we call false positives.

If you are curious about the math behind it and how to calculate the probability of false positives, the Wikipedia article is quite good:

png

\(\varepsilon\) is the false positive rate, and to keep it under control, we can fine-tune the actual values of k, n, and m.

The implementation

The full code of this article can be found in this github repo:

git clone git@github.com:nomemory/bloomfilters-c.git

Bit Vector

We assume that for our platform 1 byte is always 8 bits.

Before jumping directly into implementing a Bloom Filter we need to implement first an auxiliary data structure called a Bit Vector (also known as a Bit Array).

A Bit Vector is an array data structure that compactly stores bits, meaning that all the elements from this particular array are either 0 and 1. But in the C programming language, there is no data type as small as 1 bit. The smallest you can get is char which is 8 bits, or uint8_t (and its signed counterpart int8_t) which is also 8 bits. It will be a shame to use that much space to store either 1 or 0 in 8 bits, when you actually need only one.

Initially, I wanted to use a xtrapbits.h, a bulletproof macro implementation that has worked since 1987, simply put, a piece of code that’s about my age. Eventually, I’ve opted to write my straightforward implementation, which you can find in bitutil.c and bitutil.h, stripping down functionality to exactly what I need in the context of Bloom Filters.

Unfortunately, C doesn’t provide support at the language level for Bit Vectors, but fortunately, through the use of pointers and bitwise operations, we can implement them ourselves.

Our Bit Vector will be composed of small uint32_t chunks of memory, but each of those chunks will contain 32 possible values of 1s and 0s.

typedef struct bit_vect_s {
    uint32_t *mem; 
    size_t size;        // The number of bits
} bit_vect;

From a memory perspective things will look like this:

png

So in n chunks of uint32_t we will be able to store n * sizeof(uint32_t) * 8 elements (1s or 0s).

If n=2, we would be able to store 2 * 4 * 8 = 64 bits.

If n=3, we would be able to store 3 * 4 * 8 = 96 bits.

… and so on

Allocating memory for a new Bit Vector

The method to allocate memory for a new Bit Vector has the following signature, where num_bits represents the (exact) number of bits our Bit Vector will contain:

bit_vect *bit_vect_new(size_t num_bits) ...

Implementing it is relatively straightforward; the only “math” involved is to find out how many uint32_t chunks we need to allocate, as the input parameter num_bits is not necessarily a multiple of a sizeof(uint32_t)*8.

For example, if num_bits=71, we need at least 3 uint32_t chunks: 2 of them will be fully utilized, and from the third one, we will use only 7 bits out of the 32 available, a compromise in wasting resources we can live with.

To put this into code, we will start by declaring the following macros:

#define BITS_IN_BYTE 8
#define BITS_IN_TYPE(type) (BITS_IN_BYTE * (sizeof(type)))
  • BITS_IN_BYTE On most modern systems, we expect the number to be 8, but it’s better to isolate this in a constant than to have magic numbers in our code.
  • BITS_IN_TYPE is a function macro that computes the number of bits associated with a type. If we call BITS_IN_TYPE(uint32_t), the result will probably be 32 on most modern systems.

Later Edit, user nimaje on gitbub mentioned that in limits.h there’s already a constant CHAR_BIT that holds the number of bits of a char. Instead of defining your own BITS_IN_BYTE constant, it’s a better idea to use this one.

Then we write the actual memory allocator, bit_vect_new:

bit_vect *bit_vect_new(size_t num_bits) {
    bit_vect *vect = malloc(sizeof(*vect));
    if (NULL==vect) {
        fprintf(stderr, "Out of memory.\n");
        exit(EXIT_FAILURE);
    }
    size_t mem_size = num_bits / BITS_IN_TYPE(uint32_t);
    // If num_bits is not a multiplier of BITS_IN_TYPE(uint32_t)
    // We add one more chunk that will be partial occupied
    if (!(num_bits%BITS_IN_TYPE(u_int32_t))) {
        mem_size++;
    }
    vect->mem = calloc(mem_size, sizeof(*(vect->mem)));
    if (NULL==vect->mem) {
        fprintf(stderr, "Out of memory.\n");
        exit(EXIT_FAILURE);
    }
    vect->size = num_bits;
    return vect;
}

Freeing the memory for a bit_vect is straightforward and is done in two steps, in this exact order:

  • Free the memory associated with bit_vect->mem first;
  • Free the memory associated with bit_vect second;
void bit_vect_free(bit_vect *vect){
    free(vect->mem);
    free(vect);
}

Getting the value of the nth bit

The function that gets the value of the nth bit from our Bit Vector looks like this:

bool bit_vect_get(bit_vect *vect, size_t bit_idx) {
    if (bit_idx>=vect->size) {
        fprintf(stderr, "Out of bounds bit_idx=%zu, vect->size=%zu\n", 
                            bit_idx, vect->size);
        exit(EXIT_FAILURE);                            
    }
    size_t chunk_offset = bit_idx / BITS_IN_TYPE(uint32_t);
    size_t bit_offset = bit_idx & (BITS_IN_TYPE(uint32_t)-1);
    uint32_t byte = vect->mem[chunk_offset];
    return (byte>>bit_offset) & 1;
}

Firstly, it performs some sanity checks on the input (bit_idx) to see if nth bit we are looking is in between our memory bounds.

Secondly, we need to compute two offsets to determine where the memory location (the bit) we are looking for is situated:

png

  • chunk_offset is the uint32_t frame that holds the bit.
  • bit_offset is the actual position of the bit inside the uint32_t chunk

For example, if we are searching for bit_idx=60, then: chunk_offset = 60 / 32 = 1, and bit_offset = 60 % 32 = 28.

This line bit_idx & (BITS_IN_TYPE(uint32_t)-1) can be a little confusing because you would typically expect the % operation to get the remainder, instead of &(32-1). This simple bitwise trick makes the codes more efficient by reducing the number of operations on the CPU. It only works if the divisor is a power of two (lucky us). I’ve already explained it here, using this diagram.

The last line, (byte>>bit_offset) & 1, extracts the actual value of the bit (by shifting the uint32_t with the bit_offset).

Setting the value of the nth bit

The corresponding code for this operation is the following:

void bit_vect_set(bit_vect *vect, size_t bit_idx, bool val) {
    if (bit_idx>=vect->size) {
        fprintf(stderr, "Out of bounds bit_idx=%zu, vect->size=%zu\n", 
                            bit_idx, vect->size);
        exit(EXIT_FAILURE);                            
    }
    size_t chunk_offset = bit_idx /  BITS_IN_TYPE(uint32_t);
    size_t bit_offset = bit_idx & (BITS_IN_TYPE(uint32_t)-1);
    uint32_t *byte = &(vect->mem[chunk_offset]);
    if (val) {
        // Sets the the `bit_idx` to 1 (true)
        *byte |= ((uint32_t)1) << bit_offset;
    }else {
        // Sets the bit `bit_idx` to 0 (false)
        *byte &= ~(1 << bit_offset);
    }
}

Yet again, we need to compute we compute the chunk_offset and bit_offset, and then we branch off:

  • *byte |= ((uint32_t)1) << bit_offset sets the corresponding bit to 1;
  • *byte &= ~(1 << bit_offset) sets the correponding bit to 0;

The hash functions

Just like Hash Tables (previously explained in this article, Bloom Filters make heavy use of Hash Functions. It’s not in the purpose of this article to explain how Hash Functions work, as I’ve already did my best here.

My rule of thumb is to use the sdmm and djb2 functions in toy implementations, and something more advanced like MurmurHash, FNV or spookyhash for serious stuff. Recently I’ve also played with chunky64, and the results good, still I am not sure how popular is this in the real world.

But as I said, the two hash functions I will gonna be using in this article are sdbm and djb2. They are extremely simple, and they work decent enough for our purpose:

uint32_t djb2(const void *buff, size_t length) {
    uint32_t hash = DJB2_INIT;
    const uint8_t *data = buff;
    for(size_t i = 0; i < length; i++) {
         hash = ((hash << 5) + hash) + data[i]; 
    }
    return hash;
}
uint32_t sdbm(const void *buff, size_t length) {
    uint32_t hash = 0;
    const uint8_t *data = buff;
    for(size_t i = 0; i < length; i++) {
        hash = data[i] + (hash << 6) + (hash << 16) - hash;
    }
    return hash;
}

The only improvement I would make on these two would be to increase the data reading frame size from uint8_t to something bigger, reducing the overall number of >> operations.

Next, we need to make sure is that our hash functions share the same signature, so we can typedef them for further use:

typedef uint32_t (*hash32_func)(const void *data, size_t length);

The main idea is that our Bloom Filter will receive several hash functions in the initialization phase, so doing an alias on the type (by using typedef) will make our code more readable when we pass the function pointers.

The bloom filter

The “interface”

The Bloom’s Filter interface, looks like this:

typedef uint32_t (*hash32_func)(const void *data, size_t length);
typedef struct bloom_filter_s {
    bit_vect *vect;
    hash32_func *hash_functions;
    size_t num_functions;
    size_t num_items;
} bloom_filter;
bloom_filter *bloom_filter_new(size_t size, size_t num_functions, ...);
bloom_filter *bloom_filter_new_default(size_t size);
void bloom_filter_free(bloom_filter *filter);
void bloom_filter_put(bloom_filter *filter, const void *data, size_t length);
void bloom_filter_put_str(bloom_filter *filter, const char *str);
bool bloom_filter_test(bloom_filter *filter, const void *data, size_t lentgth);
bool bloom_filter_test_str(bloom_filter *filter, const char *str);

It’s nothing fancy, just a simple API:

  • struct bloom_filter_s defines the internal data-structure:
    • A bit_vect where we keep our 1s and 0s;
    • num_items the actual number of bits the Bit Vector has;
    • hash_functions and num_functions for keeping track of the internal hash functions;

Allocating and de-allocating memory

The code for allocating dynamic memory for our newly defined struct (bloom_filter) is the following:

bloom_filter *bloom_filter_new(size_t size, size_t num_functions, ...) {
    va_list argp;
    bloom_filter *filter = malloc(sizeof(*filter));
    if (NULL==filter) {
        fprintf(stderr, "Out of memory.\n");
        exit(EXIT_FAILURE);
    }
    filter->num_items = 0;
    filter->vect = bit_vect_new(size);
    filter->num_functions = num_functions;
    filter->hash_functions = malloc(sizeof(hash32_func)*num_functions);
    if (NULL==filter->hash_functions) {
        fprintf(stderr, "Out of memory.\n");
        exit(EXIT_FAILURE);    
    }
    va_start(argp, num_functions);
    for(int i = 0; i < num_functions; i++) {
        filter->hash_functions[i] = va_arg(argp, hash32_func);
    }
    va_end(argp);
    return filter;
}
bloom_filter *bloom_filter_new_default(size_t size) {
    return bloom_filter_new(size, 2, djb2, sdbm);
} 
void bloom_filter_free(bloom_filter *filter) {
    bit_vect_free(filter->vect);
    free(filter->hash_functions);
    free(filter);
}

For an inexperienced eye, the only thing that might look confusing is the ... notation from the bloom_filter_new method signature. bloom_filter_new is basically a variadic function, that accepts an arbitrary number of params num_functions. Those params are actual hash functions (hash32_func).

Adding an element

The code that is adding an element to the Bloom Filter is the following:

  • We iterate through all the hash_functions;
  • We compute the hash values, and we set to 1 the correct values
void bloom_filter_put(bloom_filter *filter, const void *data, size_t length){
    for(int i = 0; i < filter->num_functions; i++) {
        uint32_t cur_hash = filter->hash_functions[i](data, length);
        bit_vect_set1(filter->vect, cur_hash % filter->vect->size);
    }
    // We've just added a new item, we incremenet the value
    filter->num_items++;
}

C doesn’t support polymorphism, so to make it simple for adding a string (char*) to the filter, we simply write a helper function:

void bloom_filter_put_str(bloom_filter *filter, const char *str) {
    bloom_filter_put(filter, str, strlen(str));
}

Testing if an element exists

The code for Testing if an element exists in the Bloom Filter is the following:

  • We iterate through all the hash_functions;
  • We compute the hash values we check the corresponding values in the bit_vect.
bool bloom_filter_test(bloom_filter *filter, const void *data, size_t lentgth) {
    for(int i = 0; i < filter->num_functions; i++) {
        uint32_t cur_hash = filter->hash_functions[i](data, lentgth);
        if (!bit_vect_get(filter->vect, cur_hash % filter->vect->size)) {
            return false;
        }
    }
    return true;
}
bool bloom_filter_test_str(bloom_filter *filter, const char *str) {
    return bloom_filter_test(filter, str, strlen(str));
}

Using the Bloom Filter

#include <stdio.h>
#include <stdlib.h>

#include "bloom.h"

int main(int argc, char *argv[]) {
    
    bloom_filter *filter = bloom_filter_new_default(1024);
    bloom_filter_put_str(filter, "abc");
    printf("%d\n", bloom_filter_test_str(filter, "abc"));
    printf("%d\n", bloom_filter_test_str(filter, "bcd"));
    printf("%d\n", bloom_filter_test_str(filter, "0"));
    printf("%d\n", bloom_filter_test_str(filter, "1"));
    bloom_filter_put_str(filter, "2");
    printf("%d\n", bloom_filter_test_str(filter, "2"));
    
    return 0;
}

Output:

1
0
0
0
1

Interesting ideas to explore

Re-use hash functions

Following a discussion on reddit, Chris Wellons suggested the fact that Bloom Filters don’t need actual unique k functions. We can only have one that generates a hash value, and from that value, through permutations we can generate as many new hash values as we want:

uint64_t hash64(void *buf, size_t len);
uint64_t permute64(uint64_t, uint64_t key);

uint64_t bufhash = hash64(buf, len);
uint32_t hashes[K];
for (int i = 0; i < K; i++) {
    hashes[i] = permute64(bufhash, i);
}

Where permute64 can look like:

uint64_t permute64(uint64_t x, uint64_t key)
{
    x += key;
    x ^= x >> 30;
    x *= 0xbf58476d1ce4e5b9U;
    x ^= x >> 27;
    x *= 0x94d049bb133111ebU;
    x ^= x >> 31;
}

Another interesting idea for avoiding to use multiple (separate) hash functions comes from Building a Better Bloom Filte. The authors suggest that two hash functions \(h_{1}(x)\), and \(h_{2}(x)\) are enough to generate others in the form \(g_{i}(x)\):

png

Articles, C implementations, and References

If you want to read more about how to implement Bloom Filters in the C language, you can check this article: “How to write a better Bloom Filter” by Drew DeVault.

Another interesting C implementation is bloomd which is a Network Daemon for Bloom Filters written by Armon Dadgar, and even if the project doesn’t seem to be maintained anymore, it is quite an exciting piece of software to look at.

Other references:

Going further

A proposed efficient alternative to Bloom Filters the Cuckoo Filters, but before speaking, I need to do my homework first.

Updated:

Comments