Bloom Filters, Explained Properly

Most explanations throw a wall of theory at you and leave out the one thing that actually makes it click. This post fixes that.

The one question a bloom filter answers

Is this item definitely not in the set, or possibly in it?

It can tell you "no" with total certainty, but only ever says "maybe" for yes. It never gives a false negative; it can give a false positive. That asymmetry is the whole point — everything falls out of it. In exchange for the fuzziness, you get tiny memory and blazing speed.

What it's made of

Two things: a bit array (a row of light switches, all starting at 0) and a few hash functions (machines that take an item and spit out a position in the array). No stored words. No list of items. Just switches and hash functions.

How many bits does one item take?

This is the question that trips everyone up. When you add cat, how much room does it take?

Exactly k bits — and k is a number you pick in advance, not something the word decides.

A word has no natural size. You decide up front, "every item sets exactly 3 bits." That's k, the same for everything. cat sets 3 bits, dog sets 3 bits, elephant sets 3 bits. Word length is irrelevant — elephant takes no more room than cat. Which bits? That's what the hash functions decide.

Building the array

Tiny array of 10 slots, k = 3. Add cat:

hashA("cat") -> 4
hashB("cat") -> 4    <- two hashes can land on the same spot
hashC("cat") -> 1

Flip those bits on:

index:  0  1  2  3  4  5  6  7  8  9
        .  ■  .  .  ■  .  .  .  .  .
           ^cat        ^cat

cat lit only 2 distinct bits even though k = 3, because two hashes both said 4. Now add dog (hashes to 7, 1, 9):

index:  0  1  2  3  4  5  6  7  8  9
        .  ■  .  .  ■  .  .  ■  .  ■
           ^            ^  ^     ^
        cat+dog        cat dog  dog

Bit 1 was already lit by cat; dog setting it again does nothing.

The insight: you can't read the array

Here's the part that sticks. Look at that final array:

        .  ■  .  .  ■  .  .  ■  .  ■
        0  1  2  3  4  5  6  7  8  9

If I handed you this array cold, you could not tell me what's in it. You couldn't say "cat and dog are here." All you could say is "switches 1, 4, 7, 9 are on." The array keeps no record of who lit each bit. A lit bit just means somebody hashed there — it doesn't remember who, or how many.

That forgetfulness is the source of everything. It's why the filter is tiny (it's not storing your words, just their smeared fingerprints), and it's why "maybe" can never become "yes."

Querying: same hashes, then an AND

To check cat, hash it again — same outputs, 4, 4, 1 — and check whether those bits are all 1.

Are bits 4 AND 1 both 1?  -> yes -> "possibly present"

It's an AND across the item's own target bits. One 0 anywhere is an instant "definitely not" — it doesn't even check the rest.

Why "No" is certain and "Maybe" isn't

It comes from one fact: bits only ever turn on, never off.

a 0 bit -> "this slot was never touched" -> CERTAIN, nothing erases
a 1 bit -> "someone touched this slot"   -> AMBIGUOUS, the array forgot who

A 0 is proof of absence: if a position is still 0, nothing ever hashed there — so any item needing that bit was definitely never added. A 1 is only circumstantial: it could be your item, or other items that happened to share that slot. When all an item's bits are 1, you can't tell those apart, so the only honest answer is "maybe."

This is also why false negatives are impossible — a real member always finds its own bits lit, because it set them and nothing erases.

The false positive

Query wolf, never added, but it happens to hash to bits 4 and 1 — both already lit by cat:

wolf -> bits 4, 1 -> both 1 -> "maybe present"  -> WRONG

The filter isn't lying. Its real claim is "every bit wolf needs is lit" — which is true. It never promised wolf is present, only that it can't rule it out. And it genuinely can't: "wolf was added" and "cat was added and wolf collides" produce the identical pattern. The array forgot who lit those bits, remember? So it cannot distinguish them.

Why it's useful anyway

A trustworthy "no" lets you skip expensive work:

query the filter
  -> "definitely not"  -> skip the expensive lookup (the common case)
  -> "maybe"           -> do the real lookup to confirm (the rare case)

A false positive costs you one wasted lookup, never a wrong final answer — you always verify a "maybe" against the real source. Real uses: LSM-tree databases (Cassandra, RocksDB) put one in front of each on-disk table to skip reads for keys that definitely aren't there; browser safe-browsing keeps a local filter of bad URLs and only phones home on a "maybe."

The tradeoff, briefly

More bits and more hash functions lower the false-positive rate, at the cost of memory. You pick how many items you'll store and what error rate you'll tolerate; the array size and number of hash functions fall out of that (the sweet spot lands when the array is about half-full of 1s). And you can't delete from a basic bloom filter — clearing a bit might break other items sharing it.

Going deeper. The exact sizing math — deriving the false-positive rate (1 − e^(−kn/m))^k, the optimal m = −(n ln p)/(ln 2)², the optimal k = (m/n) ln 2, and why ln 2 keeps appearing — is laid out clearly in:

Wikipedia: Bloom filter — the standard reference, full derivations

Bloom Filters by Example — interactive, lets you watch the array fill

The original 1970 paper: Burton H. Bloom, Space/Time Trade-offs in Hash Coding with Allowable Errors

The one-paragraph version

A bloom filter is a bit array plus a few hash functions. Adding an item flips the k bits it hashes to; checking looks at those same bits. Any 0 means definitely-not; all 1s mean probably-yes. The key: you can't read the array — a lit bit doesn't remember who lit it, so a 0 proves absence but a 1 is only circumstantial. That one-way guarantee makes it a perfect cheap gate in front of an expensive lookup: trust the "no," verify the "maybe."