advanced10 min readUpdated 2026-06-08

Bloom Filters

Bloom filters save expensive disk/network lookups. Before querying a database or cache, check the Bloom filter.

Bloom Filters

The Core Idea

A Bloom filter is a space-efficient probabilistic data structure that tests whether an element is a member of a set. It can tell you 'definitely not in the set' or 'probably in the set.' False positives are possible; false negatives are not.

Step-by-Step Walkthrough

System architecture diagram for Bloom Filters showing how services, databases, and caches connect — System architecture for Bloom Filters

Cassandra uses Bloom filters for each SSTable (sorted file on disk). When a read query arrives: check the Bloom filter for each SSTable. If the filter says 'not present,' skip reading that file entirely. This avoids expensive disk I/O for most queries.

With 10 SSTables and Bloom filters, instead of reading 10 files, you might only read 1-2 (the ones where the filter says 'might be present').

Why This Approach Wins

Bit array + hash functions: A Bloom filter is a bit array of m bits with k hash functions. To add an element, hash it k times and set those bit positions to 1.
Lookup: Hash the query k times. If ALL positions are 1, the element is probably in the set. If ANY position is 0, the element is definitely not in the set.
False positive rate: Depends on m (array size), k (hash functions), and n (elements). Typical: 1% false positive rate with 10 bits per element.
No deletion: Standard Bloom filters do not support deletion (clearing a bit might affect other elements). Counting Bloom filters add this capability.
Space efficiency: A Bloom filter with 1% false positive rate uses ~10 bits per element, regardless of element size.

Step-by-step diagram showing how Bloom Filters processes a request from start to finish — How Bloom Filters works step by step

In Production

Google Bigtable uses Bloom filters to avoid reading SSTables that do not contain the requested row.

Cassandra uses per-SSTable Bloom filters, reducing disk reads by 90%+ for missing keys.

Akamai CDN uses Bloom filters to track which content is cached at each edge server.

Comparison table for Bloom Filters contrasting approaches, tradeoffs, and when to use each — Comparing key aspects of Bloom Filters

Tradeoffs and Limitations

Space vs Accuracy: More bits = lower false positive rate, but more memory.
No deletion: Standard Bloom filters are append-only. Counting Bloom filters support deletion but use more space.
Probabilistic: Bloom filters are not 100% accurate — you must handle false positives in your application logic.

Production Gotchas

Using Bloom filters for small sets where a hash set is better
Not sizing the filter correctly — too small leads to unacceptable false positive rates
Forgetting that Bloom filters cannot return the actual data — only membership

Data flow diagram for Bloom Filters showing how requests and responses move through the system — Data flow through Bloom Filters

The Interview Angle

What is a Bloom filter and how does it work?
What are false positives and false negatives in a Bloom filter?
Where are Bloom filters used in practice?
How do you choose the size and number of hash functions?

Next Up

Component diagram for Bloom Filters showing each building block and its responsibility — Key components of Bloom Filters

The Real-World Incident That Made This Famous

Google's Bigtable system (published 2006) made Bloom filters a household name in systems engineering. Bigtable stores data in SSTables (Sorted String Tables) on disk. When a read request comes in, Bigtable might need to check multiple SSTables to find the requested row. Without Bloom filters, each check requires a disk read (or at least a disk seek), and with hundreds of SSTables, a single read could require hundreds of disk operations.

By adding a Bloom filter for each SSTable, Bigtable can check in-memory whether a key might exist in that SSTable before doing any disk I/O. If the Bloom filter says "no," the key is definitely not in that SSTable (zero false negatives). If the Bloom filter says "maybe," then Bigtable reads the SSTable from disk. With a properly sized Bloom filter (1% false positive rate), 99% of unnecessary disk reads are eliminated. This reduced Bigtable's read latency dramatically, especially for keys that do not exist.

Interview preparation checklist for Bloom Filters with key points to mention and mistakes to avoid — Interview tips for Bloom Filters

Medium also uses Bloom filters for a user-facing feature: avoiding showing users articles they have already read. With millions of users and millions of articles, storing every user-article pair in a database would be expensive to query. Instead, Medium maintains a Bloom filter per user containing the IDs of articles they have viewed. When generating recommendations, they check the Bloom filter to skip already-read articles. The occasional false positive (skipping an unread article) is far less harmful than the alternative (repeatedly showing articles the user has already seen). The space savings are enormous: a Bloom filter for 10,000 articles at 1% false positive rate requires only about 12 KB, compared to storing 10,000 article IDs explicitly.

How Senior Engineers Think About This

A Bloom filter is a probabilistic data structure that can tell you "definitely not in the set" or "probably in the set." It never has false negatives (if it says no, the element is definitely not present) but can have false positives (if it says yes, the element might not actually be present). This one-sided error property makes it perfect as a cheap pre-filter before an expensive lookup.

The math is elegant: a Bloom filter uses m bits and k hash functions. To add an element, hash it with each of the k functions and set those k bits to 1. To query, hash the element and check if all k bits are set. If any bit is 0, the element is definitely not in the set. If all bits are 1, the element is probably in the set (but some other elements might have set those same bits).

Decision guide for when to choose Bloom Filters and when alternative approaches are better — When to use Bloom Filters

The optimal parameters: for n elements with a desired false positive rate p, the optimal bit count is m = -(n * ln(p)) / (ln(2))^2, and the optimal number of hash functions is k = (m/n) * ln(2). For 1 million elements at 1% false positive rate, you need about 9.6 million bits (1.2 MB) and 7 hash functions. For 0.1% false positive rate, you need 14.4 million bits (1.8 MB) and 10 hash functions.

Senior engineers think about where Bloom filters fit in the system architecture. They are perfect for: avoiding unnecessary database queries (check if user exists before querying the users table), eliminating redundant network requests (check if you have already fetched this URL), deduplicating events in stream processing (check if you have already processed this event ID), and implementing "already seen" features (recommendations, news feeds).

Common Interview Mistakes

Mistake 1: Forgetting that you cannot delete elements. A standard Bloom filter does not support deletion because clearing a bit might affect other elements. If you need deletion, use a Counting Bloom filter (uses counters instead of bits).

Tradeoff analysis for Bloom Filters listing advantages, disadvantages, and real-world considerations — Advantages and disadvantages of Bloom Filters

Mistake 2: Not knowing the math. You should be able to estimate the space requirement. For 1M elements at 1% false positive rate, it is about 1.2 MB with 7 hash functions.

Mistake 3: Using Bloom filters where exact membership is required. If a false positive has serious consequences (granting access to unauthorized users), a Bloom filter is the wrong choice.

Mistake 4: Not mentioning real use cases. Bigtable, LevelDB, Cassandra, and Chrome (for malicious URL checking) all use Bloom filters. Concrete examples show practical knowledge.

Mistake 5: Not discussing alternatives. Cuckoo filters support deletion and have better space efficiency for low false positive rates. Quotient filters support deletion and merging. Know when to consider alternatives.

Production deployment examples of Bloom Filters at companies like Netflix, Google, and Amazon — Real-world examples of Bloom Filters

Production Checklist

Size your Bloom filter based on expected cardinality and acceptable false positive rate: use m = -(n * ln(p)) / (ln(2))^2
Choose the number of hash functions optimally: k = (m/n) * ln(2), typically 7-10 for 1% false positive rate
Use double hashing (gi(x) = h1(x) + i * h2(x)) to generate k hash values from just two hash functions
Monitor the actual false positive rate in production — if it exceeds your target, the filter is overloaded and needs to be rebuilt with more bits
Implement a rebuild strategy: periodically create a new filter from the current set to reset the false positive rate
Serialize the Bloom filter to disk for persistence — a Bloom filter is useless if it is lost on restart
For growing datasets, consider Scalable Bloom Filters that add additional filter layers as elements are added
Use Bloom filters as a pre-filter only — always verify positive results against the authoritative data source
Test with production-scale data: false positive rates can differ from theoretical predictions with skewed or correlated data
Monitor memory usage: a Bloom filter for 100M elements at 1% FP rate requires about 120 MB of RAM

Read the original source | Content from System-Design-Overview

Practical Implementation for .NET Developers

In a .NET application, you would typically implement this pattern using the following approach:

ASP.NET Core setup: Create a service class that encapsulates the logic, register it with dependency injection, and inject it into your controllers or minimal API endpoints. The built-in DI container handles lifecycle management.

Entity Framework Core: For database interactions, EF Core provides the ORM layer. Use migrations for schema management and raw SQL for performance-critical queries. Consider Dapper for read-heavy paths where EF Core's overhead matters.

Azure integration: If deploying to Azure, leverage managed services — Azure Cache for Redis, Azure SQL, Azure Service Bus, Azure Cosmos DB. These eliminate operational overhead and provide built-in monitoring through Application Insights.

Testing: Use xUnit with Testcontainers for integration tests that spin up real databases in Docker. Mock external dependencies with NSubstitute. The WebApplicationFactory class lets you test your entire HTTP pipeline in-process.

Monitoring: Add Application Insights telemetry to track request latency, dependency calls, and custom metrics. Use structured logging with Serilog to make production debugging possible:

text

Log.Information("Processing order {OrderId} for {CustomerId}", orderId, customerId);

This gives you searchable, structured logs in Azure Monitor or Seq.

External Resources

Original Sourcearticle