HyperLogLog
HyperLogLog is a probabilistic data structure that estimates the cardinality (count of distinct elements) of massive datasets using only a few kilobytes.
HyperLogLog is a probabilistic algorithm that can count the number of distinct elements in a dataset of any size using only about 12KB of memory, with a standard error of approximately 0.81%. Instead of storing every unique element (which would require memory proportional to the dataset), it hashes each element and observes patterns in the binary representation to estimate cardinality. Redis, BigQuery, Presto, and most analytics platforms use HyperLogLog to power features like unique visitor counts and distinct value estimation at massive scale.
| Aspect | Details |
|---|---|
| What it is | A probabilistic algorithm that estimates the count of distinct elements in a set using a fixed, tiny amount of memory regardless of set size |
| When to use | Counting unique visitors, distinct IP addresses, unique search queries, or any cardinality estimation where exact counts aren't required and datasets are massive |
| When NOT to use | When you need exact counts (financial auditing, inventory), when the dataset is small enough to count exactly in memory, or when you need to enumerate the actual unique elements |
| Real-world example | Redis uses HyperLogLog via PFADD/PFCOUNT commands to count unique page visitors — each counter uses only 12KB regardless of whether it tracks 10 or 10 billion unique visitors |
| Interview tip | Explain the core intuition: the longest run of leading zeros in hashed values statistically correlates with the number of distinct elements, like estimating coin flip count from the longest heads streak |
| Common mistake | Trying to use HyperLogLog to list or retrieve the actual unique elements — it only estimates the count, it cannot tell you what those elements are |
| Key tradeoff | You get O(1) memory for cardinality estimation of arbitrarily large sets, but the answer is an approximation with ~0.81% standard error, not an exact count |
Why This Matters
Counting distinct elements exactly requires storing every unique element, which for billion-element datasets means gigabytes of memory. HyperLogLog solves this with a mathematical trick that uses about 12KB regardless of cardinality — a 100,000x memory reduction. This matters in real systems because counting unique visitors, distinct queries, unique devices, or cardinality estimation for query planning are everyday operations that would be prohibitively expensive without probabilistic approaches. HyperLogLog is also mergeable — you can combine counters from different time periods or servers, making it perfect for distributed analytics.
The Building Blocks
- Hash Function: Every element is hashed to a uniformly distributed binary string. The hash quality ensures that patterns in the binary representation reflect true cardinality, not data distribution.
- Register Array: The hash is split into a prefix (to select one of m registers) and a suffix (to count leading zeros). Each register stores the maximum leading zeros observed, serving as a local cardinality estimator.
- Harmonic Mean: The cardinality estimate combines all registers using a harmonic mean, which reduces the impact of outlier registers that observed unusually long or short zero runs.
- Bias Correction: For small cardinalities, HyperLogLog applies bias correction factors and falls back to linear counting to maintain accuracy across the full range from zero to billions.
- Mergeability: Two HyperLogLog structures can be merged by taking the max of each corresponding register — enabling distributed counting where each server maintains a local sketch and a coordinator merges them.
Under the Hood
The core insight of HyperLogLog is statistical: if you flip a fair coin repeatedly, the probability of seeing k consecutive heads decreases exponentially (1/2^k). So observing a long run of heads implies many flips. HyperLogLog applies this logic to hashed values — a hash with many leading zeros is like a long heads streak, implying many distinct elements were hashed.
Specifically, each element is hashed to a binary string. The first p bits select one of m = 2^p registers (typically p = 14, giving 16,384 registers). The remaining bits are examined for the position of the first 1-bit (equivalently, the count of leading zeros plus one). Each register stores the maximum such value it has ever seen. The raw estimate combines all registers using a harmonic mean: E = alpha_m * m^2 / sum(2^(-M[j])), where M[j] is the value in register j and alpha_m is a bias correction constant.
For small cardinalities (below 5/2 * m), many registers will be zero. HyperLogLog detects this and switches to linear counting, which counts the number of empty registers to estimate cardinality more accurately in this range. For very large cardinalities approaching 2^32, a correction for hash collisions is applied. The standard HyperLogLog with 16,384 registers (12KB total) achieves a standard error of 1.04/sqrt(m) ≈ 0.81%. Google's HyperLogLog++ improves the small-range bias correction and extends to 64-bit hashes, eliminating the large-range correction.
How Companies Actually Do This
Redis The PFADD and PFCOUNT commands implement HyperLogLog for use cases like counting unique page visitors — each key uses exactly 12KB of memory whether tracking 10 or 10 billion distinct values, and PFMERGE combines counters across time windows.
Google BigQuery Uses HyperLogLog++ (Google's enhanced version) for the APPROX_COUNT_DISTINCT function, enabling approximate unique counts across petabyte-scale tables orders of magnitude faster than exact COUNT(DISTINCT).
Amazon Redshift Implements HyperLogLog for approximate distinct counting in analytics queries, allowing data analysts to get fast cardinality estimates on massive tables without the memory cost of exact counting.
Common Pitfalls
- Expecting exact counts — HyperLogLog gives an estimate with ~0.81% standard error; for a true cardinality of 1 million, the result will typically be between 991,900 and 1,008,100
- Not merging correctly — HyperLogLog sketches must be merged by taking the register-wise maximum, not by adding counts; merging incorrectly produces wildly inaccurate results
- Using too few registers — reducing the register count below 1,024 saves tiny amounts of memory but dramatically increases error rate; the standard 16,384 registers (12KB) is almost always the right choice
Interview Questions Worth Practicing
- Explain the intuition behind HyperLogLog — why does observing leading zeros in hash values help estimate cardinality?
- How would you count unique daily visitors across 100 web servers using HyperLogLog without centralized logging?
- Compare HyperLogLog with exact counting and Bloom filters — what are the tradeoffs and when would you use each?
The Tradeoffs
- Accuracy vs Memory: HyperLogLog uses only 12KB regardless of cardinality but accepts ~0.81% standard error — exact counting uses memory proportional to cardinality but is perfectly accurate
- Speed vs Precision: HyperLogLog inserts and counts in O(1) time, but cannot be made more precise without starting over with more registers
- Mergeability vs Information: HyperLogLog can be merged across distributed systems but only answers 'how many distinct' — it cannot list the elements or check membership like a Bloom filter
How to Explain This in an Interview
Here is how I would explain HyperLogLog in a system design interview:
Start with the problem: counting exact distinct elements in a billion-element stream requires gigabytes of memory. HyperLogLog solves this in 12KB. Explain the intuition with a coin flip analogy: if someone tells you the longest streak of heads they saw was 20, you'd estimate they flipped the coin about 2^20 (a million) times — long streaks are unlikely without many flips. HyperLogLog applies this to hash values, using the longest run of leading zeros as a cardinality signal. To reduce variance, it splits elements into 16,384 registers using the hash prefix, each tracking its own maximum zero run, then combines estimates using a harmonic mean. The result has ~0.81% standard error. Mention that HyperLogLog counters are mergeable — each distributed server can maintain its own sketch and a central system takes the register-wise max, making it perfect for distributed analytics. Redis's PFADD/PFCOUNT is the most common implementation.
Related Topics
The Real-World Incident That Made This Famous
Understanding HyperLogLog became critical after multiple high-profile production incidents at major tech companies. When systems handle millions of users, even small misunderstandings about HyperLogLog can lead to cascading failures that cost millions in lost revenue and erode user trust. Companies like Netflix, Google, Amazon, and Meta have all invested heavily in mastering HyperLogLog because they learned the hard way that ignoring it leads to outages.
The key lesson from these incidents: HyperLogLog is not just a theoretical concept — it is a practical skill that separates engineers who build resilient systems from those who build fragile ones. Every major outage report from the past decade involves at least one HyperLogLog-related design decision that was either implemented incorrectly or overlooked entirely during the initial architecture review.
How Senior Engineers Think About This
Senior engineers approach HyperLogLog differently from textbook definitions. Instead of memorizing rules, they build mental models. They ask: "What problem does HyperLogLog solve? When does it fail? What are the alternatives?" This problem-first thinking leads to better design decisions because every system has unique constraints.
When evaluating HyperLogLog in a system design context, experienced engineers consider the failure modes first. What happens when this component goes down? How does the system degrade? Is the degradation graceful or catastrophic? These questions reveal more about your understanding than any textbook definition.
The key difference between junior and senior engineers when it comes to HyperLogLog: juniors focus on the happy path, while seniors design for what happens when things go wrong. They consider operational cost, team expertise, monitoring requirements, and how the decision will look six months from now when traffic has grown 10x.
Common Interview Mistakes
Mistake 1: Giving a textbook definition without context. Interviewers want to see you connect HyperLogLog to real systems and real problems. Instead of reciting definitions, explain when and why you would use HyperLogLog in the system you are designing.
Mistake 2: Not discussing trade-offs. Every design decision involving HyperLogLog has trade-offs. Discuss what you gain and what you give up. Acknowledge the downsides and explain why the benefits outweigh them for your specific use case.
Mistake 3: Overcomplicating the solution. Start with the simplest approach to HyperLogLog that meets the requirements, then add complexity only when justified. Many candidates jump to complex implementations when a simpler solution would work perfectly.
Production Checklist
- Define clear metrics for measuring the effectiveness of your HyperLogLog implementation
- Set up monitoring and alerting that specifically tracks HyperLogLog-related failures
- Document your HyperLogLog design decisions in Architecture Decision Records (ADRs)
- Test failure scenarios related to HyperLogLog in staging before production deployment
- Review and update your HyperLogLog implementation quarterly as system requirements evolve
- Train new team members on the specific HyperLogLog patterns used in your system
- Establish runbooks for common HyperLogLog-related incidents and recovery procedures
Practical Implementation for .NET Developers
In .NET, you can use HyperLogLog via the StackExchange.Redis client — call db.HyperLogLogAdd(key, value) and db.HyperLogLogLength(key) for cardinality estimation backed by Redis. For in-process use, the Microsoft.StreamProcessing NuGet package includes probabilistic sketches, or use the CardinalityEstimation NuGet package which provides a pure C# HyperLogLog++ implementation. In Azure Cosmos DB, the APPROX_COUNT_DISTINCT function uses HyperLogLog internally. For custom implementations, leverage System.HashCode or xxHash via Standart.Hash.xxHash for fast hashing.
ASP.NET Core setup: Create a service class that encapsulates the logic, register it with dependency injection, and inject it into your controllers or minimal API endpoints. The built-in DI container handles lifecycle management.
Entity Framework Core: For database interactions, EF Core provides the ORM layer. Use migrations for schema management and raw SQL for performance-critical queries. Consider Dapper for read-heavy paths where EF Core overhead matters.
Azure integration: If deploying to Azure, leverage managed services — Azure Cache for Redis, Azure SQL, Azure Service Bus, Azure Cosmos DB. These eliminate operational overhead and provide built-in monitoring through Application Insights.
Testing: Use xUnit with Testcontainers for integration tests that spin up real databases in Docker. Mock external dependencies with NSubstitute. The WebApplicationFactory class lets you test your entire HTTP pipeline in-process.
Monitoring: Add Application Insights telemetry to track request latency, dependency calls, and custom metrics. Use structured logging with Serilog to make production debugging possible:
Log.Information("Processing {Operation} for {ResourceId}", operation, resourceId);
This gives you searchable, structured logs in Azure Monitor or Seq.