Consensus Algorithms
Without consensus, distributed systems cannot reliably replicate data, elect leaders, or coordinate actions.
The Problem Consensus Algorithms Solves
Without consensus, distributed systems cannot reliably replicate data, elect leaders, or coordinate actions. Consensus is what makes distributed databases (CockroachDB, etcd) and coordination services (ZooKeeper) work correctly.
How It Works Under the Hood
Consensus algorithms enable a group of distributed nodes to agree on a single value or decision, even when some nodes fail. They are the foundation of replicated state machines, distributed databases, and leader election. Key algorithms: Paxos, Raft, and ZAB.
In Raft:
- Leader election: Nodes start as followers. If a follower does not hear from a leader within a timeout, it becomes a candidate and requests votes. The node with a majority of votes becomes the leader.
- Log replication: The leader receives client requests, appends them to its log, and replicates to followers. When a majority of followers confirm, the entry is committed.
- Safety: Raft guarantees that committed entries are never lost, even if the leader crashes. A new leader's log is always at least as up-to-date as any committed entry.
The Mental Model
- The problem: Multiple nodes must agree on a value. Some nodes may crash, messages may be delayed, but the system must still reach agreement.
- Raft: The most understandable consensus algorithm. Leader election + log replication. Used by etcd, CockroachDB, Consul.
- Paxos: The original consensus algorithm by Leslie Lamport. Correct but notoriously difficult to understand and implement.
- ZAB (ZooKeeper Atomic Broadcast): Used by ZooKeeper. Similar to Raft but optimized for primary-backup replication.
- Quorum: Consensus typically requires a majority (N/2 + 1) of nodes to agree. With 5 nodes, 3 must agree (tolerates 2 failures).
Real Systems That Depend on This
etcd (Kubernetes' brain) uses Raft consensus to replicate cluster state across 3-5 nodes.
CockroachDB uses Raft for per-range replication, with each data range having its own Raft group.
Apache Kafka uses ZAB-like protocol for controller election and in-sync replica management.
Where This Shows Up in Interviews
- What is consensus in distributed systems?
- How does Raft work at a high level?
- Why do you need an odd number of nodes?
- What happens when the leader fails in Raft?
Tradeoffs
- Safety vs Liveness: Consensus guarantees safety (never commits wrong value) but may sacrifice liveness during network partitions.
- Latency: Every write requires a round trip to a majority of nodes. This adds write latency.
- Cluster size: More nodes = higher fault tolerance but slower consensus (more round trips).
Watch Out For
- Using an even number of nodes — provides no extra fault tolerance over N-1 nodes
- Deploying all nodes in the same data center — a single failure domain defeats the purpose
- Not understanding that consensus adds write latency — every write waits for quorum
Go Deeper
- distributed-locking — start here if this is new to you
- cap-theorem
- data-replication
- paxos
- zookeeper
The Real-World Incident That Made This Famous
In 2013, a major cloud provider experienced a multi-hour outage caused by a split-brain scenario in their configuration management system. The system used a Paxos-based consensus protocol, but a network partition combined with an implementation bug caused two groups of nodes to each elect their own leader. Both leaders accepted writes, creating conflicting configuration data. When the partition healed, the conflicting data caused cascading failures across dependent services.
This incident motivated the development and adoption of Raft, a consensus algorithm designed specifically to be understandable. Diego Ongaro and John Ousterhout published the Raft paper in 2014, explicitly stating their goal: "We designed Raft so that it would be easier to understand than Paxos." The result was an algorithm that students could learn in a day, compared to Paxos which had a reputation for taking months to understand and years to implement correctly.
Raft became the consensus algorithm behind etcd, which became the backbone of Kubernetes. Every Kubernetes cluster uses etcd (and therefore Raft) to store its cluster state: pod definitions, service configurations, secrets, and scheduling decisions. When your Kubernetes cluster has a control plane outage, it is often because the etcd Raft cluster lost quorum. Understanding Raft directly helps you understand why your Kubernetes cluster behaves the way it does during network issues.
How Senior Engineers Think About This
The mental model: consensus algorithms solve the problem of getting multiple machines to agree on a value, even when some machines fail or messages are lost. This is the foundation of all distributed systems that require coordination: leader election, distributed configuration, and strongly consistent replication.
Senior engineers think of Raft in three sub-problems. Leader election: nodes start as followers, time out if they do not hear from a leader, become candidates, and request votes. The candidate with a majority of votes becomes leader. Log replication: the leader receives client requests, appends them to its log, and replicates the log entry to followers. Once a majority of followers acknowledge, the entry is committed. Safety: committed entries are never lost, and all nodes eventually apply the same entries in the same order.
The key insight: consensus requires a majority (quorum) of N/2 + 1 nodes. A 3-node cluster tolerates 1 failure. A 5-node cluster tolerates 2 failures. A 7-node cluster tolerates 3 failures. But adding nodes also increases the number of acknowledgments needed for each write, which increases latency. Most production systems use 3 or 5 nodes as the sweet spot between fault tolerance and performance.
Paxos is theoretically more general but notoriously hard to implement correctly. The joke in the distributed systems community is that "there are no correct Paxos implementations, only Paxos implementations that haven't been proven incorrect yet." This is why most modern systems (etcd, CockroachDB, TiKV) chose Raft. Multi-Paxos and EPaxos offer theoretical advantages but are rarely used outside of Google.
Common Interview Mistakes
Mistake 1: Not being able to explain leader election. Walk through the steps: timeout, candidacy, vote request, majority wins. This is the most asked part of consensus in interviews.
Mistake 2: Confusing Raft with Paxos. Know the key difference: Raft enforces a strong leader (all writes go through the leader), making it simpler to understand and implement. Paxos is more flexible but more complex.
Mistake 3: Not knowing the quorum math. 3 nodes = tolerates 1 failure, 5 nodes = tolerates 2 failures. A 4-node cluster still only tolerates 1 failure (needs 3 for quorum), so even numbers provide no benefit.
Mistake 4: Forgetting about brain-split prevention. The whole point of consensus is preventing split-brain. Explain how majority quorum prevents two leaders from existing simultaneously.
Mistake 5: Not connecting consensus to real systems. etcd uses Raft. ZooKeeper uses Zab (similar to Paxos). CockroachDB uses Raft. Knowing which systems use which algorithm shows practical knowledge.
Production Checklist
- Deploy consensus clusters with odd numbers of nodes (3 or 5) — even numbers waste resources without improving fault tolerance
- Monitor leader election frequency: frequent elections indicate network instability or clock skew
- Set election timeouts appropriately: too short causes unnecessary elections, too long delays failure detection
- Monitor log replication lag between leader and followers — high lag means writes are slow to commit
- Implement client-side retry with leader discovery: when the leader changes, clients need to find the new leader
- Keep consensus clusters in the same region to minimize replication latency (cross-region consensus adds 50-100ms per write)
- Monitor disk I/O latency on consensus nodes — slow disk causes write stalls and can trigger unnecessary elections
- Implement snapshot compaction to prevent the Raft log from growing unbounded
- Test failure scenarios: kill the leader node and verify the cluster elects a new leader and continues serving
- Do not use the consensus cluster for high-throughput data storage — it is designed for metadata and configuration, not bulk data
Read the original source | Content from System-Design-Overview
Practical Implementation for .NET Developers
In a .NET application, you would typically implement this pattern using the following approach:
ASP.NET Core setup: Create a service class that encapsulates the logic, register it with dependency injection, and inject it into your controllers or minimal API endpoints. The built-in DI container handles lifecycle management.
Entity Framework Core: For database interactions, EF Core provides the ORM layer. Use migrations for schema management and raw SQL for performance-critical queries. Consider Dapper for read-heavy paths where EF Core's overhead matters.
Azure integration: If deploying to Azure, leverage managed services — Azure Cache for Redis, Azure SQL, Azure Service Bus, Azure Cosmos DB. These eliminate operational overhead and provide built-in monitoring through Application Insights.
Testing: Use xUnit with Testcontainers for integration tests that spin up real databases in Docker. Mock external dependencies with NSubstitute. The WebApplicationFactory class lets you test your entire HTTP pipeline in-process.
Monitoring: Add Application Insights telemetry to track request latency, dependency calls, and custom metrics. Use structured logging with Serilog to make production debugging possible:
Log.Information("Processing order {OrderId} for {CustomerId}", orderId, customerId);
This gives you searchable, structured logs in Azure Monitor or Seq.