Skip to main content
SDMastery
advanced9 min readUpdated 2026-06-03

Consensus Algorithms

Without consensus, distributed systems cannot reliably replicate data, elect leaders, or coordinate actions.

Consensus Algorithms system design overview showing key components and metrics
High-level overview of Consensus Algorithms
Consensus Algorithms

The Problem Consensus Algorithms Solves

Without consensus, distributed systems cannot reliably replicate data, elect leaders, or coordinate actions. Consensus is what makes distributed databases (CockroachDB, etcd) and coordination services (ZooKeeper) work correctly.

How It Works Under the Hood

Consensus Algorithms system architecture with service components and data flow
System architecture for Consensus Algorithms

Consensus algorithms enable a group of distributed nodes to agree on a single value or decision, even when some nodes fail. They are the foundation of replicated state machines, distributed databases, and leader election. Key algorithms: Paxos, Raft, and ZAB.

In Raft:

  1. Leader election: Nodes start as followers. If a follower does not hear from a leader within a timeout, it becomes a candidate and requests votes. The node with a majority of votes becomes the leader.
  2. Log replication: The leader receives client requests, appends them to its log, and replicates to followers. When a majority of followers confirm, the entry is committed.
  3. Safety: Raft guarantees that committed entries are never lost, even if the leader crashes. A new leader's log is always at least as up-to-date as any committed entry.

The Mental Model

  • The problem: Multiple nodes must agree on a value. Some nodes may crash, messages may be delayed, but the system must still reach agreement.
  • Raft: The most understandable consensus algorithm. Leader election + log replication. Used by etcd, CockroachDB, Consul.
  • Paxos: The original consensus algorithm by Leslie Lamport. Correct but notoriously difficult to understand and implement.
  • ZAB (ZooKeeper Atomic Broadcast): Used by ZooKeeper. Similar to Raft but optimized for primary-backup replication.
  • Quorum: Consensus typically requires a majority (N/2 + 1) of nodes to agree. With 5 nodes, 3 must agree (tolerates 2 failures).
Step-by-step diagram showing how Consensus Algorithms works in practice
How Consensus Algorithms works step by step

Real Systems That Depend on This

etcd (Kubernetes' brain) uses Raft consensus to replicate cluster state across 3-5 nodes.

CockroachDB uses Raft for per-range replication, with each data range having its own Raft group.

Apache Kafka uses ZAB-like protocol for controller election and in-sync replica management.

Comparison table for Consensus Algorithms showing key metrics and tradeoffs
Comparing key aspects of Consensus Algorithms

Where This Shows Up in Interviews

  1. What is consensus in distributed systems?
  2. How does Raft work at a high level?
  3. Why do you need an odd number of nodes?
  4. What happens when the leader fails in Raft?

Tradeoffs

  • Safety vs Liveness: Consensus guarantees safety (never commits wrong value) but may sacrifice liveness during network partitions.
  • Latency: Every write requires a round trip to a majority of nodes. This adds write latency.
  • Cluster size: More nodes = higher fault tolerance but slower consensus (more round trips).
Data flow diagram for Consensus Algorithms showing request and response paths
Data flow through Consensus Algorithms

Watch Out For

  1. Using an even number of nodes — provides no extra fault tolerance over N-1 nodes
  2. Deploying all nodes in the same data center — a single failure domain defeats the purpose
  3. Not understanding that consensus adds write latency — every write waits for quorum

Go Deeper

Key components of Consensus Algorithms with roles and responsibilities
Key components of Consensus Algorithms

The Real-World Incident That Made This Famous

In 2013, a major cloud provider experienced a multi-hour outage caused by a split-brain scenario in their configuration management system. The system used a Paxos-based consensus protocol, but a network partition combined with an implementation bug caused two groups of nodes to each elect their own leader. Both leaders accepted writes, creating conflicting configuration data. When the partition healed, the conflicting data caused cascading failures across dependent services.

This incident motivated the development and adoption of Raft, a consensus algorithm designed specifically to be understandable. Diego Ongaro and John Ousterhout published the Raft paper in 2014, explicitly stating their goal: "We designed Raft so that it would be easier to understand than Paxos." The result was an algorithm that students could learn in a day, compared to Paxos which had a reputation for taking months to understand and years to implement correctly.

Interview tips for Consensus Algorithms system design questions
Interview tips for Consensus Algorithms

Raft became the consensus algorithm behind etcd, which became the backbone of Kubernetes. Every Kubernetes cluster uses etcd (and therefore Raft) to store its cluster state: pod definitions, service configurations, secrets, and scheduling decisions. When your Kubernetes cluster has a control plane outage, it is often because the etcd Raft cluster lost quorum. Understanding Raft directly helps you understand why your Kubernetes cluster behaves the way it does during network issues.

How Senior Engineers Think About This

The mental model: consensus algorithms solve the problem of getting multiple machines to agree on a value, even when some machines fail or messages are lost. This is the foundation of all distributed systems that require coordination: leader election, distributed configuration, and strongly consistent replication.

Senior engineers think of Raft in three sub-problems. Leader election: nodes start as followers, time out if they do not hear from a leader, become candidates, and request votes. The candidate with a majority of votes becomes leader. Log replication: the leader receives client requests, appends them to its log, and replicates the log entry to followers. Once a majority of followers acknowledge, the entry is committed. Safety: committed entries are never lost, and all nodes eventually apply the same entries in the same order.

Decision guide showing when to use Consensus Algorithms and when to avoid
When to use Consensus Algorithms

The key insight: consensus requires a majority (quorum) of N/2 + 1 nodes. A 3-node cluster tolerates 1 failure. A 5-node cluster tolerates 2 failures. A 7-node cluster tolerates 3 failures. But adding nodes also increases the number of acknowledgments needed for each write, which increases latency. Most production systems use 3 or 5 nodes as the sweet spot between fault tolerance and performance.

Paxos is theoretically more general but notoriously hard to implement correctly. The joke in the distributed systems community is that "there are no correct Paxos implementations, only Paxos implementations that haven't been proven incorrect yet." This is why most modern systems (etcd, CockroachDB, TiKV) chose Raft. Multi-Paxos and EPaxos offer theoretical advantages but are rarely used outside of Google.

Common Interview Mistakes

Mistake 1: Not being able to explain leader election. Walk through the steps: timeout, candidacy, vote request, majority wins. This is the most asked part of consensus in interviews.

Pros and cons analysis of Consensus Algorithms for system design decisions
Advantages and disadvantages of Consensus Algorithms

Mistake 2: Confusing Raft with Paxos. Know the key difference: Raft enforces a strong leader (all writes go through the leader), making it simpler to understand and implement. Paxos is more flexible but more complex.

Mistake 3: Not knowing the quorum math. 3 nodes = tolerates 1 failure, 5 nodes = tolerates 2 failures. A 4-node cluster still only tolerates 1 failure (needs 3 for quorum), so even numbers provide no benefit.

Mistake 4: Forgetting about brain-split prevention. The whole point of consensus is preventing split-brain. Explain how majority quorum prevents two leaders from existing simultaneously.

Mistake 5: Not connecting consensus to real systems. etcd uses Raft. ZooKeeper uses Zab (similar to Paxos). CockroachDB uses Raft. Knowing which systems use which algorithm shows practical knowledge.

Real-world companies using Consensus Algorithms in production systems
Real-world examples of Consensus Algorithms

Production Checklist

  • Deploy consensus clusters with odd numbers of nodes (3 or 5) — even numbers waste resources without improving fault tolerance
  • Monitor leader election frequency: frequent elections indicate network instability or clock skew
  • Set election timeouts appropriately: too short causes unnecessary elections, too long delays failure detection
  • Monitor log replication lag between leader and followers — high lag means writes are slow to commit
  • Implement client-side retry with leader discovery: when the leader changes, clients need to find the new leader
  • Keep consensus clusters in the same region to minimize replication latency (cross-region consensus adds 50-100ms per write)
  • Monitor disk I/O latency on consensus nodes — slow disk causes write stalls and can trigger unnecessary elections
  • Implement snapshot compaction to prevent the Raft log from growing unbounded
  • Test failure scenarios: kill the leader node and verify the cluster elects a new leader and continues serving
  • Do not use the consensus cluster for high-throughput data storage — it is designed for metadata and configuration, not bulk data

Read the original source | Content from System-Design-Overview

Practical Implementation for .NET Developers

In a .NET application, you would typically implement this pattern using the following approach:

ASP.NET Core setup: Create a service class that encapsulates the logic, register it with dependency injection, and inject it into your controllers or minimal API endpoints. The built-in DI container handles lifecycle management.

Entity Framework Core: For database interactions, EF Core provides the ORM layer. Use migrations for schema management and raw SQL for performance-critical queries. Consider Dapper for read-heavy paths where EF Core's overhead matters.

Azure integration: If deploying to Azure, leverage managed services — Azure Cache for Redis, Azure SQL, Azure Service Bus, Azure Cosmos DB. These eliminate operational overhead and provide built-in monitoring through Application Insights.

Testing: Use xUnit with Testcontainers for integration tests that spin up real databases in Docker. Mock external dependencies with NSubstitute. The WebApplicationFactory class lets you test your entire HTTP pipeline in-process.

Monitoring: Add Application Insights telemetry to track request latency, dependency calls, and custom metrics. Use structured logging with Serilog to make production debugging possible:

text
Log.Information("Processing order {OrderId} for {CustomerId}", orderId, customerId);

This gives you searchable, structured logs in Azure Monitor or Seq.

External Resources

Original Sourcearticle