How Distributed Locks Fail

2025-01-158 min read

How Distributed Locks Fail

A distributed lock seems simple: acquire a lock, do work, release the lock. In a single-process application, mutexes provide this guarantee reliably. In a distributed system, almost every assumption that makes local locks work is violated.

Failure Mode 1: Clock Skew

Many distributed lock implementations use time-based expiration: "this lock expires in 30 seconds." This assumes all machines agree on what time it is. They do not.

System architecture diagram for How Distributed Locks Fail showing how services, databases, and caches connect — System architecture for How Distributed Locks Fail

NTP synchronization keeps clocks within tens of milliseconds of each other, but network delays, NTP server issues, or misconfigured machines can cause clocks to drift by seconds or more. If machine A acquires a lock with a 30-second TTL, but machine B's clock is 5 seconds ahead, machine B may see the lock as expired 5 seconds before machine A thinks it is.

The Redis Redlock controversy: Martin Kleppmann criticized Redis's Redlock algorithm precisely because it relies on clock synchronization across multiple Redis nodes. If clocks drift, the safety guarantee breaks. Antirez (Redis creator) disagreed, arguing that bounded clock drift is a reasonable assumption. This debate remains unresolved and illustrates the fundamental difficulty.

Failure Mode 2: GC Pauses

Garbage-collected languages (Java, Go, C#) can pause all application threads for tens or hundreds of milliseconds during a GC cycle. During a "stop-the-world" pause, a process holding a lock cannot renew it or do useful work.

Step-by-step diagram showing how How Distributed Locks Fail processes a request from start to finish — How How Distributed Locks Fail works step by step

Scenario: Process A acquires a lock with a 10-second TTL. Process A enters a GC pause lasting 15 seconds. The lock expires while A is paused. Process B acquires the now-available lock and begins its critical section. Process A wakes up from GC, believes it still holds the lock, and enters the critical section. Both processes are now in the critical section simultaneously.

This is not hypothetical — it happens in production systems running on JVM-based services.

Failure Mode 3: Split Brain

In a network partition, the lock service itself may disagree about who holds the lock.

Comparison table for How Distributed Locks Fail contrasting approaches, tradeoffs, and when to use each — Comparing key aspects of How Distributed Locks Fail

If you use a single Redis node for locking, and that node fails, the lock disappears entirely. If you use Redis replication, the lock might not have been replicated to the new primary before failover, so a second client can acquire a lock that the first client believes it still holds.

If you use a consensus-based system (ZooKeeper, etcd), network partitions can prevent the lock holder from renewing its session, causing the lock to be revoked while the holder is still running and still believes it holds the lock.

Failure Mode 4: Process Pauses Beyond GC

Operating system scheduling, VM live migration, container rescheduling, disk I/O stalls, and memory pressure can all cause a process to pause for significant durations. Any pause longer than the lock's TTL creates the same danger as a GC pause: the lock expires while the holder is still running.

Data flow diagram for How Distributed Locks Fail showing how requests and responses move through the system — Data flow through How Distributed Locks Fail

The Solution: Fencing Tokens

Fencing tokens solve the safety problem regardless of clocks, GC pauses, or process delays. When a client acquires a lock, the lock service issues a monotonically increasing token (a number that goes up with every lock acquisition). The client includes this token with every operation on the shared resource. The resource (e.g., a database) rejects any operation with a token lower than the highest token it has already seen.

Example: Client A acquires the lock with fencing token 33. Client A enters a GC pause. The lock expires. Client B acquires the lock with fencing token 34. Client B writes to the database with token 34. Client A wakes up and tries to write with token 33. The database rejects the write because it has already seen token 34.

Fencing tokens shift the safety guarantee from the lock service to the resource being protected. Even if the lock service fails to prevent two clients from thinking they hold the lock, the fencing token ensures only one client's operations are accepted.

Component diagram for How Distributed Locks Fail showing each building block and its responsibility — Key components of How Distributed Locks Fail

When You Actually Need Distributed Locks

Efficiency locks: Preventing duplicate work (e.g., two cron jobs running the same task). If both occasionally run, the only cost is wasted computation. A simple Redis lock with TTL is sufficient.

Correctness locks: Preventing data corruption (e.g., two processes updating the same bank account). If both run, data is lost. You need fencing tokens or a consensus-based system with fencing.

For efficiency, use Redis SETNX with a TTL. For correctness, use ZooKeeper/etcd with fencing tokens, or redesign to avoid locks entirely (use database transactions, CAS operations, or idempotent operations).

Interview preparation checklist for How Distributed Locks Fail with key points to mention and mistakes to avoid — Interview tips for How Distributed Locks Fail

Summary

Distributed locks fail due to clock skew, GC pauses, process stalls, and network partitions. The common thread is that a lock holder can be delayed or disconnected while another process acquires the same lock. Fencing tokens are the only reliable solution for correctness-critical operations: even if two clients believe they hold the lock, the resource rejects stale tokens.

Practical Implementation for .NET Developers

In a .NET application, you would typically implement this pattern using the following approach:

Decision guide for when to choose How Distributed Locks Fail and when alternative approaches are better — When to use How Distributed Locks Fail

ASP.NET Core setup: Create a service class that encapsulates the logic, register it with dependency injection, and inject it into your controllers or minimal API endpoints. The built-in DI container handles lifecycle management.

Entity Framework Core: For database interactions, EF Core provides the ORM layer. Use migrations for schema management and raw SQL for performance-critical queries. Consider Dapper for read-heavy paths where EF Core's overhead matters.

Azure integration: If deploying to Azure, leverage managed services — Azure Cache for Redis, Azure SQL, Azure Service Bus, Azure Cosmos DB. These eliminate operational overhead and provide built-in monitoring through Application Insights.

Testing: Use xUnit with Testcontainers for integration tests that spin up real databases in Docker. Mock external dependencies with NSubstitute. The WebApplicationFactory class lets you test your entire HTTP pipeline in-process.

Tradeoff analysis for How Distributed Locks Fail listing advantages, disadvantages, and real-world considerations — Advantages and disadvantages of How Distributed Locks Fail

Monitoring: Add Application Insights telemetry to track request latency, dependency calls, and custom metrics. Use structured logging with Serilog to make production debugging possible:

text

Log.Information("Processing order {OrderId} for {CustomerId}", orderId, customerId);

This gives you searchable, structured logs in Azure Monitor or Seq.

What Most Articles Get Wrong

Production deployment examples of How Distributed Locks Fail at companies like Netflix, Google, and Amazon — Real-world examples of How Distributed Locks Fail

Most articles present Redlock as the standard for distributed locking and ignore Martin Kleppmann's critique. The fundamental issue: Redlock relies on time-based lock expiry, and any algorithm that relies on time in a distributed system is fragile. A long garbage collection pause, a VM migration, or a network delay can cause a client to hold an expired lock while believing it still holds it. Another client then acquires the lock, and you have two clients in the critical section.

The deeper misconception is that you need a distributed lock at all. Many problems that seem to require locking can be solved with idempotent operations (retry safely without locks), optimistic concurrency control (detect conflicts at commit time, not at read time), or single-writer architectures (route all writes for a resource to the same process). Locks are the last resort, not the first tool.

The Numbers That Matter

Redis SETNX + EXPIRE: 0.1ms latency per lock operation, sufficient for efficiency locks
ZooKeeper lock: 2-5ms latency per lock acquisition (consensus protocol round trip), necessary for correctness locks
Redlock: requires N Redis instances (typically 5), acquires on majority (3+), lock validity = TTL minus acquisition time
Fencing token overhead: one additional field in every protected write, trivial storage but significant correctness improvement
GC pause risk: Java GC pauses of 100ms-10s are documented in production. Any lock with a TTL shorter than the maximum GC pause is at risk.
etcd lock: uses Raft consensus, provides linearizable operations, 5-15ms latency depending on cluster size

How Distributed Locks Fail