How Distributed Locks Fail
How Distributed Locks Fail
A distributed lock seems simple: acquire a lock, do work, release the lock. In a single-process application, mutexes provide this guarantee reliably. In a distributed system, almost every assumption that makes local locks work is violated.
Failure Mode 1: Clock Skew
Many distributed lock implementations use time-based expiration: "this lock expires in 30 seconds." This assumes all machines agree on what time it is. They do not.
NTP synchronization keeps clocks within tens of milliseconds of each other, but network delays, NTP server issues, or misconfigured machines can cause clocks to drift by seconds or more. If machine A acquires a lock with a 30-second TTL, but machine B's clock is 5 seconds ahead, machine B may see the lock as expired 5 seconds before machine A thinks it is.
The Redis Redlock controversy: Martin Kleppmann criticized Redis's Redlock algorithm precisely because it relies on clock synchronization across multiple Redis nodes. If clocks drift, the safety guarantee breaks. Antirez (Redis creator) disagreed, arguing that bounded clock drift is a reasonable assumption. This debate remains unresolved and illustrates the fundamental difficulty.
Failure Mode 2: GC Pauses
Garbage-collected languages (Java, Go, C#) can pause all application threads for tens or hundreds of milliseconds during a GC cycle. During a "stop-the-world" pause, a process holding a lock cannot renew it or do useful work.
Scenario: Process A acquires a lock with a 10-second TTL. Process A enters a GC pause lasting 15 seconds. The lock expires while A is paused. Process B acquires the now-available lock and begins its critical section. Process A wakes up from GC, believes it still holds the lock, and enters the critical section. Both processes are now in the critical section simultaneously.
This is not hypothetical — it happens in production systems running on JVM-based services.
Failure Mode 3: Split Brain
In a network partition, the lock service itself may disagree about who holds the lock.
If you use a single Redis node for locking, and that node fails, the lock disappears entirely. If you use Redis replication, the lock might not have been replicated to the new primary before failover, so a second client can acquire a lock that the first client believes it still holds.
If you use a consensus-based system (ZooKeeper, etcd), network partitions can prevent the lock holder from renewing its session, causing the lock to be revoked while the holder is still running and still believes it holds the lock.
Failure Mode 4: Process Pauses Beyond GC
Operating system scheduling, VM live migration, container rescheduling, disk I/O stalls, and memory pressure can all cause a process to pause for significant durations. Any pause longer than the lock's TTL creates the same danger as a GC pause: the lock expires while the holder is still running.
The Solution: Fencing Tokens
Fencing tokens solve the safety problem regardless of clocks, GC pauses, or process delays. When a client acquires a lock, the lock service issues a monotonically increasing token (a number that goes up with every lock acquisition). The client includes this token with every operation on the shared resource. The resource (e.g., a database) rejects any operation with a token lower than the highest token it has already seen.
Example: Client A acquires the lock with fencing token 33. Client A enters a GC pause. The lock expires. Client B acquires the lock with fencing token 34. Client B writes to the database with token 34. Client A wakes up and tries to write with token 33. The database rejects the write because it has already seen token 34.
Fencing tokens shift the safety guarantee from the lock service to the resource being protected. Even if the lock service fails to prevent two clients from thinking they hold the lock, the fencing token ensures only one client's operations are accepted.
When You Actually Need Distributed Locks
Efficiency locks: Preventing duplicate work (e.g., two cron jobs running the same task). If both occasionally run, the only cost is wasted computation. A simple Redis lock with TTL is sufficient.
Correctness locks: Preventing data corruption (e.g., two processes updating the same bank account). If both run, data is lost. You need fencing tokens or a consensus-based system with fencing.
For efficiency, use Redis SETNX with a TTL. For correctness, use ZooKeeper/etcd with fencing tokens, or redesign to avoid locks entirely (use database transactions, CAS operations, or idempotent operations).
Summary
Distributed locks fail due to clock skew, GC pauses, process stalls, and network partitions. The common thread is that a lock holder can be delayed or disconnected while another process acquires the same lock. Fencing tokens are the only reliable solution for correctness-critical operations: even if two clients believe they hold the lock, the resource rejects stale tokens.
Practical Implementation for .NET Developers
In a .NET application, you would typically implement this pattern using the following approach:
ASP.NET Core setup: Create a service class that encapsulates the logic, register it with dependency injection, and inject it into your controllers or minimal API endpoints. The built-in DI container handles lifecycle management.
Entity Framework Core: For database interactions, EF Core provides the ORM layer. Use migrations for schema management and raw SQL for performance-critical queries. Consider Dapper for read-heavy paths where EF Core's overhead matters.
Azure integration: If deploying to Azure, leverage managed services — Azure Cache for Redis, Azure SQL, Azure Service Bus, Azure Cosmos DB. These eliminate operational overhead and provide built-in monitoring through Application Insights.
Testing: Use xUnit with Testcontainers for integration tests that spin up real databases in Docker. Mock external dependencies with NSubstitute. The WebApplicationFactory class lets you test your entire HTTP pipeline in-process.
Monitoring: Add Application Insights telemetry to track request latency, dependency calls, and custom metrics. Use structured logging with Serilog to make production debugging possible:
Log.Information("Processing order {OrderId} for {CustomerId}", orderId, customerId);
This gives you searchable, structured logs in Azure Monitor or Seq.
What Most Articles Get Wrong
Most articles present Redlock as the standard for distributed locking and ignore Martin Kleppmann's critique. The fundamental issue: Redlock relies on time-based lock expiry, and any algorithm that relies on time in a distributed system is fragile. A long garbage collection pause, a VM migration, or a network delay can cause a client to hold an expired lock while believing it still holds it. Another client then acquires the lock, and you have two clients in the critical section.
The deeper misconception is that you need a distributed lock at all. Many problems that seem to require locking can be solved with idempotent operations (retry safely without locks), optimistic concurrency control (detect conflicts at commit time, not at read time), or single-writer architectures (route all writes for a resource to the same process). Locks are the last resort, not the first tool.
The Numbers That Matter
- Redis SETNX + EXPIRE: 0.1ms latency per lock operation, sufficient for efficiency locks
- ZooKeeper lock: 2-5ms latency per lock acquisition (consensus protocol round trip), necessary for correctness locks
- Redlock: requires N Redis instances (typically 5), acquires on majority (3+), lock validity = TTL minus acquisition time
- Fencing token overhead: one additional field in every protected write, trivial storage but significant correctness improvement
- GC pause risk: Java GC pauses of 100ms-10s are documented in production. Any lock with a TTL shorter than the maximum GC pause is at risk.
- etcd lock: uses Raft consensus, provides linearizable operations, 5-15ms latency depending on cluster size