Distributed Locking
Without distributed locks, concurrent processes can cause data corruption, double-spending, overselling inventory, or duplicate processing.
The Core Idea
A distributed lock is a mechanism that ensures only one process across a distributed system can access a shared resource at a time. Unlike local locks (mutex), distributed locks must work across multiple servers that communicate over a network.
Step-by-Step Walkthrough
Using Redis: SET lock_key unique_value NX PX 30000 (set only if not exists, expire in 30 seconds). If SET succeeds, the client holds the lock. When done, the client deletes the key (only if the value matches — prevents releasing someone else's lock). If the client crashes, the lock expires after 30 seconds.
Using ZooKeeper: Create an ephemeral sequential znode. The client with the lowest sequence number holds the lock. When the client disconnects (crashes), the ephemeral node is automatically deleted.
Why This Approach Wins
- Mutual exclusion: Only one client can hold the lock at a time.
- Deadlock freedom: If a client crashes while holding a lock, the lock must eventually be released (TTL).
- Fault tolerance: The lock system must survive node failures (use consensus-based stores like etcd, ZooKeeper).
- Redlock (Redis): Acquires the lock on N/2+1 Redis instances. Controversial — see Martin Kleppmann's critique.
- Fencing tokens: A monotonically increasing token issued with each lock. Storage systems reject writes with older tokens, preventing split-brain scenarios.
In Production
Stripe uses distributed locks to prevent double-charging customers during payment processing.
Google uses Chubby (distributed lock service) for leader election, metadata storage, and configuration management.
Kubernetes uses etcd leases for leader election among controller manager replicas.
Tradeoffs and Limitations
- Correctness vs Performance: Consensus-based locks (etcd) are correct but slower. Redis locks are faster but can have correctness issues.
- TTL: Too short = lock expires while client is still working. Too long = long wait if client crashes.
- Availability: The lock service must be highly available; if it goes down, all locked operations block.
Production Gotchas
- Using a single Redis instance for distributed locks — no fault tolerance
- Not using fencing tokens — stale lock holders can corrupt data
- Setting lock TTL too short — long GC pauses can cause the lock to expire while the client is still processing
The Interview Angle
- How do you implement a distributed lock?
- What is the Redlock algorithm and why is it controversial?
- What are fencing tokens and why are they needed?
- How does ZooKeeper implement distributed locking?
Next Up
The Real-World Incident That Made This Famous
The most famous debate in distributed locking history is the Redlock controversy between Salvatore Sanfilippo (creator of Redis) and Martin Kleppmann (author of "Designing Data-Intensive Applications"). In 2016, Sanfilippo published the Redlock algorithm: acquire locks on N Redis instances (typically 5), consider the lock acquired if you get it on a majority (3+), and use clock-based expiry to prevent deadlocks.
Kleppmann published a devastating critique titled "How to do distributed locking." He argued that Redlock is fundamentally broken because it relies on timing assumptions. If a client acquires the lock, then experiences a long garbage collection pause (or network delay), the lock might expire before the client finishes its work. Another client acquires the lock, and now two clients believe they hold it simultaneously. The fix, Kleppmann argued, is fencing tokens: the lock service assigns a monotonically increasing token with each lock acquisition, and the resource being protected rejects operations with old tokens.
Sanfilippo responded that Kleppmann's critique, while theoretically correct, overstates the practical risk. He argued that GC pauses long enough to cause problems are rare and detectable. The debate was never fully resolved, and it illustrates a fundamental truth about distributed systems: there is no perfect distributed lock. Every implementation makes tradeoffs between safety (never allowing two holders), liveness (eventually granting the lock), and performance.
In practice, Google's Chubby lock service (and its open-source equivalent, ZooKeeper) uses a consensus-based approach with lease-based locks and fencing tokens, which is considered more correct than Redlock but requires more infrastructure.
How Senior Engineers Think About This
The first question a senior engineer asks about distributed locking is: "Do I actually need a distributed lock?" Distributed locks are expensive (network round trips), fragile (what if the lock service is down?), and complex (all the edge cases around expiry and failure). Many problems that seem to require locks can be solved with other approaches: idempotent operations, optimistic concurrency control (compare-and-swap), or single-writer designs.
When you do need a lock, understand the two levels of correctness. Efficiency locks prevent duplicate work — if two workers process the same task, the system wastes resources but nothing breaks. For these, a simple Redis SETNX with a TTL is sufficient. Correctness locks prevent data corruption — if two processes update the same bank balance, money is lost. For these, you need a consensus-based lock with fencing tokens (ZooKeeper, etcd).
The mental model for lock expiry: every distributed lock must have a TTL (time-to-live). Without it, a client that crashes while holding the lock will hold it forever (deadlock). But the TTL creates a window where two clients can hold the lock simultaneously (if the first client's work takes longer than the TTL). Fencing tokens solve this: each lock acquisition gets a monotonically increasing number, and the protected resource rejects operations with numbers lower than the highest it has seen.
Senior engineers also think about lock granularity. A global lock serializes all operations (safe but slow). Fine-grained locks per resource allow parallelism (fast but complex). The sweet spot depends on your contention level: if only 1% of operations compete, fine-grained locks give 100x throughput improvement.
Common Interview Mistakes
Mistake 1: Proposing a single Redis instance for distributed locking. A single Redis is a single point of failure. If Redis goes down, all locks are lost. Discuss either Redlock (multiple Redis instances) or consensus-based locks (ZooKeeper, etcd).
Mistake 2: Forgetting about lock expiry. Every distributed lock needs a TTL. Without it, a crashed client creates a permanent deadlock. With it, you need to handle the case where the lock expires before the client finishes.
Mistake 3: Not knowing fencing tokens. This is the solution to the "lock expired while I was working" problem. Always mention fencing tokens when discussing correctness-critical locks.
Mistake 4: Using locks for efficiency when idempotency would suffice. If the worst case of duplicate processing is wasted compute (not data corruption), make the operation idempotent instead of adding a distributed lock.
Mistake 5: Not discussing the CAP implications. A CP lock service (ZooKeeper) is unavailable during network partitions. An AP approach (Redis) might grant the lock to multiple clients during partitions. Know which tradeoff your system needs.
Production Checklist
- Decide upfront whether you need an efficiency lock (Redis SETNX is fine) or a correctness lock (use ZooKeeper/etcd with fencing tokens)
- Always set a TTL on locks — calculate it as 3x the expected operation duration to handle variance
- Implement lock renewal (heartbeat) for long-running operations that might exceed the TTL
- Use fencing tokens for any lock that protects data integrity
- Monitor lock wait time and lock hold duration — long waits indicate contention, long holds indicate slow operations
- Implement lock acquisition timeouts so clients do not block forever waiting for a lock
- Log all lock acquisitions and releases with timestamps and client identifiers for debugging deadlocks
- Test failure scenarios: what happens when the lock service is unavailable? Does your application queue work or fail fast?
- Avoid nested locks (acquiring lock B while holding lock A) — this creates deadlock potential
- Use consistent lock ordering if multiple locks must be acquired: always acquire them in the same order across all clients
Read the original source | Content from System-Design-Overview
Practical Implementation for .NET Developers
In a .NET application, you would typically implement this pattern using the following approach:
ASP.NET Core setup: Create a service class that encapsulates the logic, register it with dependency injection, and inject it into your controllers or minimal API endpoints. The built-in DI container handles lifecycle management.
Entity Framework Core: For database interactions, EF Core provides the ORM layer. Use migrations for schema management and raw SQL for performance-critical queries. Consider Dapper for read-heavy paths where EF Core's overhead matters.
Azure integration: If deploying to Azure, leverage managed services — Azure Cache for Redis, Azure SQL, Azure Service Bus, Azure Cosmos DB. These eliminate operational overhead and provide built-in monitoring through Application Insights.
Testing: Use xUnit with Testcontainers for integration tests that spin up real databases in Docker. Mock external dependencies with NSubstitute. The WebApplicationFactory class lets you test your entire HTTP pipeline in-process.
Monitoring: Add Application Insights telemetry to track request latency, dependency calls, and custom metrics. Use structured logging with Serilog to make production debugging possible:
Log.Information("Processing order {OrderId} for {CustomerId}", orderId, customerId);
This gives you searchable, structured logs in Azure Monitor or Seq.