Skip to main content
SDMastery
hard8 min readUpdated 2026-06-03

Design a Distributed Lock Service

Design a distributed lock service with consensus-based locking, fencing tokens, TTLs, and leader election. Covers Chubby/ZooKeeper architecture patterns.

Design a Distributed Lock Service system design overview showing key components and metrics
High-level overview of Design a Distributed Lock Service

Problem Statement

Design a distributed lock service (like Google Chubby or Apache ZooKeeper) that provides mutual exclusion across distributed processes. The service must ensure that at most one client holds a lock at any time, survive node failures without losing lock state, support lock TTLs to prevent deadlocks from crashed clients, and issue fencing tokens to prevent split-brain scenarios.

Requirements

Design a Distributed Lock Service system architecture with service components and data flow
System architecture for Design a Distributed Lock Service

Functional

  • Acquire a named lock with a TTL (e.g., lock "payment-processor" for 30 seconds)
  • Release a lock explicitly or automatically on TTL expiration
  • Issue a monotonically increasing fencing token with each lock acquisition
  • Support lock waiting: client can block until the lock becomes available (with timeout)

Non-Functional

  • Safety: At most one client holds a given lock at any time (mutual exclusion)
  • Liveness: Locks are eventually released even if the holder crashes (TTL)
  • Availability: Service continues operating if a minority of nodes fail (Raft quorum)
  • Latency: Lock acquire/release in <10ms within the same data center

Core Architecture

Step-by-step diagram showing how Design a Distributed Lock Service works in practice
How Design a Distributed Lock Service works step by step
  1. Consensus Layer (Raft) -- The lock service runs on a cluster of 5 nodes using the Raft consensus protocol. Lock state is replicated to a majority (3/5) before an acquire is acknowledged. The Raft leader handles all lock operations; followers redirect clients to the leader. This ensures that lock state survives any 2 node failures.

  2. Lock Manager -- Maintains a hash map of lock_name -> (owner_client_id, fencing_token, expiry_time). Acquire: if lock is free or expired, assign to requester with a new fencing token (monotonically increasing counter) and set expiry = now + TTL. If lock is held, either reject immediately or add the client to a wait queue. Release: clear the lock entry and notify the first waiter.

  3. Fencing Token Issuer -- Each lock acquisition returns a fencing token (e.g., 42, 43, 44...). Clients include this token in all operations on the protected resource. The resource server rejects operations with a token lower than the highest it has seen. This prevents a scenario where a client holds a lock, pauses (GC, network), the lock expires, another client acquires it, and the original client resumes and corrupts data.

Data flow diagram for Design a Distributed Lock Service showing request and response paths
Data flow through Design a Distributed Lock Service
  1. Session and Heartbeat Manager -- Each client maintains a session with the lock service via periodic heartbeats (every 5 seconds). If the server receives no heartbeat for 3 intervals (15 seconds), the session is expired and all locks held by that client are released. This prevents deadlocks from silently crashed clients beyond the TTL mechanism.

Database Choice

No external database -- the lock state is the Raft replicated state machine itself. The Raft log is persisted to local SSD on each node for durability across restarts. The state machine (lock hash map) is kept in-memory for fast access. Periodic snapshots of the state machine are written to disk to bound Raft log size. Prometheus for operational metrics (lock contention, acquire latency, session count).

Interview tips for Design a Distributed Lock Service system design questions
Interview tips for Design a Distributed Lock Service

Key API Endpoints

text
POST /api/v1/locks/\{lock_name\}/acquire
  -> Body: \{ client_id: "worker-7", ttl_ms: 30000, wait_timeout_ms: 5000 \}
  -> Returns: \{ acquired: true, fencing_token: 42, expires_at: "..." \}

POST /api/v1/locks/\{lock_name\}/release
  -> Body: \{ client_id: "worker-7", fencing_token: 42 \}
  -> Returns: \{ released: true \}

POST /api/v1/locks/\{lock_name\}/renew
  -> Body: \{ client_id: "worker-7", fencing_token: 42, ttl_ms: 30000 \}
  -> Returns: \{ renewed: true, new_expires_at: "..." \}

Scaling Insight

Fencing tokens are the solution to the distributed lock's biggest problem: the "client pause" scenario. A client acquires a lock (token=42), then pauses for a full GC. The lock TTL expires, another client acquires the lock (token=43) and begins work. The first client resumes, believing it still holds the lock. Without fencing tokens, both clients corrupt the shared resource. With fencing tokens, every write to the resource includes the token, and the resource server rejects token=42 because it has already seen token=43. This makes the system safe even when clocks are unreliable and clients are unpredictable.

Decision guide showing when to use Design a Distributed Lock Service and when to avoid
When to use Design a Distributed Lock Service

Key Tradeoffs

DecisionOption AOption BChosen
ConsensusPaxos (theoretically optimal)Raft (understandable)Raft -- easier to implement correctly, same safety guarantees, widely adopted
Lock expiryClient-managed renewal onlyServer-side TTL + client renewalServer-side TTL -- guarantees liveness even if client crashes without releasing
Cluster size3 nodes (tolerates 1 failure)5 nodes (tolerates 2 failures)5 nodes -- production systems need to survive 1 planned + 1 unplanned failure simultaneously

Practical Implementation for .NET Developers

In a .NET application, you would typically implement this pattern using the following approach:

Pros and cons analysis of Design a Distributed Lock Service for system design decisions
Advantages and disadvantages of Design a Distributed Lock Service

ASP.NET Core setup: Create a service class that encapsulates the logic, register it with dependency injection, and inject it into your controllers or minimal API endpoints. The built-in DI container handles lifecycle management.

Entity Framework Core: For database interactions, EF Core provides the ORM layer. Use migrations for schema management and raw SQL for performance-critical queries. Consider Dapper for read-heavy paths where EF Core's overhead matters.

Azure integration: If deploying to Azure, leverage managed services — Azure Cache for Redis, Azure SQL, Azure Service Bus, Azure Cosmos DB. These eliminate operational overhead and provide built-in monitoring through Application Insights.

Real-world companies using Design a Distributed Lock Service in production systems
Real-world examples of Design a Distributed Lock Service

Testing: Use xUnit with Testcontainers for integration tests that spin up real databases in Docker. Mock external dependencies with NSubstitute. The WebApplicationFactory class lets you test your entire HTTP pipeline in-process.

Monitoring: Add Application Insights telemetry to track request latency, dependency calls, and custom metrics. Use structured logging with Serilog to make production debugging possible:

text
Log.Information("Processing order {OrderId} for {CustomerId}", orderId, customerId);

This gives you searchable, structured logs in Azure Monitor or Seq.

Comparison table for Design a Distributed Lock Service showing key metrics and tradeoffs
Comparing key aspects of Design a Distributed Lock Service

System-Specific Clarifying Questions

Before designing Distributed Lock Service, ask questions specific to THIS system:

Key components of Design a Distributed Lock Service with roles and responsibilities
Key components of Design a Distributed Lock Service
  1. Who are the primary users? Understanding the user base shapes every technical decision — consumer apps have different requirements than enterprise B2B systems.
  2. What is the read-to-write ratio? This determines whether you optimize for fast reads (caching, denormalization) or fast writes (write-ahead logs, async processing).
  3. What is the geographic distribution? Users in one country vs. global users fundamentally changes your data replication and CDN strategy.
  4. What is the acceptable latency? Some features need sub-100ms responses, others can tolerate seconds. This determines your caching and architecture strategy.
  5. What is the consistency requirement? Some data (payments, inventory) needs strong consistency. Other data (social feeds, recommendations) can be eventually consistent.

Architecture Deep Dive

The architecture for Distributed Lock Service should be designed around the specific access patterns of the system. Do not apply generic templates — every system has unique hotspots, bottlenecks, and scaling challenges.

Write Path: How does data enter the system? Is it bursty (event-driven, flash sales) or steady (sensor data, logs)? Bursty writes need queuing and backpressure. Steady writes can go directly to the database.

Read Path: How is data consumed? Is it fan-out (one write, many reads like social feeds) or point lookups (one read for specific data like user profiles)? Fan-out reads benefit from pre-computation and caching. Point lookups benefit from efficient indexing.

Hot Spots: Where are the bottlenecks? For Distributed Lock Service, identify the component that will fail first under load and design mitigation strategies: caching, sharding, rate limiting, or async processing.

Sources