Design Rate Limiter
System design interview solution for Design Rate Limiter. Includes requirements, API design, data model, architecture, scaling strategy, and tradeoffs.
Problem Statement
Design a system similar to Rate Limiter. The system should handle millions of users and provide a reliable, scalable experience.
Step 1: Clarifying Questions
Before diving into the design, ask these clarifying questions:
- What is the expected scale (users, requests per second)?
- What are the most critical features to support?
- What are the latency requirements?
- Do we need to support real-time features?
- What consistency guarantees are needed?
Step 2: Functional Requirements
- Core feature set for Rate Limiter
- User-facing APIs and interactions
- Data storage and retrieval
- Search and discovery (if applicable)
- Notifications (if applicable)
Step 3: Non-Functional Requirements
- Scalability: Handle millions of concurrent users
- Availability: 99.99% uptime (four nines)
- Latency: Sub-200ms for read operations
- Consistency: Eventually consistent where acceptable, strongly consistent for critical paths
- Durability: No data loss
Step 4: Back-of-the-Envelope Estimation
| Metric | Estimate |
|---|---|
| Daily Active Users | 10M |
| Read:Write Ratio | 10:1 |
| Average Request Size | 1 KB |
| Storage per year | ~10 TB |
| Peak QPS | 100K |
Step 5: API Design
POST /api/v1/resource
GET /api/v1/resource/{id}
PUT /api/v1/resource/{id}
DELETE /api/v1/resource/{id}
Step 6: Data Model
Define the core entities and their relationships. Consider the access patterns when choosing between SQL and NoSQL.
Step 7: High-Level Architecture
The system consists of these major components:
- Client Layer — Web/mobile clients
- API Gateway — Rate limiting, authentication, routing
- Application Servers — Business logic
- Database Layer — Primary storage
- Cache Layer — Redis/Memcached for hot data
- Message Queue — Async processing
Step 8: Detailed Component Design
Write Path
How data flows from client to persistent storage.
Read Path
How data is retrieved, including cache interactions.
Step 9: Scaling Strategy
- Horizontal scaling of application servers behind a load balancer
- Database sharding by user ID or geographic region
- Read replicas for read-heavy workloads
- CDN for static content delivery
- Auto-scaling based on traffic patterns
Step 10: Reliability and Fault Tolerance
- Data replication across availability zones
- Circuit breakers for dependent services
- Graceful degradation under high load
- Health checks and automated failover
Step 11: Monitoring and Observability
- Request latency (p50, p95, p99)
- Error rates by endpoint
- Database query performance
- Cache hit/miss ratios
- Queue depth and processing lag
Key Tradeoffs
| Decision | Option A | Option B | Chosen |
|---|---|---|---|
| Database | SQL | NoSQL | Depends on access patterns |
| Consistency | Strong | Eventual | Eventual for most reads |
| Communication | Sync | Async | Async for non-critical paths |
How to Present This in an Interview
- Start with clarifying questions (2 min)
- Define requirements (3 min)
- Do estimation (2 min)
- Design API and data model (5 min)
- Draw high-level architecture (10 min)
- Deep dive into critical components (10 min)
- Discuss tradeoffs and bottlenecks (5 min)
Practical Implementation for .NET Developers
In a .NET application, you would typically implement this pattern using the following approach:
ASP.NET Core setup: Create a service class that encapsulates the logic, register it with dependency injection, and inject it into your controllers or minimal API endpoints. The built-in DI container handles lifecycle management.
Entity Framework Core: For database interactions, EF Core provides the ORM layer. Use migrations for schema management and raw SQL for performance-critical queries. Consider Dapper for read-heavy paths where EF Core's overhead matters.
Azure integration: If deploying to Azure, leverage managed services — Azure Cache for Redis, Azure SQL, Azure Service Bus, Azure Cosmos DB. These eliminate operational overhead and provide built-in monitoring through Application Insights.
Testing: Use xUnit with Testcontainers for integration tests that spin up real databases in Docker. Mock external dependencies with NSubstitute. The WebApplicationFactory class lets you test your entire HTTP pipeline in-process.
Monitoring: Add Application Insights telemetry to track request latency, dependency calls, and custom metrics. Use structured logging with Serilog to make production debugging possible:
Log.Information("Processing order {OrderId} for {CustomerId}", orderId, customerId);
This gives you searchable, structured logs in Azure Monitor or Seq.
Deep-Dive: Clarifying Questions for Rate Limiter Design
- Where in the stack? Client-side, API gateway level, application level, or all three? Each layer has different tradeoffs.
- What counting granularity? Per user (API key), per IP, per endpoint, or a combination? Corporate NAT can put thousands of users behind one IP.
- Distributed or single-node? If you have 50 API servers, rate limits must be coordinated. A user hitting different servers should still be rate limited globally.
- What happens when rate limited? Drop the request? Queue it? Return HTTP 429 with Retry-After header?
- What algorithms should we support? Token bucket (allows bursts), sliding window (precise), fixed window (simple), leaky bucket (smooth rate)?
- Soft or hard limits? Should limits be strict (exactly 100 requests/minute) or approximate (roughly 100, with brief bursts allowed)?
Specific Functional Requirements
- Configurable Rate Rules: Define rate limits per API key, per endpoint, per IP, or any combination (e.g., "100 requests/minute for /api/search per API key")
- Multiple Algorithms: Support token bucket (burst-friendly), sliding window (precise), and fixed window (simple)
- Distributed Counting: Rate limits are consistent across all API servers using centralized counting
- Informative Responses: Return HTTP 429 with X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, and Retry-After headers
- Low Latency: Rate limit check must add under 1ms of latency to each request
- Tiered Limits: Different rate limits for free, basic, pro, and enterprise API tiers
- Monitoring: Dashboard showing rate limit hit rates per rule, per API key, per endpoint
Specific Internal API (Rate Limiter as a Service)
POST /ratelimit/check
Body: { "key": "apikey_123:/api/search", "limit": 100, "window_seconds": 60, "algorithm": "sliding_window" }
Response: { "allowed": true, "remaining": 72, "reset_at": 1234567890, "retry_after": null }
POST /ratelimit/check (rate limited response)
Response: { "allowed": false, "remaining": 0, "reset_at": 1234567920, "retry_after": 28 }
GET /ratelimit/rules
Response: { "rules": [{ "pattern": "/api/search", "limit": 100, "window": 60, "tier": "free" }, ...] }
PUT /ratelimit/rules/:rule_id
Body: { "limit": 200, "window": 60 }
Response: { "rule_id": "r_123", "updated": true }
Redis implementation (sliding window counter):
-- Lua script for atomic sliding window rate limiting
local key = KEYS[1]
local window = tonumber(ARGV[1])
local limit = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
redis.call('ZREMRANGEBYSCORE', key, 0, now - window)
local count = redis.call('ZCARD', key)
if count < limit then
redis.call('ZADD', key, now, now .. math.random())
redis.call('EXPIRE', key, window)
return {1, limit - count - 1}
end
return {0, 0}
Specific Data Model
Rate Limit Rules (PostgreSQL)
| Column | Type | Notes |
|---|---|---|
| rule_id | UUID | Primary key |
| endpoint_pattern | VARCHAR | e.g., "/api/v1/search", "/api/v1/*" |
| tier | ENUM | free, basic, pro, enterprise |
| limit_count | INT | Max requests allowed |
| window_seconds | INT | Time window |
| algorithm | ENUM | token_bucket, sliding_window, fixed_window |
Rate Limit Counters (Redis)
- Token Bucket: Key = "tb:API_KEY:ENDPOINT", Value = (tokens: N, last_refill: timestamp)
- Sliding Window: Key = "sw:API_KEY:ENDPOINT", Value = Sorted Set of request timestamps
- Fixed Window: Key = "fw:API_KEY:ENDPOINT:WINDOW_START", Value = counter (integer)
Rate Limit Metrics (Time-series DB / InfluxDB)
| Field | Type | Notes |
|---|---|---|
| api_key | TAG | For per-key analysis |
| endpoint | TAG | For per-endpoint analysis |
| allowed_count | FIELD | Requests that passed |
| rejected_count | FIELD | Requests that were rate limited |
| timestamp | TIME | 1-minute granularity |
Specific Back-of-the-Envelope Numbers
Traffic:
- API receives 100,000 requests/second at peak
- Each request requires one rate limit check = 100K Redis operations/second
- Redis single instance handles ~100K operations/second — need sharding or clustering for higher throughput
Memory (Redis):
- Token bucket: 1 key per (API key, endpoint) combination. 1M API keys * 10 endpoints = 10M keys * 100 bytes = 1 GB
- Sliding window: stores each request timestamp. 1M keys * 100 timestamps * 16 bytes = 1.6 GB. More memory-intensive but more precise.
- Fixed window: 1 key per (API key, endpoint, window). Same as token bucket memory but keys auto-expire.
Latency:
- Redis lookup: 0.1-0.5ms (local network)
- Lua script execution: 0.05ms
- Total rate limit overhead: under 1ms per request (target: under 0.5ms)
Availability:
- Rate limiter must be as available as the API itself
- If Redis is down: fail OPEN (allow all requests) for availability-focused systems, fail CLOSED (reject all) for security-focused systems
- Use Redis Sentinel or Redis Cluster for high availability