Rate Limiting Algorithms

Java and Python implementations of 5 rate limiting algorithms: Token Bucket, Leaky Bucket, Fixed Window, Sliding Window Log, and Sliding Window Counter.

Rate Limiting Algorithms — Code Lab

Rate limiting controls how many requests a client can make within a time window. Here are 5 common algorithms with working implementations in Java and Python.

Algorithm Comparison

Algorithm	Precision	Memory	Burst Handling	Recommended For
Token Bucket	Good	O(1)	Allows bursts	General-purpose APIs
Leaky Bucket	Perfect	O(queue)	No bursts (smooth)	Traffic shaping
Fixed Window	Low	O(1)	Edge burst issue	Simple use cases
Sliding Window Log	Perfect	O(N)	No edge bursts	Low-traffic precision
Sliding Window Counter	Good	O(1)	Minimal edge bursts	Production default

Token Bucket

A bucket holds tokens up to a maximum capacity. Tokens are added at a fixed rate. Each request consumes one token. If the bucket is empty, the request is rejected. Allows bursts up to the bucket capacity.

System architecture diagram for Rate Limiting Algorithms showing how services, databases, and caches connect — System architecture for Rate Limiting Algorithms

Time Complexity: O(1)
Space Complexity: O(1) per client
Best For: APIs that allow occasional bursts (most common in production)

Java

java

package implementations.java.rate_limiting;

import java.time.Instant;

public class TokenBucket {
    private final long capacity;        // Maximum number of tokens the bucket can hold
    private final double fillRate;      // Rate at which tokens are added to the bucket (tokens per second)
    private double tokens;              // Current number of tokens in the bucket
    private Instant lastRefillTimestamp; // Last time we refilled the bucket

    public TokenBucket(long capacity, double fillRate) {
        this.capacity = capacity;
        this.fillRate = fillRate;
        this.tokens = capacity;  // Start with a full bucket
        this.lastRefillTimestamp = Instant.now();
    }

    public synchronized boolean allowRequest(int tokens) {
        refill();  // First, add any new tokens based on elapsed time

        if (this.tokens < tokens) {
            return false;  // Not enough tokens, deny the request
        }

        this.tokens -= tokens;  // Consume the tokens
        return true;  // Allow the request
    }

    private void refill() {
        Instant now = Instant.now();
        // Calculate how many tokens to add based on the time elapsed
        double tokensToAdd = (now.toEpochMilli() - lastRefillTimestamp.toEpochMilli()) * fillRate / 1000.0;
        this.tokens = Math.min(capacity, this.tokens + tokensToAdd);  // Add tokens, but don't exceed capacity
        this.lastRefillTimestamp = now;
    }
}

Python

python

import time

class TokenBucket:
    def __init__(self, capacity, fill_rate):
        self.capacity = capacity  # Maximum number of tokens the bucket can hold
        self.fill_rate = fill_rate  # Rate at which tokens are added (tokens/second)
        self.tokens = capacity  # Current token count, start with a full bucket
        self.last_time = time.time()  # Last time we checked the token count

    def allow_request(self, tokens=1):
        now = time.time()
        # Calculate how many tokens have been added since the last check
        time_passed = now - self.last_time
        self.tokens = min(self.capacity, self.tokens + time_passed * self.fill_rate)
        self.last_time = now

        # Check if we have enough tokens for this request
        if self.tokens >= tokens:
            self.tokens -= tokens
            return True
        return False

# Usage example
limiter = TokenBucket(capacity=10, fill_rate=1)  # 10 tokens, refill 1 token per second

for _ in range(15):
    print(limiter.allow_request())  # Will print True for the first 10 requests, then False
    time.sleep(0.1)  # Wait a bit between requests

time.sleep(5)  # Wait for bucket to refill
print(limiter.allow_request())  # True

Leaky Bucket

Step-by-step diagram showing how Rate Limiting Algorithms processes a request from start to finish — How Rate Limiting Algorithms works step by step

Requests enter a queue (bucket). They are processed at a fixed rate, regardless of arrival rate. If the queue is full, new requests are dropped. Produces a perfectly smooth output rate.

Time Complexity: O(1)
Space Complexity: O(bucket size)
Best For: Traffic shaping where a smooth output rate is required

Java

java

package implementations.java.rate_limiting;

import java.time.Instant;
import java.util.LinkedList;
import java.util.Queue;

public class LeakyBucket {
    private final long capacity;        // Maximum number of requests the bucket can hold
    private final double leakRate;      // Rate at which requests leak out of the bucket (requests per second)
    private final Queue<Instant> bucket; // Queue to hold timestamps of requests
    private Instant lastLeakTimestamp;   // Last time we leaked from the bucket

    public LeakyBucket(long capacity, double leakRate) {
        this.capacity = capacity;
        this.leakRate = leakRate;
        this.bucket = new LinkedList<>();
        this.lastLeakTimestamp = Instant.now();
    }

    public synchronized boolean allowRequest() {
        leak();  // First, leak out any requests based on elapsed time

        if (bucket.size() < capacity) {
            bucket.offer(Instant.now());  // Add the new request to the bucket
            return true;  // Allow the request
        }
        return false;  // Bucket is full, deny the request
    }

    private void leak() {
        Instant now = Instant.now();
        long elapsedMillis = now.toEpochMilli() - lastLeakTimestamp.toEpochMilli();
        int leakedItems = (int) (elapsedMillis * leakRate / 1000.0);  // Calculate how many items should have leaked

        // Remove the leaked items from the bucket
        for (int i = 0; i < leakedItems && !bucket.isEmpty(); i++) {
            bucket.poll();
        }

        lastLeakTimestamp = now;
    }
}

Python

python

from collections import deque
import time

class LeakyBucket:
    def __init__(self, capacity, leak_rate):
        self.capacity = capacity  # Maximum number of requests in the bucket
        self.leak_rate = leak_rate  # Rate at which requests leak (requests/second)
        self.bucket = deque()  # Queue to hold request timestamps
        self.last_leak = time.time()  # Last time we leaked from the bucket

    def allow_request(self):
        now = time.time()
        # Simulate leaking from the bucket
        leak_time = now - self.last_leak
        leaked = int(leak_time * self.leak_rate)
        if leaked > 0:
            # Remove the leaked requests from the bucket
            for _ in range(min(leaked, len(self.bucket))):
                self.bucket.popleft()
            self.last_leak = now

        # Check if there's capacity and add the new request
        if len(self.bucket) < self.capacity:
            self.bucket.append(now)
            return True
        return False

# Usage example
limiter = LeakyBucket(capacity=5, leak_rate=1)  # 5 requests, leak 1 per second

for _ in range(10):
    print(limiter.allow_request())  # Will print True for the first 5 requests, then False
    time.sleep(0.1)  # Wait a bit between requests

time.sleep(1)  # Wait for bucket to leak
print(limiter.allow_request())  # True

Comparison table for Rate Limiting Algorithms contrasting approaches, tradeoffs, and when to use each — Comparing key aspects of Rate Limiting Algorithms

Fixed Window Counter

Counts requests in fixed time intervals (e.g., 1-minute windows). Resets counter at window boundaries. Simple but vulnerable to burst at window edges — 2x the limit can pass if bursts straddle two windows.

Time Complexity: O(1)
Space Complexity: O(1) per client
Best For: Simple rate limiting where edge bursts are acceptable

Java

java

package implementations.java.rate_limiting;

import java.time.Instant;

public class FixedWindowCounter {
    private final long windowSizeInSeconds;  // Size of each window in seconds
    private final long maxRequestsPerWindow; // Maximum number of requests allowed per window
    private long currentWindowStart;         // Start time of the current window
    private long requestCount;               // Number of requests in the current window

    public FixedWindowCounter(long windowSizeInSeconds, long maxRequestsPerWindow) {
        this.windowSizeInSeconds = windowSizeInSeconds;
        this.maxRequestsPerWindow = maxRequestsPerWindow;
        this.currentWindowStart = Instant.now().getEpochSecond();
        this.requestCount = 0;
    }

    public synchronized boolean allowRequest() {
        long now = Instant.now().getEpochSecond();
        
        // Check if we've moved to a new window
        if (now - currentWindowStart >= windowSizeInSeconds) {
            currentWindowStart = now;  // Start a new window
            requestCount = 0;          // Reset the count for the new window
        }

        if (requestCount < maxRequestsPerWindow) {
            requestCount++;  // Increment the count for this window
            return true;     // Allow the request
        }
        return false;  // We've exceeded the limit for this window, deny the request
    }    
}

Python

python

import time

class FixedWindowCounter:
    def __init__(self, window_size, max_requests):
        self.window_size = window_size  # Size of the window in seconds
        self.max_requests = max_requests  # Maximum number of requests per window
        self.current_window = time.time() // window_size
        self.request_count = 0

    def allow_request(self):
        current_time = time.time()
        window = current_time // self.window_size

        # If we've moved to a new window, reset the counter
        if window != self.current_window:
            self.current_window = window
            self.request_count = 0

        # Check if we're still within the limit for this window
        if self.request_count < self.max_requests:
            self.request_count += 1
            return True
        return False

# Usage example
limiter = FixedWindowCounter(window_size=60, max_requests=5)  # 5 requests per minute

for _ in range(10):
    print(limiter.allow_request())  # Will print True for the first 5 requests, then False
    time.sleep(0.1)  # Wait a bit between requests

time.sleep(60)  # Wait for the window to reset
print(limiter.allow_request())  # True

Data flow diagram for Rate Limiting Algorithms showing how requests and responses move through the system — Data flow through Rate Limiting Algorithms

Sliding Window Log

Stores the timestamp of every request. Counts requests in the last N seconds by scanning the log. Precise but memory-intensive for high-traffic endpoints.

Time Complexity: O(N) per check
Space Complexity: O(N) per client
Best For: When precision is critical and traffic is moderate

Java

java

package implementations.java.rate_limiting;

import java.time.Instant;
import java.util.LinkedList;
import java.util.Queue;

public class SlidingWindowLog {
    private final long windowSizeInSeconds;   // Size of the sliding window in seconds
    private final long maxRequestsPerWindow;  // Maximum number of requests allowed in the window
    private final Queue<Long> requestLog;     // Log of request timestamps

    public SlidingWindowLog(long windowSizeInSeconds, long maxRequestsPerWindow) {
        this.windowSizeInSeconds = windowSizeInSeconds;
        this.maxRequestsPerWindow = maxRequestsPerWindow;
        this.requestLog = new LinkedList<>();
    }

    public synchronized boolean allowRequest() {
        long now = Instant.now().getEpochSecond();
        long windowStart = now - windowSizeInSeconds;

        // Remove timestamps that are outside of the current window
        while (!requestLog.isEmpty() && requestLog.peek() <= windowStart) {
            requestLog.poll();
        }

        if (requestLog.size() < maxRequestsPerWindow) {
            requestLog.offer(now);  // Log this request
            return true;            // Allow the request
        }
        return false;  // We've exceeded the limit for this window, deny the request
    }    
}

Component diagram for Rate Limiting Algorithms showing each building block and its responsibility — Key components of Rate Limiting Algorithms

Python

python

import time
from collections import deque

class SlidingWindowLog:
    def __init__(self, window_size, max_requests):
        self.window_size = window_size  # Size of the sliding window in seconds
        self.max_requests = max_requests  # Maximum number of requests per window
        self.request_log = deque()  # Log to keep track of request timestamps

    def allow_request(self):
        now = time.time()
        
        # Remove timestamps that are outside the current window
        while self.request_log and now - self.request_log[0] >= self.window_size:
            self.request_log.popleft()

        # Check if we're still within the limit
        if len(self.request_log) < self.max_requests:
            self.request_log.append(now)
            return True
        return False

# Usage example
limiter = SlidingWindowLog(window_size=60, max_requests=5)  # 5 requests per minute

for _ in range(10):
    print(limiter.allow_request())  # Will print True for the first 5 requests, then False
    time.sleep(0.1)  # Wait a bit between requests

time.sleep(60)  # Wait for the window to slide
print(limiter.allow_request())  # True

Sliding Window Counter

Combines fixed window and sliding window. Estimates the count using a weighted average of the current and previous window. Better precision than fixed window, less memory than sliding log.

Time Complexity: O(1)
Space Complexity: O(1) per client
Best For: Best balance of precision and efficiency (recommended default)

Interview preparation checklist for Rate Limiting Algorithms with key points to mention and mistakes to avoid — Interview tips for Rate Limiting Algorithms

Java

java

package implementations.java.rate_limiting;

import java.time.Instant;

public class SlidingWindowCounter {
    private final long windowSizeInSeconds;   // Size of the sliding window in seconds
    private final long maxRequestsPerWindow;  // Maximum number of requests allowed in the window
    private long currentWindowStart;          // Start time of the current window
    private long previousWindowCount;         // Number of requests in the previous window
    private long currentWindowCount;          // Number of requests in the current window

    public SlidingWindowCounter(long windowSizeInSeconds, long maxRequestsPerWindow) {
        this.windowSizeInSeconds = windowSizeInSeconds;
        this.maxRequestsPerWindow = maxRequestsPerWindow;
        this.currentWindowStart = Instant.now().getEpochSecond();
        this.previousWindowCount = 0;
        this.currentWindowCount = 0;
    }

    public synchronized boolean allowRequest() {
        long now = Instant.now().getEpochSecond();
        long timePassedInWindow = now - currentWindowStart;

        // Check if we've moved to a new window
        if (timePassedInWindow >= windowSizeInSeconds) {
            previousWindowCount = currentWindowCount;
            currentWindowCount = 0;
            currentWindowStart = now;
            timePassedInWindow = 0;
        }

        // Calculate the weighted count of requests
        double weightedCount = previousWindowCount * ((windowSizeInSeconds - timePassedInWindow) / (double) windowSizeInSeconds)
                + currentWindowCount;

        if (weightedCount < maxRequestsPerWindow) {
            currentWindowCount++;  // Increment the count for this window
            return true;           // Allow the request
        }
        return false;  // We've exceeded the limit, deny the request
    }    
}

Python

python

import time

class SlidingWindowCounter:
    def __init__(self, window_size, max_requests):
        self.window_size = window_size  # Size of the sliding window in seconds
        self.max_requests = max_requests  # Maximum number of requests per window
        self.current_window = time.time() // window_size
        self.request_count = 0
        self.previous_count = 0

    def allow_request(self):
        now = time.time()
        window = now // self.window_size

        # If we've moved to a new window, update the counts
        if window != self.current_window:
            self.previous_count = self.request_count
            self.request_count = 0
            self.current_window = window

        # Calculate the weighted request count
        window_elapsed = (now % self.window_size) / self.window_size
        threshold = self.previous_count * (1 - window_elapsed) + self.request_count

        # Check if we're within the limit
        if threshold < self.max_requests:
            self.request_count += 1
            return True
        return False

# Usage example
limiter = SlidingWindowCounter(window_size=60, max_requests=5)  # 5 requests per minute

for _ in range(10):
    print(limiter.allow_request())  # Will print True for the first 5 requests, then gradually become False
    time.sleep(0.1)  # Wait a bit between requests

time.sleep(30)  # Wait for half the window to pass
print(limiter.allow_request())  # Might be True or False depending on the exact timing

Production Considerations

Distributed Rate Limiting: These implementations work for a single server. For distributed systems, use a centralized store like Redis. The token bucket maps naturally to Redis INCR with TTL.
Client Identification: Rate limit by API key (for authenticated clients) or IP address (for anonymous). Be careful with shared IPs (NAT).
Response Headers: Return rate limit headers: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, Retry-After.
Graceful Handling: Return HTTP 429 (Too Many Requests) with a clear error message and Retry-After header.
Per-Endpoint Limits: Different endpoints may need different limits — a search endpoint might allow 100 req/min while a write endpoint allows 10 req/min.

Decision guide for when to choose Rate Limiting Algorithms and when alternative approaches are better — When to use Rate Limiting Algorithms

Interview Tip

In system design interviews, Token Bucket and Sliding Window Counter are the two most recommended algorithms. Token Bucket is the industry standard (used by Stripe, AWS); Sliding Window Counter offers the best precision/efficiency tradeoff.

Source

Code from System-Design-Overview by Ashish Pratap Singh. Explanations are original.

Practical Implementation for .NET Developers

Tradeoff analysis for Rate Limiting Algorithms listing advantages, disadvantages, and real-world considerations — Advantages and disadvantages of Rate Limiting Algorithms

In a .NET application, you would typically implement this pattern using the following approach:

ASP.NET Core setup: Create a service class that encapsulates the logic, register it with dependency injection, and inject it into your controllers or minimal API endpoints. The built-in DI container handles lifecycle management.

Entity Framework Core: For database interactions, EF Core provides the ORM layer. Use migrations for schema management and raw SQL for performance-critical queries. Consider Dapper for read-heavy paths where EF Core's overhead matters.

Azure integration: If deploying to Azure, leverage managed services — Azure Cache for Redis, Azure SQL, Azure Service Bus, Azure Cosmos DB. These eliminate operational overhead and provide built-in monitoring through Application Insights.

Testing: Use xUnit with Testcontainers for integration tests that spin up real databases in Docker. Mock external dependencies with NSubstitute. The WebApplicationFactory class lets you test your entire HTTP pipeline in-process.

Production deployment examples of Rate Limiting Algorithms at companies like Netflix, Google, and Amazon — Real-world examples of Rate Limiting Algorithms

Monitoring: Add Application Insights telemetry to track request latency, dependency calls, and custom metrics. Use structured logging with Serilog to make production debugging possible:

text

Log.Information("Processing order {OrderId} for {CustomerId}", orderId, customerId);

This gives you searchable, structured logs in Azure Monitor or Seq.

Key Takeaways for Interviews

Understand the core problem this resource addresses and be able to explain it in 2-3 sentences without jargon
Know the key trade-offs: what does this approach optimize for, and what does it sacrifice?
Be ready to compare this with alternative approaches and explain when each is appropriate
Connect the concepts to real-world systems you have worked with or studied
Demonstrate depth by discussing failure modes and how they are handled

How This Applies to Modern .NET Systems

The concepts from this resource translate to .NET through several established libraries and patterns:

Azure managed services often abstract away the underlying distributed systems complexity, but understanding the fundamentals helps you configure them correctly, debug issues, and make informed architectural decisions.

NuGet packages in the .NET ecosystem provide production-ready implementations of many patterns described in this resource. Before building custom solutions, check if a well-maintained package already exists.

ASP.NET Core middleware pipeline is where many of these patterns are implemented in practice: caching, rate limiting, health checks, and circuit breaking all fit naturally into the middleware model.

Sources

Source Code on GitHubcode