advanced11 min readUpdated 2026-06-08

Load Shedding

Learn Load Shedding in distributed systems — intentionally dropping excess requests to protect system stability and maintain quality of service for.

Load Shedding

Load shedding is the deliberate practice of dropping excess requests when a system approaches its capacity limits, ensuring that accepted requests are served with acceptable latency rather than all requests degrading. Unlike rate limiting which enforces policy, load shedding is a self-preservation mechanism triggered by actual resource saturation. By rejecting work early — before it consumes CPU, memory, or threads — the system protects itself from cascading failure and maintains quality of service for the traffic it does handle.

Aspect	Details
What it is	Intentionally rejecting excess requests during overload to maintain quality of service for accepted traffic
When to use	When traffic exceeds system capacity and serving all requests would degrade quality for everyone
When NOT to use	When the system is within capacity or when every request must be processed regardless of latency (batch pipelines with no SLO)
Real-world example	Google's frontend servers shed load based on CPU utilization thresholds, returning 503s early before backends become unresponsive
Interview tip	Distinguish load shedding from rate limiting — rate limiting enforces policy, load shedding protects against actual overload
Common mistake	Shedding load too aggressively during brief spikes — some headroom should exist before triggering rejection
Key tradeoff	Availability vs. correctness — shedding keeps the system alive but rejected requests need proper handling by callers

Why This Matters

Every system has a capacity ceiling. When load exceeds that ceiling, latency increases nonlinearly — a server at 95% CPU might have 10x the latency of one at 70%. Without load shedding, all users experience degradation: pages load slowly, API calls timeout, and downstream dependencies pile up queued requests. Load shedding accepts that serving 80% of traffic well is better than serving 100% poorly. By rejecting excess requests with fast 503 responses, the system preserves resources for accepted work. This is especially important during flash crowds, DDoS attacks, or cascading failures when autoscaling cannot react quickly enough. Google's SRE handbook calls load shedding essential for any service operating near its capacity limit.

System architecture diagram for Load Shedding showing how services, databases, and caches connect — System architecture for Load Shedding

The Building Blocks

Admission Control: The decision layer that evaluates incoming requests against current system load and decides whether to accept or reject each one
Load Signals: Real-time metrics like CPU utilization, queue depth, in-flight request count, or latency percentiles used to detect overload conditions
Priority Classification: Categorizing requests by importance so that load shedding drops low-priority traffic first while preserving critical operations
Fast Rejection: Returning 503 Service Unavailable or 429 responses immediately without performing any downstream work to minimize resource consumption
Feedback Loop: Continuously adjusting the shedding threshold based on observed system behavior to avoid over-shedding or under-shedding

Under the Hood

Load shedding operates as an early-reject mechanism at the ingress point of a service. The simplest implementation tracks in-flight requests using an atomic counter. When the counter exceeds a configured threshold (derived from load testing), new requests are immediately rejected with a 503 status. More sophisticated systems use Little's Law — if average latency rises above a threshold while concurrency is high, the system is overloaded.

Step-by-step diagram showing how Load Shedding processes a request from start to finish — How Load Shedding works step by step

Google's Doorman and CoDel-inspired approaches use adaptive algorithms. CoDel (Controlled Delay) monitors the time requests spend waiting in queues. If queue sojourn time exceeds a target (say 5ms) for a sustained interval, the system begins dropping requests. This approach self-tunes: during brief bursts, the queue drains normally, but sustained overload triggers progressive shedding.

Priority-based shedding is essential for production systems. Not all requests are equal — a payment confirmation matters more than a recommendation refresh. Services classify requests into priority tiers (critical, normal, best-effort) and shed lowest priority first. Only if overload persists do higher-priority tiers get affected. Load shedding works best when combined with client-side retry with backoff, ensuring shed requests are retried after a delay rather than immediately re-queuing and worsening congestion.

How Companies Actually Do This

Google Uses CPU-based admission control in its frontend servers, progressively shedding lower-priority requests as CPU utilization exceeds thresholds, documented extensively in their SRE handbook

Comparison table for Load Shedding contrasting approaches, tradeoffs, and when to use each — Comparing key aspects of Load Shedding

Netflix Implements priority-based load shedding in Zuul API gateway, dropping non-critical API calls during traffic surges to preserve streaming playback functionality

Amazon Uses shuffle sharding with load shedding in DynamoDB partitions so that hot keys are rejected early without affecting other keys on the same storage node

Common Pitfalls

Shedding based on raw request count instead of actual resource utilization — a burst of lightweight requests may be fine while fewer heavy requests cause overload
Not differentiating request priority — shedding critical checkout requests alongside analytics pings wastes the opportunity to maintain core business functionality
Failing to coordinate shedding across replicas — if load balancers route shed requests to other instances at the same capacity, the entire fleet cascades

Data flow diagram for Load Shedding showing how requests and responses move through the system — Data flow through Load Shedding

Interview Questions Worth Practicing

How does load shedding differ from rate limiting and when would you use each?
How would you implement priority-based load shedding for a system with mixed critical and non-critical traffic?
What signals would you use to detect that a system needs to start shedding load?

The Tradeoffs

Availability vs. Throughput: Shedding preserves responsiveness for accepted requests but reduces overall throughput by intentionally rejecting work
Aggressiveness vs. Waste: Shedding too early wastes capacity that could serve requests; shedding too late allows latency degradation before protection kicks in
Simplicity vs. Fairness: Simple threshold-based shedding is easy but unfair; priority-aware shedding requires request classification infrastructure

Component diagram for Load Shedding showing each building block and its responsibility — Key components of Load Shedding

How to Explain This in an Interview

Here is how I would explain Load Shedding in a system design interview:

Load shedding is the practice of intentionally rejecting excess requests when a system is near capacity to maintain quality of service for accepted traffic. Unlike rate limiting which enforces business policy, load shedding is a self-preservation mechanism triggered by actual resource saturation. I would implement it using in-flight request counting with Little's Law — when concurrency times latency exceeds capacity, start rejecting. Priority classification is essential: shed analytics requests before payment requests. The system returns fast 503 responses without doing any downstream work, preserving resources. I would pair this with client-side exponential backoff so rejected requests retry gracefully.

Interview preparation checklist for Load Shedding with key points to mention and mistakes to avoid — Interview tips for Load Shedding

The Real-World Incident That Made This Famous

Understanding Load Shedding became critical after multiple high-profile production incidents at major tech companies. When systems handle millions of users, even small misunderstandings about Load Shedding can lead to cascading failures that cost millions in lost revenue and erode user trust. Companies like Netflix, Google, Amazon, and Meta have all invested heavily in mastering Load Shedding because they learned the hard way that ignoring it leads to outages.

The key lesson from these incidents: Load Shedding is not just a theoretical concept — it is a practical skill that separates engineers who build resilient systems from those who build fragile ones. Every major outage report from the past decade involves at least one Load Shedding-related design decision that was either implemented incorrectly or overlooked entirely during the initial architecture review.

Decision guide for when to choose Load Shedding and when alternative approaches are better — When to use Load Shedding

How Senior Engineers Think About This

Senior engineers approach Load Shedding differently from textbook definitions. Instead of memorizing rules, they build mental models. They ask: "What problem does Load Shedding solve? When does it fail? What are the alternatives?" This problem-first thinking leads to better design decisions because every system has unique constraints.

When evaluating Load Shedding in a system design context, experienced engineers consider the failure modes first. What happens when this component goes down? How does the system degrade? Is the degradation graceful or catastrophic? These questions reveal more about your understanding than any textbook definition.

The key difference between junior and senior engineers when it comes to Load Shedding: juniors focus on the happy path, while seniors design for what happens when things go wrong. They consider operational cost, team expertise, monitoring requirements, and how the decision will look six months from now when traffic has grown 10x.

Tradeoff analysis for Load Shedding listing advantages, disadvantages, and real-world considerations — Advantages and disadvantages of Load Shedding

Common Interview Mistakes

Mistake 1: Giving a textbook definition without context. Interviewers want to see you connect Load Shedding to real systems and real problems. Instead of reciting definitions, explain when and why you would use Load Shedding in the system you are designing.

Mistake 2: Not discussing trade-offs. Every design decision involving Load Shedding has trade-offs. Discuss what you gain and what you give up. Acknowledge the downsides and explain why the benefits outweigh them for your specific use case.

Mistake 3: Overcomplicating the solution. Start with the simplest approach to Load Shedding that meets the requirements, then add complexity only when justified. Many candidates jump to complex implementations when a simpler solution would work perfectly.

Production deployment examples of Load Shedding at companies like Netflix, Google, and Amazon — Real-world examples of Load Shedding

Production Checklist

Define clear metrics for measuring the effectiveness of your Load Shedding implementation
Set up monitoring and alerting that specifically tracks Load Shedding-related failures
Document your Load Shedding design decisions in Architecture Decision Records (ADRs)
Test failure scenarios related to Load Shedding in staging before production deployment
Review and update your Load Shedding implementation quarterly as system requirements evolve
Train new team members on the specific Load Shedding patterns used in your system
Establish runbooks for common Load Shedding-related incidents and recovery procedures

Practical Implementation for .NET Developers

In .NET, load shedding can be implemented using ASP.NET Core's built-in ConcurrencyLimiter middleware from the Microsoft.AspNetCore.RateLimiting package, which rejects requests with 503 when in-flight count exceeds a threshold. For priority-based shedding, custom middleware can inspect request headers and route priority. The System.Threading.RateLimiting namespace provides ConcurrencyLimiter and TokenBucketRateLimiter primitives. For queue-based shedding, System.Threading.Channels with BoundedChannelOptions.FullMode set to DropOldest or DropWrite provides backpressure semantics.

ASP.NET Core setup: Create a service class that encapsulates the logic, register it with dependency injection, and inject it into your controllers or minimal API endpoints. The built-in DI container handles lifecycle management.

Entity Framework Core: For database interactions, EF Core provides the ORM layer. Use migrations for schema management and raw SQL for performance-critical queries. Consider Dapper for read-heavy paths where EF Core overhead matters.

Azure integration: If deploying to Azure, leverage managed services — Azure Cache for Redis, Azure SQL, Azure Service Bus, Azure Cosmos DB. These eliminate operational overhead and provide built-in monitoring through Application Insights.

Testing: Use xUnit with Testcontainers for integration tests that spin up real databases in Docker. Mock external dependencies with NSubstitute. The WebApplicationFactory class lets you test your entire HTTP pipeline in-process.

Monitoring: Add Application Insights telemetry to track request latency, dependency calls, and custom metrics. Use structured logging with Serilog to make production debugging possible:

text

Log.Information("Processing {Operation} for {ResourceId}", operation, resourceId);

This gives you searchable, structured logs in Azure Monitor or Seq.