Circuit Breaker Pattern
Without circuit breakers, a failing downstream service can cascade failures throughout your system.
The Core Idea
The circuit breaker pattern prevents a service from repeatedly calling a failing dependency. Like an electrical circuit breaker, it 'trips' open after detecting failures, rejecting requests immediately instead of wasting time and resources on calls that will fail.
Step-by-Step Walkthrough
Service A calls Service B through a circuit breaker. Initially, the circuit is Closed — all requests pass through. If 5 out of 10 requests to Service B fail, the circuit Opens. All subsequent requests to B fail immediately (no network call). After 30 seconds, the circuit moves to Half-Open — the next request passes through. If it succeeds, the circuit Closes. If it fails, it Opens again.
Why This Approach Wins
- Three states: Closed (normal — requests pass through), Open (tripped — requests fail immediately), Half-Open (testing — a few requests pass to check if the dependency recovered).
- Failure threshold: The circuit opens after N consecutive failures or a failure rate exceeding X% within a time window.
- Timeout: Open circuits automatically transition to Half-Open after a timeout (e.g., 30 seconds), allowing test requests.
- Fallback: When the circuit is open, return a fallback response (cached data, default value, error message) instead of an error.
- Per-dependency: Each downstream service should have its own circuit breaker. A failing payment service should not affect the search service.
In Production
Netflix Hystrix (now in maintenance) pioneered circuit breakers in microservices, protecting against cascading failures across hundreds of services.
Resilience4j is the modern Java circuit breaker library, used in Spring Boot applications.
Envoy proxy implements circuit breaking at the service mesh level, transparently protecting all services.
Tradeoffs and Limitations
- Protection vs Availability: Opening the circuit protects the system but makes the dependency completely unavailable (even for requests that might succeed).
- Threshold sensitivity: Too sensitive = false trips on transient errors. Too lenient = slow to protect against real failures.
- Fallback quality: A good fallback (cached data) maintains user experience. A bad fallback (empty response) confuses users.
Production Gotchas
- Not implementing circuit breakers at all — cascading failures bring down the entire system
- Using a single circuit breaker for all dependencies — one failing service trips the breaker for healthy ones
- Not providing a useful fallback — the circuit opens and users see cryptic errors
The Interview Angle
- What is the circuit breaker pattern?
- What are the three states of a circuit breaker?
- How does a circuit breaker prevent cascading failures?
- What fallback strategies can you use when the circuit is open?
Next Up
The Real-World Incident That Made This Famous
Netflix's creation of Hystrix in 2011 was born from a production nightmare. During a holiday traffic spike, Netflix's recommendation service experienced elevated latency (from 50ms to 5 seconds). Every microservice that called the recommendation service had threads waiting for responses. Those threads were tied up for seconds instead of milliseconds, and the thread pools quickly exhausted. Since the same servers handled other requests too, the latency cascaded: the browsing service slowed down, then the search service, then the homepage service. Within minutes, the entire Netflix platform was degraded — not because the recommendation service was down, but because it was slow.
This is the insidious nature of cascading failures: a slow dependency is often worse than a dead one. A dead service fails fast (connection refused, immediate timeout). A slow service ties up resources (threads, connections, memory) while callers wait. It is like a traffic accident that does not block the road completely but reduces it to one lane — traffic backs up for miles.
Netflix's solution was Hystrix, a circuit breaker library. When the recommendation service error rate exceeded 50% over a 10-second window, Hystrix "opened" the circuit. All subsequent calls to the recommendation service were immediately rejected (fail fast) without even trying. Instead, a fallback was served: generic recommendations based on overall popularity instead of personalized ones. After 30 seconds, Hystrix would allow one test request through ("half-open" state). If it succeeded, the circuit closed and normal traffic resumed. If it failed, the circuit stayed open for another 30 seconds.
Hystrix was so successful that Netflix open-sourced it, and it became the standard circuit breaker implementation. Although Hystrix itself is now in maintenance mode (replaced by resilience4j), the pattern it popularized is built into every service mesh and modern microservices framework.
How Senior Engineers Think About This
Think of a circuit breaker like an electrical circuit breaker in your house. When there is a power surge, the breaker trips to protect your appliances. Without it, the surge would fry everything. In software, the "surge" is a failing downstream dependency, and the "appliances" are the threads, connections, and resources in your service.
The three states are simple: Closed (normal operation, requests pass through), Open (dependency is failing, requests are immediately rejected with a fallback), and Half-Open (testing if the dependency has recovered by allowing a small number of requests through).
Senior engineers configure three key parameters. Failure threshold: what percentage of failures triggers the circuit to open (typically 50%). Window size: over what time period you measure failures (typically 10-30 seconds). Recovery timeout: how long the circuit stays open before testing with a half-open request (typically 30-60 seconds). Getting these right requires tuning in production — too sensitive and the circuit opens on normal latency spikes, too insensitive and cascading failures spread before the circuit trips.
The most important design decision is the fallback strategy. When the circuit is open, what do you return? Options include: cached data (serve the last known good response), default values (show generic recommendations), a degraded experience (show the page without the failing component), or an error message. The best fallback is invisible to the user — they get a slightly less personalized experience but do not see an error page.
Common Interview Mistakes
Mistake 1: Not explaining the three states. Always describe Closed, Open, and Half-Open. Many candidates just say "it stops calling the failing service" without explaining the recovery mechanism.
Mistake 2: Confusing circuit breaker with retry. Retries try again after a failure. Circuit breakers stop trying when failures are systemic. They complement each other: retry for transient failures, circuit breaker for sustained failures.
Mistake 3: Not discussing fallback strategies. Opening the circuit is only half the solution. The other half is what you serve instead. Always have a plan for degraded operation.
Mistake 4: Forgetting about cascading circuit breakers. If Service A calls Service B calls Service C, and C fails, both A and B need circuit breakers. Discuss how to propagate failure information up the call chain.
Mistake 5: Not mentioning bulkheads. Circuit breakers are often paired with the bulkhead pattern: isolating dependencies into separate thread pools so that a slow dependency cannot exhaust all threads. Mentioning bulkheads shows depth.
Production Checklist
- Implement circuit breakers on every outbound call to external services or databases
- Configure failure thresholds based on observed error rates — start with 50% failures over a 10-second window
- Define meaningful fallback responses for every circuit: cached data, default values, or graceful degradation
- Monitor circuit breaker state changes and alert on open circuits — an open circuit means a dependency is unhealthy
- Pair circuit breakers with bulkheads (isolated thread pools) to prevent one slow dependency from consuming all resources
- Implement circuit breakers at the client library level so all callers benefit, not just one endpoint
- Use exponential backoff for the recovery timeout so you do not hammer a recovering service
- Log all circuit state transitions with the failure reason for post-incident analysis
- Test circuit breaker behavior with chaos engineering: inject latency or errors into a dependency and verify the circuit opens
- Set the half-open request count to 1-3 so recovery testing does not overwhelm a healing service
Read the original source | Content from System-Design-Overview
Practical Implementation for .NET Developers
In a .NET application, you would typically implement this pattern using the following approach:
ASP.NET Core setup: Create a service class that encapsulates the logic, register it with dependency injection, and inject it into your controllers or minimal API endpoints. The built-in DI container handles lifecycle management.
Entity Framework Core: For database interactions, EF Core provides the ORM layer. Use migrations for schema management and raw SQL for performance-critical queries. Consider Dapper for read-heavy paths where EF Core's overhead matters.
Azure integration: If deploying to Azure, leverage managed services — Azure Cache for Redis, Azure SQL, Azure Service Bus, Azure Cosmos DB. These eliminate operational overhead and provide built-in monitoring through Application Insights.
Testing: Use xUnit with Testcontainers for integration tests that spin up real databases in Docker. Mock external dependencies with NSubstitute. The WebApplicationFactory class lets you test your entire HTTP pipeline in-process.
Monitoring: Add Application Insights telemetry to track request latency, dependency calls, and custom metrics. Use structured logging with Serilog to make production debugging possible:
Log.Information("Processing order {OrderId} for {CustomerId}", orderId, customerId);
This gives you searchable, structured logs in Azure Monitor or Seq.