Failover
Without failover, any single component failure can bring down your entire system. Failover is how you achieve high availability in practice — it is the.
The Core Idea
Failover is the process of automatically switching to a redundant or standby system when the primary system fails. It ensures continuity of service by detecting failures and redirecting traffic or operations to a healthy backup.
Step-by-Step Walkthrough
A typical failover setup: Two database servers — primary handles all writes, secondary replicates data in real-time. A health checker pings the primary every 5 seconds. If 3 consecutive checks fail (15 seconds), the system promotes the secondary to primary, updates the connection string, and starts routing traffic to the new primary. When the old primary recovers, it becomes the new secondary and catches up on missed writes.
Why This Approach Wins
- Active-passive failover: A standby server sits idle until the primary fails. Simple but wastes resources. Used for databases where only one node should accept writes.
- Active-active failover: Multiple servers handle traffic simultaneously. If one fails, others absorb the load. More efficient but requires load balancing and state synchronization.
- Failure detection: Health checks, heartbeats, or consensus protocols detect when a primary has failed. Detection must be fast but not trigger false positives.
- DNS failover: Update DNS records to point to a healthy server. Simple but slow (DNS TTL propagation can take minutes).
- Automatic vs. manual: Automated failover is faster but riskier (can trigger on false alarms). Manual failover is slower but safer for critical systems.
In Production
AWS RDS provides automatic failover for multi-AZ deployments — if the primary database fails, RDS promotes the standby replica in a different availability zone within 60-120 seconds.
Redis Sentinel monitors Redis master instances and automatically promotes a replica to master when the primary fails.
Cloudflare uses BGP Anycast failover — if an entire data center goes offline, BGP routing automatically redirects traffic to the next closest data center.
Tradeoffs and Limitations
- Speed vs. Safety: Faster failover detection means quicker recovery but higher risk of false positives.
- Active-passive vs. Active-active: Passive wastes resources; active-active requires complex state management.
- Data loss risk: If replication is asynchronous, some writes may be lost during failover.
Production Gotchas
- Not testing failover regularly — an untested failover mechanism is not reliable
- Ignoring split-brain scenarios where both nodes think they are the primary
- Setting health check intervals too aggressively, causing flapping between primary and secondary
The Interview Angle
- What is the difference between active-passive and active-active failover?
- How do you handle data consistency during failover?
- What is split-brain and how do you prevent it?
- How fast should failover detection be?
Next Up
The Real-World Incident That Made This Famous
Understanding Failover became critical after multiple high-profile production incidents at major tech companies. When systems handle millions of users, even small misunderstandings about Failover can lead to cascading failures that cost millions in lost revenue and erode user trust. Companies like Netflix, Google, Amazon, and Meta have all invested heavily in mastering Failover because they learned the hard way that ignoring it leads to outages.
The key lesson from these incidents: Failover is not just a theoretical concept — it is a practical skill that separates engineers who build resilient systems from those who build fragile ones.
How Senior Engineers Think About This
Senior engineers approach Failover differently from textbook definitions. Instead of memorizing rules, they build mental models. They ask: "What problem does Failover solve? When does it fail? What are the alternatives?" This problem-first thinking leads to better design decisions because every system has unique constraints.
When evaluating Failover in a system design context, experienced engineers consider the failure modes first. What happens when this component goes down? How does the system degrade? Is the degradation graceful or catastrophic? These questions reveal more about your understanding than any textbook definition.
Common Interview Mistakes
Mistake 1: Giving a textbook definition without context. Interviewers want to see you connect Failover to real systems and real problems.
Mistake 2: Not discussing trade-offs. Every design decision involving Failover has trade-offs. Discuss what you gain and what you give up.
Mistake 3: Overcomplicating the solution. Start with the simplest approach to Failover that meets the requirements, then add complexity only when justified.
Production Checklist
- Define clear metrics for measuring the effectiveness of your Failover implementation
- Set up monitoring and alerting that specifically tracks Failover-related failures
- Document your Failover design decisions in Architecture Decision Records (ADRs)
- Test failure scenarios related to Failover in staging before production deployment
- Review and update your Failover implementation quarterly as system requirements evolve
- Train new team members on the specific Failover patterns used in your system
Read the original source | Content from System-Design-Overview
Practical Implementation for .NET Developers
In a .NET application, you would typically implement this pattern using the following approach:
ASP.NET Core setup: Create a service class that encapsulates the logic, register it with dependency injection, and inject it into your controllers or minimal API endpoints. The built-in DI container handles lifecycle management.
Entity Framework Core: For database interactions, EF Core provides the ORM layer. Use migrations for schema management and raw SQL for performance-critical queries. Consider Dapper for read-heavy paths where EF Core's overhead matters.
Azure integration: If deploying to Azure, leverage managed services — Azure Cache for Redis, Azure SQL, Azure Service Bus, Azure Cosmos DB. These eliminate operational overhead and provide built-in monitoring through Application Insights.
Testing: Use xUnit with Testcontainers for integration tests that spin up real databases in Docker. Mock external dependencies with NSubstitute. The WebApplicationFactory class lets you test your entire HTTP pipeline in-process.
Monitoring: Add Application Insights telemetry to track request latency, dependency calls, and custom metrics. Use structured logging with Serilog to make production debugging possible:
Log.Information("Processing order {OrderId} for {CustomerId}", orderId, customerId);
This gives you searchable, structured logs in Azure Monitor or Seq.