Disaster Recovery
Disasters happen: AWS us-east-1 has had multi-hour outages, entire data centers have lost power, and ransomware attacks have encrypted production.
The Problem Disaster Recovery Solves
Disasters happen: AWS us-east-1 has had multi-hour outages, entire data centers have lost power, and ransomware attacks have encrypted production databases. Without DR, these events cause extended downtime and permanent data loss.
How It Works Under the Hood
Disaster recovery (DR) is the set of policies, tools, and procedures to recover from catastrophic failures — data center outages, region failures, natural disasters, or large-scale cyberattacks. DR ensures business continuity by maintaining backup systems in geographically separate locations.
A typical DR setup: Primary region (US-East) handles all traffic. Data is continuously replicated to a secondary region (US-West). Monitoring detects a primary region failure. DNS or global load balancer switches traffic to the secondary region. Users experience a brief disruption but service continues.
For active-active: both regions handle traffic simultaneously. A region failure simply means remaining regions handle all traffic (at higher utilization).
The Mental Model
- RPO (Recovery Point Objective): Maximum acceptable data loss measured in time. RPO of 1 hour means you can lose at most 1 hour of data.
- RTO (Recovery Time Objective): Maximum acceptable downtime. RTO of 4 hours means the system must be back online within 4 hours.
- Active-passive DR: A standby environment in a different region receives replicated data. On disaster, traffic is switched to the standby.
- Active-active DR: Both regions serve traffic simultaneously. A disaster in one region simply reduces capacity.
- Backup tiers: Hot (always running, seconds to failover), Warm (running but not serving, minutes to failover), Cold (data only, hours to restore).
Real Systems That Depend on This
Netflix runs active-active across 3 AWS regions. Traffic is continuously routed to all regions. A region failure is absorbed by the other two.
GitHub maintains a warm standby in a separate data center with continuously replicated data.
Google Spanner provides active-active multi-region replication with strong consistency using TrueTime.
Where This Shows Up in Interviews
- What is RPO and RTO?
- How do you design a system for disaster recovery?
- What is the difference between active-passive and active-active DR?
- How do you test disaster recovery?
Tradeoffs
- Cost vs Recovery speed: Active-active is fast but expensive (2x infrastructure). Cold standby is cheap but slow.
- Consistency vs Availability: Synchronous cross-region replication guarantees zero data loss but adds latency.
- Complexity: Multi-region architectures are significantly more complex to build, test, and operate.
Watch Out For
- Not testing DR procedures — they do not work when you need them
- RPO/RTO not aligned with business requirements — over-engineering wastes money, under-engineering causes data loss
- Not accounting for DNS propagation time during failover
Go Deeper
- availability — start here if this is new to you
- fault-tolerance
- failover
- data-replication
The Real-World Incident That Made This Famous
Understanding Disaster Recovery became critical after multiple high-profile production incidents at major tech companies. When systems handle millions of users, even small misunderstandings about Disaster Recovery can lead to cascading failures that cost millions in lost revenue and erode user trust. Companies like Netflix, Google, Amazon, and Meta have all invested heavily in mastering Disaster Recovery because they learned the hard way that ignoring it leads to outages.
The key lesson from these incidents: Disaster Recovery is not just a theoretical concept — it is a practical skill that separates engineers who build resilient systems from those who build fragile ones.
How Senior Engineers Think About This
Senior engineers approach Disaster Recovery differently from textbook definitions. Instead of memorizing rules, they build mental models. They ask: "What problem does Disaster Recovery solve? When does it fail? What are the alternatives?" This problem-first thinking leads to better design decisions because every system has unique constraints.
When evaluating Disaster Recovery in a system design context, experienced engineers consider the failure modes first. What happens when this component goes down? How does the system degrade? Is the degradation graceful or catastrophic? These questions reveal more about your understanding than any textbook definition.
Common Interview Mistakes
Mistake 1: Giving a textbook definition without context. Interviewers want to see you connect Disaster Recovery to real systems and real problems.
Mistake 2: Not discussing trade-offs. Every design decision involving Disaster Recovery has trade-offs. Discuss what you gain and what you give up.
Mistake 3: Overcomplicating the solution. Start with the simplest approach to Disaster Recovery that meets the requirements, then add complexity only when justified.
Production Checklist
- Define clear metrics for measuring the effectiveness of your Disaster Recovery implementation
- Set up monitoring and alerting that specifically tracks Disaster Recovery-related failures
- Document your Disaster Recovery design decisions in Architecture Decision Records (ADRs)
- Test failure scenarios related to Disaster Recovery in staging before production deployment
- Review and update your Disaster Recovery implementation quarterly as system requirements evolve
- Train new team members on the specific Disaster Recovery patterns used in your system
Read the original source | Content from System-Design-Overview
Practical Implementation for .NET Developers
In a .NET application, you would typically implement this pattern using the following approach:
ASP.NET Core setup: Create a service class that encapsulates the logic, register it with dependency injection, and inject it into your controllers or minimal API endpoints. The built-in DI container handles lifecycle management.
Entity Framework Core: For database interactions, EF Core provides the ORM layer. Use migrations for schema management and raw SQL for performance-critical queries. Consider Dapper for read-heavy paths where EF Core's overhead matters.
Azure integration: If deploying to Azure, leverage managed services — Azure Cache for Redis, Azure SQL, Azure Service Bus, Azure Cosmos DB. These eliminate operational overhead and provide built-in monitoring through Application Insights.
Testing: Use xUnit with Testcontainers for integration tests that spin up real databases in Docker. Mock external dependencies with NSubstitute. The WebApplicationFactory class lets you test your entire HTTP pipeline in-process.
Monitoring: Add Application Insights telemetry to track request latency, dependency calls, and custom metrics. Use structured logging with Serilog to make production debugging possible:
Log.Information("Processing order {OrderId} for {CustomerId}", orderId, customerId);
This gives you searchable, structured logs in Azure Monitor or Seq.