Skip to main content
SDMastery
intermediate9 min readUpdated 2026-06-03

Data Replication

Every production database uses replication. Without it, a single server failure means data loss and downtime.

Data Replication system design overview showing key components and metrics
High-level overview of Data Replication
Data Replication

When You Need Data Replication

Every production database uses replication. Without it, a single server failure means data loss and downtime. In interviews, you must explain how replication affects consistency, availability, and performance.

What It Is

Data replication is the process of copying data from one database server (primary/master) to one or more other servers (replicas/slaves). It provides redundancy (survive server failures), performance (distribute reads across replicas), and geographic distribution (replicas near users).

Data Replication system architecture with service components and data flow
System architecture for Data Replication

How It Works

In leader-follower replication: All writes go to the leader. The leader writes to its write-ahead log (WAL). The WAL is streamed to followers, which replay the changes. Reads can go to any follower. If the leader fails, a follower is promoted (failover).

Replication lag is the delay between a write on the leader and that write being visible on a follower. During lag, reading from a follower may return stale data.

The Decision Framework

  • Synchronous replication: The primary waits for replicas to confirm before committing. Guarantees consistency but adds latency.
  • Asynchronous replication: The primary commits immediately and sends changes to replicas in the background. Fast but replicas may lag (eventual consistency).
  • Semi-synchronous: The primary waits for at least one replica to confirm. Balances consistency and performance.
  • Leader-follower (primary-replica): One leader handles writes; followers handle reads. Most common pattern.
  • Multi-leader: Multiple nodes accept writes. Used for multi-region deployments. Requires conflict resolution.
  • Leaderless (quorum): Any node can accept reads/writes. Uses quorum (R + W > N) for consistency. Used by Dynamo, Cassandra.
Step-by-step diagram showing how Data Replication works in practice
How Data Replication works step by step

What the Industry Uses

PostgreSQL streams WAL records to replicas for leader-follower replication. Hot standby replicas serve read queries.

MySQL offers both asynchronous and semi-synchronous replication. Group Replication provides multi-leader with conflict detection.

Amazon Aurora replicates data 6 times across 3 availability zones with <20ms replication lag.

Performance and Tradeoffs

Comparison table for Data Replication showing key metrics and tradeoffs
Comparing key aspects of Data Replication
  • Sync vs Async: Sync guarantees consistency but adds write latency. Async is faster but allows stale reads.
  • Single-leader vs Multi-leader: Single-leader is simpler; multi-leader enables multi-region writes but requires conflict resolution.
  • Read replicas vs Sharding: Replicas scale reads; sharding scales both reads and writes. Use both for maximum scale.

Mistakes Engineers Make

  1. Reading from a replica immediately after writing — may return stale data (read-after-write consistency issue)
  2. Not monitoring replication lag — a lagging replica can serve very stale data
  3. Using synchronous replication across regions — adds 100ms+ latency per write

Practice These Interview Questions

  1. What are the types of replication strategies?
  2. What is replication lag and how does it affect the system?
  3. When would you use synchronous vs asynchronous replication?
  4. How does quorum-based replication work?
Data flow diagram for Data Replication showing request and response paths
Data flow through Data Replication

Practical Implementation for .NET Developers

In a .NET application, you would typically implement this pattern using the following approach:

ASP.NET Core setup: Create a service class that encapsulates the logic, register it with dependency injection, and inject it into your controllers or minimal API endpoints. The built-in DI container handles lifecycle management.

Entity Framework Core: For database interactions, EF Core provides the ORM layer. Use migrations for schema management and raw SQL for performance-critical queries. Consider Dapper for read-heavy paths where EF Core's overhead matters.

Azure integration: If deploying to Azure, leverage managed services — Azure Cache for Redis, Azure SQL, Azure Service Bus, Azure Cosmos DB. These eliminate operational overhead and provide built-in monitoring through Application Insights.

Key components of Data Replication with roles and responsibilities
Key components of Data Replication

Testing: Use xUnit with Testcontainers for integration tests that spin up real databases in Docker. Mock external dependencies with NSubstitute. The WebApplicationFactory class lets you test your entire HTTP pipeline in-process.

Monitoring: Add Application Insights telemetry to track request latency, dependency calls, and custom metrics. Use structured logging with Serilog to make production debugging possible:

text
Log.Information("Processing order {OrderId} for {CustomerId}", orderId, customerId);

This gives you searchable, structured logs in Azure Monitor or Seq.

Further Reading

Interview tips for Data Replication system design questions
Interview tips for Data Replication

The Real-World Incident That Made This Famous

GitHub experienced a major incident on October 21, 2018, when a network partition between their US East Coast data center and their primary MySQL cluster caused a cascading failure. The partition lasted only 43 seconds, but the consequences lasted 24 hours. During the partition, their Orchestrator tool (MySQL high availability manager) promoted a replica to primary. When the original primary came back online, there were two databases that had accepted writes independently — a classic split-brain scenario.

The recovery was painful. GitHub had to reconcile thousands of conflicting writes across the two databases. Some data had been written to the old primary during the partition, some to the new primary. They spent hours running reconciliation scripts, comparing row-by-row to identify and resolve conflicts. Several GitHub features (webhooks, GitHub Pages builds, GitHub Actions) were degraded for over 24 hours while the team worked through the data inconsistencies.

This incident was a textbook demonstration of why replication lag and failover are the hardest problems in database operations. MySQL's default asynchronous replication means the replica can be behind the primary by seconds (or minutes during peak load). If you promote a behind replica to primary, you lose those seconds of data. Semi-synchronous replication (where at least one replica must acknowledge before the primary commits) reduces this risk but adds latency to every write. GitHub later migrated to a more robust MySQL high availability setup with semi-synchronous replication and more careful failover procedures.

Decision guide showing when to use Data Replication and when to avoid
When to use Data Replication

How Senior Engineers Think About This

The mental model: replication copies data from one database server to others. The fundamental question is when the copy happens relative to the write being acknowledged to the client. Synchronous replication copies before acknowledging (strong consistency, higher latency). Asynchronous replication acknowledges immediately, copies later (lower latency, risk of data loss). Semi-synchronous is the middle ground: acknowledge after at least one replica confirms.

Senior engineers think about replication in terms of failure modes, not normal operation. During normal operation, all replication modes work fine. The question is: what happens when things go wrong? With async replication, a primary failure loses all un-replicated writes. How many seconds of data are you willing to lose? This is your Recovery Point Objective (RPO). With sync replication, a replica failure blocks writes on the primary. How long can your writes be blocked? This is your availability tolerance.

The three replication topologies to know: Single-leader (one primary accepts writes, replicas are read-only — the most common setup), Multi-leader (multiple primaries accept writes, changes are replicated bidirectionally — used for geo-distributed systems), and Leaderless (all nodes accept writes, quorum determines consistency — used by Cassandra and DynamoDB).

Multi-leader replication is the most dangerous. When two leaders accept conflicting writes (User A changes their email on the US server while User B changes the same email on the EU server), you need a conflict resolution strategy. Options include: last-writer-wins (simple but loses data), merge (combine both writes if possible), or application-level resolution (let the user choose). There is no universally correct answer — it depends on your data semantics.

Pros and cons analysis of Data Replication for system design decisions
Advantages and disadvantages of Data Replication

Common Interview Mistakes

Mistake 1: Only knowing about single-leader replication. You should be able to discuss single-leader, multi-leader, and leaderless replication with tradeoffs for each.

Mistake 2: Not discussing replication lag consequences. Async replication means a user might write data and then immediately read from a replica that has not received the write yet. Discuss read-your-own-writes consistency and how to implement it.

Mistake 3: Ignoring conflict resolution for multi-leader. If you propose multi-leader replication for geo-distribution, the interviewer will ask how you handle conflicts. Have an answer ready.

Mistake 4: Confusing replication with backup. Replication provides high availability (automatic failover). Backups provide disaster recovery (restore from a point in time). You need both.

Real-world companies using Data Replication in production systems
Real-world examples of Data Replication

Mistake 5: Not knowing the CAP implications. Sync replication sacrifices availability during partition (CP). Async replication sacrifices consistency during partition (AP). Relate your replication choice to the CAP theorem.

Production Checklist

  • Choose your replication mode based on RPO requirements: synchronous for zero data loss, semi-synchronous for minimal loss, asynchronous for performance
  • Monitor replication lag continuously and alert if it exceeds your staleness tolerance
  • Implement read-your-own-writes consistency: route a user's reads to the primary for a few seconds after they write
  • Test failover procedures monthly: promote a replica to primary and verify the process completes correctly
  • For multi-leader replication: implement and test conflict resolution logic before you need it in production
  • Keep replicas in different availability zones or regions for disaster recovery
  • Monitor replica health independently: a replica that is connected but not applying writes is worse than one that is disconnected
  • Size replicas identically to the primary so they can handle the full write load during failover
  • Implement automated failover with manual override — automation handles the common case, humans handle the edge cases
  • Maintain a runbook for split-brain recovery: the steps to reconcile conflicting data when two nodes were both accepting writes

Read the original source | Content from System-Design-Overview

External Resources

Original Sourcearticle