Design a CDN
Design a Content Delivery Network with edge caching, cache invalidation, origin shield, and global traffic routing. Covers push vs pull CDN models.
Problem Statement
Design a Content Delivery Network (CDN) that caches and delivers static and dynamic content from edge servers distributed globally. The system must minimize latency by serving content from the nearest edge, handle cache invalidation efficiently, and protect origin servers from traffic spikes.
Requirements
Functional
- Cache static assets (images, JS, CSS, video) at edge locations closest to users
- Support cache invalidation: purge by URL, wildcard path, or cache tag
- Origin shield to consolidate cache misses and protect origin servers
- Support both push (pre-populate) and pull (lazy-fill) caching strategies
Non-Functional
- Latency: <50ms for cache hits, <200ms for cache misses (with origin shield)
- Cache hit ratio: Target 95%+ for static content
- Availability: 99.99% -- edge failures must be transparent to users
- Scale: 200+ global PoPs, 50 Tbps aggregate bandwidth
Core Architecture
-
Edge Server (PoP) -- Each Point of Presence runs a reverse proxy (e.g., NGINX/Varnish) with SSD-backed cache. Handles TLS termination, Brotli/gzip compression, and HTTP/2 push. Content is keyed by URL + Vary headers.
-
Origin Shield -- A mid-tier cache layer between edge and origin. All edge cache misses in a region route through a single shield node, collapsing duplicate requests. Reduces origin load by 90% during cache invalidation storms.
-
DNS-based Global Traffic Manager -- Resolves CDN domains to the nearest healthy PoP using GeoDNS and latency-based routing. Health checks run every 10s; unhealthy PoPs are removed from DNS within 30s.
- Purge Propagation Service -- Receives purge requests and fans them out to all 200+ PoPs via a pub/sub system (Kafka). Supports instant purge (delete from cache), soft purge (mark stale, serve while revalidating), and tag-based purge.
Database Choice
Edge cache storage: Local SSD + in-memory LRU for hot objects. No centralized database for content. Configuration and routing rules: stored in etcd with replication to every PoP. Access logs and analytics: streamed to Kafka, aggregated in ClickHouse for real-time dashboards (bandwidth, cache hit ratio, error rates by PoP).
Key API Endpoints
POST /api/v1/purge
-> Body: \{ type: "url" | "wildcard" | "tag", value: "/images/*", soft: false \}
GET /api/v1/analytics/\{domain\}?range=1h
-> Returns: \{ requests: 1.2M, cache_hit_ratio: 0.96, bandwidth_gb: 450 \}
PUT /api/v1/origins/\{domain\}
-> Body: \{ origin: "origin.example.com", shield_region: "us-east", ttl_default: 86400 \}
Scaling Insight
The origin shield pattern is the most critical scaling lever. Without it, a cache purge of a popular asset causes N simultaneous requests to the origin (one per PoP). With an origin shield per region, this collapses to 3-5 requests (one per shield). Combined with request coalescing (holding duplicate in-flight requests and serving all from one origin fetch), even viral content spikes generate minimal origin load.
Key Tradeoffs
| Decision | Option A | Option B | Chosen |
|---|---|---|---|
| Caching model | Push (pre-populate) | Pull (lazy-fill on miss) | Pull default with push for known-hot assets -- simpler, wastes less edge storage |
| Invalidation | TTL-based expiry | Explicit purge | Both -- TTL as safety net, explicit purge for immediate updates |
| Routing | DNS-based (GeoDNS) | Anycast IP | DNS for flexibility -- allows weighted routing and A/B testing per PoP |
Practical Implementation for .NET Developers
In a .NET application, you would typically implement this pattern using the following approach:
ASP.NET Core setup: Create a service class that encapsulates the logic, register it with dependency injection, and inject it into your controllers or minimal API endpoints. The built-in DI container handles lifecycle management.
Entity Framework Core: For database interactions, EF Core provides the ORM layer. Use migrations for schema management and raw SQL for performance-critical queries. Consider Dapper for read-heavy paths where EF Core's overhead matters.
Azure integration: If deploying to Azure, leverage managed services — Azure Cache for Redis, Azure SQL, Azure Service Bus, Azure Cosmos DB. These eliminate operational overhead and provide built-in monitoring through Application Insights.
Testing: Use xUnit with Testcontainers for integration tests that spin up real databases in Docker. Mock external dependencies with NSubstitute. The WebApplicationFactory class lets you test your entire HTTP pipeline in-process.
Monitoring: Add Application Insights telemetry to track request latency, dependency calls, and custom metrics. Use structured logging with Serilog to make production debugging possible:
Log.Information("Processing order {OrderId} for {CustomerId}", orderId, customerId);
This gives you searchable, structured logs in Azure Monitor or Seq.
System-Specific Clarifying Questions
Before designing Design Cdn, ask questions specific to THIS system:
- Who are the primary users? Understanding the user base shapes every technical decision — consumer apps have different requirements than enterprise B2B systems.
- What is the read-to-write ratio? This determines whether you optimize for fast reads (caching, denormalization) or fast writes (write-ahead logs, async processing).
- What is the geographic distribution? Users in one country vs. global users fundamentally changes your data replication and CDN strategy.
- What is the acceptable latency? Some features need sub-100ms responses, others can tolerate seconds. This determines your caching and architecture strategy.
- What is the consistency requirement? Some data (payments, inventory) needs strong consistency. Other data (social feeds, recommendations) can be eventually consistent.
Architecture Deep Dive
The architecture for Design Cdn should be designed around the specific access patterns of the system. Do not apply generic templates — every system has unique hotspots, bottlenecks, and scaling challenges.
Write Path: How does data enter the system? Is it bursty (event-driven, flash sales) or steady (sensor data, logs)? Bursty writes need queuing and backpressure. Steady writes can go directly to the database.
Read Path: How is data consumed? Is it fan-out (one write, many reads like social feeds) or point lookups (one read for specific data like user profiles)? Fan-out reads benefit from pre-computation and caching. Point lookups benefit from efficient indexing.
Hot Spots: Where are the bottlenecks? For Design Cdn, identify the component that will fail first under load and design mitigation strategies: caching, sharding, rate limiting, or async processing.