Design Facebook
Design Facebook's social graph, news feed, friend suggestions, and notification system. Covers fan-out architecture, graph databases, and feed ranking.
Problem Statement
Design a social network like Facebook supporting a friend graph, news feed, posts (text/photo/video), comments, reactions, friend suggestions, and notifications. The system must handle 3B monthly users with a news feed that balances relevance ranking with freshness.
Requirements
Functional
- Manage social graph: send/accept/reject friend requests; mutual friendship model
- Post content (text, photos, videos) with privacy controls (public, friends-only, custom lists)
- Generate a personalized news feed ranked by relevance (engagement signals, recency, relationship closeness)
- Friend suggestions using mutual friends (friends-of-friends) and graph analysis
Non-Functional
- Latency: News feed loads in <1 second, post creation confirmed in <2 seconds
- Scale: 3B MAU, 2B DAU, average 338 friends per user, 500M posts/day
- Consistency: Eventual consistency for feed (1-5s delay acceptable); strong consistency for friend graph mutations
- Storage: Petabytes of photo/video content
Core Architecture
-
Social Graph Service -- Stores the friendship graph in a TAO-like system (graph database over MySQL shards). Each user is a node; friendships are bidirectional edges. Supports adjacency queries (list friends), membership queries (are A and B friends?), and traversal (friends-of-friends). Heavily cached -- friend lists change infrequently.
-
News Feed Service -- Hybrid fan-out model. When a user posts, the feed service pushes the post ID to each friend's feed timeline (Redis sorted set) for users with <5000 friends. For high-follower accounts, the post is pulled at read time. Feed ranking applies an ML model scoring: affinity (interaction history), content type weight, recency decay, and engagement velocity.
-
Friend Suggestion Engine -- Runs a nightly batch job computing friends-of-friends with overlap counts. "People You May Know" = users who share >3 mutual friends, weighted by interaction recency. Also incorporates signals like shared workplace, school (from profile), and phone contacts.
- Notification Service -- Event-driven via Kafka. Actions (likes, comments, friend requests, mentions) produce events consumed by the notification service. Deduplicates and batches ("Alice and 5 others liked your post"). Delivers via push notification (APNs/FCM), in-app badge, and email digest.
Database Choice
MySQL (sharded) for the social graph (TAO layer) -- partitioned by user_id for locality. TAO provides a graph API over relational storage with a massive memcache layer. Cassandra for feed timelines -- write-heavy, time-sorted, partitioned by user_id. S3 + CDN for media. Redis for precomputed feed caches and online presence. Elasticsearch for people and post search.
Key API Endpoints
GET /api/v1/feed?cursor=\{score\}&limit=20
-> Returns: \{ posts: [\{ post_id, author, content, reactions_count, comments_preview \}], next_cursor \}
POST /api/v1/posts
-> Body: \{ content: "Hello world", media_ids: ["M1"], privacy: "friends" \}
POST /api/v1/friends/\{user_id\}/request
-> Returns: \{ request_id: "FR-123", status: "PENDING" \}
Scaling Insight
TAO (The Associations and Objects) is Facebook's key innovation: a write-through, read-optimized cache over a sharded MySQL graph store. TAO ensures that every graph read (friend list, "is friend?" check) hits a nearby cache server rather than the database. With 99.8% cache hit ratio and sub-ms lookups, TAO makes the social graph feel like an in-memory data structure despite being backed by petabytes of MySQL data.
Key Tradeoffs
| Decision | Option A | Option B | Chosen |
|---|---|---|---|
| Graph storage | Native graph DB (Neo4j) | Sharded MySQL + cache (TAO) | TAO -- proven at Facebook scale, relational sharding is well-understood operationally |
| Feed ranking | Chronological | ML-ranked relevance | ML-ranked -- drives engagement, but chronological toggle offered to users |
| Fan-out | Pure push (fan-out-on-write) | Pure pull (fan-out-on-read) | Hybrid -- push for most users, pull for celebrities to avoid write amplification |
Practical Implementation for .NET Developers
In a .NET application, you would typically implement this pattern using the following approach:
ASP.NET Core setup: Create a service class that encapsulates the logic, register it with dependency injection, and inject it into your controllers or minimal API endpoints. The built-in DI container handles lifecycle management.
Entity Framework Core: For database interactions, EF Core provides the ORM layer. Use migrations for schema management and raw SQL for performance-critical queries. Consider Dapper for read-heavy paths where EF Core's overhead matters.
Azure integration: If deploying to Azure, leverage managed services — Azure Cache for Redis, Azure SQL, Azure Service Bus, Azure Cosmos DB. These eliminate operational overhead and provide built-in monitoring through Application Insights.
Testing: Use xUnit with Testcontainers for integration tests that spin up real databases in Docker. Mock external dependencies with NSubstitute. The WebApplicationFactory class lets you test your entire HTTP pipeline in-process.
Monitoring: Add Application Insights telemetry to track request latency, dependency calls, and custom metrics. Use structured logging with Serilog to make production debugging possible:
Log.Information("Processing order {OrderId} for {CustomerId}", orderId, customerId);
This gives you searchable, structured logs in Azure Monitor or Seq.
System-Specific Clarifying Questions
Before designing Facebook, ask questions specific to THIS system:
- Who are the primary users? Understanding the user base shapes every technical decision — consumer apps have different requirements than enterprise B2B systems.
- What is the read-to-write ratio? This determines whether you optimize for fast reads (caching, denormalization) or fast writes (write-ahead logs, async processing).
- What is the geographic distribution? Users in one country vs. global users fundamentally changes your data replication and CDN strategy.
- What is the acceptable latency? Some features need sub-100ms responses, others can tolerate seconds. This determines your caching and architecture strategy.
- What is the consistency requirement? Some data (payments, inventory) needs strong consistency. Other data (social feeds, recommendations) can be eventually consistent.
Architecture Deep Dive
The architecture for Facebook should be designed around the specific access patterns of the system. Do not apply generic templates — every system has unique hotspots, bottlenecks, and scaling challenges.
Write Path: How does data enter the system? Is it bursty (event-driven, flash sales) or steady (sensor data, logs)? Bursty writes need queuing and backpressure. Steady writes can go directly to the database.
Read Path: How is data consumed? Is it fan-out (one write, many reads like social feeds) or point lookups (one read for specific data like user profiles)? Fan-out reads benefit from pre-computation and caching. Point lookups benefit from efficient indexing.
Hot Spots: Where are the bottlenecks? For Facebook, identify the component that will fail first under load and design mitigation strategies: caching, sharding, rate limiting, or async processing.