Design Reddit
Design Reddit with subreddits, voting (hot/top/controversial algorithms), threaded comment trees, and karma.
Problem Statement
Design a community-driven content platform like Reddit with subreddits, post/comment voting, ranking algorithms (hot, top, new, controversial), nested comment threads, and a karma system. The system must handle viral posts with millions of votes while keeping ranking scores fresh.
Requirements
Functional
- Create subreddits with rules and moderators; subscribe/unsubscribe
- Submit posts (text, link, image) to subreddits; upvote/downvote posts and comments
- Rank posts by Hot (time-decayed score), Top (by time range), New, and Controversial
- Display comment threads as nested trees with collapsible replies
Non-Functional
- Latency: Front page and subreddit feed loads in <500ms
- Scale: 50M DAU, 100K new posts/day, 10M comments/day, 500M votes/day
- Consistency: Vote counts eventually consistent (1-2s delay acceptable); comment tree structure strongly consistent
- Availability: 99.95% -- read-heavy system (100:1 read-write ratio)
Core Architecture
-
Ranking Engine -- Computes Hot score using Reddit's formula: score = log10(max(|ups - downs|, 1)) + sign(ups - downs) * (post_timestamp - epoch) / 45000. This gives newer posts a time-based boost while heavily-upvoted posts stay visible longer. Scores are recomputed asynchronously on vote events and cached in Redis sorted sets per subreddit.
-
Vote Processing Pipeline -- Votes are written to Kafka, deduplicated (one vote per user per item), and aggregated. A vote flipping from up to down is a delta of -2. Aggregated counts are flushed to the database every 5 seconds. Redis holds the real-time approximate count for display.
-
Comment Tree Service -- Comments are stored with a materialized path (e.g., "c1/c2/c5") enabling efficient subtree queries. Each comment has parent_id and a depth field. Tree rendering sorts by score within each depth level. Deep threads (>5 levels) show a "Continue this thread" link and load lazily.
- Content Moderation Pipeline -- Combines automated filters (spam detection, banned words, URL blacklist) with human moderator actions. AutoModerator rules are evaluated on post/comment creation. Reported content enters a moderation queue. ML models flag potential policy violations for human review.
Database Choice
PostgreSQL for posts, comments, subreddits, and user profiles -- relational queries for comment trees (materialized path with LIKE prefix queries) and subreddit membership. Redis sorted sets for ranked feeds per subreddit (ZADD with Hot score, ZREVRANGE for feed). Cassandra for vote records -- write-heavy (500M/day), partitioned by item_id, simple lookup pattern. Elasticsearch for post and subreddit search.
Key API Endpoints
GET /api/v1/r/\{subreddit\}/\{sort\}?cursor=\{score\}&limit=25
-> Returns: \{ posts: [\{ id, title, author, score, num_comments, created_at \}], next_cursor \}
POST /api/v1/vote
-> Body: \{ item_id: "P-123", item_type: "post", direction: "up" \}
GET /api/v1/posts/\{post_id\}/comments?sort=top&depth=5
-> Returns: \{ comments: [\{ id, body, score, replies: [...recursive...] \}] \}
Scaling Insight
Approximate vote counts are the key to handling Reddit's vote volume. Instead of incrementing a counter in the database for every vote, votes are buffered in Kafka and flushed in batches every 5 seconds. The displayed count is the last flushed value + an in-memory delta from Redis. Users see a near-instant response to their vote (optimistic UI update), while the backend processes votes asynchronously. This reduces database write pressure from 500M individual updates to ~10M batched updates per day.
Key Tradeoffs
| Decision | Option A | Option B | Chosen |
|---|---|---|---|
| Comment storage | Adjacency list (parent_id) | Materialized path ("c1/c2/c5") | Materialized path -- faster subtree queries, single query loads a full thread |
| Vote counting | Exact real-time count | Approximate batched count | Approximate -- 5s delay invisible to users, 50x reduction in DB writes |
| Ranking | Precomputed scores in cache | Computed on read | Precomputed -- amortizes calculation cost, sub-ms feed reads from Redis sorted sets |
Practical Implementation for .NET Developers
In a .NET application, you would typically implement this pattern using the following approach:
ASP.NET Core setup: Create a service class that encapsulates the logic, register it with dependency injection, and inject it into your controllers or minimal API endpoints. The built-in DI container handles lifecycle management.
Entity Framework Core: For database interactions, EF Core provides the ORM layer. Use migrations for schema management and raw SQL for performance-critical queries. Consider Dapper for read-heavy paths where EF Core's overhead matters.
Azure integration: If deploying to Azure, leverage managed services — Azure Cache for Redis, Azure SQL, Azure Service Bus, Azure Cosmos DB. These eliminate operational overhead and provide built-in monitoring through Application Insights.
Testing: Use xUnit with Testcontainers for integration tests that spin up real databases in Docker. Mock external dependencies with NSubstitute. The WebApplicationFactory class lets you test your entire HTTP pipeline in-process.
Monitoring: Add Application Insights telemetry to track request latency, dependency calls, and custom metrics. Use structured logging with Serilog to make production debugging possible:
Log.Information("Processing order {OrderId} for {CustomerId}", orderId, customerId);
This gives you searchable, structured logs in Azure Monitor or Seq.
System-Specific Clarifying Questions
Before designing Reddit, ask questions specific to THIS system:
- Who are the primary users? Understanding the user base shapes every technical decision — consumer apps have different requirements than enterprise B2B systems.
- What is the read-to-write ratio? This determines whether you optimize for fast reads (caching, denormalization) or fast writes (write-ahead logs, async processing).
- What is the geographic distribution? Users in one country vs. global users fundamentally changes your data replication and CDN strategy.
- What is the acceptable latency? Some features need sub-100ms responses, others can tolerate seconds. This determines your caching and architecture strategy.
- What is the consistency requirement? Some data (payments, inventory) needs strong consistency. Other data (social feeds, recommendations) can be eventually consistent.
Architecture Deep Dive
The architecture for Reddit should be designed around the specific access patterns of the system. Do not apply generic templates — every system has unique hotspots, bottlenecks, and scaling challenges.
Write Path: How does data enter the system? Is it bursty (event-driven, flash sales) or steady (sensor data, logs)? Bursty writes need queuing and backpressure. Steady writes can go directly to the database.
Read Path: How is data consumed? Is it fan-out (one write, many reads like social feeds) or point lookups (one read for specific data like user profiles)? Fan-out reads benefit from pre-computation and caching. Point lookups benefit from efficient indexing.
Hot Spots: Where are the bottlenecks? For Reddit, identify the component that will fail first under load and design mitigation strategies: caching, sharding, rate limiting, or async processing.