Design Notification Service
System design interview solution for Design Notification Service. Includes requirements, API design, data model, architecture, scaling strategy, and.
Problem Statement
Design a system similar to Notification Service. The system should handle millions of users and provide a reliable, scalable experience.
Step 1: Clarifying Questions
Before diving into the design, ask these clarifying questions:
- What is the expected scale (users, requests per second)?
- What are the most critical features to support?
- What are the latency requirements?
- Do we need to support real-time features?
- What consistency guarantees are needed?
Step 2: Functional Requirements
- Core feature set for Notification Service
- User-facing APIs and interactions
- Data storage and retrieval
- Search and discovery (if applicable)
- Notifications (if applicable)
Step 3: Non-Functional Requirements
- Scalability: Handle millions of concurrent users
- Availability: 99.99% uptime (four nines)
- Latency: Sub-200ms for read operations
- Consistency: Eventually consistent where acceptable, strongly consistent for critical paths
- Durability: No data loss
Step 4: Back-of-the-Envelope Estimation
| Metric | Estimate |
|---|---|
| Daily Active Users | 10M |
| Read:Write Ratio | 10:1 |
| Average Request Size | 1 KB |
| Storage per year | ~10 TB |
| Peak QPS | 100K |
Step 5: API Design
POST /api/v1/resource
GET /api/v1/resource/{id}
PUT /api/v1/resource/{id}
DELETE /api/v1/resource/{id}
Step 6: Data Model
Define the core entities and their relationships. Consider the access patterns when choosing between SQL and NoSQL.
Step 7: High-Level Architecture
The system consists of these major components:
- Client Layer — Web/mobile clients
- API Gateway — Rate limiting, authentication, routing
- Application Servers — Business logic
- Database Layer — Primary storage
- Cache Layer — Redis/Memcached for hot data
- Message Queue — Async processing
Step 8: Detailed Component Design
Write Path
How data flows from client to persistent storage.
Read Path
How data is retrieved, including cache interactions.
Step 9: Scaling Strategy
- Horizontal scaling of application servers behind a load balancer
- Database sharding by user ID or geographic region
- Read replicas for read-heavy workloads
- CDN for static content delivery
- Auto-scaling based on traffic patterns
Step 10: Reliability and Fault Tolerance
- Data replication across availability zones
- Circuit breakers for dependent services
- Graceful degradation under high load
- Health checks and automated failover
Step 11: Monitoring and Observability
- Request latency (p50, p95, p99)
- Error rates by endpoint
- Database query performance
- Cache hit/miss ratios
- Queue depth and processing lag
Key Tradeoffs
| Decision | Option A | Option B | Chosen |
|---|---|---|---|
| Database | SQL | NoSQL | Depends on access patterns |
| Consistency | Strong | Eventual | Eventual for most reads |
| Communication | Sync | Async | Async for non-critical paths |
How to Present This in an Interview
- Start with clarifying questions (2 min)
- Define requirements (3 min)
- Do estimation (2 min)
- Design API and data model (5 min)
- Draw high-level architecture (10 min)
- Deep dive into critical components (10 min)
- Discuss tradeoffs and bottlenecks (5 min)
Practical Implementation for .NET Developers
In a .NET application, you would typically implement this pattern using the following approach:
ASP.NET Core setup: Create a service class that encapsulates the logic, register it with dependency injection, and inject it into your controllers or minimal API endpoints. The built-in DI container handles lifecycle management.
Entity Framework Core: For database interactions, EF Core provides the ORM layer. Use migrations for schema management and raw SQL for performance-critical queries. Consider Dapper for read-heavy paths where EF Core's overhead matters.
Azure integration: If deploying to Azure, leverage managed services — Azure Cache for Redis, Azure SQL, Azure Service Bus, Azure Cosmos DB. These eliminate operational overhead and provide built-in monitoring through Application Insights.
Testing: Use xUnit with Testcontainers for integration tests that spin up real databases in Docker. Mock external dependencies with NSubstitute. The WebApplicationFactory class lets you test your entire HTTP pipeline in-process.
Monitoring: Add Application Insights telemetry to track request latency, dependency calls, and custom metrics. Use structured logging with Serilog to make production debugging possible:
Log.Information("Processing order {OrderId} for {CustomerId}", orderId, customerId);
This gives you searchable, structured logs in Azure Monitor or Seq.
Deep-Dive: Clarifying Questions for Notification Service
- What notification channels? Push notifications (iOS APNs, Android FCM), email (SMTP), SMS (Twilio/SNS), in-app notifications, and webhooks. Each channel has different delivery guarantees and costs.
- What is the volume? A large platform sends 1-10 billion notifications per day. Most are push notifications; email and SMS are an order of magnitude less.
- Do we need priority levels? A security alert (password changed) is high priority and must be delivered immediately. A "someone liked your photo" notification can be batched and delayed.
- Do we need rate limiting per user? Users should not receive more than N notifications per hour to avoid notification fatigue and uninstalls.
- How do we handle user preferences? Users should be able to opt out of specific notification types per channel (e.g., receive likes via push but not email).
- Do we need notification templates? Localized templates with variable substitution for consistent messaging across channels.
Specific Functional Requirements
- Multi-Channel Delivery: Send notifications via push (iOS/Android), email, SMS, in-app, and webhooks
- Priority Queuing: High-priority notifications (security alerts, OTPs) are delivered immediately; low-priority ones can be batched
- User Preferences: Per-user settings for which notification types they receive on which channels
- Rate Limiting: Limit notifications per user per time window to prevent notification fatigue
- Template Engine: Localized notification templates with variable substitution
- Delivery Tracking: Track sent, delivered, opened, and clicked status per notification
- Retry and Failover: Retry failed deliveries with exponential backoff; fall back to alternate channels
Specific API Endpoints
POST /api/v1/notifications/send
Body: {
"user_id": "user_123",
"type": "like_photo",
"priority": "low",
"data": { "liker_name": "Alice", "photo_id": "p456" },
"channels": ["push", "in_app"]
}
Response: { "notification_id": "n789", "status": "queued" }
POST /api/v1/notifications/send-bulk
Body: { "user_ids": ["user_1", "user_2", ...], "type": "new_feature", "data": {...} }
Response: { "batch_id": "batch_abc", "queued_count": 50000 }
GET /api/v1/users/:user_id/notifications?unread=true&limit=20
Response: { "notifications": [...], "unread_count": 5 }
PUT /api/v1/users/:user_id/notification-preferences
Body: { "like_photo": { "push": true, "email": false, "sms": false }, "security_alert": { "push": true, "email": true, "sms": true } }
Specific Data Model
Notification Queue (Kafka): Topics partitioned by priority (high, medium, low). High-priority topic has more partitions and dedicated consumers for faster processing.
Notification Log (Cassandra)
| Column | Type | Notes |
|---|---|---|
| user_id | BIGINT | Partition key |
| notification_id | TIMEUUID | Clustering key |
| type | VARCHAR | like_photo, comment, follow, security_alert |
| channel | VARCHAR | push, email, sms, in_app |
| status | VARCHAR | queued, sent, delivered, opened, failed |
| data | JSON | Template variables |
| created_at | TIMESTAMP | |
| delivered_at | TIMESTAMP | Nullable |
User Preferences (PostgreSQL)
| Column | Type | Notes |
|---|---|---|
| user_id | BIGINT | Primary key |
| preferences | JSONB | Map of notification_type -> channel -> enabled |
| quiet_hours_start | TIME | Do not disturb start |
| quiet_hours_end | TIME | Do not disturb end |
| timezone | VARCHAR | For quiet hours calculation |
Device Registry (Redis/PostgreSQL): Maps user_id to device tokens for push notifications. Users may have multiple devices.
Specific Back-of-the-Envelope Numbers
Traffic:
- 500M DAU generating ~10 notification-triggering events each = 5 billion notification requests/day
- After preference filtering and deduplication: ~2 billion actual deliveries/day
- Push notifications: ~1.5B/day (70%), in-app: 400M (20%), email: 150M (7%), SMS: 50M (3%)
- Average: ~23,000 notifications/second, peak: ~70,000/second
Processing:
- Each notification: check preferences (Redis lookup), apply rate limit (Redis counter), render template, route to channel
- Processing time per notification: ~5ms
- Need ~350 worker instances at peak to maintain under 1-second queue latency
Storage:
- Notification log: 2B/day * 200 bytes = 400 GB/day, retained for 90 days = 36 TB
- User preferences: 500M users * 1 KB = 500 GB (fits in a single PostgreSQL instance with read replicas)
External provider limits:
- APNs (Apple): effectively unlimited but throttles per device token
- FCM (Google): 240 messages/minute per device, up to 500 topics per app
- Email (SES): 50,000 emails/day on standard, need dedicated IPs for higher volumes
- SMS: $0.0075 per message (US) — expensive at scale, use sparingly