Design an Analytics Platform
Design an analytics platform with event ingestion, real-time aggregation, dashboards, and ad-hoc queries. Covers Lambda architecture and columnar storage.
Problem Statement
Design an analytics platform (like Google Analytics or Amplitude) that ingests billions of events per day from web/mobile apps, processes them in real time for live dashboards, stores them for historical ad-hoc queries, and supports custom metrics, funnels, and cohort analysis. Must handle bursty ingestion and provide query results in seconds.
Requirements
Functional
- Ingest events (page_view, click, purchase, custom) with arbitrary properties via a lightweight SDK
- Real-time dashboards showing active users, event counts, and conversion rates updated every 5 seconds
- Ad-hoc queries: filter/group events by any property, time range, and user segments
- Funnel analysis: define multi-step funnels and compute conversion rates
Non-Functional
- Ingestion: 10B events/day, peak 500K events/second, <1 second end-to-end to dashboard
- Query latency: Real-time dashboards <2 seconds, ad-hoc queries <10 seconds for 90-day range
- Retention: Raw events stored for 2 years, aggregates forever
- Scale: 100K customer apps, each sending millions of events/day
Core Architecture
-
Event Ingestion Gateway -- Lightweight HTTP endpoint that accepts event batches from client SDKs. Validates schema, enriches with server timestamp and geo-IP, and publishes to Kafka. The gateway is stateless and horizontally scaled behind a load balancer. Client SDKs batch events locally (every 10 seconds or 20 events) to reduce HTTP overhead.
-
Stream Processing Layer (Flink) -- Consumes events from Kafka in real time. Computes pre-aggregated metrics: event counts, unique users (HyperLogLog), and active session counts per 5-second window per customer app. Outputs aggregates to Redis for live dashboards and to the raw event store for historical queries.
-
Columnar Event Store (ClickHouse) -- Stores raw events in a columnar format partitioned by date and customer_id. Columnar storage enables fast aggregation queries (scanning only relevant columns). Compression ratios of 10-20x reduce storage costs. Supports SQL-like queries with sub-second response for most analytical patterns.
- Query Engine -- Translates user-defined dashboards and ad-hoc queries into optimized SQL against ClickHouse. For funnel analysis, uses window functions to detect sequential event patterns per user. For real-time metrics, reads pre-aggregated data from Redis. Caches frequently-run dashboard queries with a 5-second TTL.
Database Choice
ClickHouse for raw event storage and ad-hoc queries -- columnar format ideal for analytical workloads, handles 10B rows/day insertion with excellent compression. Redis for real-time pre-aggregated metrics (counters, HyperLogLog sketches for unique users, sorted sets for top events). Kafka as the event bus between ingestion and processing. PostgreSQL for customer accounts, dashboard definitions, and funnel configurations.
Key API Endpoints
POST /api/v1/events/batch
-> Body: \{ app_id: "APP-1", events: [\{ name: "page_view", properties: \{ url: "/home" \}, user_id: "U1", timestamp: ... \}] \}
GET /api/v1/analytics/\{app_id\}/query
-> Body: \{ metric: "unique_users", filters: [\{ property: "country", op: "eq", value: "US" \}], group_by: "device_type", range: "7d" \}
-> Returns: \{ results: [\{ device_type: "mobile", unique_users: 142000 \}, ...], query_time_ms: 850 \}
POST /api/v1/analytics/\{app_id\}/funnel
-> Body: \{ steps: ["page_view", "add_to_cart", "purchase"], range: "30d" \}
-> Returns: \{ funnel: [\{ step: "page_view", users: 100000 \}, \{ step: "add_to_cart", users: 25000, conversion: 0.25 \}, \{ step: "purchase", users: 5000, conversion: 0.20 \}] \}
Scaling Insight
Pre-aggregation at the stream processing layer is the secret to real-time dashboards at scale. Instead of querying 10B raw events for a simple "active users in the last hour" metric, Flink maintains a rolling HyperLogLog sketch per customer app per 5-second window. The dashboard reads these pre-computed sketches from Redis, merging windows as needed. This turns a billion-row scan into a sub-millisecond Redis read, making real-time dashboards feasible regardless of event volume.
Key Tradeoffs
| Decision | Option A | Option B | Chosen |
|---|---|---|---|
| Event store | Row-oriented (PostgreSQL) | Columnar (ClickHouse) | Columnar -- 10-100x faster for analytical aggregation queries, 10x better compression |
| Real-time metrics | Query raw events on every dashboard load | Pre-aggregate in stream processor | Pre-aggregate -- O(1) dashboard reads instead of scanning billions of rows |
| Ingestion | Synchronous write to database | Kafka buffer + async write | Kafka buffer -- absorbs traffic spikes, decouples ingestion from storage speed |
Practical Implementation for .NET Developers
In a .NET application, you would typically implement this pattern using the following approach:
ASP.NET Core setup: Create a service class that encapsulates the logic, register it with dependency injection, and inject it into your controllers or minimal API endpoints. The built-in DI container handles lifecycle management.
Entity Framework Core: For database interactions, EF Core provides the ORM layer. Use migrations for schema management and raw SQL for performance-critical queries. Consider Dapper for read-heavy paths where EF Core's overhead matters.
Azure integration: If deploying to Azure, leverage managed services — Azure Cache for Redis, Azure SQL, Azure Service Bus, Azure Cosmos DB. These eliminate operational overhead and provide built-in monitoring through Application Insights.
Testing: Use xUnit with Testcontainers for integration tests that spin up real databases in Docker. Mock external dependencies with NSubstitute. The WebApplicationFactory class lets you test your entire HTTP pipeline in-process.
Monitoring: Add Application Insights telemetry to track request latency, dependency calls, and custom metrics. Use structured logging with Serilog to make production debugging possible:
Log.Information("Processing order {OrderId} for {CustomerId}", orderId, customerId);
This gives you searchable, structured logs in Azure Monitor or Seq.
System-Specific Clarifying Questions
Before designing Analytics Platform, ask questions specific to THIS system:
- Who are the primary users? Understanding the user base shapes every technical decision — consumer apps have different requirements than enterprise B2B systems.
- What is the read-to-write ratio? This determines whether you optimize for fast reads (caching, denormalization) or fast writes (write-ahead logs, async processing).
- What is the geographic distribution? Users in one country vs. global users fundamentally changes your data replication and CDN strategy.
- What is the acceptable latency? Some features need sub-100ms responses, others can tolerate seconds. This determines your caching and architecture strategy.
- What is the consistency requirement? Some data (payments, inventory) needs strong consistency. Other data (social feeds, recommendations) can be eventually consistent.
Architecture Deep Dive
The architecture for Analytics Platform should be designed around the specific access patterns of the system. Do not apply generic templates — every system has unique hotspots, bottlenecks, and scaling challenges.
Write Path: How does data enter the system? Is it bursty (event-driven, flash sales) or steady (sensor data, logs)? Bursty writes need queuing and backpressure. Steady writes can go directly to the database.
Read Path: How is data consumed? Is it fan-out (one write, many reads like social feeds) or point lookups (one read for specific data like user profiles)? Fan-out reads benefit from pre-computation and caching. Point lookups benefit from efficient indexing.
Hot Spots: Where are the bottlenecks? For Analytics Platform, identify the component that will fail first under load and design mitigation strategies: caching, sharding, rate limiting, or async processing.