Monitoring
Learn Monitoring for distributed systems — build dashboards, set SLOs, configure alerts, and establish processes to detect, diagnose, and respond to.
Monitoring is the practice of collecting, visualizing, and alerting on system health metrics to detect problems before they impact users. Dashboards display real-time and historical data — request rates, error rates, latency percentiles, resource utilization. Service Level Objectives (SLOs) define reliability targets, and alerts fire when those targets are at risk. In distributed systems, effective monitoring requires tracking health across dozens of services, their dependencies, and the infrastructure they run on, using a structured approach like RED or USE methods.
| Aspect | Details |
|---|---|
| What it is | The practice of collecting system health data, visualizing it on dashboards, setting SLOs, and alerting when reliability targets are at risk |
| When to use | Always in production systems — monitoring is the minimum viable operational practice for any service that has users |
| When NOT to use | When you need to understand why something failed — monitoring tells you what is wrong, but you need observability (logs, traces) for why |
| Real-world example | Google codified monitoring best practices in the SRE handbook, defining SLIs, SLOs, and error budgets as the foundation for running reliable services |
| Interview tip | Define SLI vs. SLO vs. SLA precisely and explain error budgets — shows you understand Google SRE principles beyond just building dashboards |
| Common mistake | Alert fatigue from too many low-priority alerts — teams start ignoring alerts, and real incidents get missed in the noise |
| Key tradeoff | Signal vs. noise — too few monitors miss real problems, too many create alert fatigue; SLO-based alerting balances this |
Why This Matters
Monitoring is the operational foundation of running production systems. Without it, you discover problems when users complain — which may be hours after the issue started. With monitoring, dashboards show real-time system health, and alerts notify on-call engineers within minutes of degradation. The modern approach centers on Service Level Objectives (SLOs): defining that 99.9% of requests should complete in under 300ms, tracking the error budget (the allowed 0.1% unreliability), and alerting when the burn rate suggests the budget will be exhausted. This approach focuses engineering attention on what matters — user-perceived reliability — rather than arbitrary thresholds on internal metrics. Grafana, Datadog, and New Relic are the dominant monitoring platforms.
The Building Blocks
- Service Level Indicators: Measurable metrics that reflect user experience — request latency, error rate, availability — the quantitative inputs to SLO calculations
- Service Level Objectives: Reliability targets expressed as percentages over time windows — 99.9% of requests succeed within 300ms over a rolling 30-day period
- Dashboards: Visual displays of real-time and historical metrics using tools like Grafana, organized by service, team, or user journey for quick health assessment
- Health Checks: Endpoints that report whether a service and its dependencies are functioning correctly, used by load balancers, orchestrators, and monitoring systems
- Error Budgets: The allowed unreliability (100% minus SLO) that teams can spend on feature development versus reliability work, balancing velocity and stability
Under the Hood
Modern monitoring follows a layered approach. At the bottom, infrastructure monitoring tracks host-level metrics — CPU, memory, disk, network — using agents like node_exporter or CloudWatch. Above that, application monitoring tracks service-level metrics using the RED method (Request rate, Error rate, Duration) for request-driven services and the USE method (Utilization, Saturation, Errors) for resources like queues and connection pools.
SLO-based monitoring adds a user-centric layer. An SLI might be the proportion of HTTP requests returning 2xx status codes within 300ms. The SLO states this should be 99.9% over 30 days. The error budget is 0.1% — roughly 43 minutes of total downtime or a proportional number of failed requests per month. Multi-window, multi-burn-rate alerting detects when the error budget is being consumed too quickly: a fast burn alert (e.g., 14x burn rate over 1 hour) catches sudden outages, while a slow burn alert (e.g., 2x burn rate over 3 days) catches gradual degradation.
Dashboard design follows information hierarchy. A top-level dashboard shows overall service health across all services. Drilling down reveals per-service dashboards with RED metrics, dependency health, and infrastructure utilization. The most effective dashboards are organized by user journey — showing the complete health of the checkout flow from cart to payment to confirmation — rather than by team ownership, which creates blind spots at service boundaries.
How Companies Actually Do This
Google Pioneered SLO-based monitoring and error budgets through their SRE practice, defining the industry-standard framework for balancing reliability targets with feature development velocity
Grafana Labs Created the open-source Grafana dashboarding platform used by millions of organizations, supporting Prometheus, Loki, and Tempo as a complete monitoring stack
Datadog Built a unified monitoring platform combining infrastructure, application, and log monitoring with AI-powered anomaly detection used by thousands of enterprises
Common Pitfalls
- Alert fatigue from hundreds of low-priority alerts — teams learn to ignore notifications and miss critical incidents; every alert should be actionable and require human intervention
- Dashboard sprawl with hundreds of panels nobody looks at — focus on a few high-signal dashboards aligned with user journeys and SLOs rather than metrics for every internal component
- Setting SLOs without measuring SLIs first — SLOs must be based on real user experience data, not aspirational targets invented in a planning meeting
Interview Questions Worth Practicing
- What is the difference between SLI, SLO, and SLA, and how do error budgets connect them?
- How would you design a monitoring strategy for a new microservices platform with 20 services?
- What is multi-window, multi-burn-rate alerting and why is it better than simple threshold alerts?
The Tradeoffs
- Coverage vs. Noise: More monitors catch more issues but generate more alerts; SLO-based alerting focuses on user impact to reduce noise
- Real-Time vs. Cost: Higher-resolution metrics and faster scrape intervals give quicker detection but increase storage, query, and infrastructure costs
- Centralized vs. Federated: A single monitoring platform simplifies operations but creates a single point of failure; federated monitoring is resilient but harder to correlate
How to Explain This in an Interview
Here is how I would explain Monitoring in a system design interview:
Monitoring is the practice of collecting metrics, visualizing them on dashboards, and alerting when reliability targets are at risk. I follow Google's SRE framework: define SLIs (measurable quality metrics like latency and error rate), set SLOs (99.9% of requests under 300ms), and track error budgets (the 0.1% allowed unreliability). I use multi-window, multi-burn-rate alerting — fast burn catches sudden outages, slow burn catches gradual degradation. Dashboards should be organized by user journey, not team. I instrument services with RED metrics (Rate, Errors, Duration) and resources with USE metrics (Utilization, Saturation, Errors). Every alert must be actionable — if it does not require human intervention, it should not page anyone.
Related Topics
The Real-World Incident That Made This Famous
Understanding Monitoring became critical after multiple high-profile production incidents at major tech companies. When systems handle millions of users, even small misunderstandings about Monitoring can lead to cascading failures that cost millions in lost revenue and erode user trust. Companies like Netflix, Google, Amazon, and Meta have all invested heavily in mastering Monitoring because they learned the hard way that ignoring it leads to outages.
The key lesson from these incidents: Monitoring is not just a theoretical concept — it is a practical skill that separates engineers who build resilient systems from those who build fragile ones. Every major outage report from the past decade involves at least one Monitoring-related design decision that was either implemented incorrectly or overlooked entirely during the initial architecture review.
How Senior Engineers Think About This
Senior engineers approach Monitoring differently from textbook definitions. Instead of memorizing rules, they build mental models. They ask: "What problem does Monitoring solve? When does it fail? What are the alternatives?" This problem-first thinking leads to better design decisions because every system has unique constraints.
When evaluating Monitoring in a system design context, experienced engineers consider the failure modes first. What happens when this component goes down? How does the system degrade? Is the degradation graceful or catastrophic? These questions reveal more about your understanding than any textbook definition.
The key difference between junior and senior engineers when it comes to Monitoring: juniors focus on the happy path, while seniors design for what happens when things go wrong. They consider operational cost, team expertise, monitoring requirements, and how the decision will look six months from now when traffic has grown 10x.
Common Interview Mistakes
Mistake 1: Giving a textbook definition without context. Interviewers want to see you connect Monitoring to real systems and real problems. Instead of reciting definitions, explain when and why you would use Monitoring in the system you are designing.
Mistake 2: Not discussing trade-offs. Every design decision involving Monitoring has trade-offs. Discuss what you gain and what you give up. Acknowledge the downsides and explain why the benefits outweigh them for your specific use case.
Mistake 3: Overcomplicating the solution. Start with the simplest approach to Monitoring that meets the requirements, then add complexity only when justified. Many candidates jump to complex implementations when a simpler solution would work perfectly.
Production Checklist
- Define clear metrics for measuring the effectiveness of your Monitoring implementation
- Set up monitoring and alerting that specifically tracks Monitoring-related failures
- Document your Monitoring design decisions in Architecture Decision Records (ADRs)
- Test failure scenarios related to Monitoring in staging before production deployment
- Review and update your Monitoring implementation quarterly as system requirements evolve
- Train new team members on the specific Monitoring patterns used in your system
- Establish runbooks for common Monitoring-related incidents and recovery procedures
Practical Implementation for .NET Developers
In .NET, health checks are implemented with Microsoft.Extensions.Diagnostics.HealthChecks, exposing /healthz and /ready endpoints consumed by Kubernetes and load balancers. AspNetCore.HealthChecks.UI provides a dashboard. For metrics dashboards, Prometheus scrapes OpenTelemetry metrics exported via OpenTelemetry.Exporter.Prometheus.AspNetCore, displayed in Grafana. Application Insights (Microsoft.ApplicationInsights.AspNetCore) provides built-in dashboards, alerts, and SLO tracking in Azure Monitor. dotnet-monitor is a sidecar tool that exposes diagnostics endpoints for containerized .NET applications without code changes.
ASP.NET Core setup: Create a service class that encapsulates the logic, register it with dependency injection, and inject it into your controllers or minimal API endpoints. The built-in DI container handles lifecycle management.
Entity Framework Core: For database interactions, EF Core provides the ORM layer. Use migrations for schema management and raw SQL for performance-critical queries. Consider Dapper for read-heavy paths where EF Core overhead matters.
Azure integration: If deploying to Azure, leverage managed services — Azure Cache for Redis, Azure SQL, Azure Service Bus, Azure Cosmos DB. These eliminate operational overhead and provide built-in monitoring through Application Insights.
Testing: Use xUnit with Testcontainers for integration tests that spin up real databases in Docker. Mock external dependencies with NSubstitute. The WebApplicationFactory class lets you test your entire HTTP pipeline in-process.
Monitoring: Add Application Insights telemetry to track request latency, dependency calls, and custom metrics. Use structured logging with Serilog to make production debugging possible:
Log.Information("Processing {Operation} for {ResourceId}", operation, resourceId);
This gives you searchable, structured logs in Azure Monitor or Seq.