beginner11 min readUpdated 2026-06-08

Monitoring

Learn Monitoring for distributed systems — build dashboards, set SLOs, configure alerts, and establish processes to detect, diagnose, and respond to.

Monitoring

Monitoring is the practice of collecting, visualizing, and alerting on system health metrics to detect problems before they impact users. Dashboards display real-time and historical data — request rates, error rates, latency percentiles, resource utilization. Service Level Objectives (SLOs) define reliability targets, and alerts fire when those targets are at risk. In distributed systems, effective monitoring requires tracking health across dozens of services, their dependencies, and the infrastructure they run on, using a structured approach like RED or USE methods.

Aspect	Details
What it is	The practice of collecting system health data, visualizing it on dashboards, setting SLOs, and alerting when reliability targets are at risk
When to use	Always in production systems — monitoring is the minimum viable operational practice for any service that has users
When NOT to use	When you need to understand why something failed — monitoring tells you what is wrong, but you need observability (logs, traces) for why
Real-world example	Google codified monitoring best practices in the SRE handbook, defining SLIs, SLOs, and error budgets as the foundation for running reliable services
Interview tip	Define SLI vs. SLO vs. SLA precisely and explain error budgets — shows you understand Google SRE principles beyond just building dashboards
Common mistake	Alert fatigue from too many low-priority alerts — teams start ignoring alerts, and real incidents get missed in the noise
Key tradeoff	Signal vs. noise — too few monitors miss real problems, too many create alert fatigue; SLO-based alerting balances this

Why This Matters

Monitoring is the operational foundation of running production systems. Without it, you discover problems when users complain — which may be hours after the issue started. With monitoring, dashboards show real-time system health, and alerts notify on-call engineers within minutes of degradation. The modern approach centers on Service Level Objectives (SLOs): defining that 99.9% of requests should complete in under 300ms, tracking the error budget (the allowed 0.1% unreliability), and alerting when the burn rate suggests the budget will be exhausted. This approach focuses engineering attention on what matters — user-perceived reliability — rather than arbitrary thresholds on internal metrics. Grafana, Datadog, and New Relic are the dominant monitoring platforms.

System architecture diagram for Monitoring showing how services, databases, and caches connect — System architecture for Monitoring

The Building Blocks

Service Level Indicators: Measurable metrics that reflect user experience — request latency, error rate, availability — the quantitative inputs to SLO calculations
Service Level Objectives: Reliability targets expressed as percentages over time windows — 99.9% of requests succeed within 300ms over a rolling 30-day period
Dashboards: Visual displays of real-time and historical metrics using tools like Grafana, organized by service, team, or user journey for quick health assessment
Health Checks: Endpoints that report whether a service and its dependencies are functioning correctly, used by load balancers, orchestrators, and monitoring systems
Error Budgets: The allowed unreliability (100% minus SLO) that teams can spend on feature development versus reliability work, balancing velocity and stability

Under the Hood

Modern monitoring follows a layered approach. At the bottom, infrastructure monitoring tracks host-level metrics — CPU, memory, disk, network — using agents like node_exporter or CloudWatch. Above that, application monitoring tracks service-level metrics using the RED method (Request rate, Error rate, Duration) for request-driven services and the USE method (Utilization, Saturation, Errors) for resources like queues and connection pools.

Step-by-step diagram showing how Monitoring processes a request from start to finish — How Monitoring works step by step

SLO-based monitoring adds a user-centric layer. An SLI might be the proportion of HTTP requests returning 2xx status codes within 300ms. The SLO states this should be 99.9% over 30 days. The error budget is 0.1% — roughly 43 minutes of total downtime or a proportional number of failed requests per month. Multi-window, multi-burn-rate alerting detects when the error budget is being consumed too quickly: a fast burn alert (e.g., 14x burn rate over 1 hour) catches sudden outages, while a slow burn alert (e.g., 2x burn rate over 3 days) catches gradual degradation.

Dashboard design follows information hierarchy. A top-level dashboard shows overall service health across all services. Drilling down reveals per-service dashboards with RED metrics, dependency health, and infrastructure utilization. The most effective dashboards are organized by user journey — showing the complete health of the checkout flow from cart to payment to confirmation — rather than by team ownership, which creates blind spots at service boundaries.

How Companies Actually Do This

Google Pioneered SLO-based monitoring and error budgets through their SRE practice, defining the industry-standard framework for balancing reliability targets with feature development velocity

Comparison table for Monitoring contrasting approaches, tradeoffs, and when to use each — Comparing key aspects of Monitoring

Grafana Labs Created the open-source Grafana dashboarding platform used by millions of organizations, supporting Prometheus, Loki, and Tempo as a complete monitoring stack

Datadog Built a unified monitoring platform combining infrastructure, application, and log monitoring with AI-powered anomaly detection used by thousands of enterprises

Common Pitfalls

Alert fatigue from hundreds of low-priority alerts — teams learn to ignore notifications and miss critical incidents; every alert should be actionable and require human intervention
Dashboard sprawl with hundreds of panels nobody looks at — focus on a few high-signal dashboards aligned with user journeys and SLOs rather than metrics for every internal component
Setting SLOs without measuring SLIs first — SLOs must be based on real user experience data, not aspirational targets invented in a planning meeting

Data flow diagram for Monitoring showing how requests and responses move through the system — Data flow through Monitoring

Interview Questions Worth Practicing

What is the difference between SLI, SLO, and SLA, and how do error budgets connect them?
How would you design a monitoring strategy for a new microservices platform with 20 services?
What is multi-window, multi-burn-rate alerting and why is it better than simple threshold alerts?

The Tradeoffs

Coverage vs. Noise: More monitors catch more issues but generate more alerts; SLO-based alerting focuses on user impact to reduce noise
Real-Time vs. Cost: Higher-resolution metrics and faster scrape intervals give quicker detection but increase storage, query, and infrastructure costs
Centralized vs. Federated: A single monitoring platform simplifies operations but creates a single point of failure; federated monitoring is resilient but harder to correlate

Component diagram for Monitoring showing each building block and its responsibility — Key components of Monitoring

How to Explain This in an Interview

Here is how I would explain Monitoring in a system design interview:

Monitoring is the practice of collecting metrics, visualizing them on dashboards, and alerting when reliability targets are at risk. I follow Google's SRE framework: define SLIs (measurable quality metrics like latency and error rate), set SLOs (99.9% of requests under 300ms), and track error budgets (the 0.1% allowed unreliability). I use multi-window, multi-burn-rate alerting — fast burn catches sudden outages, slow burn catches gradual degradation. Dashboards should be organized by user journey, not team. I instrument services with RED metrics (Rate, Errors, Duration) and resources with USE metrics (Utilization, Saturation, Errors). Every alert must be actionable — if it does not require human intervention, it should not page anyone.

Interview preparation checklist for Monitoring with key points to mention and mistakes to avoid — Interview tips for Monitoring

The Real-World Incident That Made This Famous

Understanding Monitoring became critical after multiple high-profile production incidents at major tech companies. When systems handle millions of users, even small misunderstandings about Monitoring can lead to cascading failures that cost millions in lost revenue and erode user trust. Companies like Netflix, Google, Amazon, and Meta have all invested heavily in mastering Monitoring because they learned the hard way that ignoring it leads to outages.

The key lesson from these incidents: Monitoring is not just a theoretical concept — it is a practical skill that separates engineers who build resilient systems from those who build fragile ones. Every major outage report from the past decade involves at least one Monitoring-related design decision that was either implemented incorrectly or overlooked entirely during the initial architecture review.

Decision guide for when to choose Monitoring and when alternative approaches are better — When to use Monitoring

How Senior Engineers Think About This

Senior engineers approach Monitoring differently from textbook definitions. Instead of memorizing rules, they build mental models. They ask: "What problem does Monitoring solve? When does it fail? What are the alternatives?" This problem-first thinking leads to better design decisions because every system has unique constraints.

When evaluating Monitoring in a system design context, experienced engineers consider the failure modes first. What happens when this component goes down? How does the system degrade? Is the degradation graceful or catastrophic? These questions reveal more about your understanding than any textbook definition.

The key difference between junior and senior engineers when it comes to Monitoring: juniors focus on the happy path, while seniors design for what happens when things go wrong. They consider operational cost, team expertise, monitoring requirements, and how the decision will look six months from now when traffic has grown 10x.

Tradeoff analysis for Monitoring listing advantages, disadvantages, and real-world considerations — Advantages and disadvantages of Monitoring

Common Interview Mistakes

Mistake 1: Giving a textbook definition without context. Interviewers want to see you connect Monitoring to real systems and real problems. Instead of reciting definitions, explain when and why you would use Monitoring in the system you are designing.

Mistake 2: Not discussing trade-offs. Every design decision involving Monitoring has trade-offs. Discuss what you gain and what you give up. Acknowledge the downsides and explain why the benefits outweigh them for your specific use case.

Mistake 3: Overcomplicating the solution. Start with the simplest approach to Monitoring that meets the requirements, then add complexity only when justified. Many candidates jump to complex implementations when a simpler solution would work perfectly.

Production deployment examples of Monitoring at companies like Netflix, Google, and Amazon — Real-world examples of Monitoring

Production Checklist

Define clear metrics for measuring the effectiveness of your Monitoring implementation
Set up monitoring and alerting that specifically tracks Monitoring-related failures
Document your Monitoring design decisions in Architecture Decision Records (ADRs)
Test failure scenarios related to Monitoring in staging before production deployment
Review and update your Monitoring implementation quarterly as system requirements evolve
Train new team members on the specific Monitoring patterns used in your system
Establish runbooks for common Monitoring-related incidents and recovery procedures

Practical Implementation for .NET Developers

In .NET, health checks are implemented with Microsoft.Extensions.Diagnostics.HealthChecks, exposing /healthz and /ready endpoints consumed by Kubernetes and load balancers. AspNetCore.HealthChecks.UI provides a dashboard. For metrics dashboards, Prometheus scrapes OpenTelemetry metrics exported via OpenTelemetry.Exporter.Prometheus.AspNetCore, displayed in Grafana. Application Insights (Microsoft.ApplicationInsights.AspNetCore) provides built-in dashboards, alerts, and SLO tracking in Azure Monitor. dotnet-monitor is a sidecar tool that exposes diagnostics endpoints for containerized .NET applications without code changes.

ASP.NET Core setup: Create a service class that encapsulates the logic, register it with dependency injection, and inject it into your controllers or minimal API endpoints. The built-in DI container handles lifecycle management.

Entity Framework Core: For database interactions, EF Core provides the ORM layer. Use migrations for schema management and raw SQL for performance-critical queries. Consider Dapper for read-heavy paths where EF Core overhead matters.

Azure integration: If deploying to Azure, leverage managed services — Azure Cache for Redis, Azure SQL, Azure Service Bus, Azure Cosmos DB. These eliminate operational overhead and provide built-in monitoring through Application Insights.

Testing: Use xUnit with Testcontainers for integration tests that spin up real databases in Docker. Mock external dependencies with NSubstitute. The WebApplicationFactory class lets you test your entire HTTP pipeline in-process.

Monitoring: Add Application Insights telemetry to track request latency, dependency calls, and custom metrics. Use structured logging with Serilog to make production debugging possible:

text

Log.Information("Processing {Operation} for {ResourceId}", operation, resourceId);

This gives you searchable, structured logs in Azure Monitor or Seq.