intermediate11 min readUpdated 2026-06-08

Metrics

Learn about metrics in distributed systems — counters, gauges, and histograms that enable real-time dashboards, alerting, and capacity planning for.

Metrics

Metrics are numeric measurements collected at regular intervals that describe system behavior over time. Unlike logs which record individual events, metrics aggregate data into time-series — request counts, error rates, latency percentiles, CPU utilization — that reveal trends, trigger alerts, and power dashboards. The three fundamental metric types are counters (monotonically increasing totals), gauges (point-in-time values), and histograms (distributions). Metrics are the most cost-effective observability signal because aggregated numbers are far cheaper to store and query than individual events.

Aspect	Details
What it is	Numeric time-series measurements (counters, gauges, histograms) that quantify system behavior for dashboards, alerting, and capacity planning
When to use	Always in production — metrics are the first signal for detecting problems and the primary input for alerts, SLOs, and autoscaling decisions
When NOT to use	When you need to understand the specific details of individual requests — use logs or traces for event-level granularity
Real-world example	Prometheus, created at SoundCloud, became the industry standard for metrics collection in cloud-native environments and Kubernetes
Interview tip	Explain the RED method (Rate, Errors, Duration) for services and USE method (Utilization, Saturation, Errors) for resources — shows structured thinking
Common mistake	Using averages instead of percentiles — an average latency of 100ms hides the fact that 1% of users experience 5-second responses
Key tradeoff	Cardinality vs. utility — high-cardinality labels provide detailed breakdowns but explode storage and query costs in time-series databases

Why This Matters

Metrics answer the fundamental question: is the system healthy right now, and how does today compare to yesterday? A single Prometheus counter tracking HTTP requests can tell you the request rate, error rate, and availability — the core health indicators for any service. Histograms reveal latency distributions, showing that while median latency is 50ms, the p99 is 2 seconds, meaning 1% of users have a terrible experience. Metrics are essential for alerting (page me if error rate exceeds 1%), capacity planning (we need to add servers before traffic doubles next quarter), and autoscaling (scale up when CPU exceeds 70%). Without metrics, you are flying blind — unable to set SLOs, detect degradation, or justify infrastructure investments.

System architecture diagram for Metrics showing how services, databases, and caches connect — System architecture for Metrics

The Building Blocks

Counters: Monotonically increasing values that only go up — total requests served, total errors, total bytes processed — rate is computed at query time
Gauges: Point-in-time values that can go up or down — current CPU utilization, active connections, queue depth, memory usage
Histograms: Distribution measurements that bucket observations into ranges — request latency distribution enabling percentile calculations like p50, p95, p99
Labels and Dimensions: Key-value tags attached to metrics (service, endpoint, status_code, region) enabling slicing and filtering of aggregated data
Collection and Storage: Pull-based (Prometheus scrapes endpoints) or push-based (StatsD, OTLP) collection into time-series databases optimized for write-heavy numeric workloads

Under the Hood

Metrics work through a pipeline of instrumentation, collection, storage, and visualization. At the instrumentation layer, applications expose metric values through libraries. Prometheus client libraries expose an HTTP /metrics endpoint that the Prometheus server scrapes at configured intervals (typically 15-30 seconds). Each scrape captures the current value of all registered metrics.

Step-by-step diagram showing how Metrics processes a request from start to finish — How Metrics works step by step

Counters are the simplest and most useful type. A counter tracking total HTTP requests might show a value of 1,000,000. The rate() function in PromQL calculates the per-second rate of change, giving you requests per second. Error counters divided by request counters give error rates. Histograms are more complex — they maintain configurable buckets (0-10ms, 10-50ms, 50-100ms, 100-500ms, 500ms+) and count observations per bucket, enabling percentile calculation without storing individual values.

The cardinality challenge is the primary operational concern. Each unique combination of metric name and label values creates a distinct time series. A metric with labels for service (10 values), endpoint (50 values), status_code (5 values), and region (3 values) creates 10 × 50 × 5 × 3 = 7,500 time series. Adding a userId label with 1 million users would create 7.5 billion series, overwhelming any time-series database. Cardinality must be carefully controlled — use logs for high-cardinality dimensions and metrics only for bounded label sets.

How Companies Actually Do This

SoundCloud Created Prometheus to solve metrics collection at scale, which became a CNCF graduated project and the de facto standard for Kubernetes monitoring

Comparison table for Metrics contrasting approaches, tradeoffs, and when to use each — Comparing key aspects of Metrics

Netflix Uses Atlas, their custom time-series database, to ingest millions of metrics per second from streaming infrastructure, enabling real-time dashboards and automated anomaly detection

Uber Built M3, an open-source metrics platform handling billions of time series, to aggregate metrics from thousands of microservices across multiple data centers

Common Pitfalls

Using averages for latency metrics instead of percentiles — the average masks tail latency problems; always track p50, p95, and p99 at minimum
High-cardinality label explosion (adding user IDs or request IDs as metric labels) which overwhelms time-series database storage and query performance
Not defining clear metric naming conventions across teams, leading to inconsistent dashboards where http_requests_total and request_count_http mean the same thing

Data flow diagram for Metrics showing how requests and responses move through the system — Data flow through Metrics

Interview Questions Worth Practicing

What are the three fundamental metric types and when would you use each one?
How does the RED method guide you in choosing which metrics to instrument for a service?
What is the cardinality problem in metrics and how do you prevent it from crashing your monitoring infrastructure?

The Tradeoffs

Granularity vs. Cost: Higher scrape frequency and more labels provide finer-grained data but increase storage, query latency, and infrastructure costs
Simplicity vs. Accuracy: Counters and gauges are simple but histograms give percentiles — you need both, and choosing bucket boundaries for histograms is a design decision
Pull vs. Push: Pull-based (Prometheus) is simple for service mesh but fails for short-lived jobs; push-based (StatsD/OTLP) works everywhere but requires push infrastructure

Component diagram for Metrics showing each building block and its responsibility — Key components of Metrics

How to Explain This in an Interview

Here is how I would explain Metrics in a system design interview:

Metrics are numeric time-series measurements that quantify system behavior. The three types are counters (monotonically increasing, like total requests), gauges (current values, like active connections), and histograms (distributions, like latency percentiles). I follow the RED method for services — Rate, Errors, Duration — and the USE method for resources — Utilization, Saturation, Errors. Prometheus is the industry standard, scraping /metrics endpoints. The critical pitfall is cardinality: each unique label combination creates a time series, so adding unbounded labels like userId explodes storage. I always use percentiles (p50, p95, p99) rather than averages because averages hide tail latency affecting real users.

Interview preparation checklist for Metrics with key points to mention and mistakes to avoid — Interview tips for Metrics

The Real-World Incident That Made This Famous

Understanding Metrics became critical after multiple high-profile production incidents at major tech companies. When systems handle millions of users, even small misunderstandings about Metrics can lead to cascading failures that cost millions in lost revenue and erode user trust. Companies like Netflix, Google, Amazon, and Meta have all invested heavily in mastering Metrics because they learned the hard way that ignoring it leads to outages.

The key lesson from these incidents: Metrics is not just a theoretical concept — it is a practical skill that separates engineers who build resilient systems from those who build fragile ones. Every major outage report from the past decade involves at least one Metrics-related design decision that was either implemented incorrectly or overlooked entirely during the initial architecture review.

Decision guide for when to choose Metrics and when alternative approaches are better — When to use Metrics

How Senior Engineers Think About This

Senior engineers approach Metrics differently from textbook definitions. Instead of memorizing rules, they build mental models. They ask: "What problem does Metrics solve? When does it fail? What are the alternatives?" This problem-first thinking leads to better design decisions because every system has unique constraints.

When evaluating Metrics in a system design context, experienced engineers consider the failure modes first. What happens when this component goes down? How does the system degrade? Is the degradation graceful or catastrophic? These questions reveal more about your understanding than any textbook definition.

The key difference between junior and senior engineers when it comes to Metrics: juniors focus on the happy path, while seniors design for what happens when things go wrong. They consider operational cost, team expertise, monitoring requirements, and how the decision will look six months from now when traffic has grown 10x.

Tradeoff analysis for Metrics listing advantages, disadvantages, and real-world considerations — Advantages and disadvantages of Metrics

Common Interview Mistakes

Mistake 1: Giving a textbook definition without context. Interviewers want to see you connect Metrics to real systems and real problems. Instead of reciting definitions, explain when and why you would use Metrics in the system you are designing.

Mistake 2: Not discussing trade-offs. Every design decision involving Metrics has trade-offs. Discuss what you gain and what you give up. Acknowledge the downsides and explain why the benefits outweigh them for your specific use case.

Mistake 3: Overcomplicating the solution. Start with the simplest approach to Metrics that meets the requirements, then add complexity only when justified. Many candidates jump to complex implementations when a simpler solution would work perfectly.

Production deployment examples of Metrics at companies like Netflix, Google, and Amazon — Real-world examples of Metrics

Production Checklist

Define clear metrics for measuring the effectiveness of your Metrics implementation
Set up monitoring and alerting that specifically tracks Metrics-related failures
Document your Metrics design decisions in Architecture Decision Records (ADRs)
Test failure scenarios related to Metrics in staging before production deployment
Review and update your Metrics implementation quarterly as system requirements evolve
Train new team members on the specific Metrics patterns used in your system
Establish runbooks for common Metrics-related incidents and recovery procedures

Practical Implementation for .NET Developers

In .NET, the System.Diagnostics.Metrics API is the native metrics framework since .NET 8, with Meter and Instrument types for counters, histograms, and gauges. OpenTelemetry.Exporter.Prometheus.AspNetCore exposes a /metrics endpoint for Prometheus scraping. For push-based collection, OpenTelemetry OTLP exporters send to collectors. App.Metrics is a popular third-party library with built-in reservoir sampling for histograms. ASP.NET Core emits built-in metrics (http.server.request.duration, kestrel.active_connections) via the hosting EventCounters and Meters. Azure Monitor Application Insights also consumes these metrics natively.

ASP.NET Core setup: Create a service class that encapsulates the logic, register it with dependency injection, and inject it into your controllers or minimal API endpoints. The built-in DI container handles lifecycle management.

Entity Framework Core: For database interactions, EF Core provides the ORM layer. Use migrations for schema management and raw SQL for performance-critical queries. Consider Dapper for read-heavy paths where EF Core overhead matters.

Azure integration: If deploying to Azure, leverage managed services — Azure Cache for Redis, Azure SQL, Azure Service Bus, Azure Cosmos DB. These eliminate operational overhead and provide built-in monitoring through Application Insights.

Testing: Use xUnit with Testcontainers for integration tests that spin up real databases in Docker. Mock external dependencies with NSubstitute. The WebApplicationFactory class lets you test your entire HTTP pipeline in-process.

Monitoring: Add Application Insights telemetry to track request latency, dependency calls, and custom metrics. Use structured logging with Serilog to make production debugging possible:

text

Log.Information("Processing {Operation} for {ResourceId}", operation, resourceId);

This gives you searchable, structured logs in Azure Monitor or Seq.