Distributed Tracing
In a microservices system, a single user request may pass through 10+ services. When something is slow or fails, you need to see the entire chain to find.
The Core Idea
Distributed tracing tracks a request as it flows through multiple services in a microservices architecture. Each service adds a span (a timed operation) to the trace, creating a complete picture of the request's journey, timing, and any errors encountered.
Step-by-Step Walkthrough
A user request arrives at the API gateway. The gateway generates a trace ID (e.g., abc123) and adds it to the request header. Each service creates a span with start time, end time, and metadata. Spans are sent to a tracing backend (Jaeger, Zipkin, Datadog). The backend assembles spans into a trace, creating a waterfall visualization showing exactly where time was spent.
Why This Approach Wins
- Trace: A complete record of a request's journey through the system. Has a unique trace ID.
- Span: A single operation within a trace (e.g., 'database query took 50ms'). Spans have parent-child relationships.
- Context propagation: The trace ID is passed between services (typically in HTTP headers) so all spans are correlated.
- Sampling: Tracing every request is too expensive. Sample 1-10% of requests. Use head-based or tail-based sampling.
- OpenTelemetry: The industry standard for distributed tracing instrumentation. Supports many languages and backends.
In Production
Google Dapper was the pioneering distributed tracing system, tracing requests across Google's millions of servers.
Uber uses Jaeger (which they open-sourced) to trace requests across thousands of microservices.
Netflix traces every request through their microservices using OpenTelemetry, helping identify latency bottlenecks.
Tradeoffs and Limitations
- Overhead vs Visibility: Tracing adds latency (span creation, context propagation) and storage costs.
- Sampling vs Completeness: Sampling misses some slow requests. Tail-based sampling captures slow requests but is more complex.
- Cost: Storing and analyzing traces at scale requires significant infrastructure.
Production Gotchas
- Not propagating trace context — breaks the trace between services
- Tracing 100% of requests — too expensive at scale
- Not correlating traces with logs and metrics — tracing alone does not tell the full story
The Interview Angle
- What is distributed tracing and why is it needed?
- How does context propagation work?
- What is the difference between a trace and a span?
- How do you handle the overhead of tracing?
Next Up
The Real-World Incident That Made This Famous
Understanding Distributed Tracing became critical after multiple high-profile production incidents at major tech companies. When systems handle millions of users, even small misunderstandings about Distributed Tracing can lead to cascading failures that cost millions in lost revenue and erode user trust. Companies like Netflix, Google, Amazon, and Meta have all invested heavily in mastering Distributed Tracing because they learned the hard way that ignoring it leads to outages.
The key lesson from these incidents: Distributed Tracing is not just a theoretical concept — it is a practical skill that separates engineers who build resilient systems from those who build fragile ones.
How Senior Engineers Think About This
Senior engineers approach Distributed Tracing differently from textbook definitions. Instead of memorizing rules, they build mental models. They ask: "What problem does Distributed Tracing solve? When does it fail? What are the alternatives?" This problem-first thinking leads to better design decisions because every system has unique constraints.
When evaluating Distributed Tracing in a system design context, experienced engineers consider the failure modes first. What happens when this component goes down? How does the system degrade? Is the degradation graceful or catastrophic? These questions reveal more about your understanding than any textbook definition.
Common Interview Mistakes
Mistake 1: Giving a textbook definition without context. Interviewers want to see you connect Distributed Tracing to real systems and real problems.
Mistake 2: Not discussing trade-offs. Every design decision involving Distributed Tracing has trade-offs. Discuss what you gain and what you give up.
Mistake 3: Overcomplicating the solution. Start with the simplest approach to Distributed Tracing that meets the requirements, then add complexity only when justified.
Production Checklist
- Define clear metrics for measuring the effectiveness of your Distributed Tracing implementation
- Set up monitoring and alerting that specifically tracks Distributed Tracing-related failures
- Document your Distributed Tracing design decisions in Architecture Decision Records (ADRs)
- Test failure scenarios related to Distributed Tracing in staging before production deployment
- Review and update your Distributed Tracing implementation quarterly as system requirements evolve
- Train new team members on the specific Distributed Tracing patterns used in your system
Read the original source | Content from System-Design-Overview
Practical Implementation for .NET Developers
In a .NET application, you would typically implement this pattern using the following approach:
ASP.NET Core setup: Create a service class that encapsulates the logic, register it with dependency injection, and inject it into your controllers or minimal API endpoints. The built-in DI container handles lifecycle management.
Entity Framework Core: For database interactions, EF Core provides the ORM layer. Use migrations for schema management and raw SQL for performance-critical queries. Consider Dapper for read-heavy paths where EF Core's overhead matters.
Azure integration: If deploying to Azure, leverage managed services — Azure Cache for Redis, Azure SQL, Azure Service Bus, Azure Cosmos DB. These eliminate operational overhead and provide built-in monitoring through Application Insights.
Testing: Use xUnit with Testcontainers for integration tests that spin up real databases in Docker. Mock external dependencies with NSubstitute. The WebApplicationFactory class lets you test your entire HTTP pipeline in-process.
Monitoring: Add Application Insights telemetry to track request latency, dependency calls, and custom metrics. Use structured logging with Serilog to make production debugging possible:
Log.Information("Processing order {OrderId} for {CustomerId}", orderId, customerId);
This gives you searchable, structured logs in Azure Monitor or Seq.