Skip to main content
SDMastery
beginner11 min readUpdated 2026-06-08

Logging

Learn structured logging and log levels for distributed systems — capture meaningful context, correlate events across services, and build queryable.

Diagram showing the key components and data flow in a Logging system design
High-level overview of Logging
Logging

Logging records discrete events as timestamped entries, providing the most detailed view of what happened in a system at any given moment. In distributed systems, structured logging — emitting events as machine-parseable JSON with consistent fields — transforms logs from grep-able text files into a queryable diagnostic database. Combined with log levels (DEBUG, INFO, WARN, ERROR) and centralized aggregation, logging enables engineers to search across millions of events from dozens of services to understand failures that cross service boundaries.

AspectDetails
What it isThe practice of recording timestamped events from applications with structured fields, log levels, and centralized aggregation for debugging
When to useAlways — every service should emit structured logs with request context; logging is the most fundamental observability signal
When NOT to useWhen you need real-time numeric dashboards or alerting — use metrics instead of trying to derive numbers from log analysis at query time
Real-world exampleStripe processes billions of structured log events daily through a centralized pipeline to debug payment processing issues within minutes
Interview tipMention structured logging specifically and explain why unstructured text logs fail at scale — shows you understand production realities
Common mistakeUsing string concatenation for log messages instead of structured fields — 'User 123 failed' is ungrepable compared to structured key-value pairs like userId=123, event=auth_failure
Key tradeoffVerbosity vs. cost — detailed logging aids debugging but generates massive storage and ingestion costs at scale

Why This Matters

Logs are the most detailed observability signal — they capture the exact sequence of events, the specific data involved, and the error messages that explain failures. In a monolith, you might grep a single log file. In a distributed system with 50 services, you need centralized log aggregation (ELK Stack, Datadog, Splunk) with structured fields so you can query: show me all ERROR logs for user X in the last hour across all services. Without structured logging, this query is impossible — unstructured text logs from different services use different formats, field names, and timestamp conventions. Structured logging with consistent schemas and correlation IDs transforms logs into a powerful diagnostic tool that scales with your architecture.

System architecture diagram for Logging showing how services, databases, and caches connect
System architecture for Logging

The Building Blocks

  • Log Levels: Severity classifications (TRACE, DEBUG, INFO, WARN, ERROR, FATAL) that control verbosity and enable filtering by importance during incidents
  • Structured Fields: Key-value pairs (userId, requestId, duration, statusCode) attached to every log event, enabling precise queries across billions of records
  • Log Aggregation: Centralized collection of logs from all services into a single queryable store like Elasticsearch, Loki, or CloudWatch Logs
  • Context Propagation: Automatically injecting trace IDs, span IDs, and request metadata into every log line so events can be correlated across service boundaries
  • Log Sampling: Reducing log volume by selectively logging based on criteria — always log errors, sample debug logs at 1%, increase verbosity during incidents

Under the Hood

Modern logging pipelines have three stages: emission, collection, and storage. At the emission stage, applications use logging libraries that format events as structured JSON with consistent fields. A well-structured log line includes timestamp, log level, service name, trace ID, span ID, and business-specific fields. This is fundamentally different from printf-style logging because every field is independently queryable.

Step-by-step diagram showing how Logging processes a request from start to finish
How Logging works step by step

Collection is handled by log agents (Fluentd, Fluent Bit, Logstash, Vector) running on each host. These agents tail log files or receive logs over the network, parse and enrich them (adding hostname, environment, region), and forward them to a centralized store. In Kubernetes, sidecar or DaemonSet collectors automatically capture stdout from every pod.

Storage and querying is where the value is realized. Elasticsearch powers keyword search with inverted indexes. Grafana Loki takes a different approach, indexing only labels and compressing log content, making it much cheaper for high-volume environments. The critical design decision is retention — keeping 30 days of DEBUG logs at scale is prohibitively expensive. Production systems typically retain ERROR logs for months, INFO logs for weeks, and DEBUG logs for days, with the ability to dynamically increase verbosity for specific services during active incidents.

How Companies Actually Do This

Stripe Uses structured logging with request IDs in every event, enabling support engineers to trace a single payment through dozens of internal services within seconds

Comparison table for Logging contrasting approaches, tradeoffs, and when to use each
Comparing key aspects of Logging

Netflix Processes petabytes of logs daily through their centralized logging pipeline, using automated pattern detection to identify novel error types before they escalate to incidents

Datadog Pioneered log analytics with automatic pattern clustering, grouping millions of similar log lines into patterns and surfacing anomalous new patterns that may indicate emerging issues

Common Pitfalls

  1. Logging sensitive data (passwords, tokens, PII) in plaintext — always redact or mask sensitive fields to comply with security and privacy requirements
  2. Using unstructured string messages ('Error processing order for user John') instead of structured fields (event=order_error, userId=john) makes searching impossible at scale
  3. Setting all services to DEBUG level in production — the volume overwhelms storage and ingestion systems, increasing costs 10-100x without proportional debugging value
Data flow diagram for Logging showing how requests and responses move through the system
Data flow through Logging

Interview Questions Worth Practicing

  1. How does structured logging differ from traditional text logging and why does it matter at scale?
  2. How would you design a log aggregation pipeline for a microservices system with 100 services?
  3. What log levels would you emit for different types of events, and how would you manage log volume in production?

The Tradeoffs

  • Verbosity vs. Cost: More detailed logs help debugging but exponentially increase storage, ingestion, and query costs at high scale
  • Structure vs. Developer Effort: Structured logging requires discipline and conventions across teams but pays off enormously in queryability
  • Centralized vs. Local: Centralized aggregation enables cross-service queries but introduces a critical dependency and network overhead
Component diagram for Logging showing each building block and its responsibility
Key components of Logging

How to Explain This in an Interview

Here is how I would explain Logging in a system design interview:

Logging captures discrete timestamped events from applications. In distributed systems, structured logging — emitting JSON with consistent fields like userId, requestId, statusCode — is essential because it transforms logs into a queryable database across services. I would use log levels strategically: ERROR for failures needing attention, WARN for degradation, INFO for key business events, DEBUG for troubleshooting detail. The pipeline involves emission via a structured logging library, collection via agents like Fluentd, and storage in Elasticsearch or Loki. I always include trace IDs in log context so I can correlate logs with distributed traces. Critical considerations include PII redaction, retention policies by level, and dynamic verbosity control during incidents.

Interview preparation checklist for Logging with key points to mention and mistakes to avoid
Interview tips for Logging

The Real-World Incident That Made This Famous

Understanding Logging became critical after multiple high-profile production incidents at major tech companies. When systems handle millions of users, even small misunderstandings about Logging can lead to cascading failures that cost millions in lost revenue and erode user trust. Companies like Netflix, Google, Amazon, and Meta have all invested heavily in mastering Logging because they learned the hard way that ignoring it leads to outages.

The key lesson from these incidents: Logging is not just a theoretical concept — it is a practical skill that separates engineers who build resilient systems from those who build fragile ones. Every major outage report from the past decade involves at least one Logging-related design decision that was either implemented incorrectly or overlooked entirely during the initial architecture review.

Decision guide for when to choose Logging and when alternative approaches are better
When to use Logging

How Senior Engineers Think About This

Senior engineers approach Logging differently from textbook definitions. Instead of memorizing rules, they build mental models. They ask: "What problem does Logging solve? When does it fail? What are the alternatives?" This problem-first thinking leads to better design decisions because every system has unique constraints.

When evaluating Logging in a system design context, experienced engineers consider the failure modes first. What happens when this component goes down? How does the system degrade? Is the degradation graceful or catastrophic? These questions reveal more about your understanding than any textbook definition.

The key difference between junior and senior engineers when it comes to Logging: juniors focus on the happy path, while seniors design for what happens when things go wrong. They consider operational cost, team expertise, monitoring requirements, and how the decision will look six months from now when traffic has grown 10x.

Tradeoff analysis for Logging listing advantages, disadvantages, and real-world considerations
Advantages and disadvantages of Logging

Common Interview Mistakes

Mistake 1: Giving a textbook definition without context. Interviewers want to see you connect Logging to real systems and real problems. Instead of reciting definitions, explain when and why you would use Logging in the system you are designing.

Mistake 2: Not discussing trade-offs. Every design decision involving Logging has trade-offs. Discuss what you gain and what you give up. Acknowledge the downsides and explain why the benefits outweigh them for your specific use case.

Mistake 3: Overcomplicating the solution. Start with the simplest approach to Logging that meets the requirements, then add complexity only when justified. Many candidates jump to complex implementations when a simpler solution would work perfectly.

Production deployment examples of Logging at companies like Netflix, Google, and Amazon
Real-world examples of Logging

Production Checklist

  • Define clear metrics for measuring the effectiveness of your Logging implementation
  • Set up monitoring and alerting that specifically tracks Logging-related failures
  • Document your Logging design decisions in Architecture Decision Records (ADRs)
  • Test failure scenarios related to Logging in staging before production deployment
  • Review and update your Logging implementation quarterly as system requirements evolve
  • Train new team members on the specific Logging patterns used in your system
  • Establish runbooks for common Logging-related incidents and recovery procedures

Practical Implementation for .NET Developers

In .NET, logging is built around the Microsoft.Extensions.Logging abstraction. ILogger<T> with LoggerMessage.Define provides high-performance structured logging with compile-time source generation in .NET 8+. Serilog is the most popular third-party library, offering structured logging sinks for Elasticsearch (Serilog.Sinks.Elasticsearch), Seq, and console JSON output. For OpenTelemetry integration, OpenTelemetry.Exporter.OpenTelemetryProtocol exports logs with trace context. NLog and log4net remain options but Serilog's structured-first approach aligns best with modern observability. ASP.NET Core automatically logs request/response data through hosting middleware.

ASP.NET Core setup: Create a service class that encapsulates the logic, register it with dependency injection, and inject it into your controllers or minimal API endpoints. The built-in DI container handles lifecycle management.

Entity Framework Core: For database interactions, EF Core provides the ORM layer. Use migrations for schema management and raw SQL for performance-critical queries. Consider Dapper for read-heavy paths where EF Core overhead matters.

Azure integration: If deploying to Azure, leverage managed services — Azure Cache for Redis, Azure SQL, Azure Service Bus, Azure Cosmos DB. These eliminate operational overhead and provide built-in monitoring through Application Insights.

Testing: Use xUnit with Testcontainers for integration tests that spin up real databases in Docker. Mock external dependencies with NSubstitute. The WebApplicationFactory class lets you test your entire HTTP pipeline in-process.

Monitoring: Add Application Insights telemetry to track request latency, dependency calls, and custom metrics. Use structured logging with Serilog to make production debugging possible:

text
Log.Information("Processing {Operation} for {ResourceId}", operation, resourceId);

This gives you searchable, structured logs in Azure Monitor or Seq.