System Design: Distributed Tracing System (Jaeger/Zipkin-style)
Distributed tracing tracks a single request as it propagates through microservices, allowing engineers to identify latency bottlenecks, errors, and service dependencies. It is foundational to observability and is asked in senior interviews at Uber (Jaeger’s origin), Netflix, Databricks, and Cloudflare.
Core Concepts
- Trace: the end-to-end journey of a single request. Identified by a globally unique trace_id.
- Span: a single unit of work within a trace (e.g., “database query”, “HTTP call to payment service”). Has a span_id, parent_span_id, start_time, duration, and key-value tags.
- Context Propagation: the trace_id and span_id are forwarded between services via HTTP headers (W3C traceparent:
00-{trace_id}-{span_id}-01) or message metadata.
Data Model
from dataclasses import dataclass, field
from typing import Optional
import uuid, time
@dataclass
class Span:
trace_id: str
span_id: str
parent_span_id: Optional[str]
service_name: str
operation: str
start_time_us: int # microseconds since epoch
duration_us: int = 0 # set on span.finish()
tags: dict = field(default_factory=dict)
logs: list = field(default_factory=list) # timed events
def finish(self) -> None:
self.duration_us = int(time.time() * 1e6) - self.start_time_us
def set_tag(self, key: str, value) -> "Span":
self.tags[key] = value
return self
def log(self, event: str, **fields) -> "Span":
self.logs.append({"timestamp_us": int(time.time() * 1e6),
"event": event, **fields})
return self
class Tracer:
def __init__(self, service_name: str, reporter):
self.service_name = service_name
self.reporter = reporter
def start_span(self, operation: str,
trace_id: str = None,
parent_span_id: str = None) -> Span:
span = Span(
trace_id = trace_id or str(uuid.uuid4()).replace("-", ""),
span_id = str(uuid.uuid4()).replace("-", "")[:16],
parent_span_id = parent_span_id,
service_name = self.service_name,
operation = operation,
start_time_us = int(time.time() * 1e6),
)
return span
def finish_span(self, span: Span) -> None:
span.finish()
self.reporter.report(span)
Context Propagation (HTTP)
class TraceContext:
HEADER = "traceparent"
@staticmethod
def inject(span: Span, headers: dict) -> None:
"""Inject trace context into outgoing HTTP headers."""
headers[TraceContext.HEADER] = f"00-{span.trace_id}-{span.span_id}-01"
@staticmethod
def extract(headers: dict) -> Optional[tuple[str, str]]:
"""Extract (trace_id, parent_span_id) from incoming headers."""
tp = headers.get(TraceContext.HEADER, "")
parts = tp.split("-")
if len(parts) == 4:
return parts[1], parts[2] # trace_id, parent_span_id
return None
# Usage in a service handler:
def handle_request(request):
ctx = TraceContext.extract(request.headers)
trace_id, parent_id = ctx if ctx else (None, None)
span = tracer.start_span("handle_checkout",
trace_id=trace_id,
parent_span_id=parent_id)
span.set_tag("user_id", request.user_id)
# ... do work, call downstream services with injected context ...
tracer.finish_span(span)
Architecture
Instrumentation Layer
- SDKs in each service language (Python, Java, Go) intercept HTTP calls, DB queries, and message sends to automatically create spans.
- Auto-instrumentation via monkey-patching (OpenTelemetry SDK patches requests, psycopg2, kafka-python, etc.).
Span Reporting Pipeline
- In-process buffer: finished spans are queued in-memory (max 1000 spans). Overflow is sampled away.
- Async reporter: a background thread batches spans and sends UDP/HTTP to the Collector every 100ms.
- Collector: receives spans, validates, and writes to Kafka for async processing.
- Ingestion workers: consume from Kafka, normalise, and write to Cassandra (optimised for time-series wide rows).
- Query service: serves trace lookups and dependency graphs from Cassandra.
Storage Schema (Cassandra)
-- Traces by trace_id (for trace detail view)
CREATE TABLE spans_by_trace (
trace_id text,
span_id text,
parent_id text,
service text,
operation text,
start_us bigint,
duration bigint,
tags map,
PRIMARY KEY (trace_id, start_us, span_id)
) WITH CLUSTERING ORDER BY (start_us ASC);
-- Service operation index (for search by service+operation)
CREATE TABLE traces_by_service (
service text,
operation text,
start_date text, -- partition by date to bound row width
trace_id text,
duration bigint,
PRIMARY KEY ((service, operation, start_date), duration, trace_id)
) WITH CLUSTERING ORDER BY (duration DESC); -- slowest traces first
Sampling Strategies
Tracing every request at 100% is too expensive at scale. Sampling strategies:
| Strategy | How it works | Pros/Cons |
|---|---|---|
| Head-based sampling | Decision made at trace root; propagated to all services | Simple; misses interesting slow/errored requests |
| Rate-based (1%) | Keep 1 in 100 traces uniformly | Low overhead; misses rare errors |
| Adaptive / dynamic | Rate auto-adjusts per operation to maintain target QPS | Fair representation; more complex |
| Tail-based sampling | Collect all spans; sample-in after trace completes if slow/errored | Never misses errors; requires full buffering |
Jaeger uses head-based adaptive sampling. Production systems often combine: always sample errors and high-latency traces (tail-based on a short window), plus 0.1% uniform sampling for base coverage.
Interview Extensions
How does distributed tracing differ from logging?
Logs record discrete events at a point in time with no structural relationship between services. Distributed tracing records causally-related spans forming a tree, enabling latency attribution across service boundaries. A single request’s logs from 5 services are hard to correlate; its trace shows the exact call graph with timing. The two complement each other: traces for latency and call graph; logs for detailed event information within a span.
How do you build a service dependency graph from traces?
For each span with a parent in a different service, emit a directed edge (parent_service → child_service). Aggregate these edges over a time window (e.g., last 5 minutes). The resulting graph shows which services call which, with edge weights (call count, error rate, p99 latency). Store as a materialized view updated by a streaming job on the span Kafka topic.
Asked at: Uber Interview Guide
Asked at: Netflix Interview Guide
Asked at: Databricks Interview Guide
Asked at: Cloudflare Interview Guide
Asked at: Stripe Interview Guide