System Design: Distributed Tracing System (Jaeger/Zipkin/OpenTelemetry)

System Design: Distributed Tracing System (Jaeger/Zipkin-style)

Distributed tracing tracks a single request as it propagates through microservices, allowing engineers to identify latency bottlenecks, errors, and service dependencies. It is foundational to observability and is asked in senior interviews at Uber (Jaeger’s origin), Netflix, Databricks, and Cloudflare.

Core Concepts

  • Trace: the end-to-end journey of a single request. Identified by a globally unique trace_id.
  • Span: a single unit of work within a trace (e.g., “database query”, “HTTP call to payment service”). Has a span_id, parent_span_id, start_time, duration, and key-value tags.
  • Context Propagation: the trace_id and span_id are forwarded between services via HTTP headers (W3C traceparent: 00-{trace_id}-{span_id}-01) or message metadata.

Data Model

from dataclasses import dataclass, field
from typing import Optional
import uuid, time

@dataclass
class Span:
    trace_id:      str
    span_id:       str
    parent_span_id: Optional[str]
    service_name:  str
    operation:     str
    start_time_us: int          # microseconds since epoch
    duration_us:   int = 0      # set on span.finish()
    tags:          dict = field(default_factory=dict)
    logs:          list = field(default_factory=list)  # timed events

    def finish(self) -> None:
        self.duration_us = int(time.time() * 1e6) - self.start_time_us

    def set_tag(self, key: str, value) -> "Span":
        self.tags[key] = value
        return self

    def log(self, event: str, **fields) -> "Span":
        self.logs.append({"timestamp_us": int(time.time() * 1e6),
                          "event": event, **fields})
        return self

class Tracer:
    def __init__(self, service_name: str, reporter):
        self.service_name = service_name
        self.reporter     = reporter

    def start_span(self, operation: str,
                   trace_id: str = None,
                   parent_span_id: str = None) -> Span:
        span = Span(
            trace_id       = trace_id or str(uuid.uuid4()).replace("-", ""),
            span_id        = str(uuid.uuid4()).replace("-", "")[:16],
            parent_span_id = parent_span_id,
            service_name   = self.service_name,
            operation      = operation,
            start_time_us  = int(time.time() * 1e6),
        )
        return span

    def finish_span(self, span: Span) -> None:
        span.finish()
        self.reporter.report(span)

Context Propagation (HTTP)

class TraceContext:
    HEADER = "traceparent"

    @staticmethod
    def inject(span: Span, headers: dict) -> None:
        """Inject trace context into outgoing HTTP headers."""
        headers[TraceContext.HEADER] = f"00-{span.trace_id}-{span.span_id}-01"

    @staticmethod
    def extract(headers: dict) -> Optional[tuple[str, str]]:
        """Extract (trace_id, parent_span_id) from incoming headers."""
        tp = headers.get(TraceContext.HEADER, "")
        parts = tp.split("-")
        if len(parts) == 4:
            return parts[1], parts[2]   # trace_id, parent_span_id
        return None

# Usage in a service handler:
def handle_request(request):
    ctx      = TraceContext.extract(request.headers)
    trace_id, parent_id = ctx if ctx else (None, None)
    span     = tracer.start_span("handle_checkout",
                                 trace_id=trace_id,
                                 parent_span_id=parent_id)
    span.set_tag("user_id", request.user_id)

    # ... do work, call downstream services with injected context ...

    tracer.finish_span(span)

Architecture

Instrumentation Layer

  • SDKs in each service language (Python, Java, Go) intercept HTTP calls, DB queries, and message sends to automatically create spans.
  • Auto-instrumentation via monkey-patching (OpenTelemetry SDK patches requests, psycopg2, kafka-python, etc.).

Span Reporting Pipeline

  1. In-process buffer: finished spans are queued in-memory (max 1000 spans). Overflow is sampled away.
  2. Async reporter: a background thread batches spans and sends UDP/HTTP to the Collector every 100ms.
  3. Collector: receives spans, validates, and writes to Kafka for async processing.
  4. Ingestion workers: consume from Kafka, normalise, and write to Cassandra (optimised for time-series wide rows).
  5. Query service: serves trace lookups and dependency graphs from Cassandra.

Storage Schema (Cassandra)

-- Traces by trace_id (for trace detail view)
CREATE TABLE spans_by_trace (
    trace_id   text,
    span_id    text,
    parent_id  text,
    service    text,
    operation  text,
    start_us   bigint,
    duration   bigint,
    tags       map,
    PRIMARY KEY (trace_id, start_us, span_id)
) WITH CLUSTERING ORDER BY (start_us ASC);

-- Service operation index (for search by service+operation)
CREATE TABLE traces_by_service (
    service    text,
    operation  text,
    start_date text,   -- partition by date to bound row width
    trace_id   text,
    duration   bigint,
    PRIMARY KEY ((service, operation, start_date), duration, trace_id)
) WITH CLUSTERING ORDER BY (duration DESC);   -- slowest traces first

Sampling Strategies

Tracing every request at 100% is too expensive at scale. Sampling strategies:

Strategy How it works Pros/Cons
Head-based sampling Decision made at trace root; propagated to all services Simple; misses interesting slow/errored requests
Rate-based (1%) Keep 1 in 100 traces uniformly Low overhead; misses rare errors
Adaptive / dynamic Rate auto-adjusts per operation to maintain target QPS Fair representation; more complex
Tail-based sampling Collect all spans; sample-in after trace completes if slow/errored Never misses errors; requires full buffering

Jaeger uses head-based adaptive sampling. Production systems often combine: always sample errors and high-latency traces (tail-based on a short window), plus 0.1% uniform sampling for base coverage.

Interview Extensions

How does distributed tracing differ from logging?

Logs record discrete events at a point in time with no structural relationship between services. Distributed tracing records causally-related spans forming a tree, enabling latency attribution across service boundaries. A single request’s logs from 5 services are hard to correlate; its trace shows the exact call graph with timing. The two complement each other: traces for latency and call graph; logs for detailed event information within a span.

How do you build a service dependency graph from traces?

For each span with a parent in a different service, emit a directed edge (parent_service → child_service). Aggregate these edges over a time window (e.g., last 5 minutes). The resulting graph shows which services call which, with edge weights (call count, error rate, p99 latency). Store as a materialized view updated by a streaming job on the span Kafka topic.

Asked at: Uber Interview Guide

Asked at: Netflix Interview Guide

Asked at: Databricks Interview Guide

Asked at: Cloudflare Interview Guide

Asked at: Stripe Interview Guide

Scroll to Top