What is a distributed trace and how does it differ from a log?

A distributed trace is a causal chain of spans representing a single end-to-end request across multiple services. Each span records a named operation, its service, start/end timestamps, and a parent_span_id that links it to the calling span. Logs are isolated timestamped events with no cross-service correlation. Traces are correlated by a shared trace_id injected at the entry point and propagated through every downstream call.

How does context propagation work in distributed tracing?

Context propagation passes trace_id and parent_span_id across service boundaries in HTTP headers or message metadata. The W3C standard uses the traceparent header: "00-{trace_id}-{parent_span_id}-{flags}". At each service boundary, the receiving service extracts this header, creates a child span with parent_span_id set to the incoming span_id, and injects updated context into outgoing calls. OpenTelemetry automates this via instrumentation libraries.

What is the difference between head-based and tail-based sampling?

Head-based sampling decides at request entry (the root span) whether to trace u2014 typically a fixed rate like 1%. Simple and low overhead but misses rare slow or error traces that happen to be unsampled. Tail-based sampling buffers all spans and decides after the full trace is assembled u2014 enabling 100% capture of error traces, traces over 2s, or traces touching a specific service. Tail-based requires a collector with buffering capacity (Tempo, Jaeger) and is more operationally complex.

How would you store distributed traces efficiently at scale?

Store spans in a time-series-friendly column store like Cassandra. Primary access pattern: fetch all spans for a trace_id u2192 partition key = trace_id, clustering = start_time. Secondary pattern: find recent traces for a service u2192 separate table with partition key = (service_name, date_bucket), clustering = start_time DESC. TTL spans at 7-30 days to bound storage. Keep a separate index table for trace metadata (duration, root service, error flag) for list/filter views.

How does OpenTelemetry relate to Jaeger and Zipkin?

OpenTelemetry (OTel) is a vendor-neutral SDK and protocol standard for emitting traces, metrics, and logs. Jaeger and Zipkin are trace storage and query backends. OTel replaced the Jaeger and Zipkin client SDKs u2014 you instrument with OTel, export via OTLP (OpenTelemetry Protocol) to an OTel Collector, which fan-outs to Jaeger, Zipkin, Tempo, or any backend. OTel is the instrumentation layer; Jaeger/Zipkin are the storage+UI layer.

System Design: Distributed Tracing System (Jaeger/Zipkin/OpenTelemetry)

⏱ 6 min read

System Design: Distributed Tracing System (Jaeger/Zipkin-style)

Distributed tracing tracks a single request as it propagates through microservices, allowing engineers to identify latency bottlenecks, errors, and service dependencies. It is foundational to observability and is asked in senior interviews at Uber (Jaeger’s origin), Netflix, Databricks, and Cloudflare.

Core Concepts

Trace: the end-to-end journey of a single request. Identified by a globally unique trace_id.
Span: a single unit of work within a trace (e.g., “database query”, “HTTP call to payment service”). Has a span_id, parent_span_id, start_time, duration, and key-value tags.
Context Propagation: the trace_id and span_id are forwarded between services via HTTP headers (W3C traceparent: 00-{trace_id}-{span_id}-01) or message metadata.

Data Model

from dataclasses import dataclass, field
from typing import Optional
import uuid, time

@dataclass
class Span:
    trace_id:      str
    span_id:       str
    parent_span_id: Optional[str]
    service_name:  str
    operation:     str
    start_time_us: int          # microseconds since epoch
    duration_us:   int = 0      # set on span.finish()
    tags:          dict = field(default_factory=dict)
    logs:          list = field(default_factory=list)  # timed events

    def finish(self) -> None:
        self.duration_us = int(time.time() * 1e6) - self.start_time_us

    def set_tag(self, key: str, value) -> "Span":
        self.tags[key] = value
        return self

    def log(self, event: str, **fields) -> "Span":
        self.logs.append({"timestamp_us": int(time.time() * 1e6),
                          "event": event, **fields})
        return self

class Tracer:
    def __init__(self, service_name: str, reporter):
        self.service_name = service_name
        self.reporter     = reporter

    def start_span(self, operation: str,
                   trace_id: str = None,
                   parent_span_id: str = None) -> Span:
        span = Span(
            trace_id       = trace_id or str(uuid.uuid4()).replace("-", ""),
            span_id        = str(uuid.uuid4()).replace("-", "")[:16],
            parent_span_id = parent_span_id,
            service_name   = self.service_name,
            operation      = operation,
            start_time_us  = int(time.time() * 1e6),
        )
        return span

    def finish_span(self, span: Span) -> None:
        span.finish()
        self.reporter.report(span)

Context Propagation (HTTP)

class TraceContext:
    HEADER = "traceparent"

    @staticmethod
    def inject(span: Span, headers: dict) -> None:
        """Inject trace context into outgoing HTTP headers."""
        headers[TraceContext.HEADER] = f"00-{span.trace_id}-{span.span_id}-01"

    @staticmethod
    def extract(headers: dict) -> Optional[tuple[str, str]]:
        """Extract (trace_id, parent_span_id) from incoming headers."""
        tp = headers.get(TraceContext.HEADER, "")
        parts = tp.split("-")
        if len(parts) == 4:
            return parts[1], parts[2]   # trace_id, parent_span_id
        return None

# Usage in a service handler:
def handle_request(request):
    ctx      = TraceContext.extract(request.headers)
    trace_id, parent_id = ctx if ctx else (None, None)
    span     = tracer.start_span("handle_checkout",
                                 trace_id=trace_id,
                                 parent_span_id=parent_id)
    span.set_tag("user_id", request.user_id)

    # ... do work, call downstream services with injected context ...

    tracer.finish_span(span)

Architecture

Instrumentation Layer

SDKs in each service language (Python, Java, Go) intercept HTTP calls, DB queries, and message sends to automatically create spans.
Auto-instrumentation via monkey-patching (OpenTelemetry SDK patches requests, psycopg2, kafka-python, etc.).

Span Reporting Pipeline

In-process buffer: finished spans are queued in-memory (max 1000 spans). Overflow is sampled away.
Async reporter: a background thread batches spans and sends UDP/HTTP to the Collector every 100ms.
Collector: receives spans, validates, and writes to Kafka for async processing.
Ingestion workers: consume from Kafka, normalise, and write to Cassandra (optimised for time-series wide rows).
Query service: serves trace lookups and dependency graphs from Cassandra.

Storage Schema (Cassandra)

-- Traces by trace_id (for trace detail view)
CREATE TABLE spans_by_trace (
    trace_id   text,
    span_id    text,
    parent_id  text,
    service    text,
    operation  text,
    start_us   bigint,
    duration   bigint,
    tags       map,
    PRIMARY KEY (trace_id, start_us, span_id)
) WITH CLUSTERING ORDER BY (start_us ASC);

-- Service operation index (for search by service+operation)
CREATE TABLE traces_by_service (
    service    text,
    operation  text,
    start_date text,   -- partition by date to bound row width
    trace_id   text,
    duration   bigint,
    PRIMARY KEY ((service, operation, start_date), duration, trace_id)
) WITH CLUSTERING ORDER BY (duration DESC);   -- slowest traces first

Sampling Strategies

Tracing every request at 100% is too expensive at scale. Sampling strategies:

Strategy	How it works	Pros/Cons
Head-based sampling	Decision made at trace root; propagated to all services	Simple; misses interesting slow/errored requests
Rate-based (1%)	Keep 1 in 100 traces uniformly	Low overhead; misses rare errors
Adaptive / dynamic	Rate auto-adjusts per operation to maintain target QPS	Fair representation; more complex
Tail-based sampling	Collect all spans; sample-in after trace completes if slow/errored	Never misses errors; requires full buffering

Jaeger uses head-based adaptive sampling. Production systems often combine: always sample errors and high-latency traces (tail-based on a short window), plus 0.1% uniform sampling for base coverage.

Interview Extensions

How does distributed tracing differ from logging?

Logs record discrete events at a point in time with no structural relationship between services. Distributed tracing records causally-related spans forming a tree, enabling latency attribution across service boundaries. A single request’s logs from 5 services are hard to correlate; its trace shows the exact call graph with timing. The two complement each other: traces for latency and call graph; logs for detailed event information within a span.

How do you build a service dependency graph from traces?

For each span with a parent in a different service, emit a directed edge (parent_service → child_service). Aggregate these edges over a time window (e.g., last 5 minutes). The resulting graph shows which services call which, with edge weights (call count, error rate, p99 latency). Store as a materialized view updated by a streaming job on the span Kafka topic.

Asked at: Uber Interview Guide

Asked at: Netflix Interview Guide

Asked at: Databricks Interview Guide

Asked at: Cloudflare Interview Guide

Asked at: Stripe Interview Guide