The Three Pillars of Observability
Observability is the ability to understand the internal state of a system from its external outputs. The three pillars: Logs — structured event records (what happened, when, with what context). Metrics — numeric measurements over time (request rate, error rate, latency, CPU). Traces — records of a request propagating across services (distributed tracing). A mature observability platform provides all three, correlated by time and request ID.
Log Aggregation Architecture
Log pipeline: Application writes structured logs (JSON) to stdout/stderr. A log collector (Fluentd, Fluent Bit, Vector) runs as a DaemonSet on each Kubernetes node, tails container logs, and ships them to a central store. The central store indexes logs for search: Elasticsearch (self-hosted), OpenSearch, or managed services (Splunk, Datadog, CloudWatch Logs). Parsing: extract structured fields from log lines (timestamp, level, service, trace_id, message). Tag with Kubernetes metadata (namespace, pod, container). The pipeline must handle backpressure — if the destination is slow, the collector buffers (on disk) rather than dropping logs.
ELK stack: Elasticsearch (storage and search), Logstash (collection and transformation), Kibana (visualization). The modern variant uses Beats (lightweight collectors) instead of Logstash for collection, and Logstash only for complex transformations. Elastic Agent is the unified collector in newer versions.
Metrics Collection
Pull model (Prometheus): Prometheus scrapes /metrics endpoints from services on a configurable interval (15s). Services expose metrics in Prometheus format. Prometheus stores time-series data in a local TSDB. PromQL queries the data. Alertmanager handles alerts. Pull model is simple but requires service discovery to know what to scrape. Push model (StatsD, InfluxDB): services push metrics to a collector. Better for short-lived jobs that may not survive until the next scrape.
Metric types: Counter (monotonically increasing: request_count). Gauge (current value: memory_usage). Histogram (bucketed distribution: request_latency_seconds with buckets at .01, .05, .1, .5, 1, 5). Summary (quantile estimates computed at the client: p50, p95, p99). Histograms are preferred over summaries — they can be aggregated across instances (sum histogram buckets across pods, then compute percentiles server-side).
Distributed Tracing
A trace represents a single request’s journey across services. Each service creates a span (start time, end time, tags, logs). Spans are linked via parent_span_id. The trace_id propagates in HTTP headers (W3C Trace Context: traceparent header). Architecture: services emit spans to a collector (Jaeger Agent, OpenTelemetry Collector). Collector batches and exports to a trace store (Jaeger backend: Cassandra or Elasticsearch). Sampling: 100% tracing is too expensive at scale. Use head-based sampling (decision at first service: sample 1% of traces) or tail-based sampling (collect all spans, make sampling decision after the full trace is available — can sample 100% of error traces).
Alerting
Alert on symptoms, not causes. Bad: “CPU > 80%” (cause). Good: “error rate > 1% for 5 minutes” (symptom customers feel). Alert tiers: P1 (page on-call immediately: service down, error rate spike). P2 (page in business hours: elevated latency, degraded component). P3 (ticket: trend approaching a limit). Alert fatigue: too many low-quality alerts causes on-call engineers to ignore alerts. Fix: raise thresholds, add minimum duration (“for 10 minutes”), and reduce P1 alerts to only true customer-facing issues. Dead Man’s Switch: a heartbeat alert that fires when a monitoring job stops running — catches the case where the monitoring system itself fails.
Interview Tips
- Cardinality is the main scaling challenge for metrics. High-cardinality labels (user_id, request_id) on metrics create millions of time series. Reserve high-cardinality data for logs and traces.
- OpenTelemetry (OTel) is the vendor-neutral standard for instrumentation. Libraries emit OTel data; the OTel Collector exports to any backend (Jaeger, Prometheus, Datadog). Instrument once, switch backends without code changes.
- Log sampling: at very high volume, sample debug logs (keep 1%) while keeping all error/warning logs. Reservoir sampling ensures statistical validity.
Asked at: Databricks Interview Guide
Asked at: Cloudflare Interview Guide
Asked at: Netflix Interview Guide
Asked at: Uber Interview Guide
Asked at: Twitter/X Interview Guide