The Three Pillars of Observability
Observability lets engineers understand the internal state of a system from its external outputs. Three pillars: Logs (discrete event records — “user 123 logged in at 2:03pm”), Metrics (numerical measurements over time — “HTTP latency p99 = 450ms”), and Traces (end-to-end request journeys across microservices — “request ABC took 230ms: 10ms in API gateway, 120ms in user service, 100ms in DB”). Modern platforms combine all three for root-cause analysis: a metric alert shows latency spiked → a trace shows which service is slow → logs show the specific error in that service.
Log Aggregation Pipeline
Services write structured logs (JSON) to stdout. A log agent (Fluentd, Filebeat, Vector) on each host captures stdout, adds metadata (hostname, service name, environment), and ships to a central log store. Transport: Kafka buffers the log stream. Log store: Elasticsearch for full-text search (ELK stack), or ClickHouse for analytics queries. Index strategy: one index per day per service (logs-user-service-2025-04-17). Retention: keep 30 days in hot storage (Elasticsearch), archive to S3 Glacier for 1 year for compliance. Sampling: at high throughput (millions of log lines/sec), sample verbose logs (debug, info) — only store 1% of INFO logs but 100% of WARN and ERROR. Structured logging: enforce JSON format with required fields (request_id, user_id, service, timestamp, level) — enables fast filtered queries without regex parsing.
Metrics Collection and Storage
Services expose metrics via an HTTP endpoint (/metrics) in Prometheus format. A Prometheus scraper pulls metrics from each service every 15 seconds. For push-based metrics (short-lived jobs, IoT): services push to a push gateway, which Prometheus scrapes. Metrics types: Counter (monotonically increasing — request count), Gauge (can go up or down — memory usage), Histogram (buckets for distribution — latency buckets: 10ms, 50ms, 100ms, 500ms, 1s+), Summary (pre-computed quantiles — p50, p95, p99). Storage: Prometheus stores data in a local TSDB (time-series DB) for 2 weeks. Long-term storage: remote write to Thanos or Cortex (horizontally scalable Prometheus) backed by S3. Query language: PromQL — rate(http_requests_total[5m]) gives the per-second request rate over the last 5 minutes.
Distributed Tracing
A trace represents a single request’s journey across all services. Each operation is a span: (trace_id, span_id, parent_span_id, service, operation, start_time, duration, tags, status). The trace_id propagates through all service calls via HTTP headers (X-Trace-ID) or gRPC metadata. Instrumentation: OpenTelemetry SDK auto-instruments HTTP clients and servers to create and propagate spans. Trace data is sent to a collector (OpenTelemetry Collector) which batches and forwards to a trace backend (Jaeger, Zipkin, or a commercial APM like Datadog, Honeycomb). Sampling: tracing 100% of requests is expensive (high storage and CPU overhead). Head-based sampling: sample N% of all traces at the entry point (e.g., 1% for high-throughput services). Tail-based sampling: collect all spans for every request, then discard non-interesting traces (no errors, below a latency threshold) at the collector — captures all errors and slow traces.
Correlation and Alerting
Correlation: every log line should include the trace_id, allowing one-click navigation from a trace span to the corresponding log lines. A metric alert says “p99 latency > 2s” → find a trace with high latency from that time window → follow to log lines in the slow service. Alerting: Prometheus AlertManager. Alert rules in PromQL: ALERT HighLatency IF histogram_quantile(0.99, rate(request_duration_seconds_bucket[5m])) > 2. Notification routing: send P0 alerts to PagerDuty (wakes the on-call), P1 to Slack #incidents, P2 to email. Deduplication: group alerts on the same service to prevent alert storms. Inhibition: suppress lower-severity alerts when a higher-severity one is active for the same service (database is down → suppress all downstream service errors).
Interview Tips
- Logs vs metrics: logs are for debugging individual events; metrics are for aggregate health. Logs are expensive at scale; metrics are cheap (fixed cardinality).
- Cardinality explosion: avoid high-cardinality labels on metrics (e.g., user_id as a label creates millions of time series). Use low-cardinality labels only (service, endpoint, status_code).
- OpenTelemetry is the vendor-neutral standard for instrumentation — use it to stay portable across backends.
Asked at: Cloudflare Interview Guide
Asked at: Databricks Interview Guide
Asked at: Netflix Interview Guide
Asked at: Atlassian Interview Guide