The Three Pillars of Observability
Metrics: numerical measurements over time (request rate, error rate, latency percentiles, CPU usage). Low cardinality, cheap to store, good for dashboards and alerting. Best for “what is broken.” Logs: structured or unstructured text records of discrete events. High cardinality (each event is unique), expensive to store at scale, but rich in context. Best for “why is it broken.” Traces: a record of a request’s path through distributed services. Connects the timing and causality of spans across multiple services. Best for “where is the bottleneck.” Modern observability correlates all three: a dashboard alert (metrics) links to relevant logs, which link to a trace of a slow or failed request. Tools: Prometheus + Grafana (metrics), ELK/OpenSearch or Loki (logs), Jaeger or Zipkin (traces), OpenTelemetry (instrumentation standard for all three).
Metrics Pipeline: Prometheus Architecture
Prometheus uses a pull model: it scrapes metrics from HTTP /metrics endpoints on each service at regular intervals (15-30s). Services expose metrics using a client library (prometheus-client for Python, Micrometer for Java). Metric types: Counter: monotonically increasing (request_count, errors_total). Gauge: arbitrary value that can go up or down (active_connections, memory_used_bytes). Histogram: samples observations into configurable buckets (request_duration_seconds). Used to compute percentiles (p50, p95, p99) via histogram_quantile(). Summary: like histogram but computes quantiles on the client side. Storage: Prometheus stores time-series data in its own TSDB (time-series database) with high compression. For long-term retention (> 2 weeks): remote write to Thanos, Cortex, or Victoria Metrics for horizontal scaling and multi-cluster aggregation.
from prometheus_client import Counter, Histogram, start_http_server
import time
REQUEST_COUNT = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status_code']
)
REQUEST_LATENCY = Histogram(
'http_request_duration_seconds',
'HTTP request latency',
['method', 'endpoint'],
buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
)
def handle_request(method, endpoint):
start = time.time()
try:
response = process(method, endpoint)
REQUEST_COUNT.labels(method, endpoint, response.status_code).inc()
return response
finally:
REQUEST_LATENCY.labels(method, endpoint).observe(time.time() - start)
Log Pipeline: Structured Logging and ELK
Log pipeline: application emits structured JSON logs → Fluentd/Filebeat agent (runs as DaemonSet on each K8s node) ships logs → Kafka (buffer, backpressure) → Logstash/OpenSearch ingestion pipeline → Elasticsearch/OpenSearch index. Structured logging: always log JSON with consistent fields — timestamp (ISO8601 UTC), service, level, trace_id, span_id, request_id, user_id, message, and any domain fields. Avoid unstructured text: it cannot be queried efficiently. Index lifecycle management (ILM): hot indices (last 7 days) on SSD with full indexing. Warm indices (7-30 days) on HDD, fewer replicas. Cold/frozen indices (30-90 days) on object storage. Delete after 90 days (or per retention policy). Cardinality: never index high-cardinality fields (user IDs, raw URLs) as keyword — this explodes the inverted index. Store them as text or exclude from indexing.
Distributed Tracing: OpenTelemetry and Jaeger
OpenTelemetry (OTel) is the vendor-neutral standard for distributed tracing. Every request gets a trace_id (128-bit random UUID). Each service operation creates a span with: span_id, parent_span_id (forms the tree), operation_name, start_time, duration, status (OK/ERROR), and attributes (key-value metadata). The trace context (trace_id, parent_span_id) is propagated via HTTP headers (W3C Trace Context: traceparent, tracestate) and message queue headers. Sampling: tracing 100% of requests at high volume is prohibitively expensive. Strategies: head-based sampling (decide at request entry: sample 1% of requests); tail-based sampling (record all spans, decide at trace completion whether to persist — persist 100% of error traces + 1% of success traces). Jaeger/Zipkin store traces; Grafana Tempo is a cost-efficient trace backend using object storage.
Alerting and On-Call
Alert design principles: alert on symptoms (high error rate, high latency, low availability) rather than causes (CPU high, disk full) — symptoms are what users experience. Four golden signals (Google SRE): Latency (p99 request duration), Traffic (requests per second), Errors (error rate), Saturation (resource utilization). Alert thresholds: use multi-window multi-burn-rate alerts (SLO-based alerting). Example: 5% error budget burned in 1 hour → page immediately. 10% burned in 6 hours → ticket. This balances sensitivity (catch real incidents early) with specificity (avoid alert fatigue from transient spikes). Alert routing: PagerDuty/OpsGenie route alerts to on-call engineers based on service ownership. Runbooks: every alert links to a runbook with diagnosis steps, common causes, and resolution procedures. A well-maintained runbook reduces mean time to recovery (MTTR) significantly.
See also: Databricks Interview Prep
See also: Cloudflare Interview Prep
See also: Netflix Interview Prep