The Three Pillars of Observability
Observability is the ability to understand the internal state of a system from its external outputs. The three pillars: Logs — structured event records (what happened, when, with what context). Metrics — numeric measurements over time (request rate, error rate, latency, CPU). Traces — records of a request propagating across services (distributed tracing). A mature observability platform provides all three, correlated by time and request ID.
Log Aggregation Architecture
Log pipeline: Application writes structured logs (JSON) to stdout/stderr. A log collector (Fluentd, Fluent Bit, Vector) runs as a DaemonSet on each Kubernetes node, tails container logs, and ships them to a central store. The central store indexes logs for search: Elasticsearch (self-hosted), OpenSearch, or managed services (Splunk, Datadog, CloudWatch Logs). Parsing: extract structured fields from log lines (timestamp, level, service, trace_id, message). Tag with Kubernetes metadata (namespace, pod, container). The pipeline must handle backpressure — if the destination is slow, the collector buffers (on disk) rather than dropping logs.
ELK stack: Elasticsearch (storage and search), Logstash (collection and transformation), Kibana (visualization). The modern variant uses Beats (lightweight collectors) instead of Logstash for collection, and Logstash only for complex transformations. Elastic Agent is the unified collector in newer versions.
Metrics Collection
Pull model (Prometheus): Prometheus scrapes /metrics endpoints from services on a configurable interval (15s). Services expose metrics in Prometheus format. Prometheus stores time-series data in a local TSDB. PromQL queries the data. Alertmanager handles alerts. Pull model is simple but requires service discovery to know what to scrape. Push model (StatsD, InfluxDB): services push metrics to a collector. Better for short-lived jobs that may not survive until the next scrape.
Metric types: Counter (monotonically increasing: request_count). Gauge (current value: memory_usage). Histogram (bucketed distribution: request_latency_seconds with buckets at .01, .05, .1, .5, 1, 5). Summary (quantile estimates computed at the client: p50, p95, p99). Histograms are preferred over summaries — they can be aggregated across instances (sum histogram buckets across pods, then compute percentiles server-side).
Distributed Tracing
A trace represents a single request’s journey across services. Each service creates a span (start time, end time, tags, logs). Spans are linked via parent_span_id. The trace_id propagates in HTTP headers (W3C Trace Context: traceparent header). Architecture: services emit spans to a collector (Jaeger Agent, OpenTelemetry Collector). Collector batches and exports to a trace store (Jaeger backend: Cassandra or Elasticsearch). Sampling: 100% tracing is too expensive at scale. Use head-based sampling (decision at first service: sample 1% of traces) or tail-based sampling (collect all spans, make sampling decision after the full trace is available — can sample 100% of error traces).
Alerting
Alert on symptoms, not causes. Bad: “CPU > 80%” (cause). Good: “error rate > 1% for 5 minutes” (symptom customers feel). Alert tiers: P1 (page on-call immediately: service down, error rate spike). P2 (page in business hours: elevated latency, degraded component). P3 (ticket: trend approaching a limit). Alert fatigue: too many low-quality alerts causes on-call engineers to ignore alerts. Fix: raise thresholds, add minimum duration (“for 10 minutes”), and reduce P1 alerts to only true customer-facing issues. Dead Man’s Switch: a heartbeat alert that fires when a monitoring job stops running — catches the case where the monitoring system itself fails.
Interview Tips
- Cardinality is the main scaling challenge for metrics. High-cardinality labels (user_id, request_id) on metrics create millions of time series. Reserve high-cardinality data for logs and traces.
- OpenTelemetry (OTel) is the vendor-neutral standard for instrumentation. Libraries emit OTel data; the OTel Collector exports to any backend (Jaeger, Prometheus, Datadog). Instrument once, switch backends without code changes.
- Log sampling: at very high volume, sample debug logs (keep 1%) while keeping all error/warning logs. Reservoir sampling ensures statistical validity.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How do you design a scalable log aggregation pipeline?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Log pipeline stages: (1) Collection: a lightweight agent (Fluent Bit, Vector) runs on each node, tails log files and container stdout, adds metadata (hostname, pod, namespace), and buffers locally to disk. (2) Transport: agents forward to an aggregator (Logstash, Vector, Kafka). Kafka as the transport decouples collection from storage — if Elasticsearch is slow, Kafka absorbs the burst. (3) Processing: parse unstructured logs into structured fields (timestamp, level, service, trace_id). Drop low-value logs (health check noise). Sample debug logs. (4) Storage: write to Elasticsearch or OpenSearch, indexed by timestamp and service. Retention: hot storage 7-30 days, cold storage (S3) 1 year. Index lifecycle management (ILM) automates tier movement.”
}
},
{
“@type”: “Question”,
“name”: “What is the difference between push-based and pull-based metrics collection?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Pull-based (Prometheus): the metrics server scrapes each service endpoint (/metrics) on a schedule. Pros: the server controls when and how often to scrape; failed scrapes are visible (target shows as down). Natural for long-lived services. Cons: short-lived jobs (batch jobs, serverless functions) may complete before the next scrape — use Pushgateway for those. Pull requires service discovery (Prometheus discovers targets from Kubernetes API). Push-based (StatsD, Graphite, InfluxDB): services push metrics to a collector. Pros: works for any job duration, no service discovery needed. Cons: dead services silently stop sending (no down state visible). For most web services: Prometheus pull is preferred. For jobs/functions: push.”
}
},
{
“@type”: “Question”,
“name”: “How do you implement distributed tracing across microservices?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Distributed tracing tracks a request as it propagates across services. Components: (1) Instrumentation: each service creates spans (start_time, end_time, operation_name, tags, logs). Auto-instrumentation via OpenTelemetry SDK instruments HTTP frameworks, DB clients, and message queues automatically. (2) Context propagation: the trace_id and parent_span_id travel in HTTP headers (W3C traceparent header). Each service extracts the context, creates a child span, and injects context into outgoing calls. (3) Collector: spans are sent to an OTel Collector (or Jaeger Agent) which batches and exports to a trace store (Jaeger, Tempo). (4) Sampling: to reduce volume, sample 1% of traces uniformly, but 100% of traces with errors. Tail-based sampling (make sampling decision after seeing the complete trace) allows prioritizing interesting traces.”
}
},
{
“@type”: “Question”,
“name”: “How do you design effective alerting to avoid alert fatigue?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Alert fatigue occurs when too many low-quality alerts cause on-call engineers to ignore or silence them. Prevention: (1) Alert on symptoms (error rate, latency, availability) not causes (CPU, disk). Customers feel symptoms; causes are for debugging after an alert fires. (2) Add minimum duration: alert only if the condition persists for 5 minutes — eliminates transient spikes. (3) Tiered severity: P1 (wake someone up) for true customer-facing outages. P2 (notify in Slack) for degradation. P3 (create ticket) for trends. (4) Actionable alerts: every alert should have a runbook link describing exactly how to investigate and remediate. Remove alerts with no clear remediation. (5) Review weekly: track alert volume per team. Any team firing more than N P1 alerts/week needs to revisit thresholds.”
}
},
{
“@type”: “Question”,
“name”: “How do you correlate logs, metrics, and traces in an observability platform?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Correlation links the three signals so engineers can navigate between them. The correlation key is trace_id: embed trace_id in every log line (structured logging: logger.info(“request processed”, trace_id=trace_id, duration_ms=42)). Tag Prometheus metrics with trace_id only for high-cardinality (avoid: metrics have cardinality limits). In Grafana: Explore view links from a metric spike -> logs filtered by time range and service -> traces filtered by trace_id. Exemplars: Prometheus supports attaching trace_id to individual histogram samples (exemplars). When viewing a latency histogram in Grafana, click a high-latency exemplar to jump directly to the trace. This seamless navigation (metrics -> logs -> traces) is the goal of a mature observability platform.”
}
}
]
}
Asked at: Databricks Interview Guide
Asked at: Cloudflare Interview Guide
Asked at: Netflix Interview Guide
Asked at: Uber Interview Guide
Asked at: Twitter/X Interview Guide