Question 1

How do you design a scalable log aggregation pipeline?

Accepted Answer

Log pipeline stages: (1) Collection: a lightweight agent (Fluent Bit, Vector) runs on each node, tails log files and container stdout, adds metadata (hostname, pod, namespace), and buffers locally to disk. (2) Transport: agents forward to an aggregator (Logstash, Vector, Kafka). Kafka as the transport decouples collection from storage -- if Elasticsearch is slow, Kafka absorbs the burst. (3) Processing: parse unstructured logs into structured fields (timestamp, level, service, trace_id). Drop low-value logs (health check noise). Sample debug logs. (4) Storage: write to Elasticsearch or OpenSearch, indexed by timestamp and service. Retention: hot storage 7-30 days, cold storage (S3) 1 year. Index lifecycle management (ILM) automates tier movement.

Question 2

What is the difference between push-based and pull-based metrics collection?

Accepted Answer

Pull-based (Prometheus): the metrics server scrapes each service endpoint (/metrics) on a schedule. Pros: the server controls when and how often to scrape; failed scrapes are visible (target shows as down). Natural for long-lived services. Cons: short-lived jobs (batch jobs, serverless functions) may complete before the next scrape -- use Pushgateway for those. Pull requires service discovery (Prometheus discovers targets from Kubernetes API). Push-based (StatsD, Graphite, InfluxDB): services push metrics to a collector. Pros: works for any job duration, no service discovery needed. Cons: dead services silently stop sending (no down state visible). For most web services: Prometheus pull is preferred. For jobs/functions: push.

Question 3

How do you implement distributed tracing across microservices?

Accepted Answer

Distributed tracing tracks a request as it propagates across services. Components: (1) Instrumentation: each service creates spans (start_time, end_time, operation_name, tags, logs). Auto-instrumentation via OpenTelemetry SDK instruments HTTP frameworks, DB clients, and message queues automatically. (2) Context propagation: the trace_id and parent_span_id travel in HTTP headers (W3C traceparent header). Each service extracts the context, creates a child span, and injects context into outgoing calls. (3) Collector: spans are sent to an OTel Collector (or Jaeger Agent) which batches and exports to a trace store (Jaeger, Tempo). (4) Sampling: to reduce volume, sample 1% of traces uniformly, but 100% of traces with errors. Tail-based sampling (make sampling decision after seeing the complete trace) allows prioritizing interesting traces.

Question 4

How do you design effective alerting to avoid alert fatigue?

Accepted Answer

Alert fatigue occurs when too many low-quality alerts cause on-call engineers to ignore or silence them. Prevention: (1) Alert on symptoms (error rate, latency, availability) not causes (CPU, disk). Customers feel symptoms; causes are for debugging after an alert fires. (2) Add minimum duration: alert only if the condition persists for 5 minutes -- eliminates transient spikes. (3) Tiered severity: P1 (wake someone up) for true customer-facing outages. P2 (notify in Slack) for degradation. P3 (create ticket) for trends. (4) Actionable alerts: every alert should have a runbook link describing exactly how to investigate and remediate. Remove alerts with no clear remediation. (5) Review weekly: track alert volume per team. Any team firing more than N P1 alerts/week needs to revisit thresholds.

Question 5

How do you correlate logs, metrics, and traces in an observability platform?

Accepted Answer

Correlation links the three signals so engineers can navigate between them. The correlation key is trace_id: embed trace_id in every log line (structured logging: logger.info("request processed", trace_id=trace_id, duration_ms=42)). Tag Prometheus metrics with trace_id only for high-cardinality (avoid: metrics have cardinality limits). In Grafana: Explore view links from a metric spike -> logs filtered by time range and service -> traces filtered by trace_id. Exemplars: Prometheus supports attaching trace_id to individual histogram samples (exemplars). When viewing a latency histogram in Grafana, click a high-latency exemplar to jump directly to the trace. This seamless navigation (metrics -> logs -> traces) is the goal of a mature observability platform.

System Design: Log Aggregation and Observability Platform — ELK Stack, Metrics, Tracing, and Alerting

The Three Pillars of Observability

Log Aggregation Architecture

Metrics Collection

Distributed Tracing

Alerting

Interview Tips