System Design Interview: Design a Metrics and Monitoring System (Datadog/Prometheus)

System Design Interview: Design a Metrics and Monitoring System

Every large tech company runs a custom metrics and monitoring platform. Datadog, Prometheus, and Graphite solve this problem. This guide covers the architecture of a system that collects, stores, queries, and alerts on time-series metrics at scale.

Requirements

Functional: collect metrics from thousands of hosts (CPU, memory, request rate, error rate), store metrics for 1 year, support aggregation queries (avg, sum, p99 over time ranges), trigger alerts when metrics cross thresholds.

Non-functional: ingest 1M metrics/second, query latency <1 second, 99.9% availability.

Data Model: Time Series

A metric is a named measurement with a timestamp and tags:

MetricPoint {
  name:      "api.request.latency_ms"
  tags:      {"host": "web-01", "endpoint": "/checkout", "region": "us-east"}
  value:     45.2
  timestamp: 1713200000    # Unix epoch, second precision
}

# Time series = unique (name, tags) combination
# Example: api.request.latency_ms{host=web-01, endpoint=/checkout} is one time series

At 10,000 hosts × 100 metrics/host = 1M metrics/second. Over 1 year: 1M × 86400 × 365 ≈ 30 trillion data points. Storage is the primary challenge.

Storage: Time-Series Database

Relational databases are poorly suited for time-series workloads. Use a specialized TSDB:

Column-oriented storage: store all timestamps together, all values together. Same-type data compresses 10-50x better than row-oriented storage (similar values in adjacent rows).

Delta encoding: timestamps increase monotonically. Store only the delta: [1000, 1000, 1000, 1001, 1000, …] compresses to ~2 bits/sample instead of 8 bytes.

XOR compression (Gorilla, Facebook 2015): consecutive float values in a time series differ by small amounts. XOR the current value with the previous, encode only the meaningful bits. Gorilla achieves 1.37 bytes/data point (vs. 16 bytes uncompressed), enabling Facebook to keep 26 hours of data in memory.

Popular TSDBs: InfluxDB, TimescaleDB (PostgreSQL extension), Prometheus (pull-based), OpenTSDB (HBase-backed), ClickHouse.

Ingestion Pipeline

Agents (StatsD/collectd on each host)
    ↓ UDP/HTTP push every 10 seconds
Ingestion Service (stateless, horizontally scaled)
    ↓ Kafka (buffer for backpressure)
Stream Processor (aggregate, validate, tag enrichment)
    ↓
TSDB Cluster (sharded by metric name hash)

Pull vs. Push: Prometheus uses pull (scrapes each host on a schedule). Datadog uses push (agents send to collectors). Pull is simpler for service discovery but fails for short-lived jobs. Push scales better but requires the ingestion tier to handle variable load.

Data Retention and Downsampling

Storing 1-second resolution for a year is expensive. Use a tiered retention policy:

  • Raw (1s resolution): last 7 days in hot storage (SSD, 10x cost)
  • 1-minute aggregates: last 30 days in warm storage (HDD)
  • 1-hour aggregates: last 1 year in cold storage (object store, S3)

Downsampling: a batch job runs every minute, aggregating the last minute of raw data into a single (min, max, avg, sum, count) tuple per time series. Reduces storage by 60x for the 1-minute tier.

Querying

Time-series query languages (PromQL, InfluxQL) support:

# Average latency over 5-minute windows for a specific endpoint:
avg_over_time(api.request.latency_ms{endpoint="/checkout"}[5m])

# Error rate as a percentage:
sum(rate(api.errors[1m])) / sum(rate(api.requests[1m])) * 100

# P99 latency across all hosts:
histogram_quantile(0.99, api.request.latency_ms_bucket)

Queries fan out to all shards holding the relevant time series, each shard applies the time filter and aggregation, a query coordinator merges the results.

Alerting

Alert rule: IF avg(api.error_rate) > 0.01 FOR 5m THEN notify(oncall)
# Evaluation: run every 30 seconds
# Hysteresis: alert fires only after 5 continuous minutes above threshold
# Suppression: do not re-alert during a deployment window

Alert evaluation runs the same query engine as dashboards. Alerts are deduplicated (fire once, not every 30 seconds). An alert manager routes notifications to PagerDuty, Slack, or email based on severity and team ownership.

Interview Tips

  • Lead with the data model: “a metric is (name, tags, timestamp, value) — the unique (name, tags) combination is a time series”
  • Mention compression: delta encoding for timestamps, XOR/Gorilla for values — shows you understand why TSDBs exist
  • Explain the downsampling tier: raw → 1-minute → 1-hour, and why it is necessary for 1-year retention
  • For pull vs. push: mention both and say the choice depends on deployment topology
  • Alerting follow-up: explain hysteresis (sustained threshold breach) to avoid alert flapping on noisy metrics

  • LinkedIn Interview Guide
  • Cloudflare Interview Guide
  • Databricks Interview Guide
  • Stripe Interview Guide
  • Airbnb Interview Guide
  • Uber Interview Guide
  • Scroll to Top