How do time-series databases compress metrics data so efficiently?

Two key compression techniques: (1) Delta encoding for timestamps: since timestamps in a time series increase monotonically (usually by the same interval), store only the delta: instead of [1000, 1010, 1020, 1030], store [1000, 10, 10, 10]. This reduces timestamp storage from 8 bytes to ~1 bit for regular series. (2) XOR compression (Gorilla, Facebook 2015) for values: consecutive values in a time series are usually similar. XOR the current value with the previous, then encode only the non-zero bits. Values like CPU percentages that change by small amounts compress to 1-2 bytes vs. 8 bytes uncompressed. Combined, Gorilla achieves 1.37 bytes per data point, allowing Facebook to keep 26 hours of in-memory metrics for their entire infrastructure.

What is metric downsampling and why is it necessary?

Downsampling aggregates high-resolution data into lower-resolution summaries for long-term storage. Example: raw 1-second metrics for the last 7 days, 1-minute aggregates for the last 30 days, 1-hour aggregates for the last 1 year. Without downsampling, 1 year of 1-second data = 31.5M samples per metric, which is prohibitively expensive to store and slow to query. Downsampling reduces this to 8,760 hourly samples (370x reduction). A background job runs every minute: for each time series, compute (min, max, avg, sum, count) for the past minute and write one record to the 1-minute tier. The raw data is then eligible for deletion. The tradeoff: you lose sub-minute resolution for data older than 7 days.

How do you design an alerting system that avoids alert flapping?

Alert flapping occurs when a metric oscillates around the threshold, firing and resolving repeatedly. Solutions: (1) Hysteresis (sustained breach): only fire an alert if the threshold has been continuously exceeded for N minutes (e.g., "avg error rate > 1% for 5 consecutive minutes"). A brief spike does not trigger a page. (2) Alert suppression: do not fire alerts during known maintenance windows or deployments. (3) Minimum alert interval: after firing, do not re-fire the same alert for at least 30 minutes even if it briefly recovers. (4) Smoothing: alert on a moving average (5-minute avg) rather than instantaneous values to reduce noise. In Prometheus/Alertmanager: set "for: 5m" in the alert rule and configure "group_wait" and "repeat_interval" in Alertmanager.

System Design Interview: Design a Metrics and Monitoring System (Datadog/Prometheus)

⏱ 5 min read

System Design Interview: Design a Metrics and Monitoring System

Every large tech company runs a custom metrics and monitoring platform. Datadog, Prometheus, and Graphite solve this problem. This guide covers the architecture of a system that collects, stores, queries, and alerts on time-series metrics at scale.

Requirements

Functional: collect metrics from thousands of hosts (CPU, memory, request rate, error rate), store metrics for 1 year, support aggregation queries (avg, sum, p99 over time ranges), trigger alerts when metrics cross thresholds.

Non-functional: ingest 1M metrics/second, query latency <1 second, 99.9% availability.

Data Model: Time Series

A metric is a named measurement with a timestamp and tags:

MetricPoint {
  name:      "api.request.latency_ms"
  tags:      {"host": "web-01", "endpoint": "/checkout", "region": "us-east"}
  value:     45.2
  timestamp: 1713200000    # Unix epoch, second precision
}

# Time series = unique (name, tags) combination
# Example: api.request.latency_ms{host=web-01, endpoint=/checkout} is one time series

At 10,000 hosts × 100 metrics/host = 1M metrics/second. Over 1 year: 1M × 86400 × 365 ≈ 30 trillion data points. Storage is the primary challenge.

Storage: Time-Series Database

Relational databases are poorly suited for time-series workloads. Use a specialized TSDB:

Column-oriented storage: store all timestamps together, all values together. Same-type data compresses 10-50x better than row-oriented storage (similar values in adjacent rows).

Delta encoding: timestamps increase monotonically. Store only the delta: [1000, 1000, 1000, 1001, 1000, …] compresses to ~2 bits/sample instead of 8 bytes.

XOR compression (Gorilla, Facebook 2015): consecutive float values in a time series differ by small amounts. XOR the current value with the previous, encode only the meaningful bits. Gorilla achieves 1.37 bytes/data point (vs. 16 bytes uncompressed), enabling Facebook to keep 26 hours of data in memory.

Popular TSDBs: InfluxDB, TimescaleDB (PostgreSQL extension), Prometheus (pull-based), OpenTSDB (HBase-backed), ClickHouse.

Ingestion Pipeline

Agents (StatsD/collectd on each host)
    ↓ UDP/HTTP push every 10 seconds
Ingestion Service (stateless, horizontally scaled)
    ↓ Kafka (buffer for backpressure)
Stream Processor (aggregate, validate, tag enrichment)
    ↓
TSDB Cluster (sharded by metric name hash)

Pull vs. Push: Prometheus uses pull (scrapes each host on a schedule). Datadog uses push (agents send to collectors). Pull is simpler for service discovery but fails for short-lived jobs. Push scales better but requires the ingestion tier to handle variable load.

Data Retention and Downsampling

Storing 1-second resolution for a year is expensive. Use a tiered retention policy:

Raw (1s resolution): last 7 days in hot storage (SSD, 10x cost)
1-minute aggregates: last 30 days in warm storage (HDD)
1-hour aggregates: last 1 year in cold storage (object store, S3)

Downsampling: a batch job runs every minute, aggregating the last minute of raw data into a single (min, max, avg, sum, count) tuple per time series. Reduces storage by 60x for the 1-minute tier.

Querying

Time-series query languages (PromQL, InfluxQL) support:

# Average latency over 5-minute windows for a specific endpoint:
avg_over_time(api.request.latency_ms{endpoint="/checkout"}[5m])

# Error rate as a percentage:
sum(rate(api.errors[1m])) / sum(rate(api.requests[1m])) * 100

# P99 latency across all hosts:
histogram_quantile(0.99, api.request.latency_ms_bucket)

Queries fan out to all shards holding the relevant time series, each shard applies the time filter and aggregation, a query coordinator merges the results.

Alerting

Alert rule: IF avg(api.error_rate) > 0.01 FOR 5m THEN notify(oncall)
# Evaluation: run every 30 seconds
# Hysteresis: alert fires only after 5 continuous minutes above threshold
# Suppression: do not re-alert during a deployment window

Alert evaluation runs the same query engine as dashboards. Alerts are deduplicated (fire once, not every 30 seconds). An alert manager routes notifications to PagerDuty, Slack, or email based on severity and team ownership.

Interview Tips

Lead with the data model: “a metric is (name, tags, timestamp, value) — the unique (name, tags) combination is a time series”
Mention compression: delta encoding for timestamps, XOR/Gorilla for values — shows you understand why TSDBs exist
Explain the downsampling tier: raw → 1-minute → 1-hour, and why it is necessary for 1-year retention
For pull vs. push: mention both and say the choice depends on deployment topology
Alerting follow-up: explain hysteresis (sustained threshold breach) to avoid alert flapping on noisy metrics

LinkedIn Interview Guide

Cloudflare Interview Guide

Databricks Interview Guide

Stripe Interview Guide

Airbnb Interview Guide

Uber Interview Guide