System Design Interview: Time Series Database (Prometheus / InfluxDB)

What Is a Time Series Database?

A time series database (TSDB) stores sequences of timestamped values — metrics, sensor readings, financial prices, IoT data. Unlike general-purpose databases, TSDBs are optimized for: high write throughput (millions of data points per second), time-range queries (give me CPU usage for the last hour), efficient storage with compression, and automatic data retention (delete data older than 90 days). Prometheus, InfluxDB, TimescaleDB, and Graphite are popular TSDBs. Use cases: infrastructure monitoring (CPU, memory, network), application performance monitoring (request latency, error rate), IoT sensor data, and financial tick data.

Data Model

A time series is identified by a metric name and a set of labels (key-value pairs). In Prometheus: metric_name{label1=”value1″, label2=”value2″} timestamp value. Example: http_requests_total{method=”GET”, endpoint=”/api/users”, status=”200″} 1714000000 42543. Each unique combination of metric name + labels is one time series. Labels enable filtering and aggregation: sum(http_requests_total{status=”200″}) by (endpoint) gives total successful requests per endpoint. Cardinality: the number of unique time series = combinations of all label values. High-cardinality labels (user_id, request_id) create millions of time series — a common anti-pattern that degrades performance.

Write Path and Compression

TSDBs receive millions of writes per second. Efficient storage uses two tricks:

Delta encoding: instead of storing absolute timestamps (1714000000, 1714000060, 1714000120), store the first timestamp and then deltas (0, 60, 60, 60). Deltas are small integers, compressing well. If scrape intervals are consistent (every 60 seconds), the deltas are constant — achievable with run-length encoding (just store: first=1714000000, delta=60, count=100).

Gorilla compression (Facebook’s time series compression for floating-point values): XOR consecutive values. If CPU usage is 45.2, 45.3, 45.1, …, the XOR of consecutive pairs shares many high bits. Store only the meaningful bits. Achieves 1.37 bytes/sample on average vs 16 bytes/sample raw (timestamp + float64). Prometheus achieves similar compression in its block storage format.

In-memory write buffer: writes go to an in-memory chunk (WAL — Write-Ahead Log + memory buffer) for the most recent 2 hours. These hot chunks are queried frequently and compressed lightly for fast access. Periodically (every 2 hours), memory chunks are flushed to disk as immutable blocks, recompressed more aggressively (Snappy/zstd), and indexed. Old blocks are merged and compacted.

Query Language: PromQL

Prometheus Query Language (PromQL) enables flexible metric aggregation:

  • http_requests_total — select all time series with this name
  • rate(http_requests_total[5m]) — per-second rate over 5-minute window (for counter metrics)
  • histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) — P99 latency
  • sum(rate(http_requests_total[5m])) by (endpoint) — total RPS per endpoint
  • avg_over_time(cpu_usage[1h]) — average CPU over last hour

PromQL evaluates over time ranges by scanning relevant time series blocks, applying time-range filters (start/end timestamps), and computing aggregations over the matching data points.

Scalability: Thanos and Cortex

Single-node Prometheus handles ~1M active time series and ~1M samples/second. For larger deployments:

  • Thanos: adds a sidecar to Prometheus that uploads blocks to S3 for long-term storage. A Thanos Query component fans out queries to multiple Prometheus instances and S3, deduplicates results, and presents a single query endpoint. Scales horizontally — add Prometheus instances for more ingestion capacity.
  • Cortex / Mimir: fully distributed, horizontally scalable Prometheus-compatible TSDB. Each component (ingestor, querier, compactor, store-gateway) scales independently. Used by Grafana Cloud to serve thousands of tenants on a single multi-tenant cluster.
  • TimescaleDB: time series extension for PostgreSQL. Automatically partitions data into time-based chunks (hypertables). Enables SQL queries on time series data, referential integrity with other relational tables, and familiar tooling. Best for: IoT data with relational dimensions, financial data requiring SQL joins.

Downsampling and Retention

Raw metric data at 15-second resolution for 1 year: 1M time series × (365 × 24 × 60 × 4) samples = 210 billion samples. At 1.37 bytes/sample (Gorilla): ~290GB. Manageable for a single cluster but not indefinitely. Downsampling: keep raw data for 15 days. After 15 days, downsample to 5-minute resolution (retain min/max/avg over 5 minutes). After 90 days, downsample to 1-hour resolution. This reduces 1-year storage by 95%. Downsampling runs as a background compaction job — never delete raw data until downsampling is confirmed.

Interview Checklist

  • Data model: metric_name{label_key=label_value} → (timestamp, float) time series
  • Write path: WAL + memory buffer → periodic flush to immutable disk blocks
  • Compression: delta encoding for timestamps, XOR/Gorilla for float values
  • Query: time-range scan over indexed blocks, PromQL aggregations
  • Scale: sharded Prometheus + Thanos (S3 long-term) or Cortex/Mimir
  • Retention: raw → 5-minute → 1-hour downsampling tiers

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does Prometheus store time series data efficiently?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Prometheus uses a custom storage format optimized for time series metrics. Write path: raw samples are first written to a WAL (Write-Ahead Log) on disk for crash recovery, then buffered in memory in 2-hour chunks per time series. When a 2-hour memory chunk is complete, it is flushed to disk as an immutable block and compressed. Block format: each block covers a 2-hour window and contains separate files for chunks (the actual sample data), index (mapping from metric labels to chunk locations), and metadata. Compression: Prometheus uses Gorilla-style XOR compression for sample values (consecutive float64 values share many bits u2014 XOR produces small integers) and delta encoding for timestamps (store differences, not absolute values). This achieves ~1.3-1.5 bytes per sample versus 16 bytes raw (8 bytes timestamp + 8 bytes float64). Compaction: over time, small 2-hour blocks are merged by a background compaction process into larger 4-hour, 8-hour, … blocks. Larger blocks enable more aggressive compression and faster range queries (fewer blocks to scan). Retention: Prometheus deletes blocks outside the retention window (default 15 days) during compaction. The compaction process also handles downsampling for remote storage (Thanos/Cortex).”
}
},
{
“@type”: “Question”,
“name”: “What is high cardinality and why is it a problem for time series databases?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Cardinality in a TSDB is the number of unique time series u2014 the number of unique combinations of metric name + all label values. High cardinality means millions or billions of unique time series. Example: http_requests_total{user_id=”12345″, endpoint=”/api/v2/users”} u2014 if user_id can be any of 100M user IDs and there are 100 endpoints: cardinality = 100M u00d7 100 = 10 billion unique time series. Why this is a problem: (1) Memory u2014 Prometheus keeps the last two hours of all active time series in RAM. At 4KB per time series, 10 billion series = 40 terabytes of RAM u2014 impossible. (2) Disk u2014 each unique time series requires its own chunk files and index entries. (3) Query performance u2014 queries that don’t specify a high-cardinality label must scan all series (e.g., sum over all user_ids). (4) Index size u2014 the label index grows proportionally to cardinality. Solution: never use unbounded high-cardinality identifiers (user_id, request_id, session_id, IP address) as labels. Use low-cardinality labels only: environment (prod/staging), region (us-east/eu-west), service, endpoint (bucketed path, not raw URL with IDs), HTTP method, status code. For per-user metrics, use a different storage system (OLAP database, event store) designed for high-cardinality data.”
}
},
{
“@type”: “Question”,
“name”: “How does PromQL compute rate() and why is it needed for counters?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Prometheus counters are monotonically increasing u2014 they only go up (e.g., http_requests_total = 1,500,000 and increasing with every request). Querying the raw counter value tells you the total since the server started, not the current request rate. rate() computes the per-second rate of increase over a time window: rate(http_requests_total[5m]) = (last_value – first_value) / 300_seconds. This gives “requests per second, averaged over the last 5 minutes.” PromQL handles counter resets automatically: if the counter resets to 0 (e.g., server restart), rate() detects the reset (last_value < previous_value) and adjusts u2014 it counts only the increase since the reset, not a false negative spike. irate() computes the instantaneous rate using only the last two data points u2014 more responsive to sudden changes but noisier (high variance). rate() over a longer window is smoother u2014 appropriate for dashboards and alerting. increase(counter[5m]) = rate(counter[5m]) u00d7 300 = total increase over the 5-minute window (as a float, not necessarily integer due to extrapolation). Rule: always use rate() or irate() with counters, never display the raw counter value on dashboards u2014 it means nothing without the context of when the counter started."
}
}
]
}

  • Shopify Interview Guide
  • Uber Interview Guide
  • LinkedIn Interview Guide
  • Netflix Interview Guide
  • Cloudflare Interview Guide
  • Companies That Ask This

    Scroll to Top