What Is a Metrics and Monitoring System?
A metrics system collects numerical measurements from services (request rate, error rate, latency, CPU usage), stores them as time series, and enables querying, visualization, and alerting. Prometheus (pull-based) and Datadog (push-based) are the dominant systems. At scale: collecting millions of metrics per second from thousands of services, retaining months of history.
Metrics Types
- Counter: monotonically increasing value (total requests, total errors). Rate = derivative over time.
- Gauge: current value, can go up or down (memory usage, active connections, queue depth)
- Histogram: distribution of values in buckets (request latency in [<10ms, <50ms, <100ms, <500ms, +Inf] buckets)
- Summary: pre-computed quantiles (p50, p99) on the client side
Pull vs. Push Architecture
Pull (Prometheus): scraping model. Prometheus server periodically fetches /metrics endpoint from each service. Advantages: Prometheus controls the scrape rate, easy to detect down services (missing scrapes), simple to debug (just curl /metrics). Disadvantage: doesn’t work for short-lived jobs (batch jobs die before Prometheus scrapes them) — use PushGateway for those.
Push (Datadog, StatsD): services send metrics to an agent or collector. Advantages: works for short-lived processes, no need to configure Prometheus with all service endpoints. Disadvantage: metric storms (services push too fast), harder to backpressure.
Time Series Storage
Each metric is identified by: metric_name + label set (key-value pairs). Example: http_requests_total{method=”GET”, path=”/api/users”, status=”200″}. Time series = sequence of (timestamp, value) pairs.
Prometheus TSDB (time series database): stores data in blocks of 2 hours. Within a block: chunks of compressed time series. Uses delta encoding for timestamps and XOR encoding for float values (Gorilla compression from Facebook) — compresses 10x vs. raw storage. Each sample: ~1.37 bytes average (vs. 16 bytes raw). Block compaction: background process merges small blocks into larger ones, applying downsampling for long-term retention.
Querying: PromQL
# Request rate (per second) over 5-minute window:
rate(http_requests_total{status="200"}[5m])
# Error ratio:
rate(http_requests_total{status=~"5.."}[5m])
/ rate(http_requests_total[5m])
# 99th percentile latency:
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m])
)
# CPU usage per pod:
sum by (pod) (rate(container_cpu_usage_seconds_total[5m]))
Alerting
Alert rules defined in Prometheus: evaluate PromQL expressions on a schedule. When condition is true for longer than `for` duration, alert fires. Alertmanager: receives alert notifications, deduplicates (same alert from multiple Prometheus instances), groups related alerts, routes to appropriate receiver (PagerDuty, Slack, email), silences during maintenance windows.
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m])
/ rate(http_requests_total[5m]) > 0.01
for: 5m
annotations:
summary: "Error rate above 1% for 5 minutes"
Long-Term Storage
Prometheus retains 15 days by default (limited by local disk). For months/years: use Thanos or Cortex — sidecar processes that ship Prometheus blocks to S3/GCS for unlimited retention. Global query layer allows querying across multiple Prometheus instances (multi-cluster view). Downsampling: store 5-minute aggregates for 1 month, 1-hour aggregates for 1 year — dramatically reduces storage for historical data.
Interview Tips
- Four golden signals (Google SRE): Latency, Traffic (requests/sec), Errors, Saturation (resource utilization).
- Rate over range vector: rate() computes per-second average over the window. increase() computes total increase.
- Cardinality explosion: each unique label combination is a separate time series. Avoid high-cardinality labels (user_id, request_id). 1M series × 15 bytes/sample × 1 sample/15s × 86400s/day = massive storage.
- Histogram vs. Summary: histogram allows server-side aggregation (sum histograms across replicas); summary quantiles are client-side and cannot be aggregated.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What are the four golden signals and how do you monitor them?”,
“acceptedAnswer”: { “@type”: “Answer”, “text”: “The Four Golden Signals (Google SRE Book) are the most important metrics for any service: (1) Latency: time to serve a request. Measure percentiles (p50, p95, p99), not averages. High p99 with normal p50 means a subset of users has very bad experience. PromQL: histogram_quantile(0.99, rate(http_request_duration_bucket[5m])). (2) Traffic: demand on your system. Requests per second, messages per second, transactions per second. PromQL: rate(http_requests_total[5m]). (3) Errors: rate of failed requests. 5xx errors, exception rate, failed Kafka consumer messages. PromQL: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]). (4) Saturation: how "full" is your service? CPU, memory, disk, queue depth. How close are you to capacity? PromQL: sum(container_cpu_usage_seconds_total) / sum(kube_node_status_capacity_cpu_cores). Alert on: error rate > 1%, p99 latency > 500ms, saturation > 80%. The signals are ordered: latency and errors are user-facing (impact now); traffic tells you why; saturation predicts future problems.” }
},
{
“@type”: “Question”,
“name”: “How does Prometheus TSDB compress time series data efficiently?”,
“acceptedAnswer”: { “@type”: “Answer”, “text”: “Prometheus TSDB (Time Series Database) uses Gorilla-style compression (from Facebook's 2015 paper) achieving ~1.37 bytes per sample vs. 16 bytes raw (timestamp + float64). Timestamp compression: samples arrive at regular intervals (e.g., every 15 seconds). Store the first timestamp explicitly. For subsequent timestamps, store delta from previous timestamp (likely 15). For the delta-of-delta (usually 0 for regular scrapes), store 0 bits if zero, else a small variable-length encoding. Most samples: 0 bits for timestamp delta-of-delta. Value compression: XOR of current and previous float64. Consecutive measurements of a slowly-changing gauge (CPU usage: 0.453, 0.451, 0.454) have nearly identical bit patterns. XOR of similar floats produces leading zeros + small significant portion. Encode with a leading-zero count prefix and only the changed bits. Storage structure: samples grouped into chunks of ~120 samples (30 minutes at 15s interval). Chunks are immutable. Multiple chunks form a block (2-hour window). Block compaction merges overlapping blocks and applies downsampling.” }
},
{
“@type”: “Question”,
“name”: “How do you design alerting to minimize alert fatigue?”,
“acceptedAnswer”: { “@type”: “Answer”, “text”: “Alert fatigue occurs when on-call engineers receive too many alerts — many of which are low-severity, flapping, or duplicate. Engineers start ignoring or silencing alerts, which causes real incidents to be missed. Principles for good alerting: (1) Alert on symptoms, not causes. "User-facing error rate > 1%" is actionable. "MySQL slave replication lag" is a cause — only alert if it leads to user impact. (2) Use `for` duration. A momentary spike shouldn't wake someone at 3am. `for: 5m` means the condition must be true continuously for 5 minutes before firing. (3) Severity levels: P1 (pages immediately, service down), P2 (high error rate, pages), P3 (warning, Slack notification). Only P1 and P2 should page. (4) Deduplication and grouping: Alertmanager groups related alerts (same service, same time) into one notification. 100 pods all alerting high memory → one grouped alert. (5) Inhibition rules: when a datacenter is down (P1), suppress lower-severity alerts for services in that datacenter — they're expected. (6) Deadman's switch: alert if no data arrives (the monitoring system itself failing is as dangerous as the monitored system failing).” }
}
]
}