Core Entities
Metric: metric_id, name, type (COUNTER, GAUGE, HISTOGRAM), unit, description, owner_id, created_at. DataPoint: metric_id, timestamp, value (DOUBLE), tags (key-value pairs for dimensions like region, service, environment). Dashboard: dashboard_id, name, owner_id, layout (JSON grid config), created_at. Widget: widget_id, dashboard_id, type (LINE_CHART, BAR_CHART, SINGLE_STAT, TABLE, HEATMAP), metric_query (JSON), position, size. Alert: alert_id, metric_id, condition (threshold, anomaly), window_minutes, notification_channels.
Time-Series Storage
Do not store raw data points in a relational database for high-cardinality metrics (millions of data points per day). Use a time-series database: InfluxDB, TimescaleDB (PostgreSQL extension), Prometheus, or ClickHouse. Key properties needed: fast writes (millions of inserts/sec), efficient time-range queries (SELECT value WHERE timestamp BETWEEN t1 AND t2), automatic downsampling (roll up minute-level data to hour/day), data retention policies (delete data older than 90 days automatically). TimescaleDB: partitions data into time-based chunks (hypertables). Queries within a time range only scan relevant chunks — O(chunk_size) not O(table_size).
Downsampling: store raw data at 1-second or 1-minute granularity for 7 days. Roll up to 5-minute averages for 30 days. Roll up to 1-hour averages for 1 year. This reduces storage by 99% for long-range queries. A background job runs every hour to compute and store aggregates. On query: if the requested window spans more than 30 days, query the hourly aggregate table; otherwise query the finer-grained table.
Metric Ingestion Pipeline
Producers (services, IoT, client SDKs) send metrics to an ingestion API. Ingestion API validates and writes to a Kafka topic (partitioned by metric_id for ordering). Consumers (stream processors) batch-write to the time-series DB every 100ms. Buffering in Kafka handles write spikes — the DB receives a steady stream regardless of producer burst. The ingestion API returns immediately (fire-and-forget for metrics — occasional loss is acceptable). For critical business metrics (revenue, error rate): use synchronous writes with acknowledgment. Tag dimensions (region, host, service) are stored as metadata — enable filtering without increasing the number of metric series.
Query Engine
Widget queries describe: metric, time range, aggregation function (sum, avg, max, p99), group-by dimensions, and resolution. Query resolution: for a 7-day chart with 200 data points, resolution = 7 days / 200 points = ~50 minutes. The query engine rounds to the nearest available granularity (1-hour aggregates). Query execution: SELECT time_bucket(‘1 hour’, timestamp) as t, avg(value) FROM metrics WHERE metric_id=X AND timestamp BETWEEN t1 AND t2 GROUP BY t ORDER BY t. Results are cached in Redis with a TTL proportional to the query window (1-minute TTL for real-time, 1-hour TTL for historical). Cache key: hash(metric_id, time_range, resolution, tags).
Real-Time Updates
Dashboards can display live data (last 5 minutes refreshed every 10 seconds). Two approaches: (1) Polling: client requests new data points every 10 seconds. Simple, works everywhere, slight latency. (2) WebSocket push: server pushes new data points as they arrive. Lower latency, more complex. For most analytics dashboards, polling is sufficient — real-time to the second is not needed. When WebSocket is used: on each metric write, the stream processor publishes to a Redis pub/sub channel (metric:{id}:live). Dashboard server subscribes and pushes to connected clients. Rate-limit to one update per second per client to avoid overwhelming slow connections.
Alerting System
Alert definition: if metric X stays above threshold Y for Z consecutive minutes, fire. Evaluation: a background job queries the last Z minutes of data for each active alert every minute. If the condition is met: create an AlertFire event, notify via configured channels (PagerDuty, Slack, email). Implement alert suppression: do not re-fire the same alert if it fired within the last N minutes (hysteresis). Alert recovery: when the metric drops back below the threshold for Z minutes, fire a recovery notification. Store all AlertFire events for audit. Dashboard displays active alerts as colored banners on affected widgets.
Asked at: Databricks Interview Guide
Asked at: Cloudflare Interview Guide
Asked at: Stripe Interview Guide
Asked at: Netflix Interview Guide
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering