Core Entities
Metric: metric_id, name, type (COUNTER, GAUGE, HISTOGRAM), unit, description, owner_id, created_at. DataPoint: metric_id, timestamp, value (DOUBLE), tags (key-value pairs for dimensions like region, service, environment). Dashboard: dashboard_id, name, owner_id, layout (JSON grid config), created_at. Widget: widget_id, dashboard_id, type (LINE_CHART, BAR_CHART, SINGLE_STAT, TABLE, HEATMAP), metric_query (JSON), position, size. Alert: alert_id, metric_id, condition (threshold, anomaly), window_minutes, notification_channels.
Time-Series Storage
Do not store raw data points in a relational database for high-cardinality metrics (millions of data points per day). Use a time-series database: InfluxDB, TimescaleDB (PostgreSQL extension), Prometheus, or ClickHouse. Key properties needed: fast writes (millions of inserts/sec), efficient time-range queries (SELECT value WHERE timestamp BETWEEN t1 AND t2), automatic downsampling (roll up minute-level data to hour/day), data retention policies (delete data older than 90 days automatically). TimescaleDB: partitions data into time-based chunks (hypertables). Queries within a time range only scan relevant chunks — O(chunk_size) not O(table_size).
Downsampling: store raw data at 1-second or 1-minute granularity for 7 days. Roll up to 5-minute averages for 30 days. Roll up to 1-hour averages for 1 year. This reduces storage by 99% for long-range queries. A background job runs every hour to compute and store aggregates. On query: if the requested window spans more than 30 days, query the hourly aggregate table; otherwise query the finer-grained table.
Metric Ingestion Pipeline
Producers (services, IoT, client SDKs) send metrics to an ingestion API. Ingestion API validates and writes to a Kafka topic (partitioned by metric_id for ordering). Consumers (stream processors) batch-write to the time-series DB every 100ms. Buffering in Kafka handles write spikes — the DB receives a steady stream regardless of producer burst. The ingestion API returns immediately (fire-and-forget for metrics — occasional loss is acceptable). For critical business metrics (revenue, error rate): use synchronous writes with acknowledgment. Tag dimensions (region, host, service) are stored as metadata — enable filtering without increasing the number of metric series.
Query Engine
Widget queries describe: metric, time range, aggregation function (sum, avg, max, p99), group-by dimensions, and resolution. Query resolution: for a 7-day chart with 200 data points, resolution = 7 days / 200 points = ~50 minutes. The query engine rounds to the nearest available granularity (1-hour aggregates). Query execution: SELECT time_bucket(‘1 hour’, timestamp) as t, avg(value) FROM metrics WHERE metric_id=X AND timestamp BETWEEN t1 AND t2 GROUP BY t ORDER BY t. Results are cached in Redis with a TTL proportional to the query window (1-minute TTL for real-time, 1-hour TTL for historical). Cache key: hash(metric_id, time_range, resolution, tags).
Real-Time Updates
Dashboards can display live data (last 5 minutes refreshed every 10 seconds). Two approaches: (1) Polling: client requests new data points every 10 seconds. Simple, works everywhere, slight latency. (2) WebSocket push: server pushes new data points as they arrive. Lower latency, more complex. For most analytics dashboards, polling is sufficient — real-time to the second is not needed. When WebSocket is used: on each metric write, the stream processor publishes to a Redis pub/sub channel (metric:{id}:live). Dashboard server subscribes and pushes to connected clients. Rate-limit to one update per second per client to avoid overwhelming slow connections.
Alerting System
Alert definition: if metric X stays above threshold Y for Z consecutive minutes, fire. Evaluation: a background job queries the last Z minutes of data for each active alert every minute. If the condition is met: create an AlertFire event, notify via configured channels (PagerDuty, Slack, email). Implement alert suppression: do not re-fire the same alert if it fired within the last N minutes (hysteresis). Alert recovery: when the metric drops back below the threshold for Z minutes, fire a recovery notification. Store all AlertFire events for audit. Dashboard displays active alerts as colored banners on affected widgets.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “Why use a time-series database instead of a standard relational database for metrics?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Relational databases store metrics as rows with (timestamp, metric_id, value). For millions of data points per day, query performance degrades: a time-range query must scan an index for matching timestamps, then join with the metric table. Writes create index churn. Time-series databases (InfluxDB, TimescaleDB, Prometheus) are purpose-built: data is stored in time-ordered chunks (TimescaleDB hypertables), so queries for a time range only scan the relevant chunk — O(chunk_size) not O(table_size). Automatic downsampling (roll up fine-grained data to coarser granularity) and expiry policies are built-in. Write throughput is 10-100x higher because appends to the current chunk avoid random I/O. ClickHouse is used for analytics-scale metrics with SQL flexibility.”
}
},
{
“@type”: “Question”,
“name”: “How do you implement metric downsampling and retention policies?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Downsampling reduces data volume for long time ranges while preserving query performance. Store data at multiple granularities: raw (1-second or 1-minute) for the last 7 days, 5-minute averages for 30 days, 1-hour averages for 1 year, 1-day averages for 5 years. A scheduled job runs every hour: SELECT time_bucket(‘5 min’, timestamp), avg(value) FROM raw_metrics WHERE timestamp >= last_run GROUP BY 1 INSERT INTO metrics_5min. Query routing: if the requested range > 7 days, query 5-minute table; > 30 days, query hourly table. TimescaleDB continuous aggregates automate this. For aggregation functions: store sum and count separately (not just average) so aggregates can be further aggregated correctly (avg of avgs is wrong; sum/count is correct).”
}
},
{
“@type”: “Question”,
“name”: “How do you design a query builder for dashboard widgets?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A widget query describes: metric name, time range (relative like “last 6 hours” or absolute), aggregation function (avg, sum, max, min, p95, p99), group-by tags (group by “region”), filter tags (region=”us-east-1″), and display resolution (number of data points). Store this as a JSON query definition on the widget. The backend translates to a SQL or PromQL query at execution time. Relative time ranges are resolved at query time (not stored as absolute timestamps) so “last 6 hours” always means the last 6 hours. Parameterize dashboards with template variables: a dashboard variable “env” lets a dropdown switch all widgets between prod/staging. Template variables are substituted into widget query JSON before execution.”
}
},
{
“@type”: “Question”,
“name”: “How do you implement threshold-based alerting with hysteresis?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A threshold alert fires when a metric exceeds a threshold for a sustained window (e.g., error_rate > 5% for 5 consecutive minutes). Without hysteresis: if a metric oscillates between 4.9% and 5.1%, the alert fires and recovers repeatedly — alert storm. With hysteresis: fire when metric > 5% for 5 minutes (alert threshold). Recover only when metric < 3% for 5 minutes (recovery threshold). The gap between alert and recovery thresholds prevents flapping. Implement with alert state machine: NORMAL u2192 PENDING (threshold crossed but not for long enough) u2192 FIRING (sustained breach) u2192 RECOVERING (dropped below recovery threshold) u2192 NORMAL. Store the state and last transition time. Evaluate every minute; only notify on NORMALu2192FIRING and FIRINGu2192NORMAL transitions."
}
},
{
"@type": "Question",
"name": "How do you scale an analytics dashboard to handle thousands of concurrent users?",
"acceptedAnswer": {
"@type": "Answer",
"text": "Query caching is the primary lever: cache query results in Redis with TTL proportional to the time range (1-minute TTL for last-5-minutes queries; 1-hour TTL for last-30-days queries). Most dashboard views are identical — 1000 users looking at the same "company overview" dashboard all get the same cached result. Cache key includes the resolved time range, metric IDs, and tags. Pre-computation: for popular dashboards, run queries proactively every minute and warm the cache — viewers get instant results. Read replicas: route all dashboard queries to read replicas of the time-series DB; writes go to the primary. WebSocket connections for live data use a pub/sub fan-out model: one subscription per metric at the server level, fan out to N connected clients — avoids N database queries for N viewers of the same metric."
}
}
]
}
Asked at: Databricks Interview Guide
Asked at: Cloudflare Interview Guide
Asked at: Stripe Interview Guide
Asked at: Netflix Interview Guide
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering