Q: How do you scale an analytics dashboard to handle thousands of concurrent users?

Query caching is the primary lever: cache query results in Redis with TTL proportional to the time range (1-minute TTL for last-5-minutes queries; 1-hour TTL for last-30-days queries). Most dashboard views are identical -- 1000 users looking at the same "company overview" dashboard all get the same cached result. Cache key includes the resolved time range, metric IDs, and tags. Pre-computation: for popular dashboards, run queries proactively every minute and warm the cache -- viewers get instant results. Read replicas: route all dashboard queries to read replicas of the time-series DB; writes go to the primary. WebSocket connections for live data use a pub/sub fan-out model: one subscription per metric at the server level, fan out to N connected clients -- avoids N database queries for N viewers of the same metric.

Question 1

Why use a time-series database instead of a standard relational database for metrics?

Accepted Answer

Relational databases store metrics as rows with (timestamp, metric_id, value). For millions of data points per day, query performance degrades: a time-range query must scan an index for matching timestamps, then join with the metric table. Writes create index churn. Time-series databases (InfluxDB, TimescaleDB, Prometheus) are purpose-built: data is stored in time-ordered chunks (TimescaleDB hypertables), so queries for a time range only scan the relevant chunk -- O(chunk_size) not O(table_size). Automatic downsampling (roll up fine-grained data to coarser granularity) and expiry policies are built-in. Write throughput is 10-100x higher because appends to the current chunk avoid random I/O. ClickHouse is used for analytics-scale metrics with SQL flexibility.

Question 2

How do you implement metric downsampling and retention policies?

Accepted Answer

Downsampling reduces data volume for long time ranges while preserving query performance. Store data at multiple granularities: raw (1-second or 1-minute) for the last 7 days, 5-minute averages for 30 days, 1-hour averages for 1 year, 1-day averages for 5 years. A scheduled job runs every hour: SELECT time_bucket('5 min', timestamp), avg(value) FROM raw_metrics WHERE timestamp >= last_run GROUP BY 1 INSERT INTO metrics_5min. Query routing: if the requested range > 7 days, query 5-minute table; > 30 days, query hourly table. TimescaleDB continuous aggregates automate this. For aggregation functions: store sum and count separately (not just average) so aggregates can be further aggregated correctly (avg of avgs is wrong; sum/count is correct).

Question 3

How do you design a query builder for dashboard widgets?

Accepted Answer

A widget query describes: metric name, time range (relative like "last 6 hours" or absolute), aggregation function (avg, sum, max, min, p95, p99), group-by tags (group by "region"), filter tags (region="us-east-1"), and display resolution (number of data points). Store this as a JSON query definition on the widget. The backend translates to a SQL or PromQL query at execution time. Relative time ranges are resolved at query time (not stored as absolute timestamps) so "last 6 hours" always means the last 6 hours. Parameterize dashboards with template variables: a dashboard variable "env" lets a dropdown switch all widgets between prod/staging. Template variables are substituted into widget query JSON before execution.

Question 4

How do you implement threshold-based alerting with hysteresis?

Accepted Answer

A threshold alert fires when a metric exceeds a threshold for a sustained window (e.g., error_rate > 5% for 5 consecutive minutes). Without hysteresis: if a metric oscillates between 4.9% and 5.1%, the alert fires and recovers repeatedly -- alert storm. With hysteresis: fire when metric > 5% for 5 minutes (alert threshold). Recover only when metric < 3% for 5 minutes (recovery threshold). The gap between alert and recovery thresholds prevents flapping. Implement with alert state machine: NORMAL u2192 PENDING (threshold crossed but not for long enough) u2192 FIRING (sustained breach) u2192 RECOVERING (dropped below recovery threshold) u2192 NORMAL. Store the state and last transition time. Evaluate every minute; only notify on NORMALu2192FIRING and FIRINGu2192NORMAL transitions.

Question 5

How do you scale an analytics dashboard to handle thousands of concurrent users?

Accepted Answer

Query caching is the primary lever: cache query results in Redis with TTL proportional to the time range (1-minute TTL for last-5-minutes queries; 1-hour TTL for last-30-days queries). Most dashboard views are identical -- 1000 users looking at the same "company overview" dashboard all get the same cached result. Cache key includes the resolved time range, metric IDs, and tags. Pre-computation: for popular dashboards, run queries proactively every minute and warm the cache -- viewers get instant results. Read replicas: route all dashboard queries to read replicas of the time-series DB; writes go to the primary. WebSocket connections for live data use a pub/sub fan-out model: one subscription per metric at the server level, fan out to N connected clients -- avoids N database queries for N viewers of the same metric.

Low-Level Design: Analytics Dashboard — Metrics Aggregation, Time-Series Storage, and Real-Time Charting

Core Entities

Time-Series Storage

Metric Ingestion Pipeline

Query Engine

Real-Time Updates

Alerting System