Requirements
An analytics dashboard lets users visualize business metrics: revenue over time, active users, conversion rates, funnel analysis, A/B test results. Two modes: real-time (data is seconds old, e.g., live event counts) and historical (pre-aggregated, queries over months of data). Scale: a mid-size SaaS company generates 10-100 million events per day. An analytics product (Amplitude, Mixpanel, Datadog) may ingest billions of events per day from thousands of customers. The core technical challenge: making arbitrary slice-and-dice queries fast over billions of rows without forcing users to wait minutes for results.
Event Ingestion Pipeline
Events are the raw input: page views, clicks, purchases, API calls. Each event has: event_id, event_type, user_id, timestamp, properties (JSON: {page, country, device, revenue, …}). Ingestion path: client SDK (web, mobile) → event collector API → validation and enrichment → Kafka topic. Why Kafka: decouples ingestion from processing, handles bursts, provides replay capability (reprocess events if a bug is found in the aggregation logic). Throughput: Kafka can handle millions of events per second. Event validation at ingest: drop events with missing required fields, enforce schema per event type, check for bot traffic (rate limiting per user_id). Enrichment: add server-side properties (IP geolocation → country, timestamp normalization to UTC). Schema evolution: use Avro or Protobuf with a schema registry. New event properties can be added without breaking existing consumers. Kafka consumers: (1) real-time aggregator → Redis (for live dashboards), (2) batch loader → data warehouse (BigQuery, Snowflake, Redshift), (3) stream processor → ClickHouse or Druid (for fast analytical queries).
Aggregation Layer
Pre-aggregation is the key to fast dashboard queries. Don’t query raw events at dashboard render time — pre-aggregate into materialized summaries. Two approaches: Lambda architecture: batch layer (Spark jobs on S3 data, runs hourly/daily, produces accurate aggregates) + speed layer (real-time stream processing for the latest hour). Query = batch result + speed layer delta. Complex but accurate. Kappa architecture: all processing through a single stream pipeline (Flink, Spark Streaming). Reprocessing is done by replaying Kafka topics. Simpler operations. Aggregation granularity: store minute-level, hour-level, and day-level aggregates for each metric. Dashboard queries select the appropriate granularity based on the time window. Querying minute-level data for a 1-year view would be 525,600 rows per metric; day-level is 365 rows. AggregateTable: metric_name, dimensions (JSON), window_start, window_end, granularity, value. Example: {metric: ‘revenue’, dimensions: {country: ‘US’, product: ‘PRO’}, window: ‘2024-01-01 14:00’, granularity: ‘1h’, value: 12500.00}.
Query Engine and OLAP
For ad-hoc queries that can’t be pre-aggregated, use a columnar OLAP database. ClickHouse: 100x faster than PostgreSQL for analytical queries. Stores data column-by-column (not row-by-row). Columnar storage: if a query needs only revenue and country from a 100-column table, it reads only 2 columns instead of all 100. Compression: same-type values in a column compress much better than mixed-type rows. Vectorized execution: processes 1024 values per CPU instruction using SIMD. Typical performance: 10 billion row table, full scan, GROUP BY country → results in < 1 second. Partitioning by date: ClickHouse automatically skips partitions outside the query's time range. Index: sparse primary index on (date, user_id) to find the approximate data range. ClickHouse MergeTree engine merges data parts in the background, similar to LSM trees. Druid: similar to ClickHouse but optimized for sub-second queries on real-time streaming data. Segments are immutable; new data is ingested in real time and becomes queryable immediately. Used by Netflix, Airbnb, and Twitter for large-scale analytics.
Dashboard Rendering and Caching
Dashboard query plan: user selects a time range, metrics, and dimension filters. Backend translates to: check Redis cache (key = {dashboard_id, params_hash}, TTL 60s) → if miss, query ClickHouse or the pre-aggregate table → serialize to JSON → cache result → return. Cache hit rate: for popular dashboards with fixed time ranges (last 7 days), cache hit rate > 90%. For user-specific filters or custom date ranges: no caching possible, query directly. Auto-refresh: dashboards refresh every 30-60 seconds for live metrics. Use polling or Server-Sent Events. For real-time counters (live active users), use Redis counters updated by the stream processor, not OLAP queries. Sampling: for very high-cardinality queries (e.g., unique users per minute at 100M DAU), exact counting is expensive. Use HyperLogLog (Redis PFCOUNT): approximate unique count in O(1) space with < 1% error. Store HLL per {metric, window} in Redis. Merge across time windows by merging HLL structures.
See also: Databricks Interview Prep
See also: Meta Interview Prep
See also: Twitter Interview Prep
See also: LinkedIn Interview Prep