System Design: Analytics Dashboard — Real-Time Metrics, Aggregation Pipeline, and Query Engine

Requirements

An analytics dashboard lets users visualize business metrics: revenue over time, active users, conversion rates, funnel analysis, A/B test results. Two modes: real-time (data is seconds old, e.g., live event counts) and historical (pre-aggregated, queries over months of data). Scale: a mid-size SaaS company generates 10-100 million events per day. An analytics product (Amplitude, Mixpanel, Datadog) may ingest billions of events per day from thousands of customers. The core technical challenge: making arbitrary slice-and-dice queries fast over billions of rows without forcing users to wait minutes for results.

Event Ingestion Pipeline

Events are the raw input: page views, clicks, purchases, API calls. Each event has: event_id, event_type, user_id, timestamp, properties (JSON: {page, country, device, revenue, …}). Ingestion path: client SDK (web, mobile) → event collector API → validation and enrichment → Kafka topic. Why Kafka: decouples ingestion from processing, handles bursts, provides replay capability (reprocess events if a bug is found in the aggregation logic). Throughput: Kafka can handle millions of events per second. Event validation at ingest: drop events with missing required fields, enforce schema per event type, check for bot traffic (rate limiting per user_id). Enrichment: add server-side properties (IP geolocation → country, timestamp normalization to UTC). Schema evolution: use Avro or Protobuf with a schema registry. New event properties can be added without breaking existing consumers. Kafka consumers: (1) real-time aggregator → Redis (for live dashboards), (2) batch loader → data warehouse (BigQuery, Snowflake, Redshift), (3) stream processor → ClickHouse or Druid (for fast analytical queries).

Aggregation Layer

Pre-aggregation is the key to fast dashboard queries. Don’t query raw events at dashboard render time — pre-aggregate into materialized summaries. Two approaches: Lambda architecture: batch layer (Spark jobs on S3 data, runs hourly/daily, produces accurate aggregates) + speed layer (real-time stream processing for the latest hour). Query = batch result + speed layer delta. Complex but accurate. Kappa architecture: all processing through a single stream pipeline (Flink, Spark Streaming). Reprocessing is done by replaying Kafka topics. Simpler operations. Aggregation granularity: store minute-level, hour-level, and day-level aggregates for each metric. Dashboard queries select the appropriate granularity based on the time window. Querying minute-level data for a 1-year view would be 525,600 rows per metric; day-level is 365 rows. AggregateTable: metric_name, dimensions (JSON), window_start, window_end, granularity, value. Example: {metric: ‘revenue’, dimensions: {country: ‘US’, product: ‘PRO’}, window: ‘2024-01-01 14:00’, granularity: ‘1h’, value: 12500.00}.

Query Engine and OLAP

For ad-hoc queries that can’t be pre-aggregated, use a columnar OLAP database. ClickHouse: 100x faster than PostgreSQL for analytical queries. Stores data column-by-column (not row-by-row). Columnar storage: if a query needs only revenue and country from a 100-column table, it reads only 2 columns instead of all 100. Compression: same-type values in a column compress much better than mixed-type rows. Vectorized execution: processes 1024 values per CPU instruction using SIMD. Typical performance: 10 billion row table, full scan, GROUP BY country → results in < 1 second. Partitioning by date: ClickHouse automatically skips partitions outside the query's time range. Index: sparse primary index on (date, user_id) to find the approximate data range. ClickHouse MergeTree engine merges data parts in the background, similar to LSM trees. Druid: similar to ClickHouse but optimized for sub-second queries on real-time streaming data. Segments are immutable; new data is ingested in real time and becomes queryable immediately. Used by Netflix, Airbnb, and Twitter for large-scale analytics.

Dashboard Rendering and Caching

Dashboard query plan: user selects a time range, metrics, and dimension filters. Backend translates to: check Redis cache (key = {dashboard_id, params_hash}, TTL 60s) → if miss, query ClickHouse or the pre-aggregate table → serialize to JSON → cache result → return. Cache hit rate: for popular dashboards with fixed time ranges (last 7 days), cache hit rate > 90%. For user-specific filters or custom date ranges: no caching possible, query directly. Auto-refresh: dashboards refresh every 30-60 seconds for live metrics. Use polling or Server-Sent Events. For real-time counters (live active users), use Redis counters updated by the stream processor, not OLAP queries. Sampling: for very high-cardinality queries (e.g., unique users per minute at 100M DAU), exact counting is expensive. Use HyperLogLog (Redis PFCOUNT): approximate unique count in O(1) space with < 1% error. Store HLL per {metric, window} in Redis. Merge across time windows by merging HLL structures.


{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”What is the difference between lambda and kappa architecture for analytics?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Lambda architecture has two separate layers: a batch layer (accurate, runs hourly/daily on historical data) and a speed layer (approximate, processes real-time data). Queries merge both results. Kappa architecture has a single stream processing layer that handles everything — historical data is reprocessed by replaying the stream. Lambda is more complex but handles backfill naturally; kappa is simpler operationally but reprocessing at scale can be slow.”}},{“@type”:”Question”,”name”:”Why use a columnar database like ClickHouse for analytics instead of PostgreSQL?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Columnar databases store data by column rather than by row. Analytical queries (SELECT sum(revenue), country FROM events GROUP BY country) need only a few columns from potentially hundreds. Columnar storage reads only the needed columns, resulting in 10-100x less I/O. Values in the same column also compress much better (similar data types and patterns). ClickHouse also uses vectorized execution (SIMD instructions), processing 1024 values per CPU cycle for further speed gains.”}},{“@type”:”Question”,”name”:”How does pre-aggregation improve dashboard query performance?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Instead of querying raw events at render time, pre-aggregate them into materialized summaries at multiple granularities (minute, hour, day). A query for "daily revenue last 30 days" reads 30 pre-computed values instead of scanning 30 days of raw events. The aggregation job runs continuously (for the latest window) or on a schedule. The tradeoff: pre-aggregation is only efficient for known dimensions — ad-hoc slice-and-dice queries still need the raw OLAP engine.”}},{“@type”:”Question”,”name”:”How do you compute approximate unique user counts at massive scale?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Use HyperLogLog (HLL), a probabilistic data structure that estimates cardinality (unique count) using O(log log n) space with < 1% error. Redis natively supports HLL via PFADD (add element) and PFCOUNT (estimate cardinality). Store one HLL per (metric, time_window). To count unique users in a date range, merge the HLLs for each window (PFMERGE) and call PFCOUNT on the merged structure. Merging HLLs is O(1) and supports arbitrary time range queries.”}},{“@type”:”Question”,”name”:”How do you handle late-arriving events in a streaming analytics pipeline?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Allow a watermark delay (e.g., 5 minutes) — wait for events up to 5 minutes late before closing a time window. Events arriving within the watermark are included in the correct window. Events arriving after the watermark are either dropped or used to update the aggregate (triggering a recomputation of the affected window). Flink and Spark Streaming both support configurable watermarks and late data handling with allowedLateness.”}}]}

See also: Databricks Interview Prep

See also: Meta Interview Prep

See also: Twitter Interview Prep

See also: LinkedIn Interview Prep

Scroll to Top