What Is a Real-Time Analytics Dashboard?
A real-time analytics dashboard displays live metrics and aggregations over streaming data: active users, revenue per minute, error rates, conversion funnels. Examples: Datadog dashboards, Stripe radar, Google Analytics real-time view. Core challenges: ingesting high-volume event streams, computing aggregations in near-real-time (seconds, not minutes), and serving many concurrent dashboard viewers efficiently.
System Requirements
Functional
- Ingest user events (page views, clicks, purchases, errors)
- Display real-time metrics: active users (last 5 min), events/second, revenue/hour
- Time-series charts with 1-minute granularity for the last 24 hours
- Filter by dimensions: country, device type, product category
- Anomaly highlighting: metric deviating more than 2 std devs from baseline
Non-Functional
- 1M events/second ingestion
- Dashboard refresh every 10 seconds
- Query latency <500ms for a 24-hour time-series query
Architecture
Events ──► Kafka ──► Flink (streaming aggregation)
│
┌──────┴──────────┐
▼ ▼
Redis (hot data) ClickHouse/Druid
last 5 min (historical, dimensional)
│ │
└──────┬──────────┘
▼
Query Service ──► Dashboard (WebSocket)
Event Ingestion
Client SDKs batch events (50ms batches) and POST to an ingestion service. The ingestion service validates, enriches (add server timestamp, geo from IP, device parse from User-Agent), and produces to Kafka. Kafka partitioned by user_id ensures per-user event ordering. 1M events/sec at 500 bytes/event = 500 MB/sec into Kafka — needs ~50 partitions across a 10-node cluster.
Stream Processing with Flink
Flink jobs consume from Kafka and maintain windowed aggregations:
# Tumbling window: count events per minute
stream
.key_by(lambda e: (e.event_type, e.country))
.window(TumblingEventTimeWindows.of(Time.minutes(1)))
.aggregate(CountAggregate())
.add_sink(RedisSink())
# Sliding window: active users last 5 minutes
stream
.key_by(lambda e: e.user_id)
.window(SlidingEventTimeWindows.of(Time.minutes(5), Time.minutes(1)))
.aggregate(UniqueCountAggregate(HyperLogLog))
.add_sink(RedisSink())
Hot Data in Redis
Flink writes aggregated results to Redis every minute (or every 10 seconds for near-real-time metrics). Data structures:
- Active users: HyperLogLog per minute bucket (low memory, approximate unique count)
- Event counts: Redis hash keyed by (event_type, minute_bucket)
- Revenue: Redis sorted set by timestamp for time-series
Redis holds 24 hours of per-minute data. At 1440 minutes/day * 50 metric combinations = 72K keys. Each key ~100 bytes = 7 MB — trivial.
Historical Data in ClickHouse
For queries spanning days/weeks: events land in ClickHouse via Kafka consumer. ClickHouse uses a columnar engine with pre-aggregated materialized views. A query for “hourly revenue for the last 30 days” scans a pre-aggregated hourly rollup table rather than raw events. Query time: <500ms for a 30-day hourly rollup across millions of rows.
Dashboard Serving
Dashboards use WebSocket (persistent connection). On connect: serve the last 24 hours of time-series from Redis (fast) and ClickHouse (for older data). Then push updates every 10 seconds: just the latest minute’s metrics from Redis. This keeps update payloads tiny (delta only). For 10K concurrent dashboard users: fan out via Redis pub/sub to connection servers.
Approximate vs Exact Counts
Counting distinct active users exactly requires storing all user IDs — O(N) memory. HyperLogLog approximates unique count in O(1) memory (12KB for any N) with ~2% error. For 1M events/sec with high cardinality, HyperLogLog is the standard choice. Display as “~1.2M active users” — dashboards are used for trends, not billing; approximation is acceptable.
Interview Tips
- Lambda architecture: Flink for real-time, ClickHouse for historical — name both.
- HyperLogLog for approximate unique counts — 12KB vs gigabytes for exact.
- Redis for hot data (last 24h), columnar DB for cold (historical).
- WebSocket delta updates: push only the latest minute, not the full 24h on each refresh.