System Design Interview: Design a Real-Time Analytics Dashboard

What Is a Real-Time Analytics Dashboard?

A real-time analytics dashboard displays live metrics and aggregations over streaming data: active users, revenue per minute, error rates, conversion funnels. Examples: Datadog dashboards, Stripe radar, Google Analytics real-time view. Core challenges: ingesting high-volume event streams, computing aggregations in near-real-time (seconds, not minutes), and serving many concurrent dashboard viewers efficiently.

  • LinkedIn Interview Guide
  • Twitter Interview Guide
  • Cloudflare Interview Guide
  • Stripe Interview Guide
  • Meta Interview Guide
  • Databricks Interview Guide
  • System Requirements

    Functional

    • Ingest user events (page views, clicks, purchases, errors)
    • Display real-time metrics: active users (last 5 min), events/second, revenue/hour
    • Time-series charts with 1-minute granularity for the last 24 hours
    • Filter by dimensions: country, device type, product category
    • Anomaly highlighting: metric deviating more than 2 std devs from baseline

    Non-Functional

    • 1M events/second ingestion
    • Dashboard refresh every 10 seconds
    • Query latency <500ms for a 24-hour time-series query

    Architecture

    Events ──► Kafka ──► Flink (streaming aggregation)
                               │
                        ┌──────┴──────────┐
                        ▼                 ▼
                  Redis (hot data)   ClickHouse/Druid
                  last 5 min         (historical, dimensional)
                        │                 │
                        └──────┬──────────┘
                               ▼
                        Query Service ──► Dashboard (WebSocket)
    

    Event Ingestion

    Client SDKs batch events (50ms batches) and POST to an ingestion service. The ingestion service validates, enriches (add server timestamp, geo from IP, device parse from User-Agent), and produces to Kafka. Kafka partitioned by user_id ensures per-user event ordering. 1M events/sec at 500 bytes/event = 500 MB/sec into Kafka — needs ~50 partitions across a 10-node cluster.

    Flink jobs consume from Kafka and maintain windowed aggregations:

    # Tumbling window: count events per minute
    stream
        .key_by(lambda e: (e.event_type, e.country))
        .window(TumblingEventTimeWindows.of(Time.minutes(1)))
        .aggregate(CountAggregate())
        .add_sink(RedisSink())
    
    # Sliding window: active users last 5 minutes
    stream
        .key_by(lambda e: e.user_id)
        .window(SlidingEventTimeWindows.of(Time.minutes(5), Time.minutes(1)))
        .aggregate(UniqueCountAggregate(HyperLogLog))
        .add_sink(RedisSink())
    

    Hot Data in Redis

    Flink writes aggregated results to Redis every minute (or every 10 seconds for near-real-time metrics). Data structures:

    • Active users: HyperLogLog per minute bucket (low memory, approximate unique count)
    • Event counts: Redis hash keyed by (event_type, minute_bucket)
    • Revenue: Redis sorted set by timestamp for time-series

    Redis holds 24 hours of per-minute data. At 1440 minutes/day * 50 metric combinations = 72K keys. Each key ~100 bytes = 7 MB — trivial.

    Historical Data in ClickHouse

    For queries spanning days/weeks: events land in ClickHouse via Kafka consumer. ClickHouse uses a columnar engine with pre-aggregated materialized views. A query for “hourly revenue for the last 30 days” scans a pre-aggregated hourly rollup table rather than raw events. Query time: <500ms for a 30-day hourly rollup across millions of rows.

    Dashboard Serving

    Dashboards use WebSocket (persistent connection). On connect: serve the last 24 hours of time-series from Redis (fast) and ClickHouse (for older data). Then push updates every 10 seconds: just the latest minute’s metrics from Redis. This keeps update payloads tiny (delta only). For 10K concurrent dashboard users: fan out via Redis pub/sub to connection servers.

    Approximate vs Exact Counts

    Counting distinct active users exactly requires storing all user IDs — O(N) memory. HyperLogLog approximates unique count in O(1) memory (12KB for any N) with ~2% error. For 1M events/sec with high cardinality, HyperLogLog is the standard choice. Display as “~1.2M active users” — dashboards are used for trends, not billing; approximation is acceptable.

    Interview Tips

    • Lambda architecture: Flink for real-time, ClickHouse for historical — name both.
    • HyperLogLog for approximate unique counts — 12KB vs gigabytes for exact.
    • Redis for hot data (last 24h), columnar DB for cold (historical).
    • WebSocket delta updates: push only the latest minute, not the full 24h on each refresh.
    Scroll to Top