Question 1

How does HyperLogLog enable counting millions of unique users with kilobytes of memory?

Accepted Answer

Counting distinct users exactly requires storing all user IDs — O(N) memory (80MB for 10M users). HyperLogLog (HLL) is a probabilistic data structure that approximates unique count in 12KB with ~2% error, regardless of N. Algorithm insight: for uniformly random hash values, the maximum number of leading zeros in the hashes of all seen elements indicates the cardinality. If the longest run of leading zeros is k, there are approximately 2^k distinct elements. Averaging over multiple hash functions (HLL uses 16,384 registers) reduces variance to ~2%. Operations: HLL.add(element) in O(1), HLL.count() in O(1), HLL.merge(other_hll) in O(1). Use Redis PFADD and PFCOUNT natively. Trade-off: the 2% error means "1,000,000 active users" might display as "980,000 to 1,020,000." For dashboards tracking trends, this is acceptable. For billing (charge per unique user), use exact counting. Redis HyperLogLog uses 12KB per key regardless of cardinality — you can track billions of unique events with trivial memory.

Question 2

What is the difference between tumbling, sliding, and session windows in stream processing?

Accepted Answer

Window types determine how streaming data is grouped for aggregation. Tumbling window: fixed-size, non-overlapping. A 1-minute tumbling window groups events into [0:00-1:00), [1:00-2:00), etc. Each event belongs to exactly one window. Use for: per-minute metrics, hourly reports. Sliding window: fixed-size but overlapping, defined by window length and slide interval. A 5-minute window sliding every 1 minute: [0:00-5:00), [1:00-6:00), [2:00-7:00). Each event may belong to multiple windows. Use for: "active users in the last 5 minutes" (a common dashboard metric). More compute-intensive: each event is processed window_length/slide_interval times. Session window: groups events by user activity — a session starts on the first event and ends after a gap of N seconds with no events. Session length varies. Use for: user session analytics, funnel analysis. Flink and Spark Streaming support all three natively. For real-time dashboards: tumbling for discrete metrics (revenue this minute), sliding for rolling metrics (active users last N minutes).

Question 3

How do you design a query layer that serves both real-time and historical dashboard data?

Accepted Answer

The query layer must unify data from two stores with different latencies: Redis (hot, last 24h, sub-millisecond) and ClickHouse (cold, weeks/months, 100ms-500ms). Router logic: for time range within the last 24 hours, serve from Redis. For time range older than 24 hours, serve from ClickHouse. For queries spanning both (e.g., last 48 hours), split: fetch recent portion from Redis, historical portion from ClickHouse, merge and de-duplicate on the boundary minute. Cache layer: ClickHouse query results are cached in Redis with 5-minute TTL — dashboards refreshing every 10 seconds hit the cache after the first load. Pre-aggregation: ClickHouse materialized views compute hourly and daily rollups automatically. A 30-day daily rollup query scans 30 rows, not 43M raw events. The query layer exposes a unified API: GET /metrics?start=T1&end=T2&granularity=1m&dimensions=country,device. The routing to Redis vs ClickHouse is transparent to the client.

System Design Interview: Design a Real-Time Analytics Dashboard

What Is a Real-Time Analytics Dashboard?

System Requirements

Functional

Non-Functional

Architecture

Event Ingestion

Stream Processing with Flink

Hot Data in Redis

Historical Data in ClickHouse

Dashboard Serving

Approximate vs Exact Counts

Interview Tips