Q: How do you handle late-arriving events in a streaming analytics pipeline?

Allow a watermark delay (e.g., 5 minutes) — wait for events up to 5 minutes late before closing a time window. Events arriving within the watermark are included in the correct window. Events arriving after the watermark are either dropped or used to update the aggregate (triggering a recomputation of the affected window). Flink and Spark Streaming both support configurable watermarks and late data handling with allowedLateness.

Question 1

What is the difference between lambda and kappa architecture for analytics?

Accepted Answer

Lambda architecture has two separate layers: a batch layer (accurate, runs hourly/daily on historical data) and a speed layer (approximate, processes real-time data). Queries merge both results. Kappa architecture has a single stream processing layer that handles everything — historical data is reprocessed by replaying the stream. Lambda is more complex but handles backfill naturally; kappa is simpler operationally but reprocessing at scale can be slow.

Question 2

Why use a columnar database like ClickHouse for analytics instead of PostgreSQL?

Accepted Answer

Columnar databases store data by column rather than by row. Analytical queries (SELECT sum(revenue), country FROM events GROUP BY country) need only a few columns from potentially hundreds. Columnar storage reads only the needed columns, resulting in 10-100x less I/O. Values in the same column also compress much better (similar data types and patterns). ClickHouse also uses vectorized execution (SIMD instructions), processing 1024 values per CPU cycle for further speed gains.

Question 3

How does pre-aggregation improve dashboard query performance?

Accepted Answer

Instead of querying raw events at render time, pre-aggregate them into materialized summaries at multiple granularities (minute, hour, day). A query for "daily revenue last 30 days" reads 30 pre-computed values instead of scanning 30 days of raw events. The aggregation job runs continuously (for the latest window) or on a schedule. The tradeoff: pre-aggregation is only efficient for known dimensions — ad-hoc slice-and-dice queries still need the raw OLAP engine.

Question 4

How do you compute approximate unique user counts at massive scale?

Accepted Answer

Use HyperLogLog (HLL), a probabilistic data structure that estimates cardinality (unique count) using O(log log n) space with < 1% error. Redis natively supports HLL via PFADD (add element) and PFCOUNT (estimate cardinality). Store one HLL per (metric, time_window). To count unique users in a date range, merge the HLLs for each window (PFMERGE) and call PFCOUNT on the merged structure. Merging HLLs is O(1) and supports arbitrary time range queries.

Question 5

How do you handle late-arriving events in a streaming analytics pipeline?

Accepted Answer

Allow a watermark delay (e.g., 5 minutes) — wait for events up to 5 minutes late before closing a time window. Events arriving within the watermark are included in the correct window. Events arriving after the watermark are either dropped or used to update the aggregate (triggering a recomputation of the affected window). Flink and Spark Streaming both support configurable watermarks and late data handling with allowedLateness.

System Design: Analytics Dashboard — Real-Time Metrics, Aggregation Pipeline, and Query Engine

Requirements

Event Ingestion Pipeline

Aggregation Layer

Query Engine and OLAP

Dashboard Rendering and Caching