Question 1

Why is ClickHouse so much faster than PostgreSQL for analytical queries?

Accepted Answer

ClickHouse and PostgreSQL are optimized for fundamentally different access patterns. PostgreSQL is row-oriented: all columns for a row are stored together on disk. An analytical query like SELECT COUNT(*), country FROM events GROUP BY country must read ALL columns (id, user_id, event_type, timestamp, country, session_id, ...) for every row, even though only the "country" column is needed. ClickHouse is column-oriented: each column is stored separately. The same query reads only the "country" column file u2014 skipping all other columns entirely. For a table with 100 columns and a query using 3 columns, ClickHouse reads 3% of the data vs PostgreSQL's 100%. Additional optimizations: columnar data with the same value type compresses 10-100u00d7 better than mixed-type rows (LZ4 codec, DoubleDelta for timestamps). Vectorized execution processes 1,024 values at a time using SIMD CPU instructions rather than row-by-row. Sparse indexes allow skipping data ranges that don't match WHERE clauses without reading them. Combined: ClickHouse can run the same analytical query 100-1000u00d7 faster than PostgreSQL on large datasets.

Question 2

What is the difference between Lambda architecture and Kappa architecture for streaming analytics?

Accepted Answer

Lambda architecture maintains two parallel data pipelines: a batch layer (Spark on HDFS/S3, accurate but hours-old) and a speed layer (Flink/Kafka Streams, real-time but approximate or incomplete). A serving layer merges results from both for queries. Advantages: handles late-arriving data (batch reprocesses everything correctly), allows corrections (re-run batch with fixed logic), separates real-time and historical processing concerns. Disadvantages: two codebases (one for batch, one for streaming) that must produce identical semantics u2014 complex to maintain, debug, and evolve simultaneously. Kappa architecture uses only a streaming pipeline. Historical reprocessing is done by replaying Kafka from the beginning (keep raw events in Kafka with long retention, or replay from S3). Advantages: single codebase (Flink handles both real-time and backfill), simpler operations. Disadvantages: Kafka replay for years of data is slower than Spark batch (less parallelism than HDFS-based batch), and the stream processor must handle backpressure from bulk replay. Modern recommendation: start with Lambda for mature systems with separate batch and streaming teams; use Kappa when the streaming framework can handle both cases efficiently and you want to minimize operational complexity.

Question 3

How does HyperLogLog estimate distinct counts with only 12KB of memory?

Accepted Answer

HyperLogLog (HLL) estimates the number of distinct elements in a stream using a fixed 12KB data structure regardless of the number of elements, with u00b10.5% error. The intuition: hash each element to a random-looking bit string. In a random bit string, the probability of observing k leading zeros is 2^(-k). If we see an element with 10 leading zeros, we've probably seen at least 2^10 = 1,024 distinct elements. HLL uses multiple hash functions (or one hash split into multiple registers) to get a robust estimate: maintain 2^b registers (b=14 u2192 16,384 registers). For each element, use the first b bits to select a register and update it with the count of leading zeros in the remaining bits. Estimate = HLL_CONST u00d7 (registers)u00b2 / sum(2^(-register_value)). The harmonic mean across registers reduces variance. Merging HLLs: take element-wise max of register arrays. This enables distributed computation: compute HLL on each partition separately, merge for the global distinct count. Use cases: unique visitors per page (Redis PFADD/PFCOUNT), unique search queries per day, distinct active users across services u2014 all without storing the actual user IDs.

System Design Interview: Real-Time Analytics Dashboard (ClickHouse / Druid)

What Is Real-Time Analytics?

The OLAP vs OLTP Distinction

ClickHouse Architecture

Apache Druid for Streaming Analytics

The Lambda Architecture

Approximate Algorithms for Scale

Query Optimization for Dashboards

Interview Checklist

Companies That Ask This