What Is Real-Time Analytics?
Real-time analytics provides insights on data that is seconds to minutes old — not the hours or days of traditional batch BI. Use cases: live dashboard showing active users on a website, per-minute conversion rate for an A/B test, real-time revenue during a product launch, and anomaly detection on live metrics. The challenge: high ingest volume (billions of events per day) combined with fast, complex analytical queries (GROUP BY, COUNT DISTINCT, percentile aggregations) over large datasets.
The OLAP vs OLTP Distinction
OLTP (Online Transaction Processing): databases like MySQL/PostgreSQL optimized for point reads/writes (SELECT user WHERE id=123), high concurrency, low latency per query, row-oriented storage. OLAP (Online Analytical Processing): databases like ClickHouse/Druid/BigQuery optimized for analytical scans (COUNT(*) WHERE date BETWEEN x AND y GROUP BY country), moderate concurrency, column-oriented storage. The fundamental difference: OLAP stores data by column (all values for “country” are together on disk), enabling 10-100× faster reads for analytical queries that only access a few columns out of hundreds. Row-oriented storage reads all columns even for single-column aggregations.
ClickHouse Architecture
ClickHouse is a column-oriented OLAP database optimized for high-ingest analytics. Key features:
- MergeTree engine: data is written in sorted “parts” (like SSTable files in LSM trees) and merged asynchronously in the background. Parts are sorted by the ORDER BY key — queries with a WHERE clause on this key can skip entire parts (sparse index). This enables 1-10 billion row/second scan speeds on modern hardware.
- Columnar compression: each column is compressed independently using codecs suited to its data type (LZ4 for general data, DoubleDelta for timestamps, Gorilla for time-series floats). 10:1 compression is common — a 10TB dataset compresses to 1TB, reducing I/O by 10×.
- Vectorized execution: queries process data in batches of 1,024 rows using SIMD CPU instructions, achieving 10× throughput over row-by-row processing.
- Materialized views: pre-aggregated tables are maintained automatically as data is inserted. Queries against the materialized view run in milliseconds instead of seconds. Example: a materialized view aggregating (event_type, country, hour, count) is updated automatically on every insert.
Apache Druid for Streaming Analytics
Druid combines streaming ingest (real-time segments updated as data arrives) with OLAP query performance. Architecture: brokers route queries; historical nodes serve immutable deep segments (from S3); real-time nodes ingest from Kafka and serve hot segments. Data flows: Kafka → Druid real-time nodes (segment published after 10 minutes) → historical nodes (for older data). Druid pre-computes rollups at ingest time: raw events are aggregated to (minute, country, event_type, sum/count/distinct) as they arrive — raw event storage is not kept, only pre-aggregated granularity. This reduces storage by 10-100× at the cost of not being able to re-aggregate at finer granularity than the original ingest.
The Lambda Architecture
Lambda architecture handles both high-velocity streaming and historical batch analytics by maintaining two parallel pipelines:
- Batch layer: processes all historical data (Spark on S3/Parquet). Output: accurate batch views. Recomputed periodically (daily/hourly). Slow but correct — handles late-arriving data, corrections.
- Speed layer: processes real-time stream (Flink/Kafka Streams). Output: real-time views (fast, approximate for very recent data). Serves queries with low latency but only covers recent data.
- Serving layer: queries merge results from batch and speed layers. Example: for “active users in last 7 days,” combine accurate batch results (days 2-7) with real-time stream results (last hour).
Kappa architecture simplifies this by using only a streaming pipeline for everything — reprocess historical data by replaying Kafka from the beginning. Simpler operations; requires the stream processor to handle reprocessing at batch scale.
Approximate Algorithms for Scale
Exact COUNT DISTINCT queries on billions of rows are expensive. Approximate algorithms trade a small accuracy loss for massive performance gains:
- HyperLogLog: estimates the count of distinct elements using only 12KB of memory per counter, with ±0.5% error. Used by Redis (PFADD/PFCOUNT), Druid, and BigQuery for approximate DISTINCT counts. Mergeable: HLLs from different time windows or shards can be combined with union.
- Count-Min Sketch: estimates the frequency of individual items in a stream using a small fixed-size matrix. Used for finding heavy hitters (top-K most frequent events). Overestimates never, underestimates bounded by sketch size.
- T-Digest: approximate percentile computation (P50, P95, P99 latency) in a streaming fashion with high accuracy at the tails. More accurate than reservoir sampling for high-percentile queries.
Query Optimization for Dashboards
Analytics dashboards have predictable query patterns: the same queries run every 30 seconds to refresh metrics. Optimizations: (1) Result caching — cache the last query result with TTL = dashboard refresh interval. Dashboard shows slightly stale data (30-60 seconds) but reads never hit the analytical database for cache hits. (2) Pre-aggregation — materialize common GROUP BY combinations at ingest time. “Active users per country per minute” is always queried at the country-minute granularity — pre-aggregate at that level. (3) Scheduled queries — run expensive queries on a schedule (every 5 minutes) and serve from a result store rather than running on demand. (4) Tiered storage — hot data (last 30 days) in ClickHouse on SSD; cold data (older) in Parquet on S3, queried via Presto/Athena when needed.
Interview Checklist
- Ingest pipeline: Kafka → Flink/Spark Streaming → ClickHouse/Druid
- Storage: column-oriented OLAP DB; columnar compression; sparse index on ORDER BY key
- Real-time vs batch: speed layer (Flink) + batch layer (Spark) = Lambda; or Kappa (stream only)
- Approximate analytics: HyperLogLog for distinct counts, T-Digest for percentiles
- Query optimization: materialized views, result caching, pre-aggregation, scheduled queries
- Freshness vs accuracy tradeoff: real-time has seconds latency but may be approximate; batch is exact but hours stale