System Design: Analytics Pipeline — Ingestion, Stream Processing, and OLAP Query Layer

Requirements and Scale

An analytics pipeline collects events from product surfaces (web, mobile, backend services), processes them in real time or batch, and makes the results queryable for dashboards and ad-hoc analysis. Functional requirements: ingest high-volume event streams (clickstreams, purchases, API calls), compute aggregates (DAU, revenue, funnel conversion), and expose a query layer for dashboards. Non-functional: ingest 1M events/second at peak, dashboards must reflect data within 5 minutes (near-real-time), historical queries over years of data must return in seconds. The fundamental tension: low-latency ingestion vs high-throughput batch analytics vs interactive ad-hoc queries — each favors a different storage and processing model.

Ingestion Layer

Clients send events to an ingest API (REST or gRPC). The API does minimal validation (schema check, auth) and publishes to Kafka. Kafka acts as the ingestion buffer: decouples producers (apps) from consumers (processors). Producers write to a topic partitioned by event_type or user_id. With 1M events/second and average event size 500 bytes: ~500 MB/s throughput. Kafka handles this with 20-50 partitions. Client-side batching: SDK collects events for 100ms and sends in batches of 100 events. Reduces HTTP overhead by 100x. Kafka retention: keep 7 days of raw events — allows reprocessing if a downstream processor has a bug. Schema registry (Confluent Schema Registry): enforce Avro or Protobuf schemas per topic to prevent malformed events from propagating downstream.

Stream Processing

Real-time aggregations with Apache Flink or Spark Streaming. Consume from Kafka, compute windowed aggregates, write to the serving layer. Example: count active users per 5-minute window, per country. Flink job:

stream
  .filter(e -> e.type == "page_view")
  .keyBy(e -> e.country)
  .window(TumblingEventTimeWindows.of(Time.minutes(5)))
  .aggregate(new CountAggregator())
  .addSink(new RedisSink());

Event time vs processing time: use event timestamps (event time) for accuracy — events can arrive late due to mobile offline mode. Watermarks: tell Flink how late events can arrive (e.g., 30-second watermark). Events arriving later than the watermark are dropped or handled by a side output. Output: per-window aggregates written to Redis (for real-time dashboards) and to the OLAP store (for historical analysis). Exactly-once semantics: Flink with Kafka checkpointing provides exactly-once processing guarantees — no double-counting even on restart.

Batch Processing and OLAP Store

For historical queries: raw events are also written to S3 (via Kafka S3 connector) in Parquet format, partitioned by date and event_type. A nightly Spark job computes full-day aggregates and writes to the OLAP store. OLAP options: ClickHouse (columnar, extremely fast aggregations), Apache Druid (real-time + batch, good for time-series), BigQuery (serverless, good for ad-hoc). ClickHouse example: queries over 10 billion rows in under a second using columnar storage and vectorized execution. Data retention: keep 90 days of raw events in S3 Standard, move to Glacier for long-term compliance. Aggregate tables retain indefinitely. Compaction: daily aggregates are compacted from 5-minute aggregates — reduces storage and speeds up year-over-year queries.

Query Layer and Dashboards

Two query patterns: (1) Pre-aggregated metrics (DAU, revenue): served from Redis or a summary table with sub-millisecond latency. Dashboard reads these via a metrics API. (2) Ad-hoc analysis: analyst writes SQL against ClickHouse or BigQuery. Query router: parse the query, determine the time range and granularity. If the range fits in pre-computed aggregates: serve from the fast path. Otherwise: hit ClickHouse or BigQuery. Caching: cache query results for 60 seconds (dashboards refresh every minute). Use a query fingerprint (hash of the SQL + time range) as the cache key. Rate limiting: limit ad-hoc queries per user to prevent one analyst from exhausting cluster resources. Backfill: if a Flink job has a bug and produces wrong metrics for 2 hours: replay the 2-hour window from Kafka raw events (Flink savepoint + replay). Kafka’s 7-day retention makes this practical.

Interview Tips

  • Lambda vs Kappa architecture: Lambda has separate batch and stream paths (complex but reliable batch layer). Kappa replaces batch with Kafka replay — simpler, but requires the stream processor to handle reprocessing efficiently.
  • Data freshness SLA drives architecture: 5-minute freshness means streaming; 1-hour freshness could be micro-batch; 24-hour is pure batch.
  • Hotspot partitioning: if partitioning by user_id, viral users create hot partitions. Mitigate with composite keys or salting.

Asked at: Databricks Interview Guide

Asked at: Netflix Interview Guide

Asked at: Twitter/X Interview Guide

Asked at: Snap Interview Guide

Scroll to Top