System Design: Distributed Message Queue (Kafka)
A distributed message queue decouples producers (services that generate events) from consumers (services that process them), enabling async communication, backpressure handling, and replay. Apache Kafka is the de facto standard — it handles trillions of messages per day at LinkedIn, Netflix, and Uber. This guide covers the design of a Kafka-like system from the ground up.
Requirements
Functional: Producers publish messages to named topics. Consumers subscribe to topics and receive messages. Messages are retained for configurable duration (7 days default). Messages can be replayed from any offset. At-least-once delivery guarantee.
Non-functional: 1 million messages/sec write throughput, sub-10ms publish latency, horizontal scalability, fault tolerance (survive broker failures), ordered messages within a partition.
Core Concepts
- Topic: Named stream of messages (e.g., “user.clicks”, “order.created”)
- Partition: A topic is divided into N ordered, immutable logs. Messages within a partition are totally ordered. Partitions enable parallelism.
- Offset: Sequential integer assigned to each message within a partition. Consumers track their position via offset.
- Producer: Writes messages to a partition (by key hash or round-robin). Receives ACK after message is durably written.
- Consumer Group: A group of consumers sharing a topic subscription. Each partition is assigned to exactly one consumer in the group. This gives parallel consumption without duplicate processing.
- Broker: A server that stores partitions and serves producers/consumers. Kafka cluster = multiple brokers.
Data Model
Topic: topic_name, num_partitions, retention_bytes, retention_ms
Partition: topic_name, partition_id, leader_broker_id, replica_broker_ids
Message: offset, key, value (bytes), timestamp, headers
Offset: consumer_group, topic, partition_id, committed_offset
Write Path
Producer → Partition Selection → Leader Broker → Write to Segment File
↓
Replicate to Follower Brokers
↓
ACK to Producer (after N replicas confirm)
Partition selection: If message has a key: partition = hash(key) % num_partitions — same key always goes to same partition (ordering guarantee for related messages, e.g., all events for user_id=123). If no key: round-robin across partitions (maximizes throughput).
Durability: Producer acks setting (Kafka’s acks config): acks=0 — fire and forget (fastest, can lose messages); acks=1 — leader acknowledges (can lose if leader fails before replicating); acks=all — all in-sync replicas acknowledge (no data loss, highest latency).
Storage: Each partition is a sequence of segment files (each ~1 GB). New messages are appended to the active segment. Sequential writes to disk are extremely fast (500 MB/s vs 100 MB/s random). Retention: old segments are deleted when size > retention_bytes or age > retention_ms.
Read Path
Consumer → Poll(topic, partition, offset) → Broker → Return messages from offset
↓
Process messages
↓
Commit offset (mark progress)
Zero-copy read: Kafka uses Linux sendfile() to transfer data from disk directly to the network socket, bypassing application memory. This is 4–6x faster than read-then-write.
Consumer group rebalancing: When a consumer joins or leaves, the group coordinator (a special broker) triggers a rebalance — reassigns partitions among the active consumers. During rebalancing, consumption pauses briefly. New consumers get partitions at the last committed offset.
Delivery Guarantees
| Guarantee | How | Tradeoff |
|---|---|---|
| At-most-once | Commit offset before processing | May miss messages on crash |
| At-least-once | Commit offset after successful processing | May reprocess on crash (duplicates possible) |
| Exactly-once | Idempotent producer + transactional consumer | Higher latency, more complex |
At-least-once is the industry default. Handle idempotency in the consumer: track processed message IDs in Redis or DB, skip duplicates on reprocessing.
Replication and Fault Tolerance
Each partition has a leader and N-1 followers (replicas). Producers and consumers only talk to the leader. Followers pull from the leader continuously. In-Sync Replicas (ISR) = followers that are caught up within a threshold. If the leader fails, ZooKeeper (or Kafka’s built-in Raft in newer versions) elects a new leader from the ISR. With replication factor 3 and min-ISR=2: the cluster survives 1 broker failure without data loss.
Scale Numbers
- LinkedIn runs Kafka at ~7 trillion messages/day (80M messages/sec peak)
- Single broker handles 1–5 GB/s throughput with SSDs
- 1M messages/sec with 1KB messages = 1 GB/s. Requires ~10 brokers with SSD.
- Consumer throughput limited by processing speed, not Kafka (Kafka can replay as fast as disk allows)
Interview Tips
- Explain partitions as the unit of parallelism — more partitions = more consumer group parallelism.
- Mention that partition ordering is guaranteed; topic-level ordering is not (messages across partitions are interleaved).
- At-least-once vs exactly-once: at-least-once with consumer-side idempotency is the pragmatic choice.
- Contrast with a traditional queue (RabbitMQ): Kafka retains messages and allows replay; RabbitMQ deletes after consumption. Use Kafka when you need event sourcing, audit trails, or stream processing.
- Use cases: event streaming, log aggregation, change data capture (CDC), decoupling microservices.
🏢 Asked at: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence
🏢 Asked at: Uber Interview Guide 2026: Dispatch Systems, Geospatial Algorithms, and Marketplace Engineering
🏢 Asked at: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture
🏢 Asked at: LinkedIn Interview Guide 2026: Social Graph Engineering, Feed Ranking, and Professional Network Scale
🏢 Asked at: Cloudflare Interview Guide