System Design Interview: Distributed Message Queue (Kafka / SQS)

Why Message Queues?

A message queue decouples producers (services that generate events) from consumers (services that process them), enabling async communication, load leveling, and fault isolation. Without a queue, if the downstream service is slow or down, the upstream service either blocks (synchronous) or loses the event (fire-and-forget). A durable message queue absorbs bursts (load leveling), allows consumers to scale independently, and retains unprocessed messages until consumers are ready. Core use cases: order processing (charge card → ship item → send receipt, each step async), event streaming (user activity logs → analytics pipeline), task queues (image resize jobs), and inter-service communication in microservices.

Kafka Architecture

Kafka is a distributed log-based message broker. Key concepts:

  • Topic: a named log of messages (like a table in a database). Producers append to a topic; consumers read from it.
  • Partition: a topic is split into N partitions, each an ordered, immutable sequence of messages. Partitions enable parallelism — consumers in a consumer group are assigned partitions (one partition = one consumer at a time). Partition count determines maximum consumer parallelism.
  • Offset: each message in a partition has a monotonically increasing offset. Consumers track their offset — this is what makes Kafka replayable. Consumer group A can be at offset 1,000; consumer group B (a different pipeline reading the same topic) can be at offset 500.
  • Broker: a Kafka server that stores partitions. A cluster has 3-100 brokers. Each partition has one leader broker (handles reads and writes) and N-1 follower brokers (replicate for durability).
  • ZooKeeper / KRaft: Kafka historically used ZooKeeper for cluster coordination (leader election, broker registration). KRaft mode (Kafka 3.0+) replaces ZooKeeper with Kafka’s own Raft consensus.

Message Durability and Replication

Kafka guarantees durability through replication. Each partition has a replication factor (typically 3): the leader and 2 followers. Producers write to the leader; followers fetch and replicate asynchronously. The ISR (in-sync replica set) tracks which replicas are fully caught up. A write is acknowledged to the producer only when all ISR replicas have confirmed the write (acks=all). If the leader fails, a follower from the ISR is elected as the new leader — no data loss since all ISR replicas had the message. Trade-off: acks=all gives strongest durability but highest latency; acks=1 (leader only) is faster but risks losing messages if the leader fails before replication; acks=0 (fire and forget) is fastest but no durability guarantee.

Kafka vs SQS vs RabbitMQ

Apache Kafka: log-based, messages are retained for a configurable period (days/weeks) regardless of consumption — multiple consumer groups can read the same messages. Extremely high throughput (millions of msgs/sec). Best for: event sourcing, stream processing, audit logs, exactly-once semantics (with Kafka Streams / transactions).

Amazon SQS: traditional queue — messages are deleted after successful consumption. Simpler ops (managed service), at-least-once delivery, up to 14-day retention, standard queue (unlimited throughput, best-effort ordering) or FIFO queue (exactly-once, ordered, 3K msgs/sec hard limit per queue). Best for: task queues, work distribution, decoupling microservices when you don’t need replay.

RabbitMQ: message broker with advanced routing (exchanges, bindings), push-based delivery (consumers subscribe and messages are pushed). Lower throughput than Kafka but rich routing patterns (topic, fanout, direct, headers exchanges). Best for: complex routing logic, request-reply patterns, low-volume / low-latency use cases.

Consumer Groups and Delivery Semantics

A Kafka consumer group is a set of consumers that collectively consume a topic. Each partition is consumed by exactly one consumer in the group at a time. This enables parallel processing: a topic with 10 partitions supports up to 10 consumers in a group, each processing a distinct partition. Adding more consumers than partitions is wasteful — extra consumers are idle.

Delivery semantics: at-most-once: commit offset before processing — if the consumer crashes after commit but before processing, the message is lost. at-least-once: commit offset after processing — if the consumer crashes after processing but before committing, the message is re-delivered. Processing must be idempotent. exactly-once: Kafka transactions + idempotent producers ensure each message is processed exactly once — requires both the producer and consumer to use the Kafka Transactions API and a compatible sink (another Kafka topic or an idempotent external system).

Dead Letter Queues

Messages that repeatedly fail processing (due to malformed data or a bug in the consumer) must not block the queue indefinitely. A Dead Letter Queue (DLQ) receives messages after N failed processing attempts. SQS has native DLQ support: configure maxReceiveCount=5 and a DLQ ARN; after 5 failures, SQS moves the message to the DLQ automatically. In Kafka, DLQ is implemented by the consumer: catch processing exceptions, produce the failed message to a separate topic (e.g., “orders-dlq”), then commit the offset to continue. An ops team monitors the DLQ; after fixing the bug, they replay messages from the DLQ back to the main topic.

Ordering Guarantees

Kafka guarantees ordering within a partition, not across partitions. To ensure all events for a given entity (user_id, order_id) are processed in order: set the partition key to that entity ID. All events for the same key hash to the same partition and are processed in offset order by a single consumer. Trade-off: hot keys (one very active user) can overload one partition. SQS FIFO queues guarantee ordering within a message group — use message group ID to route related messages to the same group.

Scaling Kafka

Throughput scales by adding partitions and brokers. Storage scales by adding brokers and using Kafka’s partition reassignment tool. For very high fan-out (same message to 1000 consumers), Kafka is ideal — each consumer group reads independently from the same log at its own pace, with zero additional write cost. Network bandwidth is the typical bottleneck — a broker can handle ~1Gbps of replication traffic plus producer/consumer traffic. Kafka compression (snappy, lz4, zstd) reduces network and disk I/O by 3-5x for typical JSON payloads.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What are the at-most-once, at-least-once, and exactly-once delivery semantics in Kafka?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Delivery semantics describe what happens when a consumer crashes mid-processing: At-most-once: the consumer commits the offset to Kafka before processing the message. If the consumer crashes after committing but before finishing processing, the message is skipped u2014 it won’t be redelivered (Kafka thinks it’s done). This is appropriate for metrics/logging where occasional data loss is acceptable. At-least-once: the consumer processes the message and then commits the offset. If the consumer crashes after processing but before committing, Kafka redelivers the message u2014 it will be processed twice. This is the default and most common choice. Consumers must be idempotent (re-processing the same message produces the same outcome u2014 use deduplication with a unique message ID). Exactly-once: requires Kafka Transactions API. The producer sends messages with a transaction ID; Kafka assigns a producer ID and sequence numbers, deduplicating retries at the broker. The consumer uses a transactional consumer that commits offsets atomically with its own Kafka producer output. Only works within a Kafka-to-Kafka pipeline. For Kafka-to-external-database, use idempotent writes (upsert by message_id) to approximate exactly-once.”
}
},
{
“@type”: “Question”,
“name”: “How does Kafka achieve high throughput compared to traditional message brokers?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Kafka achieves throughput of millions of messages per second through several design decisions: (1) Sequential disk I/O u2014 Kafka appends messages to a log file sequentially, which achieves disk throughput of hundreds of MB/s (versus random I/O at 1-2 MB/s). Sequential reads also enable OS page cache optimization u2014 frequently accessed log segments stay in RAM. (2) Zero-copy data transfer u2014 Kafka uses the sendfile() system call (Linux) to transfer data from page cache directly to the network socket, bypassing user space. This eliminates 2 memory copies and 2 context switches per message delivery. (3) Batching u2014 producers accumulate messages in a batch (by time or size) and send in bulk. Consumers receive batches. Batch compression (snappy, lz4, zstd) reduces network bandwidth by 3-5x. (4) Partitioning u2014 a topic with N partitions can be written to and read from in parallel by N producers and N consumer group members simultaneously. (5) Pull-based consumers u2014 consumers control their read pace, eliminating backpressure issues. Compare to RabbitMQ which uses push-based delivery and stores messages in RAM/on-disk with random access patterns u2014 typically 5-10x lower throughput than Kafka.”
}
},
{
“@type”: “Question”,
“name”: “When should you choose Kafka over Amazon SQS for a message queue?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Choose Kafka when: (1) Multiple consumer groups need to independently read the same messages u2014 Kafka retains messages for a configurable period (days/weeks) and allows replay; SQS deletes messages after consumption. (2) You need event replay/reprocessing u2014 Kafka consumers can reset their offset and reprocess historical events; SQS cannot. (3) Throughput requirements exceed SQS FIFO limits u2014 SQS FIFO maxes at 3,000 msg/sec per queue; Kafka scales to millions. (4) You need strict ordering across the full event history, or you’re building event sourcing/CQRS. (5) You need stream processing (Kafka Streams, Flink reading from Kafka). Choose SQS when: (1) You want a fully managed service with no operational overhead u2014 Kafka requires cluster management (or MSK/Confluent Cloud). (2) You need a simple dead letter queue with automatic routing after N failures. (3) Messages should be deleted after consumption (you don’t need replay). (4) Throughput is modest (< 10K msg/sec) and you don't want to size and manage Kafka partitions. In practice: use Kafka for event streaming and analytics pipelines; use SQS for task queues and microservice decoupling where replay isn't needed."
}
}
]
}

  • Shopify Interview Guide
  • Airbnb Interview Guide
  • LinkedIn Interview Guide
  • Databricks Interview Guide
  • Netflix Interview Guide
  • Companies That Ask This

    Scroll to Top