Why Message Queues?
A message queue decouples producers (services that generate events) from consumers (services that process them), enabling async communication, load leveling, and fault isolation. Without a queue, if the downstream service is slow or down, the upstream service either blocks (synchronous) or loses the event (fire-and-forget). A durable message queue absorbs bursts (load leveling), allows consumers to scale independently, and retains unprocessed messages until consumers are ready. Core use cases: order processing (charge card → ship item → send receipt, each step async), event streaming (user activity logs → analytics pipeline), task queues (image resize jobs), and inter-service communication in microservices.
Kafka Architecture
Kafka is a distributed log-based message broker. Key concepts:
- Topic: a named log of messages (like a table in a database). Producers append to a topic; consumers read from it.
- Partition: a topic is split into N partitions, each an ordered, immutable sequence of messages. Partitions enable parallelism — consumers in a consumer group are assigned partitions (one partition = one consumer at a time). Partition count determines maximum consumer parallelism.
- Offset: each message in a partition has a monotonically increasing offset. Consumers track their offset — this is what makes Kafka replayable. Consumer group A can be at offset 1,000; consumer group B (a different pipeline reading the same topic) can be at offset 500.
- Broker: a Kafka server that stores partitions. A cluster has 3-100 brokers. Each partition has one leader broker (handles reads and writes) and N-1 follower brokers (replicate for durability).
- ZooKeeper / KRaft: Kafka historically used ZooKeeper for cluster coordination (leader election, broker registration). KRaft mode (Kafka 3.0+) replaces ZooKeeper with Kafka’s own Raft consensus.
Message Durability and Replication
Kafka guarantees durability through replication. Each partition has a replication factor (typically 3): the leader and 2 followers. Producers write to the leader; followers fetch and replicate asynchronously. The ISR (in-sync replica set) tracks which replicas are fully caught up. A write is acknowledged to the producer only when all ISR replicas have confirmed the write (acks=all). If the leader fails, a follower from the ISR is elected as the new leader — no data loss since all ISR replicas had the message. Trade-off: acks=all gives strongest durability but highest latency; acks=1 (leader only) is faster but risks losing messages if the leader fails before replication; acks=0 (fire and forget) is fastest but no durability guarantee.
Kafka vs SQS vs RabbitMQ
Apache Kafka: log-based, messages are retained for a configurable period (days/weeks) regardless of consumption — multiple consumer groups can read the same messages. Extremely high throughput (millions of msgs/sec). Best for: event sourcing, stream processing, audit logs, exactly-once semantics (with Kafka Streams / transactions).
Amazon SQS: traditional queue — messages are deleted after successful consumption. Simpler ops (managed service), at-least-once delivery, up to 14-day retention, standard queue (unlimited throughput, best-effort ordering) or FIFO queue (exactly-once, ordered, 3K msgs/sec hard limit per queue). Best for: task queues, work distribution, decoupling microservices when you don’t need replay.
RabbitMQ: message broker with advanced routing (exchanges, bindings), push-based delivery (consumers subscribe and messages are pushed). Lower throughput than Kafka but rich routing patterns (topic, fanout, direct, headers exchanges). Best for: complex routing logic, request-reply patterns, low-volume / low-latency use cases.
Consumer Groups and Delivery Semantics
A Kafka consumer group is a set of consumers that collectively consume a topic. Each partition is consumed by exactly one consumer in the group at a time. This enables parallel processing: a topic with 10 partitions supports up to 10 consumers in a group, each processing a distinct partition. Adding more consumers than partitions is wasteful — extra consumers are idle.
Delivery semantics: at-most-once: commit offset before processing — if the consumer crashes after commit but before processing, the message is lost. at-least-once: commit offset after processing — if the consumer crashes after processing but before committing, the message is re-delivered. Processing must be idempotent. exactly-once: Kafka transactions + idempotent producers ensure each message is processed exactly once — requires both the producer and consumer to use the Kafka Transactions API and a compatible sink (another Kafka topic or an idempotent external system).
Dead Letter Queues
Messages that repeatedly fail processing (due to malformed data or a bug in the consumer) must not block the queue indefinitely. A Dead Letter Queue (DLQ) receives messages after N failed processing attempts. SQS has native DLQ support: configure maxReceiveCount=5 and a DLQ ARN; after 5 failures, SQS moves the message to the DLQ automatically. In Kafka, DLQ is implemented by the consumer: catch processing exceptions, produce the failed message to a separate topic (e.g., “orders-dlq”), then commit the offset to continue. An ops team monitors the DLQ; after fixing the bug, they replay messages from the DLQ back to the main topic.
Ordering Guarantees
Kafka guarantees ordering within a partition, not across partitions. To ensure all events for a given entity (user_id, order_id) are processed in order: set the partition key to that entity ID. All events for the same key hash to the same partition and are processed in offset order by a single consumer. Trade-off: hot keys (one very active user) can overload one partition. SQS FIFO queues guarantee ordering within a message group — use message group ID to route related messages to the same group.
Scaling Kafka
Throughput scales by adding partitions and brokers. Storage scales by adding brokers and using Kafka’s partition reassignment tool. For very high fan-out (same message to 1000 consumers), Kafka is ideal — each consumer group reads independently from the same log at its own pace, with zero additional write cost. Network bandwidth is the typical bottleneck — a broker can handle ~1Gbps of replication traffic plus producer/consumer traffic. Kafka compression (snappy, lz4, zstd) reduces network and disk I/O by 3-5x for typical JSON payloads.
Companies That Ask This
Asked at: Twitter/X Interview Guide
Asked at: Coinbase Interview Guide
Asked at: Stripe Interview Guide