What Is a Distributed Message Queue?
A message queue decouples producers (services that generate work) from consumers (services that process work). Producers publish messages without waiting for consumers to process them. Consumers pull messages at their own pace. This provides backpressure handling (consumers can fall behind without blocking producers), fault tolerance (messages persist even if consumers are down), and horizontal scalability (add consumers to increase throughput).
You’re designing something like Amazon SQS, RabbitMQ, or a simplified Kafka. The core difference from Kafka: a traditional message queue deletes messages after consumption (each message consumed by exactly one consumer), while Kafka retains messages as a log (multiple consumer groups can read independently).
System Requirements
Functional
- Producers enqueue messages to named queues
- Consumers dequeue messages (one consumer receives each message)
- At-least-once delivery: no message lost, possible duplicate on failure
- Visibility timeout: dequeued message is hidden from others for N seconds; if not acknowledged, becomes visible again
- Dead letter queue: messages that fail repeatedly move to a DLQ
- Message ordering: FIFO queues preserve order within a message group
Non-Functional
- High throughput: 100K messages/second per queue
- Low latency: enqueue <10ms p99, dequeue <20ms p99
- Durability: messages persisted to disk (replicated) before acknowledging to producer
Core Data Model
queues: id, name, visibility_timeout_sec, max_receive_count, dlq_id
messages: id, queue_id, body, status (available/invisible/deleted), receive_count,
enqueued_at, visible_at, receipt_handle
Enqueue (Producer)
- Producer sends message to broker via HTTP or gRPC
- Broker assigns message_id (UUID), sets status=available, visible_at=now
- Broker writes to WAL (Write-Ahead Log) on disk — durability before acknowledgment
- Broker replicates to N-1 follower nodes (synchronous replication for durability)
- Broker acknowledges to producer with message_id
Dequeue (Consumer)
- Consumer sends dequeue request (with max_messages count, e.g., 10)
- Broker selects up to 10 messages with status=available AND visible_at <= now
- Broker sets status=invisible, visible_at=now + visibility_timeout, generates receipt_handle
- Broker returns messages with their receipt handles
- Consumer processes messages, sends Acknowledge(receipt_handle) for each
- Broker marks message as deleted
If consumer crashes without acknowledging: when visibility_timeout expires, the message becomes visible again and a different consumer picks it up. This is the at-least-once delivery guarantee — duplicates are possible on failure.
Visibility Timeout and Redelivery
The visibility_timeout is critical. If too short: slow-processing consumers cause redeliveries (duplicates). If too long: failed messages stay hidden too long, causing delays. Best practice: set visibility_timeout to 3-6x the expected processing time. Consumers that need more time can extend their visibility lease via an ExtendVisibility API call.
Dead letter queue: after max_receive_count deliveries (e.g., 5), the message is moved to the DLQ instead of becoming visible again. DLQ messages can be inspected for debugging and replayed after fixing the bug.
Storage Layer
Two options depending on scale:
Database-backed (SQS approach)
Store messages in a distributed database (DynamoDB, Cassandra). Simple, proven. Query for available messages: WHERE status='available' AND visible_at <= now LIMIT 10. Use optimistic locking or conditional writes to atomically claim messages. Index on (queue_id, status, visible_at) for efficient dequeue queries. Works well up to millions of messages in flight.
Log-based (Kafka-style)
Append messages to an ordered log per queue. Track consumer offset (next message index to deliver). No deletion — mark consumed via offset advancement. More efficient (sequential writes), but tracking per-message visibility state is complex. Better for high-throughput, ordered scenarios.
Ensuring FIFO Ordering
Basic queues don’t guarantee order (multiple consumers, concurrent dequeues). For FIFO:
- Single-consumer constraint: only one consumer processes a given message group at a time
- Message group ID: producers tag messages with a group_id. Messages with the same group_id are locked: only one consumer can have an in-flight message from a group at once. Others wait.
- Implementation: group_id → consumer_id lock in Redis with TTL = visibility_timeout. Released on acknowledgment or timeout.
Scaling
- Partitioning: split a high-volume queue across multiple broker partitions. Producer hashes message key to partition. Consumers assigned to partitions. Ordering guaranteed within a partition, not across partitions.
- Consumer scaling: add consumers to process faster. Automatically distributed across partitions via a group coordinator (similar to Kafka’s consumer group protocol).
- Broker scaling: add broker nodes. Reassign partitions to new brokers. Leader election via Raft or ZooKeeper.
Interview Tips
- The visibility timeout mechanism is the key idea — explain it carefully. It’s what enables at-least-once delivery without a central lock.
- Distinguish traditional queue (each message consumed once, then deleted) from Kafka log (messages retained, multiple consumer groups). Know which you’re designing.
- Dead letter queue is a real production concern — mention max_receive_count and DLQ routing to show operational maturity.
- FIFO queues: the group_id locking pattern is the right answer — prevents two consumers from processing group messages out of order.
Asked at: Netflix Interview Guide
Asked at: Twitter/X Interview Guide
Asked at: LinkedIn Interview Guide