Question 1

What is the difference between a task queue and a message queue?

Accepted Answer

A message queue (Kafka, RabbitMQ, SQS) is a generic pub/sub or point-to-point system for transmitting data between services u2014 the producer sends a message and doesn't care what happens next. A task queue (Celery, Sidekiq, Bull) is specifically designed for executing units of work (tasks/jobs): it adds worker management, retry logic, scheduling, progress tracking, and result storage on top of a message queue backend. Task queues typically use a message queue (Redis, RabbitMQ, SQS) as their broker. Think of a task queue as a message queue with built-in worker orchestration.

Question 2

How does a visibility timeout prevent task loss in SQS or Redis-based queues?

Accepted Answer

When a worker dequeues a task, the task is not deleted u2014 it becomes invisible to other workers for a visibility timeout period (e.g., 5 minutes). If the worker processes and acknowledges the task within the timeout, the task is deleted. If the worker crashes or times out without acknowledging, the task becomes visible again and another worker picks it up. This ensures at-least-once delivery u2014 tasks are never permanently lost due to worker failure. The trade-off: if a task takes longer than the visibility timeout, it may be processed twice.

Question 3

How do you implement retry with exponential backoff in a task queue?

Accepted Answer

When a task fails, instead of immediately re-enqueuing it, schedule it with an increasing delay: delay = base^retry_count seconds (e.g., 2^1=2s, 2^2=4s, 2^3=8s). Store the retry_count with the task. Use a delayed queue (Redis sorted set with score = execute_at timestamp) to hold tasks until their retry time. A scheduler job periodically promotes due tasks to the active queue. After max_retries (e.g., 3-5), move the task to a dead letter queue (DLQ) for manual inspection. Add jitter (random offset) to avoid synchronized retry storms.

Question 4

How do you implement exactly-once processing in a task queue?

Accepted Answer

At-least-once delivery (the default) means tasks may be processed twice. For exactly-once: make workers idempotent using the task_id as a deduplication key. Before processing: SET task:{id}:processing 1 NX EX 300 in Redis (atomic set-if-not-exists). If the key already exists, skip and ACK. After processing: set task:{id}:done 1 with a TTL longer than the visibility timeout. On retry, the done check short-circuits. This works when the processing action itself is idempotent (database upsert with unique constraint, Stripe idempotency key). True exactly-once also requires idempotent downstream systems.

Question 5

How do you scale a task queue to handle 100K tasks per second?

Accepted Answer

Partition by queue name u2014 each queue (email, image_resize, payments) is an independent sorted set in Redis, processed by its own worker pool. Scale workers horizontally: each stateless worker polls its assigned queues. Use Redis Cluster to shard queue data across multiple nodes. For burst traffic, use auto-scaling worker pools triggered by queue depth metrics (CloudWatch + ECS target tracking, or Kubernetes HPA on custom metrics). For very high throughput, replace Redis with Kafka: each task type = a Kafka topic with multiple partitions; consumer groups provide parallelism without polling overhead.

Backend	Strengths	Weaknesses
Redis (LPUSH/BRPOP)	Sub-millisecond latency, sorted sets for priority/delay	In-memory, durability risk
Redis Streams	Persistent, consumer groups, at-least-once semantics	More complex
PostgreSQL	ACID, existing infra, advisory locks	Not built for queuing, high polling load
Kafka	High throughput, replay, partitioned parallelism	Overkill for low-volume, no native delay
SQS	Managed, infinite scale, DLQ built-in	At-least-once only, no ordering

System Design: Distributed Task Queue and Job Scheduler (Celery, SQS, Redis)

What Is a Distributed Task Queue?

Requirements

Core Design

Task Model

Queue Backend Options

Priority Queue with Redis Sorted Sets

Worker Loop with Retry

Visibility Timeout and At-Least-Once Delivery

Exactly-Once Processing

Cron / Scheduled Tasks

Monitoring

Interview Questions

Q: How do you scale a task queue to handle 100K tasks/second?

Q: How do you prevent a slow task from blocking the queue?