Q: How does exponential backoff with jitter prevent retry storms?

When many jobs fail simultaneously (e.g., a downstream service goes down), they all retry at the same time -- causing another wave of failures. Exponential backoff: retry_delay = base_delay * 2^(attempt-1). Attempt 1: 10s, attempt 2: 20s, attempt 3: 40s, attempt 4: 80s. This spreads retries over time as the delay grows. But without jitter: all failed jobs that started at the same time still retry at the same exponential intervals -- synchronized. Full jitter: retry_delay = random(0, base_delay * 2^(attempt-1)). Now retries are spread randomly across the interval. Decorrelated jitter (AWS recommendation): delay = min(cap, random(base, prev_delay * 3)). This avoids both the synchronization problem and the slow convergence of uniform random. Always cap the maximum retry delay (e.g., at 1 hour) to ensure jobs eventually retry.

Q: How do you design job dependencies to build a workflow DAG?

Store dependencies in a job_dependencies table: (job_id, depends_on_job_id). A job is eligible to run only after all its dependencies are in COMPLETED status. Execution flow: when a job completes, query which downstream jobs have this job as a dependency. For each downstream job: check if ALL its dependencies are now COMPLETED. If yes: enqueue the downstream job. This is event-driven dependency resolution -- no polling. Cycle detection: validate the DAG on definition (DFS to detect cycles). If a cycle is introduced: reject the dependency update. Failure propagation: if a job fails, you can either (1) propagate failure to all downstream jobs (mark them SKIPPED_DUE_TO_UPSTREAM_FAILURE) or (2) allow downstream jobs to proceed with partial inputs (if they handle missing dependencies gracefully). Store the decision as a policy on the downstream job.

Q: How do you scale a job scheduler to handle millions of jobs per day?

Scheduling bottleneck: the scheduler loop that queries due jobs every N seconds. At 1M jobs/day: ~12 jobs/second -- trivial. At 100M jobs/day: ~1150 jobs/second. Solutions: (1) Index next_execution_at with a partial index on enabled=true. The "find due jobs" query becomes O(k) where k = jobs due in this interval, not O(total_jobs). (2) Partition jobs by time bucket: pre-compute next_execution_at, store in a sorted set in Redis (ZADD jobs_due {next_ts} {job_id}). The scheduler polls ZRANGEBYSCORE jobs_due 0 {now} to find due jobs in O(k + log n). (3) Separate scheduler and executor: the scheduler only does coordination (queuing), not execution. Workers scale independently based on queue depth. (4) Priority queues: separate Kafka topics for high-priority and low-priority jobs. High-priority workers are always available; low-priority workers scale with demand.

Question 1

How do you prevent duplicate job execution when running multiple scheduler instances?

Accepted Answer

Multiple scheduler instances are needed for high availability. But they must not both trigger the same cron job. Two approaches: (1) Distributed lock per job: before creating a JobExecution, try to acquire a Redis lock: SET job_trigger:{job_id}:{scheduled_time} instance_id NX EX 60. Only the instance that wins creates the execution. The EX 60 ensures the lock expires if the winner crashes before releasing it. (2) Leader election: elect one scheduler as the leader (using Redis, etcd, or Zookeeper). Only the leader triggers jobs. On leader failure, a new leader is elected (typically within 5-15 seconds). Follower instances monitor but do not trigger. Approach 1 is more fault-tolerant (any instance can trigger); approach 2 is simpler to reason about. Both are used in production (Airflow uses database-level locking; Quartz uses JDBC clustering).

Question 2

How do you handle job timeout and detect crashed workers?

Accepted Answer

Timeout enforcement requires an external watchdog -- the job itself cannot reliably detect its own timeout (it might be in an infinite loop). Worker heartbeat: every worker sends a heartbeat every 30 seconds: UPDATE workers SET last_heartbeat=NOW() WHERE worker_id=X. A monitor process runs every 60 seconds: SELECT jobs WHERE status=RUNNING AND worker.last_heartbeat < NOW()-90s. These workers are presumed dead. Actions: mark the worker as OFFLINE, mark its running jobs as status=ORPHANED or re-queue for retry (if max_retries not exceeded). Job timeout: when a worker starts a job, it sets a deadline = NOW() + timeout_seconds on the JobExecution. Another watchdog thread in the worker checks each running job against its deadline. On timeout: send SIGTERM (graceful), wait 10 seconds, then SIGKILL. Update status=TIMED_OUT. Re-queue if retries remain.

Question 3

How does exponential backoff with jitter prevent retry storms?

Accepted Answer

When many jobs fail simultaneously (e.g., a downstream service goes down), they all retry at the same time -- causing another wave of failures. Exponential backoff: retry_delay = base_delay * 2^(attempt-1). Attempt 1: 10s, attempt 2: 20s, attempt 3: 40s, attempt 4: 80s. This spreads retries over time as the delay grows. But without jitter: all failed jobs that started at the same time still retry at the same exponential intervals -- synchronized. Full jitter: retry_delay = random(0, base_delay * 2^(attempt-1)). Now retries are spread randomly across the interval. Decorrelated jitter (AWS recommendation): delay = min(cap, random(base, prev_delay * 3)). This avoids both the synchronization problem and the slow convergence of uniform random. Always cap the maximum retry delay (e.g., at 1 hour) to ensure jobs eventually retry.

Question 4

How do you design job dependencies to build a workflow DAG?

Accepted Answer

Store dependencies in a job_dependencies table: (job_id, depends_on_job_id). A job is eligible to run only after all its dependencies are in COMPLETED status. Execution flow: when a job completes, query which downstream jobs have this job as a dependency. For each downstream job: check if ALL its dependencies are now COMPLETED. If yes: enqueue the downstream job. This is event-driven dependency resolution -- no polling. Cycle detection: validate the DAG on definition (DFS to detect cycles). If a cycle is introduced: reject the dependency update. Failure propagation: if a job fails, you can either (1) propagate failure to all downstream jobs (mark them SKIPPED_DUE_TO_UPSTREAM_FAILURE) or (2) allow downstream jobs to proceed with partial inputs (if they handle missing dependencies gracefully). Store the decision as a policy on the downstream job.

Question 5

How do you scale a job scheduler to handle millions of jobs per day?

Accepted Answer

Scheduling bottleneck: the scheduler loop that queries due jobs every N seconds. At 1M jobs/day: ~12 jobs/second -- trivial. At 100M jobs/day: ~1150 jobs/second. Solutions: (1) Index next_execution_at with a partial index on enabled=true. The "find due jobs" query becomes O(k) where k = jobs due in this interval, not O(total_jobs). (2) Partition jobs by time bucket: pre-compute next_execution_at, store in a sorted set in Redis (ZADD jobs_due {next_ts} {job_id}). The scheduler polls ZRANGEBYSCORE jobs_due 0 {now} to find due jobs in O(k + log n). (3) Separate scheduler and executor: the scheduler only does coordination (queuing), not execution. Workers scale independently based on queue depth. (4) Priority queues: separate Kafka topics for high-priority and low-priority jobs. High-priority workers are always available; low-priority workers scale with demand.

Low-Level Design: Job Scheduler — Cron Jobs, Distributed Task Execution, and Retry Logic

Core Entities

Cron Scheduling Engine

Worker Execution Model

Retry Logic with Exponential Backoff

Job Dependencies and Workflows

Monitoring and Observability