Core Entities
JobDefinition: job_id, name, type (CRON, INTERVAL, ONE_TIME), schedule (cron expression or interval_seconds), handler (class name or function reference), params (JSON), max_retries, retry_delay_seconds, timeout_seconds, enabled, created_by, created_at. JobExecution: execution_id, job_id, status (QUEUED, RUNNING, COMPLETED, FAILED, TIMED_OUT, CANCELLED), scheduled_at, started_at, completed_at, worker_id, attempt_number, error_message, result (JSON). Worker: worker_id, hostname, status (ACTIVE, DRAINING, OFFLINE), last_heartbeat, current_job_id.
Cron Scheduling Engine
The scheduler runs a loop that determines which jobs are due for execution. Cron expression parsing: parse a 5-field cron expression (minute, hour, day, month, weekday) to determine the next execution time for each job. Libraries: croniter (Python), node-cron (JavaScript), Quartz (Java). On each tick (every 10 seconds): find all enabled jobs where next_execution_at <= NOW(). For each due job: create a JobExecution record with status=QUEUED and enqueue to a message queue (Redis Sorted Set, Kafka, or SQS). Update the job's next_execution_at to the next scheduled time.
Multiple scheduler instances: to avoid duplicate job creation when running multiple scheduler instances for high availability, use a distributed lock per job. Before creating a JobExecution: SETNX job_lock:{job_id} scheduler_instance_id EX 60. Only the instance that wins the lock creates the execution. Redis SETNX is atomic — exactly one instance proceeds. Alternative: use a leader election pattern (Zookeeper, etcd) — only the current leader schedules jobs. Simpler but has failover latency.
Worker Execution Model
Workers poll the job queue (or receive push via queue subscription). On dequeue: update JobExecution status=RUNNING, set worker_id=self, started_at=NOW(). Execute the job handler. On success: status=COMPLETED, completed_at=NOW(), result=output. On exception: status=FAILED, error_message=exception. Timeout enforcement: each worker has a watchdog thread that checks if the running job has exceeded its timeout_seconds. On timeout: send SIGTERM to the job thread, status=TIMED_OUT. Workers send heartbeats every 30 seconds: UPDATE workers SET last_heartbeat=NOW(). A monitor process checks for workers with last_heartbeat > 90 seconds — those workers are declared dead. Any RUNNING jobs on dead workers are re-queued (set status=QUEUED, clear worker_id).
Retry Logic with Exponential Backoff
On job failure: if attempt_number < max_retries: create a new JobExecution with attempt_number+1, scheduled_at = NOW() + retry_delay_seconds * 2^(attempt_number-1) (exponential backoff). Add jitter: multiply by a random factor between 0.8 and 1.2 to prevent retry storms. On final failure (attempt_number == max_retries): mark as PERMANENTLY_FAILED, send an alert. Retry-safe jobs: job handlers must be idempotent — retrying a failed job should produce the same result as the first successful execution. Non-idempotent operations (charge a credit card, send an email): use an idempotency key in the job params. The handler checks if this key was already processed before acting.
Job Dependencies and Workflows
Some jobs must run in sequence (Job B can only start after Job A succeeds). Model as a DAG (Directed Acyclic Graph) of jobs. On completion of Job A: check if any jobs have Job A as a prerequisite. If all prerequisites for Job B are COMPLETED: enqueue Job B. Store dependencies in a job_dependencies table: (downstream_job_id, upstream_job_id). This is the basis of workflow orchestrators like Apache Airflow (DAG-based ETL pipelines), Temporal (durable workflow execution), and Prefect. For complex workflows: track the state of each step, support human approval gates (pause workflow until a user approves), and support conditional branching (run Job C only if Job A result meets a condition).
Monitoring and Observability
Key metrics: job success rate per job_id (alert if < 99%), execution duration p50/p95/p99 (alert on regression), queue depth per priority tier (alert if growing — workers can't keep up), worker utilization (% of workers busy). Dashboard: list of all job definitions with last execution status, next scheduled time, and 24-hour success rate. For long-running jobs: emit progress events (job handler calls report_progress(50%) midway). Store progress on the JobExecution record. Display a progress bar in the dashboard. Log all execution lifecycle events (QUEUED, RUNNING, COMPLETED/FAILED) with timestamps for audit and debugging.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How do you prevent duplicate job execution when running multiple scheduler instances?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Multiple scheduler instances are needed for high availability. But they must not both trigger the same cron job. Two approaches: (1) Distributed lock per job: before creating a JobExecution, try to acquire a Redis lock: SET job_trigger:{job_id}:{scheduled_time} instance_id NX EX 60. Only the instance that wins creates the execution. The EX 60 ensures the lock expires if the winner crashes before releasing it. (2) Leader election: elect one scheduler as the leader (using Redis, etcd, or Zookeeper). Only the leader triggers jobs. On leader failure, a new leader is elected (typically within 5-15 seconds). Follower instances monitor but do not trigger. Approach 1 is more fault-tolerant (any instance can trigger); approach 2 is simpler to reason about. Both are used in production (Airflow uses database-level locking; Quartz uses JDBC clustering).”
}
},
{
“@type”: “Question”,
“name”: “How do you handle job timeout and detect crashed workers?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Timeout enforcement requires an external watchdog — the job itself cannot reliably detect its own timeout (it might be in an infinite loop). Worker heartbeat: every worker sends a heartbeat every 30 seconds: UPDATE workers SET last_heartbeat=NOW() WHERE worker_id=X. A monitor process runs every 60 seconds: SELECT jobs WHERE status=RUNNING AND worker.last_heartbeat < NOW()-90s. These workers are presumed dead. Actions: mark the worker as OFFLINE, mark its running jobs as status=ORPHANED or re-queue for retry (if max_retries not exceeded). Job timeout: when a worker starts a job, it sets a deadline = NOW() + timeout_seconds on the JobExecution. Another watchdog thread in the worker checks each running job against its deadline. On timeout: send SIGTERM (graceful), wait 10 seconds, then SIGKILL. Update status=TIMED_OUT. Re-queue if retries remain."
}
},
{
"@type": "Question",
"name": "How does exponential backoff with jitter prevent retry storms?",
"acceptedAnswer": {
"@type": "Answer",
"text": "When many jobs fail simultaneously (e.g., a downstream service goes down), they all retry at the same time — causing another wave of failures. Exponential backoff: retry_delay = base_delay * 2^(attempt-1). Attempt 1: 10s, attempt 2: 20s, attempt 3: 40s, attempt 4: 80s. This spreads retries over time as the delay grows. But without jitter: all failed jobs that started at the same time still retry at the same exponential intervals — synchronized. Full jitter: retry_delay = random(0, base_delay * 2^(attempt-1)). Now retries are spread randomly across the interval. Decorrelated jitter (AWS recommendation): delay = min(cap, random(base, prev_delay * 3)). This avoids both the synchronization problem and the slow convergence of uniform random. Always cap the maximum retry delay (e.g., at 1 hour) to ensure jobs eventually retry."
}
},
{
"@type": "Question",
"name": "How do you design job dependencies to build a workflow DAG?",
"acceptedAnswer": {
"@type": "Answer",
"text": "Store dependencies in a job_dependencies table: (job_id, depends_on_job_id). A job is eligible to run only after all its dependencies are in COMPLETED status. Execution flow: when a job completes, query which downstream jobs have this job as a dependency. For each downstream job: check if ALL its dependencies are now COMPLETED. If yes: enqueue the downstream job. This is event-driven dependency resolution — no polling. Cycle detection: validate the DAG on definition (DFS to detect cycles). If a cycle is introduced: reject the dependency update. Failure propagation: if a job fails, you can either (1) propagate failure to all downstream jobs (mark them SKIPPED_DUE_TO_UPSTREAM_FAILURE) or (2) allow downstream jobs to proceed with partial inputs (if they handle missing dependencies gracefully). Store the decision as a policy on the downstream job."
}
},
{
"@type": "Question",
"name": "How do you scale a job scheduler to handle millions of jobs per day?",
"acceptedAnswer": {
"@type": "Answer",
"text": "Scheduling bottleneck: the scheduler loop that queries due jobs every N seconds. At 1M jobs/day: ~12 jobs/second — trivial. At 100M jobs/day: ~1150 jobs/second. Solutions: (1) Index next_execution_at with a partial index on enabled=true. The "find due jobs" query becomes O(k) where k = jobs due in this interval, not O(total_jobs). (2) Partition jobs by time bucket: pre-compute next_execution_at, store in a sorted set in Redis (ZADD jobs_due {next_ts} {job_id}). The scheduler polls ZRANGEBYSCORE jobs_due 0 {now} to find due jobs in O(k + log n). (3) Separate scheduler and executor: the scheduler only does coordination (queuing), not execution. Workers scale independently based on queue depth. (4) Priority queues: separate Kafka topics for high-priority and low-priority jobs. High-priority workers are always available; low-priority workers scale with demand."
}
}
]
}
Asked at: Atlassian Interview Guide
Asked at: Cloudflare Interview Guide
Asked at: Databricks Interview Guide
Asked at: Uber Interview Guide
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering