Core Entities
JobDefinition: job_id, name, type (CRON, INTERVAL, ONE_TIME), schedule (cron expression or interval_seconds), handler (class name or function reference), params (JSON), max_retries, retry_delay_seconds, timeout_seconds, enabled, created_by, created_at. JobExecution: execution_id, job_id, status (QUEUED, RUNNING, COMPLETED, FAILED, TIMED_OUT, CANCELLED), scheduled_at, started_at, completed_at, worker_id, attempt_number, error_message, result (JSON). Worker: worker_id, hostname, status (ACTIVE, DRAINING, OFFLINE), last_heartbeat, current_job_id.
Cron Scheduling Engine
The scheduler runs a loop that determines which jobs are due for execution. Cron expression parsing: parse a 5-field cron expression (minute, hour, day, month, weekday) to determine the next execution time for each job. Libraries: croniter (Python), node-cron (JavaScript), Quartz (Java). On each tick (every 10 seconds): find all enabled jobs where next_execution_at <= NOW(). For each due job: create a JobExecution record with status=QUEUED and enqueue to a message queue (Redis Sorted Set, Kafka, or SQS). Update the job's next_execution_at to the next scheduled time.
Multiple scheduler instances: to avoid duplicate job creation when running multiple scheduler instances for high availability, use a distributed lock per job. Before creating a JobExecution: SETNX job_lock:{job_id} scheduler_instance_id EX 60. Only the instance that wins the lock creates the execution. Redis SETNX is atomic — exactly one instance proceeds. Alternative: use a leader election pattern (Zookeeper, etcd) — only the current leader schedules jobs. Simpler but has failover latency.
Worker Execution Model
Workers poll the job queue (or receive push via queue subscription). On dequeue: update JobExecution status=RUNNING, set worker_id=self, started_at=NOW(). Execute the job handler. On success: status=COMPLETED, completed_at=NOW(), result=output. On exception: status=FAILED, error_message=exception. Timeout enforcement: each worker has a watchdog thread that checks if the running job has exceeded its timeout_seconds. On timeout: send SIGTERM to the job thread, status=TIMED_OUT. Workers send heartbeats every 30 seconds: UPDATE workers SET last_heartbeat=NOW(). A monitor process checks for workers with last_heartbeat > 90 seconds — those workers are declared dead. Any RUNNING jobs on dead workers are re-queued (set status=QUEUED, clear worker_id).
Retry Logic with Exponential Backoff
On job failure: if attempt_number < max_retries: create a new JobExecution with attempt_number+1, scheduled_at = NOW() + retry_delay_seconds * 2^(attempt_number-1) (exponential backoff). Add jitter: multiply by a random factor between 0.8 and 1.2 to prevent retry storms. On final failure (attempt_number == max_retries): mark as PERMANENTLY_FAILED, send an alert. Retry-safe jobs: job handlers must be idempotent — retrying a failed job should produce the same result as the first successful execution. Non-idempotent operations (charge a credit card, send an email): use an idempotency key in the job params. The handler checks if this key was already processed before acting.
Job Dependencies and Workflows
Some jobs must run in sequence (Job B can only start after Job A succeeds). Model as a DAG (Directed Acyclic Graph) of jobs. On completion of Job A: check if any jobs have Job A as a prerequisite. If all prerequisites for Job B are COMPLETED: enqueue Job B. Store dependencies in a job_dependencies table: (downstream_job_id, upstream_job_id). This is the basis of workflow orchestrators like Apache Airflow (DAG-based ETL pipelines), Temporal (durable workflow execution), and Prefect. For complex workflows: track the state of each step, support human approval gates (pause workflow until a user approves), and support conditional branching (run Job C only if Job A result meets a condition).
Monitoring and Observability
Key metrics: job success rate per job_id (alert if < 99%), execution duration p50/p95/p99 (alert on regression), queue depth per priority tier (alert if growing — workers can't keep up), worker utilization (% of workers busy). Dashboard: list of all job definitions with last execution status, next scheduled time, and 24-hour success rate. For long-running jobs: emit progress events (job handler calls report_progress(50%) midway). Store progress on the JobExecution record. Display a progress bar in the dashboard. Log all execution lifecycle events (QUEUED, RUNNING, COMPLETED/FAILED) with timestamps for audit and debugging.
Asked at: Atlassian Interview Guide
Asked at: Cloudflare Interview Guide
Asked at: Databricks Interview Guide
Asked at: Uber Interview Guide
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering