What is a Workflow Engine?
A workflow engine executes multi-step processes where steps have dependencies, can run in parallel, and must handle failures. Examples: Airflow (data pipelines), Temporal (business workflows), AWS Step Functions, Cadence. Use cases: order processing (payment → inventory → shipping → notification), data pipelines (extract → transform → load → aggregate), CI/CD pipelines (build → test → deploy), document approval workflows. Core requirements: define workflows as code (or configuration), execute steps in dependency order, retry failed steps, resume from failure without re-running completed steps, track execution history.
Workflow Definition as a DAG
A workflow is a Directed Acyclic Graph (DAG): nodes = tasks, edges = dependencies. Example order processing workflow:
workflow = DAG("order_processing")
validate_payment = Task("validate_payment", fn=check_payment)
reserve_inventory = Task("reserve_inventory", fn=reserve_items)
charge_payment = Task("charge_payment", fn=run_charge,
depends_on=[validate_payment])
ship_order = Task("ship_order", fn=create_shipment,
depends_on=[reserve_inventory, charge_payment])
send_confirmation = Task("send_confirmation", fn=notify_customer,
depends_on=[ship_order])
workflow.add_tasks([validate_payment, reserve_inventory,
charge_payment, ship_order, send_confirmation])
Parallel execution: validate_payment and reserve_inventory have no dependencies — they run concurrently. charge_payment waits for validate_payment. ship_order waits for both charge_payment and reserve_inventory. Topological sort determines the execution order.
Execution State Persistence
Each workflow run (instance) is a separate entity. Schema: WorkflowRun: run_id (UUID), workflow_id, status (RUNNING, COMPLETED, FAILED, CANCELLED), created_at, completed_at, input (JSON). TaskRun: task_run_id, run_id, task_id, status (PENDING, RUNNING, SUCCEEDED, FAILED, SKIPPED), attempt_number, started_at, completed_at, output (JSON), error_message. Execution engine: a scheduler polls for PENDING tasks whose dependencies are all SUCCEEDED. It claims a task (UPDATE status=RUNNING WHERE status=PENDING) atomically (SELECT FOR UPDATE to prevent concurrent claims). A worker executes the task, reports the result (SUCCEEDED or FAILED) to the scheduler. The scheduler re-evaluates pending tasks for the run. If all tasks are SUCCEEDED: mark the WorkflowRun as COMPLETED. If a critical task FAILs: mark dependent tasks as SKIPPED, mark the run as FAILED.
Retry and Fault Tolerance
Task-level retry: each task has a max_retries configuration (e.g., 3 retries). On failure: TaskRun.attempt_number timeout) as FAILED and re-queues them — handles worker crashes. Checkpointing: for tasks with complex computation, support saving intermediate state (checkpoint) and resuming from the last checkpoint. The task reports checkpoint_data (JSON) to the scheduler; on retry, the scheduler passes the checkpoint back to the task.
Scalability and Observability
Scheduler scalability: a single scheduler becomes a bottleneck for thousands of concurrent workflow runs. Partition workflow runs across multiple scheduler instances by run_id (consistent hashing or database-level partitioning). Each scheduler instance owns a subset of runs. Worker scalability: workers are stateless and horizontally scalable. Use Kafka or a work queue (RabbitMQ, SQS) for task distribution. The scheduler publishes ready tasks to the queue; workers consume and execute. Observability: real-time DAG visualization shows completed, running, and pending tasks per run. Duration metrics per task: P50, P95, P99 — identifies slow tasks. Alert on: run duration exceeding expected time, failure rate exceeding threshold, task retry rate spike (indicates flaky external dependencies). Workflow versioning: when the workflow definition changes: in-progress runs use the old version; new runs use the new version. Store the workflow definition version on each WorkflowRun.
Asked at: Databricks Interview Guide
Asked at: Stripe Interview Guide
Asked at: Uber Interview Guide
Asked at: Atlassian Interview Guide