What is a Workflow Engine?
A workflow engine executes multi-step processes where steps have dependencies, can run in parallel, and must handle failures. Examples: Airflow (data pipelines), Temporal (business workflows), AWS Step Functions, Cadence. Use cases: order processing (payment → inventory → shipping → notification), data pipelines (extract → transform → load → aggregate), CI/CD pipelines (build → test → deploy), document approval workflows. Core requirements: define workflows as code (or configuration), execute steps in dependency order, retry failed steps, resume from failure without re-running completed steps, track execution history.
Workflow Definition as a DAG
A workflow is a Directed Acyclic Graph (DAG): nodes = tasks, edges = dependencies. Example order processing workflow:
workflow = DAG("order_processing")
validate_payment = Task("validate_payment", fn=check_payment)
reserve_inventory = Task("reserve_inventory", fn=reserve_items)
charge_payment = Task("charge_payment", fn=run_charge,
depends_on=[validate_payment])
ship_order = Task("ship_order", fn=create_shipment,
depends_on=[reserve_inventory, charge_payment])
send_confirmation = Task("send_confirmation", fn=notify_customer,
depends_on=[ship_order])
workflow.add_tasks([validate_payment, reserve_inventory,
charge_payment, ship_order, send_confirmation])
Parallel execution: validate_payment and reserve_inventory have no dependencies — they run concurrently. charge_payment waits for validate_payment. ship_order waits for both charge_payment and reserve_inventory. Topological sort determines the execution order.
Execution State Persistence
Each workflow run (instance) is a separate entity. Schema: WorkflowRun: run_id (UUID), workflow_id, status (RUNNING, COMPLETED, FAILED, CANCELLED), created_at, completed_at, input (JSON). TaskRun: task_run_id, run_id, task_id, status (PENDING, RUNNING, SUCCEEDED, FAILED, SKIPPED), attempt_number, started_at, completed_at, output (JSON), error_message. Execution engine: a scheduler polls for PENDING tasks whose dependencies are all SUCCEEDED. It claims a task (UPDATE status=RUNNING WHERE status=PENDING) atomically (SELECT FOR UPDATE to prevent concurrent claims). A worker executes the task, reports the result (SUCCEEDED or FAILED) to the scheduler. The scheduler re-evaluates pending tasks for the run. If all tasks are SUCCEEDED: mark the WorkflowRun as COMPLETED. If a critical task FAILs: mark dependent tasks as SKIPPED, mark the run as FAILED.
Retry and Fault Tolerance
Task-level retry: each task has a max_retries configuration (e.g., 3 retries). On failure: TaskRun.attempt_number timeout) as FAILED and re-queues them — handles worker crashes. Checkpointing: for tasks with complex computation, support saving intermediate state (checkpoint) and resuming from the last checkpoint. The task reports checkpoint_data (JSON) to the scheduler; on retry, the scheduler passes the checkpoint back to the task.
Scalability and Observability
Scheduler scalability: a single scheduler becomes a bottleneck for thousands of concurrent workflow runs. Partition workflow runs across multiple scheduler instances by run_id (consistent hashing or database-level partitioning). Each scheduler instance owns a subset of runs. Worker scalability: workers are stateless and horizontally scalable. Use Kafka or a work queue (RabbitMQ, SQS) for task distribution. The scheduler publishes ready tasks to the queue; workers consume and execute. Observability: real-time DAG visualization shows completed, running, and pending tasks per run. Duration metrics per task: P50, P95, P99 — identifies slow tasks. Alert on: run duration exceeding expected time, failure rate exceeding threshold, task retry rate spike (indicates flaky external dependencies). Workflow versioning: when the workflow definition changes: in-progress runs use the old version; new runs use the new version. Store the workflow definition version on each WorkflowRun.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does a workflow engine differ from a simple task queue?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A task queue (Celery, SQS, RabbitMQ) handles individual tasks: enqueue a task, a worker picks it up, executes it, and the queue moves on. There is no concept of dependencies between tasks, no tracking of workflow-level state, and no automatic chaining. A workflow engine manages multi-step processes: it understands that task B depends on task A, executes them in the correct order, handles task-level retries independently, tracks the state of the entire workflow run (which tasks completed, which failed), and can resume a failed workflow from the last checkpoint. Additional capabilities of workflow engines: wait states (pause until an external event — e.g., wait for human approval), conditional branching (if payment fails, go to fallback path; else, proceed with fulfillment), sub-workflows (a task can itself be a workflow), timeouts and deadlines on individual tasks or the entire workflow, and complete audit history of every task execution. Use a task queue when you need fire-and-forget async processing. Use a workflow engine when you need to orchestrate a complex, multi-step process with dependencies, error handling, and observability.”
}
},
{
“@type”: “Question”,
“name”: “How does Temporal differ from Apache Airflow for workflow orchestration?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Airflow is designed for batch data pipelines: scheduled DAGs, typically running on a cadence (hourly, daily). It is optimized for data engineering workflows: ETL, ML pipelines, dbt runs. Airflow DAGs are defined in Python; tasks execute as individual workers. Weaknesses: not designed for event-driven or real-time workflows, limited support for long-running workflows (days to months), UI is primarily for monitoring rather than operational workflows. Temporal (and Cadence, its predecessor) is designed for operational workflows: long-running business processes that span minutes, hours, or months. Workflows are written as ordinary code (Go, Java, Python, TypeScript) with the Temporal SDK. The SDK intercepts function calls and makes them durable — if the worker crashes, Temporal replays the workflow history to restore state. Temporal excels at: microservice orchestration, saga patterns (distributed transactions), workflows requiring human approval steps, stateful long-running processes. Choose Airflow for data pipelines and batch schedules. Choose Temporal for operational workflows and event-driven long-running processes.”
}
},
{
“@type”: “Question”,
“name”: “How do you implement the Saga pattern with a workflow engine?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The Saga pattern handles distributed transactions: a sequence of local transactions, each with a compensating transaction to undo it on failure. Example: book a trip (hotel + flight + car). If the car booking fails after hotel and flight succeed: run compensating transactions to cancel hotel and flight. Workflow engine implementation: define each booking step as a task. Define a corresponding cancel/compensate task for each. On task failure: the workflow engine runs compensation tasks in reverse order. Temporal implementation: in the workflow code, catch exceptions and explicitly call compensation activities. The workflow code is ordinary: try booking car u2192 on exception u2192 cancel flight u2192 cancel hotel. Temporal makes this code durable: even if the worker crashes while running compensations, Temporal replays the history and continues from where it left off. Choreography vs orchestration: the Saga pattern can be implemented via event choreography (each service reacts to events and publishes its own) or orchestration (a central workflow coordinator tells each service what to do). Workflow engines implement the orchestration approach — easier to reason about and debug.”
}
},
{
“@type”: “Question”,
“name”: “How do you handle workflow versioning when the workflow definition changes?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Workflow versioning is one of the hardest problems in workflow systems. The challenge: a workflow run may be in progress when the workflow code is deployed with changes. The new code may have a different number of steps, different step order, or different step logic. If the workflow engine tries to resume the in-progress run with the new code, it may fail (the history doesn’t match the new code). Strategies: (1) Concurrent versioning: keep multiple versions of the workflow code deployed simultaneously. In-progress runs continue with the old version; new runs use the new version. Route runs to the correct version by workflow version tag. (2) Compatibility shims: add version checks in the workflow code: if self.workflow_version >= 2: # new path. Temporal provides a Workflow.get_version() API specifically for this. (3) Drain and deploy: wait for all in-progress runs to complete before deploying the new version. Only feasible for short-lived workflows. (4) New workflow name: deploy the new version as a different workflow type (order_processing_v2). New runs use v2; old runs complete on v1. Regardless of approach: maintain backward compatibility in the workflow code for at least one deployment cycle.”
}
},
{
“@type”: “Question”,
“name”: “How do you implement parallel fan-out and fan-in in a workflow?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Fan-out: start multiple tasks concurrently. Fan-in: wait for all of them to complete before proceeding. Example: process a batch of items — create N parallel tasks, one per item, then aggregate results. Temporal implementation in Python: activities = [workflow.execute_activity(process_item, item) for item in items]. results = await asyncio.gather(*activities). All activities run in parallel; gather waits for all to complete. Handle partial failures: if some activities fail: catch exceptions individually, collect successes and failures. Decide: fail the entire workflow (strict mode) or proceed with partial results (lenient mode). Airflow fan-out: dynamic task mapping (Airflow 2.3+): .expand(item=items) creates N tasks at runtime. The downstream task (fan-in) has all_done trigger rule — waits for all upstream tasks to succeed/fail. Large fan-outs: if N is very large (thousands of parallel tasks), the workflow engine itself can become a bottleneck. Strategies: batch the fan-out (groups of 100), use a separate work queue for the parallel tasks and have the workflow poll for completion, or use a map-reduce pattern with intermediate aggregation steps.”
}
}
]
}
Asked at: Databricks Interview Guide
Asked at: Stripe Interview Guide
Asked at: Uber Interview Guide
Asked at: Atlassian Interview Guide