Question 1

How does a workflow engine differ from a simple task queue?

Accepted Answer

A task queue (Celery, SQS, RabbitMQ) handles individual tasks: enqueue a task, a worker picks it up, executes it, and the queue moves on. There is no concept of dependencies between tasks, no tracking of workflow-level state, and no automatic chaining. A workflow engine manages multi-step processes: it understands that task B depends on task A, executes them in the correct order, handles task-level retries independently, tracks the state of the entire workflow run (which tasks completed, which failed), and can resume a failed workflow from the last checkpoint. Additional capabilities of workflow engines: wait states (pause until an external event -- e.g., wait for human approval), conditional branching (if payment fails, go to fallback path; else, proceed with fulfillment), sub-workflows (a task can itself be a workflow), timeouts and deadlines on individual tasks or the entire workflow, and complete audit history of every task execution. Use a task queue when you need fire-and-forget async processing. Use a workflow engine when you need to orchestrate a complex, multi-step process with dependencies, error handling, and observability.

Question 2

How does Temporal differ from Apache Airflow for workflow orchestration?

Accepted Answer

Airflow is designed for batch data pipelines: scheduled DAGs, typically running on a cadence (hourly, daily). It is optimized for data engineering workflows: ETL, ML pipelines, dbt runs. Airflow DAGs are defined in Python; tasks execute as individual workers. Weaknesses: not designed for event-driven or real-time workflows, limited support for long-running workflows (days to months), UI is primarily for monitoring rather than operational workflows. Temporal (and Cadence, its predecessor) is designed for operational workflows: long-running business processes that span minutes, hours, or months. Workflows are written as ordinary code (Go, Java, Python, TypeScript) with the Temporal SDK. The SDK intercepts function calls and makes them durable -- if the worker crashes, Temporal replays the workflow history to restore state. Temporal excels at: microservice orchestration, saga patterns (distributed transactions), workflows requiring human approval steps, stateful long-running processes. Choose Airflow for data pipelines and batch schedules. Choose Temporal for operational workflows and event-driven long-running processes.

Question 3

How do you implement the Saga pattern with a workflow engine?

Accepted Answer

The Saga pattern handles distributed transactions: a sequence of local transactions, each with a compensating transaction to undo it on failure. Example: book a trip (hotel + flight + car). If the car booking fails after hotel and flight succeed: run compensating transactions to cancel hotel and flight. Workflow engine implementation: define each booking step as a task. Define a corresponding cancel/compensate task for each. On task failure: the workflow engine runs compensation tasks in reverse order. Temporal implementation: in the workflow code, catch exceptions and explicitly call compensation activities. The workflow code is ordinary: try booking car u2192 on exception u2192 cancel flight u2192 cancel hotel. Temporal makes this code durable: even if the worker crashes while running compensations, Temporal replays the history and continues from where it left off. Choreography vs orchestration: the Saga pattern can be implemented via event choreography (each service reacts to events and publishes its own) or orchestration (a central workflow coordinator tells each service what to do). Workflow engines implement the orchestration approach -- easier to reason about and debug.

Question 4

How do you handle workflow versioning when the workflow definition changes?

Accepted Answer

Workflow versioning is one of the hardest problems in workflow systems. The challenge: a workflow run may be in progress when the workflow code is deployed with changes. The new code may have a different number of steps, different step order, or different step logic. If the workflow engine tries to resume the in-progress run with the new code, it may fail (the history doesn't match the new code). Strategies: (1) Concurrent versioning: keep multiple versions of the workflow code deployed simultaneously. In-progress runs continue with the old version; new runs use the new version. Route runs to the correct version by workflow version tag. (2) Compatibility shims: add version checks in the workflow code: if self.workflow_version >= 2: # new path. Temporal provides a Workflow.get_version() API specifically for this. (3) Drain and deploy: wait for all in-progress runs to complete before deploying the new version. Only feasible for short-lived workflows. (4) New workflow name: deploy the new version as a different workflow type (order_processing_v2). New runs use v2; old runs complete on v1. Regardless of approach: maintain backward compatibility in the workflow code for at least one deployment cycle.

Question 5

How do you implement parallel fan-out and fan-in in a workflow?

Accepted Answer

Fan-out: start multiple tasks concurrently. Fan-in: wait for all of them to complete before proceeding. Example: process a batch of items -- create N parallel tasks, one per item, then aggregate results. Temporal implementation in Python: activities = [workflow.execute_activity(process_item, item) for item in items]. results = await asyncio.gather(*activities). All activities run in parallel; gather waits for all to complete. Handle partial failures: if some activities fail: catch exceptions individually, collect successes and failures. Decide: fail the entire workflow (strict mode) or proceed with partial results (lenient mode). Airflow fan-out: dynamic task mapping (Airflow 2.3+): .expand(item=items) creates N tasks at runtime. The downstream task (fan-in) has all_done trigger rule -- waits for all upstream tasks to succeed/fail. Large fan-outs: if N is very large (thousands of parallel tasks), the workflow engine itself can become a bottleneck. Strategies: batch the fan-out (groups of 100), use a separate work queue for the parallel tasks and have the workflow poll for completion, or use a map-reduce pattern with intermediate aggregation steps.

System Design: Workflow Engine — DAG Execution, State Persistence, and Fault Tolerance

What is a Workflow Engine?

Workflow Definition as a DAG

Execution State Persistence

Retry and Fault Tolerance

Scalability and Observability