What is the Saga pattern and how does it differ from two-phase commit?

A Saga is a sequence of local transactions where each step publishes an event or sends a command to trigger the next step. If a step fails, compensating transactions run in reverse to undo completed steps. Unlike two-phase commit (2PC), Sagas never hold distributed locks u2014 each service commits locally and immediately. This means better availability and throughput at the cost of temporary inconsistency. 2PC provides strong ACID guarantees within a single distributed transaction but blocks resources during the commit phase and fails when the coordinator crashes u2014 making it unsuitable for cross-microservice workflows.

What is the difference between choreography and orchestration in the Saga pattern?

Choreography-based Sagas have each service react to events and emit new events u2014 there is no central coordinator. Services are fully decoupled but the overall workflow is implicit and hard to trace. Orchestration-based Sagas use a central orchestrator (e.g., Temporal, AWS Step Functions) that explicitly commands each step and handles failures. Orchestration makes workflow state observable and compensation logic centralized. Choose choreography for simple 2-3 step flows; choose orchestration for complex flows with branching, timeouts, and many compensation paths.

What is the Outbox Pattern and why is it needed with Event Sourcing and Sagas?

The Outbox Pattern solves the dual-write problem: you cannot atomically update a database AND publish a message to Kafka/RabbitMQ u2014 if the service crashes between the two operations, you get inconsistency (DB updated, event not published, or vice versa). The solution: write events to an outbox table within the same local transaction as the business data update. A separate relay process (using CDC via Debezium or polling) reads the outbox table and publishes events to the message broker, then deletes processed rows. This guarantees at-least-once delivery; consumers must be idempotent to handle potential duplicates.

System Design Interview: Distributed Transactions and Saga Pattern

⏱ 5 min read

Maintaining data consistency across multiple microservices is one of the hardest distributed systems problems. Two-phase commit (2PC) ensures ACID guarantees but kills availability. The Saga pattern achieves eventual consistency with local transactions and compensating actions, enabling resilient workflows at scale.

The Problem: Distributed Transactions

Consider an e-commerce order: debit customer balance → reserve inventory → create shipment → send confirmation. Each step touches a different service with its own database. Traditional 2PC requires all services to participate in a distributed transaction — but what happens when the shipment service is down?

Two-Phase Commit (2PC):
  Coordinator → Participants: "PREPARE" (lock resources)
  All reply "READY" → Coordinator → "COMMIT"
  Any reply "ABORT" → Coordinator → "ROLLBACK"

Problems:
  - Blocking: participants hold locks until coordinator responds
  - Coordinator failure: system stuck in uncertain state
  - Latency: multiple round trips across services
  - Availability: any participant failure blocks entire transaction

The Saga Pattern

A Saga is a sequence of local transactions. If step N fails, compensating transactions (T1_comp, T2_comp, … T(N-1)_comp) are executed in reverse to undo completed steps. Each local transaction publishes an event or sends a command triggering the next step.

Choreography-Based Saga

OrderService          InventoryService      PaymentService
    │                      │                     │
    ├── OrderCreated ──────►│                     │
    │                      ├── InventoryReserved ►│
    │                      │                     ├── PaymentProcessed ──► ShippingService
    │                      │                     │                            │
    │  ◄─ OrderShipped ────────────────────────────────────────────────────────┤

If payment fails:
    PaymentFailed ──► InventoryService.ReleaseInventory
                 ──► OrderService.CancelOrder

Pros: Fully decoupled; no orchestrator single point of failure.
Cons: Hard to track overall workflow state; compensations are distributed; difficult to add new steps.

Orchestration-Based Saga

OrderSaga Orchestrator (Temporal/AWS Step Functions):
  1. ReserveInventory(orderId)  → success → continue
  2. ProcessPayment(orderId)    → failure → trigger compensation:
     2a. ReleaseInventory(orderId)
     2b. MarkOrderFailed(orderId)
  3. CreateShipment(orderId)    → success → continue
  4. SendConfirmation(orderId)

State machine stored in orchestrator — easy to inspect, retry, debug.

Pros: Centralized workflow logic; easy to monitor and debug; clear compensation logic.
Cons: Orchestrator is a coupling point (though not a data path); requires orchestration infrastructure.

Compensating Transactions

Compensating transactions must be idempotent and semantically reverse the original action. They are NOT simple rollbacks — they are new forward transactions that undo the business effect:

Original Action	Compensating Transaction
Reserve inventory (lock N units)	Release reservation (unlock N units)
Debit customer $100	Refund $100 to customer
Create shipment record	Cancel shipment, mark as cancelled
Send confirmation email	Send cancellation email (can’t unsend)

Note: some actions (email sent, notification pushed) cannot be perfectly compensated — these are pivot transactions. Design sagas to place pivot transactions last or after durable commitments.

Temporal: Durable Execution for Sagas

// Temporal workflow in Go
func OrderWorkflow(ctx workflow.Context, order Order) error {
    // Step 1: Reserve inventory
    var reservationID string
    err := workflow.ExecuteActivity(ctx, ReserveInventory, order).Get(ctx, &reservationID)
    if err != nil {
        return err  // nothing to compensate yet
    }

    // Step 2: Process payment (with compensation registered)
    err = workflow.ExecuteActivity(ctx, ProcessPayment, order).Get(ctx, nil)
    if err != nil {
        // Compensate: release reservation
        workflow.ExecuteActivity(ctx, ReleaseInventory, reservationID).Get(ctx, nil)
        return err
    }

    // Step 3: Create shipment
    err = workflow.ExecuteActivity(ctx, CreateShipment, order).Get(ctx, nil)
    if err != nil {
        workflow.ExecuteActivity(ctx, RefundPayment, order).Get(ctx, nil)
        workflow.ExecuteActivity(ctx, ReleaseInventory, reservationID).Get(ctx, nil)
        return err
    }

    return nil
}

// Temporal guarantees:
// - Workflow state persisted to Cassandra/PostgreSQL after each activity
// - Automatic retry with exponential backoff for transient failures
// - Worker crash-safe: workflow resumes from last checkpoint on new worker

Outbox Pattern: Reliable Event Publishing

A saga step must atomically update local state AND publish an event. Without coordination:

// WRONG: two separate operations — can fail between them
db.UpdateOrder(status="reserved")
kafka.Publish(OrderReserved{...})  // crash here = event lost

// CORRECT: Outbox Pattern
BEGIN TRANSACTION
  UPDATE orders SET status = 'reserved' WHERE id = ?
  INSERT INTO outbox (event_type, payload) VALUES ('OrderReserved', ?)
COMMIT

// Separate outbox relay process (CDC or polling):
// Reads outbox table → publishes to Kafka → deletes processed rows
// Guarantees at-least-once delivery; consumers must be idempotent

Idempotency: The Critical Requirement

Every Saga step must be idempotent — safe to retry:

// Use idempotency keys
func ProcessPayment(ctx context.Context, req PaymentRequest) error {
    // Check if already processed
    if alreadyProcessed(req.IdempotencyKey) {
        return nil  // success — don't double-charge
    }
    charge := stripe.ChargeCreate(req.Amount, req.IdempotencyKey)
    markProcessed(req.IdempotencyKey, charge.ID)
    return nil
}

// Idempotency key = orderId + stepName (e.g., "ord_123:payment")

Saga vs 2PC Decision Framework

Criterion	Use 2PC	Use Saga
Consistency	Strong ACID required	Eventual consistency acceptable
Availability	Can tolerate blocking	High availability required
Duration	Milliseconds (same DB cluster)	Seconds to minutes (cross-service)
Infrastructure	Single DB with XA support	Microservices with message bus
Examples	Bank transfers within one DB	Order fulfillment, booking flows

Interview Discussion Points

Why not XA transactions? XA (2PC across databases) is supported by most RDBMS but blocks resources during the commit phase, doesn’t work with NoSQL/cloud databases, and has poor failure recovery semantics — coordinator recovery requires manual intervention.
Choreography vs Orchestration: Choose choreography for simple 2-3 step flows with clear ownership boundaries. Choose orchestration (Temporal, AWS Step Functions) for complex flows with many steps, conditional branches, timeouts, and compensation logic.
How to debug a failed saga? Distributed tracing (correlation ID through all events), saga state table showing last completed step, and dead letter queues for failed compensation events.
What about read-your-writes consistency? After a saga completes, users expect to see updated state. Use a read-your-writes token (saga completion timestamp) that read replicas must be ahead of, or route reads to the primary until the token expires.

Airbnb Interview Guide

DoorDash Interview Guide

Uber Interview Guide

Shopify Interview Guide

Plaid Interview Guide

Coinbase Interview Guide

Companies That Ask This

Stripe Interview Guide