Maintaining data consistency across multiple microservices is one of the hardest distributed systems problems. Two-phase commit (2PC) ensures ACID guarantees but kills availability. The Saga pattern achieves eventual consistency with local transactions and compensating actions, enabling resilient workflows at scale.
The Problem: Distributed Transactions
Consider an e-commerce order: debit customer balance → reserve inventory → create shipment → send confirmation. Each step touches a different service with its own database. Traditional 2PC requires all services to participate in a distributed transaction — but what happens when the shipment service is down?
Two-Phase Commit (2PC):
Coordinator → Participants: "PREPARE" (lock resources)
All reply "READY" → Coordinator → "COMMIT"
Any reply "ABORT" → Coordinator → "ROLLBACK"
Problems:
- Blocking: participants hold locks until coordinator responds
- Coordinator failure: system stuck in uncertain state
- Latency: multiple round trips across services
- Availability: any participant failure blocks entire transaction
The Saga Pattern
A Saga is a sequence of local transactions. If step N fails, compensating transactions (T1_comp, T2_comp, … T(N-1)_comp) are executed in reverse to undo completed steps. Each local transaction publishes an event or sends a command triggering the next step.
Choreography-Based Saga
OrderService InventoryService PaymentService
│ │ │
├── OrderCreated ──────►│ │
│ ├── InventoryReserved ►│
│ │ ├── PaymentProcessed ──► ShippingService
│ │ │ │
│ ◄─ OrderShipped ────────────────────────────────────────────────────────┤
If payment fails:
PaymentFailed ──► InventoryService.ReleaseInventory
──► OrderService.CancelOrder
Pros: Fully decoupled; no orchestrator single point of failure.
Cons: Hard to track overall workflow state; compensations are distributed; difficult to add new steps.
Orchestration-Based Saga
OrderSaga Orchestrator (Temporal/AWS Step Functions):
1. ReserveInventory(orderId) → success → continue
2. ProcessPayment(orderId) → failure → trigger compensation:
2a. ReleaseInventory(orderId)
2b. MarkOrderFailed(orderId)
3. CreateShipment(orderId) → success → continue
4. SendConfirmation(orderId)
State machine stored in orchestrator — easy to inspect, retry, debug.
Pros: Centralized workflow logic; easy to monitor and debug; clear compensation logic.
Cons: Orchestrator is a coupling point (though not a data path); requires orchestration infrastructure.
Compensating Transactions
Compensating transactions must be idempotent and semantically reverse the original action. They are NOT simple rollbacks — they are new forward transactions that undo the business effect:
| Original Action | Compensating Transaction |
|---|---|
| Reserve inventory (lock N units) | Release reservation (unlock N units) |
| Debit customer $100 | Refund $100 to customer |
| Create shipment record | Cancel shipment, mark as cancelled |
| Send confirmation email | Send cancellation email (can’t unsend) |
Note: some actions (email sent, notification pushed) cannot be perfectly compensated — these are pivot transactions. Design sagas to place pivot transactions last or after durable commitments.
Temporal: Durable Execution for Sagas
// Temporal workflow in Go
func OrderWorkflow(ctx workflow.Context, order Order) error {
// Step 1: Reserve inventory
var reservationID string
err := workflow.ExecuteActivity(ctx, ReserveInventory, order).Get(ctx, &reservationID)
if err != nil {
return err // nothing to compensate yet
}
// Step 2: Process payment (with compensation registered)
err = workflow.ExecuteActivity(ctx, ProcessPayment, order).Get(ctx, nil)
if err != nil {
// Compensate: release reservation
workflow.ExecuteActivity(ctx, ReleaseInventory, reservationID).Get(ctx, nil)
return err
}
// Step 3: Create shipment
err = workflow.ExecuteActivity(ctx, CreateShipment, order).Get(ctx, nil)
if err != nil {
workflow.ExecuteActivity(ctx, RefundPayment, order).Get(ctx, nil)
workflow.ExecuteActivity(ctx, ReleaseInventory, reservationID).Get(ctx, nil)
return err
}
return nil
}
// Temporal guarantees:
// - Workflow state persisted to Cassandra/PostgreSQL after each activity
// - Automatic retry with exponential backoff for transient failures
// - Worker crash-safe: workflow resumes from last checkpoint on new worker
Outbox Pattern: Reliable Event Publishing
A saga step must atomically update local state AND publish an event. Without coordination:
// WRONG: two separate operations — can fail between them
db.UpdateOrder(status="reserved")
kafka.Publish(OrderReserved{...}) // crash here = event lost
// CORRECT: Outbox Pattern
BEGIN TRANSACTION
UPDATE orders SET status = 'reserved' WHERE id = ?
INSERT INTO outbox (event_type, payload) VALUES ('OrderReserved', ?)
COMMIT
// Separate outbox relay process (CDC or polling):
// Reads outbox table → publishes to Kafka → deletes processed rows
// Guarantees at-least-once delivery; consumers must be idempotent
Idempotency: The Critical Requirement
Every Saga step must be idempotent — safe to retry:
// Use idempotency keys
func ProcessPayment(ctx context.Context, req PaymentRequest) error {
// Check if already processed
if alreadyProcessed(req.IdempotencyKey) {
return nil // success — don't double-charge
}
charge := stripe.ChargeCreate(req.Amount, req.IdempotencyKey)
markProcessed(req.IdempotencyKey, charge.ID)
return nil
}
// Idempotency key = orderId + stepName (e.g., "ord_123:payment")
Saga vs 2PC Decision Framework
| Criterion | Use 2PC | Use Saga |
|---|---|---|
| Consistency | Strong ACID required | Eventual consistency acceptable |
| Availability | Can tolerate blocking | High availability required |
| Duration | Milliseconds (same DB cluster) | Seconds to minutes (cross-service) |
| Infrastructure | Single DB with XA support | Microservices with message bus |
| Examples | Bank transfers within one DB | Order fulfillment, booking flows |
Interview Discussion Points
- Why not XA transactions? XA (2PC across databases) is supported by most RDBMS but blocks resources during the commit phase, doesn’t work with NoSQL/cloud databases, and has poor failure recovery semantics — coordinator recovery requires manual intervention.
- Choreography vs Orchestration: Choose choreography for simple 2-3 step flows with clear ownership boundaries. Choose orchestration (Temporal, AWS Step Functions) for complex flows with many steps, conditional branches, timeouts, and compensation logic.
- How to debug a failed saga? Distributed tracing (correlation ID through all events), saga state table showing last completed step, and dead letter queues for failed compensation events.
- What about read-your-writes consistency? After a saga completes, users expect to see updated state. Use a read-your-writes token (saga completion timestamp) that read replicas must be ahead of, or route reads to the primary until the token expires.