Maintaining data consistency across multiple microservices is one of the hardest distributed systems problems. Two-phase commit (2PC) ensures ACID guarantees but kills availability. The Saga pattern achieves eventual consistency with local transactions and compensating actions, enabling resilient workflows at scale.
The Problem: Distributed Transactions
Consider an e-commerce order: debit customer balance → reserve inventory → create shipment → send confirmation. Each step touches a different service with its own database. Traditional 2PC requires all services to participate in a distributed transaction — but what happens when the shipment service is down?
Two-Phase Commit (2PC):
Coordinator → Participants: "PREPARE" (lock resources)
All reply "READY" → Coordinator → "COMMIT"
Any reply "ABORT" → Coordinator → "ROLLBACK"
Problems:
- Blocking: participants hold locks until coordinator responds
- Coordinator failure: system stuck in uncertain state
- Latency: multiple round trips across services
- Availability: any participant failure blocks entire transaction
The Saga Pattern
A Saga is a sequence of local transactions. If step N fails, compensating transactions (T1_comp, T2_comp, … T(N-1)_comp) are executed in reverse to undo completed steps. Each local transaction publishes an event or sends a command triggering the next step.
Choreography-Based Saga
OrderService InventoryService PaymentService
│ │ │
├── OrderCreated ──────►│ │
│ ├── InventoryReserved ►│
│ │ ├── PaymentProcessed ──► ShippingService
│ │ │ │
│ ◄─ OrderShipped ────────────────────────────────────────────────────────┤
If payment fails:
PaymentFailed ──► InventoryService.ReleaseInventory
──► OrderService.CancelOrder
Pros: Fully decoupled; no orchestrator single point of failure.
Cons: Hard to track overall workflow state; compensations are distributed; difficult to add new steps.
Orchestration-Based Saga
OrderSaga Orchestrator (Temporal/AWS Step Functions):
1. ReserveInventory(orderId) → success → continue
2. ProcessPayment(orderId) → failure → trigger compensation:
2a. ReleaseInventory(orderId)
2b. MarkOrderFailed(orderId)
3. CreateShipment(orderId) → success → continue
4. SendConfirmation(orderId)
State machine stored in orchestrator — easy to inspect, retry, debug.
Pros: Centralized workflow logic; easy to monitor and debug; clear compensation logic.
Cons: Orchestrator is a coupling point (though not a data path); requires orchestration infrastructure.
Compensating Transactions
Compensating transactions must be idempotent and semantically reverse the original action. They are NOT simple rollbacks — they are new forward transactions that undo the business effect:
| Original Action | Compensating Transaction |
|---|---|
| Reserve inventory (lock N units) | Release reservation (unlock N units) |
| Debit customer $100 | Refund $100 to customer |
| Create shipment record | Cancel shipment, mark as cancelled |
| Send confirmation email | Send cancellation email (can’t unsend) |
Note: some actions (email sent, notification pushed) cannot be perfectly compensated — these are pivot transactions. Design sagas to place pivot transactions last or after durable commitments.
Temporal: Durable Execution for Sagas
// Temporal workflow in Go
func OrderWorkflow(ctx workflow.Context, order Order) error {
// Step 1: Reserve inventory
var reservationID string
err := workflow.ExecuteActivity(ctx, ReserveInventory, order).Get(ctx, &reservationID)
if err != nil {
return err // nothing to compensate yet
}
// Step 2: Process payment (with compensation registered)
err = workflow.ExecuteActivity(ctx, ProcessPayment, order).Get(ctx, nil)
if err != nil {
// Compensate: release reservation
workflow.ExecuteActivity(ctx, ReleaseInventory, reservationID).Get(ctx, nil)
return err
}
// Step 3: Create shipment
err = workflow.ExecuteActivity(ctx, CreateShipment, order).Get(ctx, nil)
if err != nil {
workflow.ExecuteActivity(ctx, RefundPayment, order).Get(ctx, nil)
workflow.ExecuteActivity(ctx, ReleaseInventory, reservationID).Get(ctx, nil)
return err
}
return nil
}
// Temporal guarantees:
// - Workflow state persisted to Cassandra/PostgreSQL after each activity
// - Automatic retry with exponential backoff for transient failures
// - Worker crash-safe: workflow resumes from last checkpoint on new worker
Outbox Pattern: Reliable Event Publishing
A saga step must atomically update local state AND publish an event. Without coordination:
// WRONG: two separate operations — can fail between them
db.UpdateOrder(status="reserved")
kafka.Publish(OrderReserved{...}) // crash here = event lost
// CORRECT: Outbox Pattern
BEGIN TRANSACTION
UPDATE orders SET status = 'reserved' WHERE id = ?
INSERT INTO outbox (event_type, payload) VALUES ('OrderReserved', ?)
COMMIT
// Separate outbox relay process (CDC or polling):
// Reads outbox table → publishes to Kafka → deletes processed rows
// Guarantees at-least-once delivery; consumers must be idempotent
Idempotency: The Critical Requirement
Every Saga step must be idempotent — safe to retry:
// Use idempotency keys
func ProcessPayment(ctx context.Context, req PaymentRequest) error {
// Check if already processed
if alreadyProcessed(req.IdempotencyKey) {
return nil // success — don't double-charge
}
charge := stripe.ChargeCreate(req.Amount, req.IdempotencyKey)
markProcessed(req.IdempotencyKey, charge.ID)
return nil
}
// Idempotency key = orderId + stepName (e.g., "ord_123:payment")
Saga vs 2PC Decision Framework
| Criterion | Use 2PC | Use Saga |
|---|---|---|
| Consistency | Strong ACID required | Eventual consistency acceptable |
| Availability | Can tolerate blocking | High availability required |
| Duration | Milliseconds (same DB cluster) | Seconds to minutes (cross-service) |
| Infrastructure | Single DB with XA support | Microservices with message bus |
| Examples | Bank transfers within one DB | Order fulfillment, booking flows |
Interview Discussion Points
- Why not XA transactions? XA (2PC across databases) is supported by most RDBMS but blocks resources during the commit phase, doesn’t work with NoSQL/cloud databases, and has poor failure recovery semantics — coordinator recovery requires manual intervention.
- Choreography vs Orchestration: Choose choreography for simple 2-3 step flows with clear ownership boundaries. Choose orchestration (Temporal, AWS Step Functions) for complex flows with many steps, conditional branches, timeouts, and compensation logic.
- How to debug a failed saga? Distributed tracing (correlation ID through all events), saga state table showing last completed step, and dead letter queues for failed compensation events.
- What about read-your-writes consistency? After a saga completes, users expect to see updated state. Use a read-your-writes token (saga completion timestamp) that read replicas must be ahead of, or route reads to the primary until the token expires.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the Saga pattern and how does it differ from two-phase commit?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A Saga is a sequence of local transactions where each step publishes an event or sends a command to trigger the next step. If a step fails, compensating transactions run in reverse to undo completed steps. Unlike two-phase commit (2PC), Sagas never hold distributed locks u2014 each service commits locally and immediately. This means better availability and throughput at the cost of temporary inconsistency. 2PC provides strong ACID guarantees within a single distributed transaction but blocks resources during the commit phase and fails when the coordinator crashes u2014 making it unsuitable for cross-microservice workflows.”
}
},
{
“@type”: “Question”,
“name”: “What is the difference between choreography and orchestration in the Saga pattern?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Choreography-based Sagas have each service react to events and emit new events u2014 there is no central coordinator. Services are fully decoupled but the overall workflow is implicit and hard to trace. Orchestration-based Sagas use a central orchestrator (e.g., Temporal, AWS Step Functions) that explicitly commands each step and handles failures. Orchestration makes workflow state observable and compensation logic centralized. Choose choreography for simple 2-3 step flows; choose orchestration for complex flows with branching, timeouts, and many compensation paths.”
}
},
{
“@type”: “Question”,
“name”: “What is the Outbox Pattern and why is it needed with Event Sourcing and Sagas?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The Outbox Pattern solves the dual-write problem: you cannot atomically update a database AND publish a message to Kafka/RabbitMQ u2014 if the service crashes between the two operations, you get inconsistency (DB updated, event not published, or vice versa). The solution: write events to an outbox table within the same local transaction as the business data update. A separate relay process (using CDC via Debezium or polling) reads the outbox table and publishes events to the message broker, then deletes processed rows. This guarantees at-least-once delivery; consumers must be idempotent to handle potential duplicates.”
}
}
]
}