System Design Interview: Distributed Transactions, 2PC, and the Saga Pattern
When a business operation spans multiple microservices or databases, maintaining consistency is hard. ACID transactions do not cross service boundaries. This guide covers the two main patterns for distributed transactions — Two-Phase Commit (2PC) and the Saga Pattern — along with the trade-offs that make one or the other the right choice.
Why Distributed Transactions Are Hard
Consider an e-commerce order: deduct inventory, charge the credit card, create the order record. If these live in separate services, any step can fail after the previous ones succeeded. Without coordination:
- Inventory deducted, payment fails → customer cannot pay, item held indefinitely
- Payment charged, order creation crashes → money taken, no order created
We need atomicity across services: either all steps succeed or all are rolled back.
Two-Phase Commit (2PC)
2PC uses a coordinator to achieve distributed atomicity:
Phase 1 — Prepare:
Coordinator → "Can you commit?" → Service A, Service B, Service C
Each service: acquire locks, write to WAL, respond "Yes" or "No"
Phase 2 — Commit/Abort:
If all Yes: Coordinator → "Commit!" → all services apply the changes, release locks
If any No: Coordinator → "Abort!" → all services roll back
Strengths: strong consistency, atomic commit across all participants.
Weaknesses:
- Blocking: if the coordinator crashes after Phase 1 but before Phase 2, participants hold locks indefinitely — the system is stuck
- Availability: all participants must be available. One unavailable service blocks the entire transaction
- Latency: two round trips across all participants adds significant overhead
- Not microservice-friendly: requires all services to implement the 2PC protocol and share a coordinator
Use 2PC only when you control all participants and can tolerate reduced availability. Used by: distributed databases (Spanner, CockroachDB) internally, XA transactions in Java EE.
The Saga Pattern
A Saga breaks a distributed transaction into a sequence of local transactions, each with a compensating transaction that undoes it if something later fails.
Order Flow:
Step 1: Reserve inventory → compensate: release inventory
Step 2: Charge payment → compensate: issue refund
Step 3: Create order → compensate: cancel order
Success path: 1 → 2 → 3 ✓
Failure at Step 3: execute compensations in reverse: refund (2) → release inventory (1)
Each step is a local ACID transaction on a single service. No distributed lock. No coordinator blocking.
Choreography-Based Saga
Services communicate via events. Each service listens for events, performs its step, and publishes the next event:
Order Service publishes: OrderPlaced
→ Inventory Service listens, reserves stock, publishes: InventoryReserved
→ Payment Service listens, charges card, publishes: PaymentCharged
→ Order Service listens, creates order record, publishes: OrderCreated
Failure: Payment Service publishes: PaymentFailed
→ Inventory Service listens, releases stock, publishes: InventoryReleased
→ Order Service listens, marks order as cancelled
Pros: loose coupling, no central coordinator, each service only knows about its own step.
Cons: hard to track overall workflow state, difficult to debug, event chains can become complex.
Orchestration-Based Saga
A central Saga Orchestrator explicitly tells each service what to do:
OrderSagaOrchestrator:
1. Send "ReserveInventory" to Inventory Service → await result
2. Send "ChargePayment" to Payment Service → await result
3. Send "CreateOrder" to Order Service → await result
On failure at step N:
For i from N-1 down to 1: send compensation command to service i
Pros: workflow is visible and testable in one place, easier to debug.
Cons: orchestrator becomes a central point of coupling (but not failure, since it is stateless).
Saga vs. 2PC: When to Use Which
| Criterion | 2PC | Saga |
|---|---|---|
| Consistency | Strong (ACID) | Eventual (BASE) |
| Availability | Lower (all must be up) | Higher (partial failures handled) |
| Complexity | Protocol complexity | Compensation logic complexity |
| Cross-service | Difficult | Natural fit |
| Best for | Same database cluster | Microservices, long-running workflows |
Idempotency in Sagas
Saga steps must be idempotent because they may be retried on failure. Use idempotency keys:
# Charge payment step:
INSERT INTO payments (saga_id, step, status)
VALUES ('saga_123', 'charge', 'processing')
ON CONFLICT (saga_id, step) DO NOTHING;
-- If this saga_id + step already exists, it was already processed → skip
-- Return the previously stored result
Interview Tips
- Always start with why standard transactions don't work across services — sets the context
- Explain the coordinator failure scenario in 2PC — this is the blocking problem interviewers probe
- For Sagas: define “compensating transaction” clearly — it is the undo of a step, not a rollback
- Recommend orchestration over choreography for complex workflows — it is easier to debug
- Emphasize idempotency: “Saga steps must be designed so that running them twice produces the same result”