The Problem with Distributed Transactions
In a monolith with a single database, ACID transactions guarantee atomicity: either all operations succeed or none do. In a microservices architecture, a single business operation (e.g., place an order: reserve inventory, charge payment, notify shipping) spans multiple services with separate databases. Traditional ACID transactions do not work across service boundaries. Three solutions: Two-Phase Commit (2PC), Saga pattern, and Eventual Consistency with idempotent operations.
Two-Phase Commit (2PC)
Phase 1 (Prepare): a Coordinator sends PREPARE to all participants. Each participant logs that it is prepared (durable write), replies with YES/NO. Phase 2 (Commit/Abort): if all say YES, Coordinator sends COMMIT; participants commit and release locks. If any says NO, Coordinator sends ABORT; all participants rollback.
Problems with 2PC: (1) Blocking protocol — if the Coordinator crashes after Phase 1 but before Phase 2, participants are stuck holding locks indefinitely (blocking 2PC). (2) Single point of failure — the Coordinator is critical. (3) Poor performance — two round trips plus durable writes at every step. (4) Does not work well across WAN (high latency, unreliable links). Used within a single data center for databases that support it (PostgreSQL supports 2PC with PREPARE TRANSACTION).
Three-Phase Commit (3PC)
Adds a pre-commit phase between Prepare and Commit to make the protocol non-blocking. Rarely used in practice — adds complexity and latency. Real-world systems use Paxos or Raft for consensus instead of 3PC.
Saga Pattern
A saga is a sequence of local transactions with compensating transactions for rollback. No global lock — each step commits independently. Two implementations:
Choreography: each service publishes an event on success; the next service listens and acts. On failure: the failing service publishes a failure event; upstream services listen and execute compensating transactions. Pro: no central coordinator. Con: difficult to trace and reason about for complex flows.
Orchestration: a Saga Orchestrator sends commands to each service and handles failures centrally. Easier to monitor. The orchestrator maintains saga state (which steps completed, which are pending) in a database. Used by Uber (Money saga), Netflix (subscription workflows).
Saga limitations: no isolation (intermediate states are visible to other transactions), compensations must be designed carefully (some operations are hard to compensate — e.g., sent emails cannot be unsent).
Idempotency and At-Least-Once Delivery
In distributed systems, messages may be delivered more than once (network retries). Services must be idempotent: processing the same message twice produces the same result as processing it once. Implement with idempotency keys: store (idempotency_key, result) in a database. On receipt: check if key exists. If yes, return stored result. If no, process and store. This is the foundation of reliable distributed processing.
Eventual Consistency
In many systems, strict consistency is unnecessary. Eventual consistency: after all updates propagate, all replicas converge to the same value. Acceptable for: social media feeds (showing a post to some users before others), DNS propagation (takes minutes), product catalog (pricing may lag by seconds). Not acceptable for: bank account balances, inventory reservation (must be exact), payment processing.
Choosing the Right Approach
| Approach | Consistency | Availability | Complexity | Use When |
|---|---|---|---|---|
| 2PC | Strong | Low | Medium | Same datacenter, short transactions |
| Saga (choreography) | Eventual | High | Medium | Independent services, simple flows |
| Saga (orchestration) | Eventual | High | High | Complex multi-step workflows |
| Eventual consistency | Eventual | Very High | Low | Non-critical reads, high-scale |
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the main problem with Two-Phase Commit (2PC)?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “2PC has two critical problems. (1) Blocking: after Phase 1 (all participants say YES), if the Coordinator crashes before sending Phase 2, all participants are stuck holding locks indefinitely — they cannot commit (no COMMIT message) or rollback (they said YES). Other transactions are blocked until the Coordinator recovers. This is the fundamental flaw of 2PC. (2) Single point of failure: the Coordinator is critical for progress. Solutions: use persistent logging so the Coordinator can recover its state; use Paxos/Raft for the Coordinator to handle leader failure. In practice, 2PC is acceptable within a single datacenter where network reliability is high and Coordinator recovery is fast (seconds). Across data centers, use Saga instead.”
}
},
{
“@type”: “Question”,
“name”: “What is the difference between saga choreography and orchestration?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Choreography: each service listens for domain events and publishes its own events. No central coordinator. Example: Order Service publishes OrderCreated -> Inventory Service listens, reserves stock, publishes InventoryReserved -> Payment Service listens, charges, publishes PaymentCharged -> Order Service listens and confirms. Advantages: loose coupling, no SPOF. Disadvantages: logic is distributed across services (hard to trace the full flow), no single place to see saga state, circular event chains are possible. Orchestration: a Saga Orchestrator sends explicit commands (RESERVE_INVENTORY, CHARGE_PAYMENT) and handles success/failure. The orchestrator stores saga state. Advantages: clear ownership, easy to monitor, explicit error handling. Disadvantages: orchestrator is a new SPOF and coupling point. Prefer orchestration for complex long-running workflows.”
}
},
{
“@type”: “Question”,
“name”: “How do you implement idempotency in a distributed system?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Idempotency means that processing the same operation multiple times has the same effect as processing it once. Implementation: the caller generates a unique idempotency_key (UUID) for each logical operation. The server stores (idempotency_key, response) in a database with a unique constraint. On receipt: SELECT by idempotency_key. If found, return the stored response immediately. If not found: process the operation, store the result with the key, return the result. The SELECT + INSERT must be atomic (use INSERT … ON CONFLICT DO NOTHING and check if a row was inserted). TTL: expire idempotency keys after 24 hours or a business-appropriate window. Critical for: payment processing (retry on network failure should not double-charge), inventory reservation, and any state-changing operation in a distributed system.”
}
},
{
“@type”: “Question”,
“name”: “When should you use eventual consistency instead of strong consistency?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Use eventual consistency when: (1) the data is non-critical for correctness (social media likes, view counts, analytics metrics) — a few seconds of lag is invisible to users. (2) You need maximum availability — eventual consistency allows writes to proceed even when some replicas are unreachable. (3) Geographic distribution — eventual consistency allows low-latency writes to the nearest datacenter without waiting for cross-region acknowledgment. Use strong consistency when: (1) correctness is critical (account balances, inventory counts, order status). (2) Multiple services must agree on the same value simultaneously. (3) Users expect to immediately see their own writes (profile updates, settings changes). CAP theorem: you cannot have both availability and strong consistency in the presence of network partitions — eventual consistency chooses availability.”
}
},
{
“@type”: “Question”,
“name”: “How does the outbox pattern ensure reliable event publishing in a saga?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The dual-write problem: a service updates its database and publishes an event to Kafka. If the service crashes between the DB write and the Kafka publish, the event is never sent and the saga stalls. Outbox pattern: write the event to an outbox table in the same database transaction as the state change. A separate relay process reads unpublished outbox events and publishes to Kafka, then marks them as published. Even if the relay crashes: on restart, it re-reads unpublished outbox events and republishes (at-least-once delivery). Consumers must be idempotent to handle duplicate events. The relay can use: polling (simple, slightly higher latency), or CDC (Change Data Capture) with Debezium (reads the database WAL, near real-time, no polling overhead).”
}
}
]
}
Asked at: Stripe Interview Guide
Asked at: Coinbase Interview Guide
Asked at: Shopify Interview Guide
Asked at: Databricks Interview Guide
Asked at: Netflix Interview Guide