Day 17
Distributed Transactions
Master Saga, 2PC, and TCC — understand when each pattern applies, how compensating transactions work, and why distributed atomicity is harder than it looks.
Mapping E-Commerce Checkout to Saga Steps
An e-commerce platform processes 50,000 orders/day through a microservices checkout flow. The checkout involves three independent microservices: Inventory Service (reserves stock), Payment Service (charges the card), and Shipping Service (creates a shipment label). Each service has its own database. A single database transaction across all three services is impossible — you must design this as a Saga with a clear sequence of steps and corresponding compensating transactions for each failure scenario.
Tasks
- List the Saga steps in order (T1: Reserve Inventory, T2: Charge Payment, T3: Create Shipment) and write the compensating transaction for each step (C1, C2, C3) that reverses it if a later step fails.
- Trace the failure path when T3 (Create Shipment) fails: which compensating transactions execute, in what order, and what is the final state of inventory and payment?
- Explain the difference between a choreography-based Saga (services react to events) and an orchestration-based Saga (a central coordinator issues commands). Which is more appropriate for this 3-step checkout flow and why?
- Identify the "window of inconsistency" in a Saga: between T2 completing and C2 executing after a T3 failure, how long is the payment charged but shipment not created? What does the user experience during this window?
Payment Fails After Inventory Reserved — Designing Compensation Flow
During Black Friday, the Payment Service experiences a 40-second outage. During this time, 3,200 orders reached the "inventory reserved" state (T1 complete) but failed at T2 (payment charge). The Saga orchestrator must now execute compensating transactions for all 3,200 orders. The Inventory Service can process 500 reservation releases/sec, and the orchestrator must not overwhelm it. Additionally, 120 of the 3,200 orders had partial failures — the payment API returned an ambiguous response (neither success nor explicit failure).
Tasks
- Calculate how long it takes to release all 3,200 inventory reservations at 500 releases/sec and design a rate-limited compensation queue that respects this limit while also handling normal order cancellations sharing the same endpoint.
- Design the compensation state machine for the orchestrator: what states does each Saga instance move through (PENDING_PAYMENT → PAYMENT_FAILED → COMPENSATING_INVENTORY → COMPENSATION_COMPLETE / COMPENSATION_FAILED)?
- Handle the 120 ambiguous payment responses: design an idempotency-key-based verification protocol — how does the orchestrator determine whether the payment actually succeeded or failed before deciding to compensate or fulfill?
- Design the "compensation failed" case: what happens if the Inventory Service is also unavailable when trying to release reservations? Describe the retry strategy, dead-letter queue, and manual intervention escalation path.
2PC vs Saga for a Bank Transfer Between 2 Microservices
A neobank needs to implement a fund transfer between two internal microservices: Account Service A (sender) and Account Service B (receiver), each with its own PostgreSQL database. The transfer must be atomic — money must not disappear or duplicate under any failure scenario. The engineering team is debating 2PC (Two-Phase Commit) using a distributed transaction coordinator vs a Saga pattern with compensating transactions. Transfers happen at 1,200/sec peak.
Tasks
- Describe the 2PC protocol for this transfer: what happens in the Prepare phase (both services write provisional debit/credit and vote yes/no), and what happens in the Commit phase? Draw out the coordinator-to-participant message flow.
- Identify the 2PC failure mode that makes it impractical for microservices at scale: what happens if the coordinator crashes between sending "Commit" to Service A (acknowledged) and "Commit" to Service B (not yet sent)? How long is the system blocked?
- Design the equivalent Saga: T1=DebitSenderAccount, T2=CreditReceiverAccount, C1=RefundSender. Explain why this Saga is "eventually consistent" rather than "atomically consistent" and what a user would observe during the compensation window.
- State your final recommendation: 2PC or Saga? Justify based on the 1,200 transfers/sec load, the duration of locks held during 2PC, and the operational complexity of running a coordinator service vs a Saga orchestrator.
TCC Pattern for Hotel + Flight Booking
A travel platform sells vacation packages that bundle a hotel room and a flight seat. Both the Hotel Service and the Flight Service are third-party microservices with their own databases. The bundle must be booked atomically — a user should never pay for a hotel with no available flight, or vice versa. You choose the TCC (Try-Confirm-Cancel) pattern because 2PC is not supported by the third-party APIs and Saga compensations (refunds) take 5–7 business days — unacceptable for users. Implement TCC for this booking flow.
Tasks
- Explain the three phases of TCC: Try (reserve but don't commit), Confirm (finalize the reservation), Cancel (release the reservation). Show what each phase does for Hotel Service and Flight Service — what database state changes at each step?
- Design the Try phase timeout: if the Hotel Try succeeds but the Flight Try takes more than 5 seconds, what happens to the hotel reservation? How does the TCC coordinator manage reservation TTLs to prevent resources being blocked indefinitely?
- Handle the "Confirm partial failure" scenario: Confirm is sent to Hotel (success) and Flight (network timeout — unknown outcome). The coordinator cannot safely retry Confirm to Flight (might double-book) or Cancel (might cancel a successful booking). Design the resolution protocol.
- Compare TCC to Saga for this use case: why is TCC superior when compensations are slow (5–7 day refunds) and what is the implementation cost of TCC (hint: all participant services must implement all three phases — a significant API contract requirement)?