Day 17 — Distributed Transactions: 2PC & Saga

The Distributed Transaction Problem

💸 Example: Book a Flight + Hotel

A travel booking service must deduct payment from the user's account (Payment DB), reserve a seat (Airline DB), and book a room (Hotel DB) — three separate databases. If payment succeeds but the hotel is full, we need to refund. If the network dies mid-way, we need to know where we are. This is the distributed transaction problem.

❌ Why ACID Doesn't Scale

Traditional ACID transactions (BEGIN/COMMIT) work within one database. Distributed transactions require locking rows across multiple services, holding locks while network calls happen — causing deadlocks, high latency, and preventing horizontal scaling.

✅ The Two Approaches

2PC (2-Phase Commit): Coordinator asks all nodes to prepare, then commits if all agree. Strong consistency but blocking.
Saga: Break into local transactions with compensating transactions for rollback. Eventual consistency, no locks.

Two-Phase Commit (2PC)

Phase 1: Prepare (Voting)

Coordinator

──Prepare──►

Payment DB

Airline DB

Hotel DB

◄──Vote Yes/No──

Coordinator sends PREPARE. Each participant checks if it can commit, writes to a durable prepare log, acquires locks, responds YES or NO.

Phase 2a: Commit (all voted YES)

Coordinator

──COMMIT──►

Payment DB ✓

Airline DB ✓

Hotel DB ✓

Phase 2b: Abort (any voted NO)

Coordinator

──ROLLBACK──►

Payment DB ✗

Airline DB ✗

Hotel DB (NO)

✅ 2PC Strengths

Strong consistency — all-or-nothing guarantee
Simple mental model
ACID compliant across services
Built into: XA standard, PostgreSQL, MySQL

❌ 2PC Problems

Blocking: locks held during phase 2
Coordinator is single point of failure
If coordinator crashes after Prepare: participants stuck
Latency: 2 round trips across all participants
Poor availability — fails if any node fails

🎭 Saga Pattern — Interactive Walkthrough

E-commerce order saga: Order → Payment → Inventory → Shipping. Click "Run Happy Path" or "Inject Failure" to see compensating transactions.

Running

Committed

Failed

Compensating

Compensated

Saga Implementation Patterns

🎭 Choreography-based Saga

Decentralized

Each service publishes domain events and subscribes to others' events. No central coordinator. Order service emits OrderPlaced → Payment service subscribes → emits PaymentProcessed → Inventory subscribes, etc.

No single point of failure
Harder to track overall saga state
Good for simple, linear flows
Debugging is complex

🎬 Orchestration-based Saga

Centralized

A dedicated orchestrator service drives the saga — it calls each service in order and tells them what to do. If a step fails, the orchestrator calls compensating actions. Used in AWS Step Functions, Temporal.

Clear saga state in one place
Easier to add retry logic
Orchestrator can be a bottleneck
Best for complex, branching flows

Outbox Pattern: Reliable Event Publishing

📬 The Dual-Write Problem

When a service updates its DB and also publishes an event to Kafka, what if it commits the DB but crashes before publishing? Or publishes but the DB rollback happens? Either way, the system is inconsistent. The Outbox pattern solves this.

-- Outbox pattern: write event atomically with business data
BEGIN;
  -- Business operation
  INSERT INTO orders (id, user_id, status, total)
  VALUES ('ord-123', 'usr-456', 'CONFIRMED', 99.99);

  -- Outbox event (same transaction!)
  INSERT INTO outbox_events (id, aggregate_type, aggregate_id, event_type, payload, created_at)
  VALUES (
    gen_random_uuid(),
    'Order',
    'ord-123',
    'OrderConfirmed',
    '{"orderId":"ord-123","userId":"usr-456","total":99.99}',
    NOW()
  );
COMMIT;

-- Separate outbox publisher process (Debezium or custom poller):
-- SELECT * FROM outbox_events WHERE published = false ORDER BY created_at LIMIT 100;
-- Publish to Kafka, then mark as published.
-- At-least-once delivery — consumers must be idempotent!

Saga Orchestrator — Python Example

from enum import Enum
from dataclasses import dataclass, field
from typing import Callable
import asyncio, logging

class StepState(Enum):
    PENDING = "pending"
    RUNNING = "running"
    DONE = "done"
    FAILED = "failed"
    COMPENSATING = "compensating"
    COMPENSATED = "compensated"

@dataclass
class SagaStep:
    name: str
    action: Callable          # The forward action
    compensation: Callable    # Undo action if later step fails
    state: StepState = StepState.PENDING
    result: dict = field(default_factory=dict)

class SagaOrchestrator:
    def __init__(self, saga_id: str, steps: list[SagaStep]):
        self.saga_id = saga_id
        self.steps = steps
        self.completed = []   # Track for compensation

    async def execute(self) -> bool:
        for step in self.steps:
            step.state = StepState.RUNNING
            try:
                step.result = await step.action()
                step.state = StepState.DONE
                self.completed.append(step)
                logging.info(f"[{self.saga_id}] {step.name}: DONE")
            except Exception as e:
                step.state = StepState.FAILED
                logging.error(f"[{self.saga_id}] {step.name}: FAILED — {e}")
                await self._compensate()
                return False
        return True

    async def _compensate(self):
        """Roll back completed steps in reverse order."""
        for step in reversed(self.completed):
            step.state = StepState.COMPENSATING
            try:
                await step.compensation()
                step.state = StepState.COMPENSATED
                logging.info(f"[{self.saga_id}] {step.name}: COMPENSATED")
            except Exception as e:
                # Compensation failed — requires manual intervention!
                logging.critical(f"[{self.saga_id}] Compensation failed for {step.name}: {e}")
                # Alert on-call, add to dead letter queue

# Usage
async def book_order(order_id: str):
    saga = SagaOrchestrator(
        saga_id=order_id,
        steps=[
            SagaStep("CreateOrder", create_order, cancel_order),
            SagaStep("ProcessPayment", charge_card, refund_card),
            SagaStep("ReserveInventory", reserve_items, release_items),
            SagaStep("ScheduleShipping", schedule_ship, cancel_ship),
        ]
    )
    success = await saga.execute()
    return {"status": "completed" if success else "rolled_back"}

2PC vs Saga — Decision Guide

Aspect	2PC	Saga
Consistency	Strong (ACID)	Eventual (BASE)
Failure handling	Coordinator rollback	Compensating transactions
Blocking	Yes (locks held)	No (non-blocking)
Availability	Low (all nodes must agree)	High (partial progress OK)
Latency	High (2 round trips)	Low (async steps)
Complexity	Protocol complexity	Compensation logic complexity
Use when	Financial accuracy critical, few services	Microservices, high throughput, long-running
Examples	Bank transfers, XA databases	Order fulfillment, booking systems

Knowledge Check

1. In 2PC, the coordinator crashes after sending PREPARE but before COMMIT. What happens to participants?

2. A Saga's Payment step succeeds, then the Inventory step fails. What must happen?

3. What is the Outbox Pattern used for?

4. In a choreography-based Saga, OrderService publishes "OrderPlaced". PaymentService handles it and publishes "PaymentFailed". What must OrderService do?

5. Why must Saga consumers be idempotent?

Day 17 Complete!

You can now design reliable distributed transactions. Next: Clocks, Ordering & Causality.