Day 17 of 30 • Distributed Transactions

Distributed Transactions: 2PC & Saga

How to maintain data consistency across multiple microservices and databases — the patterns powering modern e-commerce, banking, and travel systems.

2-Phase Commit Saga Pattern Compensating Txns Idempotency Outbox Pattern

The Distributed Transaction Problem

💸 Example: Book a Flight + Hotel

A travel booking service must deduct payment from the user's account (Payment DB), reserve a seat (Airline DB), and book a room (Hotel DB) — three separate databases. If payment succeeds but the hotel is full, we need to refund. If the network dies mid-way, we need to know where we are. This is the distributed transaction problem.

❌ Why ACID Doesn't Scale

Traditional ACID transactions (BEGIN/COMMIT) work within one database. Distributed transactions require locking rows across multiple services, holding locks while network calls happen — causing deadlocks, high latency, and preventing horizontal scaling.

✅ The Two Approaches

  • 2PC (2-Phase Commit): Coordinator asks all nodes to prepare, then commits if all agree. Strong consistency but blocking.
  • Saga: Break into local transactions with compensating transactions for rollback. Eventual consistency, no locks.

Two-Phase Commit (2PC)

Phase 1: Prepare (Voting)
Coordinator
──Prepare──►
Payment DB
Airline DB
Hotel DB
◄──Vote Yes/No──

Coordinator sends PREPARE. Each participant checks if it can commit, writes to a durable prepare log, acquires locks, responds YES or NO.

Phase 2a: Commit (all voted YES)
Coordinator
──COMMIT──►
Payment DB ✓
Airline DB ✓
Hotel DB ✓
Phase 2b: Abort (any voted NO)
Coordinator
──ROLLBACK──►
Payment DB ✗
Airline DB ✗
Hotel DB (NO)

✅ 2PC Strengths

  • Strong consistency — all-or-nothing guarantee
  • Simple mental model
  • ACID compliant across services
  • Built into: XA standard, PostgreSQL, MySQL

❌ 2PC Problems

  • Blocking: locks held during phase 2
  • Coordinator is single point of failure
  • If coordinator crashes after Prepare: participants stuck
  • Latency: 2 round trips across all participants
  • Poor availability — fails if any node fails
🎭 Saga Pattern — Interactive Walkthrough

E-commerce order saga: Order → Payment → Inventory → Shipping. Click "Run Happy Path" or "Inject Failure" to see compensating transactions.

Running
Committed
Failed
Compensating
Compensated

Saga Implementation Patterns

🎭 Choreography-based Saga

Decentralized

Each service publishes domain events and subscribes to others' events. No central coordinator. Order service emits OrderPlaced → Payment service subscribes → emits PaymentProcessed → Inventory subscribes, etc.

  • No single point of failure
  • Harder to track overall saga state
  • Good for simple, linear flows
  • Debugging is complex

🎬 Orchestration-based Saga

Centralized

A dedicated orchestrator service drives the saga — it calls each service in order and tells them what to do. If a step fails, the orchestrator calls compensating actions. Used in AWS Step Functions, Temporal.

  • Clear saga state in one place
  • Easier to add retry logic
  • Orchestrator can be a bottleneck
  • Best for complex, branching flows

Outbox Pattern: Reliable Event Publishing

📬 The Dual-Write Problem

When a service updates its DB and also publishes an event to Kafka, what if it commits the DB but crashes before publishing? Or publishes but the DB rollback happens? Either way, the system is inconsistent. The Outbox pattern solves this.

-- Outbox pattern: write event atomically with business data
BEGIN;
  -- Business operation
  INSERT INTO orders (id, user_id, status, total)
  VALUES ('ord-123', 'usr-456', 'CONFIRMED', 99.99);

  -- Outbox event (same transaction!)
  INSERT INTO outbox_events (id, aggregate_type, aggregate_id, event_type, payload, created_at)
  VALUES (
    gen_random_uuid(),
    'Order',
    'ord-123',
    'OrderConfirmed',
    '{"orderId":"ord-123","userId":"usr-456","total":99.99}',
    NOW()
  );
COMMIT;

-- Separate outbox publisher process (Debezium or custom poller):
-- SELECT * FROM outbox_events WHERE published = false ORDER BY created_at LIMIT 100;
-- Publish to Kafka, then mark as published.
-- At-least-once delivery — consumers must be idempotent!

Saga Orchestrator — Python Example

from enum import Enum
from dataclasses import dataclass, field
from typing import Callable
import asyncio, logging

class StepState(Enum):
    PENDING = "pending"
    RUNNING = "running"
    DONE = "done"
    FAILED = "failed"
    COMPENSATING = "compensating"
    COMPENSATED = "compensated"

@dataclass
class SagaStep:
    name: str
    action: Callable          # The forward action
    compensation: Callable    # Undo action if later step fails
    state: StepState = StepState.PENDING
    result: dict = field(default_factory=dict)

class SagaOrchestrator:
    def __init__(self, saga_id: str, steps: list[SagaStep]):
        self.saga_id = saga_id
        self.steps = steps
        self.completed = []   # Track for compensation

    async def execute(self) -> bool:
        for step in self.steps:
            step.state = StepState.RUNNING
            try:
                step.result = await step.action()
                step.state = StepState.DONE
                self.completed.append(step)
                logging.info(f"[{self.saga_id}] {step.name}: DONE")
            except Exception as e:
                step.state = StepState.FAILED
                logging.error(f"[{self.saga_id}] {step.name}: FAILED — {e}")
                await self._compensate()
                return False
        return True

    async def _compensate(self):
        """Roll back completed steps in reverse order."""
        for step in reversed(self.completed):
            step.state = StepState.COMPENSATING
            try:
                await step.compensation()
                step.state = StepState.COMPENSATED
                logging.info(f"[{self.saga_id}] {step.name}: COMPENSATED")
            except Exception as e:
                # Compensation failed — requires manual intervention!
                logging.critical(f"[{self.saga_id}] Compensation failed for {step.name}: {e}")
                # Alert on-call, add to dead letter queue

# Usage
async def book_order(order_id: str):
    saga = SagaOrchestrator(
        saga_id=order_id,
        steps=[
            SagaStep("CreateOrder", create_order, cancel_order),
            SagaStep("ProcessPayment", charge_card, refund_card),
            SagaStep("ReserveInventory", reserve_items, release_items),
            SagaStep("ScheduleShipping", schedule_ship, cancel_ship),
        ]
    )
    success = await saga.execute()
    return {"status": "completed" if success else "rolled_back"}

2PC vs Saga — Decision Guide

Aspect2PCSaga
ConsistencyStrong (ACID)Eventual (BASE)
Failure handlingCoordinator rollbackCompensating transactions
BlockingYes (locks held)No (non-blocking)
AvailabilityLow (all nodes must agree)High (partial progress OK)
LatencyHigh (2 round trips)Low (async steps)
ComplexityProtocol complexityCompensation logic complexity
Use whenFinancial accuracy critical, few servicesMicroservices, high throughput, long-running
ExamplesBank transfers, XA databasesOrder fulfillment, booking systems

Knowledge Check

1. In 2PC, the coordinator crashes after sending PREPARE but before COMMIT. What happens to participants?

2. A Saga's Payment step succeeds, then the Inventory step fails. What must happen?

3. What is the Outbox Pattern used for?

4. In a choreography-based Saga, OrderService publishes "OrderPlaced". PaymentService handles it and publishes "PaymentFailed". What must OrderService do?

5. Why must Saga consumers be idempotent?

Day 17 Complete!

You can now design reliable distributed transactions. Next: Clocks, Ordering & Causality.