How to maintain data consistency across multiple microservices and databases — the patterns powering modern e-commerce, banking, and travel systems.
A travel booking service must deduct payment from the user's account (Payment DB), reserve a seat (Airline DB), and book a room (Hotel DB) — three separate databases. If payment succeeds but the hotel is full, we need to refund. If the network dies mid-way, we need to know where we are. This is the distributed transaction problem.
Traditional ACID transactions (BEGIN/COMMIT) work within one database. Distributed transactions require locking rows across multiple services, holding locks while network calls happen — causing deadlocks, high latency, and preventing horizontal scaling.
Coordinator sends PREPARE. Each participant checks if it can commit, writes to a durable prepare log, acquires locks, responds YES or NO.
E-commerce order saga: Order → Payment → Inventory → Shipping. Click "Run Happy Path" or "Inject Failure" to see compensating transactions.
Each service publishes domain events and subscribes to others' events. No central coordinator. Order service emits OrderPlaced → Payment service subscribes → emits PaymentProcessed → Inventory subscribes, etc.
A dedicated orchestrator service drives the saga — it calls each service in order and tells them what to do. If a step fails, the orchestrator calls compensating actions. Used in AWS Step Functions, Temporal.
When a service updates its DB and also publishes an event to Kafka, what if it commits the DB but crashes before publishing? Or publishes but the DB rollback happens? Either way, the system is inconsistent. The Outbox pattern solves this.
-- Outbox pattern: write event atomically with business data
BEGIN;
-- Business operation
INSERT INTO orders (id, user_id, status, total)
VALUES ('ord-123', 'usr-456', 'CONFIRMED', 99.99);
-- Outbox event (same transaction!)
INSERT INTO outbox_events (id, aggregate_type, aggregate_id, event_type, payload, created_at)
VALUES (
gen_random_uuid(),
'Order',
'ord-123',
'OrderConfirmed',
'{"orderId":"ord-123","userId":"usr-456","total":99.99}',
NOW()
);
COMMIT;
-- Separate outbox publisher process (Debezium or custom poller):
-- SELECT * FROM outbox_events WHERE published = false ORDER BY created_at LIMIT 100;
-- Publish to Kafka, then mark as published.
-- At-least-once delivery — consumers must be idempotent!
from enum import Enum
from dataclasses import dataclass, field
from typing import Callable
import asyncio, logging
class StepState(Enum):
PENDING = "pending"
RUNNING = "running"
DONE = "done"
FAILED = "failed"
COMPENSATING = "compensating"
COMPENSATED = "compensated"
@dataclass
class SagaStep:
name: str
action: Callable # The forward action
compensation: Callable # Undo action if later step fails
state: StepState = StepState.PENDING
result: dict = field(default_factory=dict)
class SagaOrchestrator:
def __init__(self, saga_id: str, steps: list[SagaStep]):
self.saga_id = saga_id
self.steps = steps
self.completed = [] # Track for compensation
async def execute(self) -> bool:
for step in self.steps:
step.state = StepState.RUNNING
try:
step.result = await step.action()
step.state = StepState.DONE
self.completed.append(step)
logging.info(f"[{self.saga_id}] {step.name}: DONE")
except Exception as e:
step.state = StepState.FAILED
logging.error(f"[{self.saga_id}] {step.name}: FAILED — {e}")
await self._compensate()
return False
return True
async def _compensate(self):
"""Roll back completed steps in reverse order."""
for step in reversed(self.completed):
step.state = StepState.COMPENSATING
try:
await step.compensation()
step.state = StepState.COMPENSATED
logging.info(f"[{self.saga_id}] {step.name}: COMPENSATED")
except Exception as e:
# Compensation failed — requires manual intervention!
logging.critical(f"[{self.saga_id}] Compensation failed for {step.name}: {e}")
# Alert on-call, add to dead letter queue
# Usage
async def book_order(order_id: str):
saga = SagaOrchestrator(
saga_id=order_id,
steps=[
SagaStep("CreateOrder", create_order, cancel_order),
SagaStep("ProcessPayment", charge_card, refund_card),
SagaStep("ReserveInventory", reserve_items, release_items),
SagaStep("ScheduleShipping", schedule_ship, cancel_ship),
]
)
success = await saga.execute()
return {"status": "completed" if success else "rolled_back"}
| Aspect | 2PC | Saga |
|---|---|---|
| Consistency | Strong (ACID) | Eventual (BASE) |
| Failure handling | Coordinator rollback | Compensating transactions |
| Blocking | Yes (locks held) | No (non-blocking) |
| Availability | Low (all nodes must agree) | High (partial progress OK) |
| Latency | High (2 round trips) | Low (async steps) |
| Complexity | Protocol complexity | Compensation logic complexity |
| Use when | Financial accuracy critical, few services | Microservices, high throughput, long-running |
| Examples | Bank transfers, XA databases | Order fulfillment, booking systems |
1. In 2PC, the coordinator crashes after sending PREPARE but before COMMIT. What happens to participants?
2. A Saga's Payment step succeeds, then the Inventory step fails. What must happen?
3. What is the Outbox Pattern used for?
4. In a choreography-based Saga, OrderService publishes "OrderPlaced". PaymentService handles it and publishes "PaymentFailed". What must OrderService do?
5. Why must Saga consumers be idempotent?
You can now design reliable distributed transactions. Next: Clocks, Ordering & Causality.