Every distributed system will fail. Design for failure: circuit breakers, bulkheads, retry strategies, and chaos engineering.
Node stops responding entirely. Easiest to detect via heartbeat timeouts. Handled by: redundancy, leader election, health checks.
Node is alive but slow or dropping messages. Hardest to distinguish from crash. Handled by: timeouts, retry with backoff, circuit breakers.
Nodes can't communicate but both are alive. CAP theorem applies. Handled by: quorum, hinted handoff, anti-entropy repair.
Node sends arbitrary/malicious responses. Requires BFT algorithms (3f+1 nodes for f failures). Common in blockchains. Rare in internal clusters.
If one server has 99.9% uptime (8.7 hrs downtime/year), what's a 3-server cluster with NO redundancy?
P(all available) = 0.999³ = 99.7% — worse! But with redundancy: P(all down) = 0.001³ = 0.000001% → effective uptime = 99.9999%. Redundancy transforms series failures into parallel failures.
Without CB: 10K threads waiting on a failing service. Each thread holds memory and a connection. The caller's thread pool exhausts — the caller also fails. This is cascading failure.
With CB: after 5 failures, requests instantly rejected. Thread pool freed. Caller can serve fallback responses. Downstream service gets breathing room to recover.
Threshold: 3 failures → OPEN. Half-open probe after 5 seconds. Click to send requests.
If 1000 clients all retry immediately after failure, they create a retry storm that prevents the server from recovering. 1000 clients × 10 retries/second = 10K requests/second on an already-failing server.
Wait = min(base × 2^attempt, max_wait) + random_jitter. Jitter prevents all clients retrying at the exact same time (thundering herd). AWS SDK uses "full jitter" by default.
import asyncio, random, logging
async def retry_with_backoff(
fn,
max_attempts: int = 5,
base_delay: float = 1.0,
max_delay: float = 60.0,
retryable_exceptions: tuple = (ConnectionError, TimeoutError),
):
for attempt in range(max_attempts):
try:
return await fn()
except retryable_exceptions as e:
if attempt == max_attempts - 1:
raise # Final attempt — propagate
# Exponential backoff with full jitter
cap = min(base_delay * (2 ** attempt), max_delay)
wait = random.uniform(0, cap) # Full jitter
logging.warning(f"Attempt {attempt+1} failed: {e}. Retrying in {wait:.2f}s")
await asyncio.sleep(wait)
except Exception:
raise # Non-retryable — propagate immediately
# Usage with circuit breaker integration
async def call_payment_service(payment_data):
if circuit_breaker.state == "OPEN":
return {"status": "service_unavailable", "fallback": True}
return await retry_with_backoff(
lambda: payment_service.charge(payment_data),
max_attempts=3,
retryable_exceptions=(TimeoutError, ConnectionError)
)
Isolate resources per downstream service. Payment service gets its own thread pool (10 threads). If payment is slow, only those 10 threads are blocked — order processing still has its own 20 threads. Prevent one slow dependency from taking down everything.
Always set timeouts. Without them, threads wait forever. Timeout cascade: if you call service B with 30s timeout, B should call C with <30s timeout. If B's upstream timeout is 5s, B must timeout C in <5s.
"If systems aren't tested against failures, they will fail when you least expect it." Netflix randomly kills EC2 instances in production to force engineers to build resilient systems. Simian Army extended this to network latency, region failures, and security vulnerabilities.
Scheduled chaos: kill a database primary, disconnect a service, inject 500ms latency. Teams practice response. Failures in controlled game day > failures at 3am.
Start small: inject failures for 1% of traffic in staging, then 1% production. Gradually increase. Always have a kill switch. Monitor impact before widening the blast radius.
| Pattern | Problem Solved | Trade-off |
|---|---|---|
| Circuit Breaker | Cascading failures, thread exhaustion | May incorrectly open; false negatives |
| Retry + Backoff | Transient network errors, momentary overload | Retry storms if no backoff/jitter |
| Bulkhead | Slow dependency taking down all services | Resource overhead; pool sizing complexity |
| Timeout | Hung threads waiting forever | Too short = false failures; too long = resource waste |
| Fallback | Graceful degradation when service fails | Fallback may be stale or incomplete |
| Idempotency Key | Duplicate requests from retries | Storage overhead for key tracking |
1. A circuit breaker is in OPEN state. A request comes in. What happens?
2. 1000 clients retry a failed API every 1 second simultaneously. What problem occurs?
3. The Bulkhead pattern uses separate thread pools per downstream service. What failure does this prevent?
4. Service A calls Service B with a 30s timeout. B calls C. What should B's timeout for C be?
5. Netflix runs Chaos Monkey in production (not just staging). What is the key reason?
You can now design fault-tolerant systems. Next: Distributed Observability & Tracing.