Day 19 of 30 • Resilience Engineering

Failure Modes & Fault Tolerance

Every distributed system will fail. Design for failure: circuit breakers, bulkheads, retry strategies, and chaos engineering.

Circuit Breaker Bulkhead Retry + Backoff Timeouts Chaos Engineering

Types of Failures

💥 Crash Failures

Node stops responding entirely. Easiest to detect via heartbeat timeouts. Handled by: redundancy, leader election, health checks.

⏳ Timeout / Omission

Node is alive but slow or dropping messages. Hardest to distinguish from crash. Handled by: timeouts, retry with backoff, circuit breakers.

🔌 Network Partition

Nodes can't communicate but both are alive. CAP theorem applies. Handled by: quorum, hinted handoff, anti-entropy repair.

😈 Byzantine Failures

Node sends arbitrary/malicious responses. Requires BFT algorithms (3f+1 nodes for f failures). Common in blockchains. Rare in internal clusters.

📊 Failure Rate Math

If one server has 99.9% uptime (8.7 hrs downtime/year), what's a 3-server cluster with NO redundancy?

P(all available) = 0.999³ = 99.7% — worse! But with redundancy: P(all down) = 0.001³ = 0.000001% → effective uptime = 99.9999%. Redundancy transforms series failures into parallel failures.

Circuit Breaker Pattern

🔌 State Machine

  • CLOSED: Normal operation. Requests flow through. Failure counter tracked.
  • OPEN: Failure threshold exceeded. Requests immediately rejected (fail-fast). After timeout, try half-open.
  • HALF-OPEN: Let one test request through. If it succeeds, close. If it fails, open again.

🎯 Why Circuit Breakers?

Without CB: 10K threads waiting on a failing service. Each thread holds memory and a connection. The caller's thread pool exhausts — the caller also fails. This is cascading failure.

With CB: after 5 failures, requests instantly rejected. Thread pool freed. Caller can serve fallback responses. Downstream service gets breathing room to recover.

⚡ Circuit Breaker Simulator

Threshold: 3 failures → OPEN. Half-open probe after 5 seconds. Click to send requests.

CLOSED
Closed
HALF
Half-Open
OPEN
Open
CLOSED
Current State
0
Failures (window)
0
Total Requests
0
Rejected (fast-fail)
Recovery Timer

Retry Strategies & Exponential Backoff

♻️ Naive Retry — Don't Do This

If 1000 clients all retry immediately after failure, they create a retry storm that prevents the server from recovering. 1000 clients × 10 retries/second = 10K requests/second on an already-failing server.

📈 Exponential Backoff + Jitter

Wait = min(base × 2^attempt, max_wait) + random_jitter. Jitter prevents all clients retrying at the exact same time (thundering herd). AWS SDK uses "full jitter" by default.

Attempt 1: wait ~1s
Attempt 2: wait ~2s
Attempt 3: wait ~4s
Attempt 4: wait ~8s + jitter
import asyncio, random, logging

async def retry_with_backoff(
    fn,
    max_attempts: int = 5,
    base_delay: float = 1.0,
    max_delay: float = 60.0,
    retryable_exceptions: tuple = (ConnectionError, TimeoutError),
):
    for attempt in range(max_attempts):
        try:
            return await fn()
        except retryable_exceptions as e:
            if attempt == max_attempts - 1:
                raise  # Final attempt — propagate
            # Exponential backoff with full jitter
            cap = min(base_delay * (2 ** attempt), max_delay)
            wait = random.uniform(0, cap)  # Full jitter
            logging.warning(f"Attempt {attempt+1} failed: {e}. Retrying in {wait:.2f}s")
            await asyncio.sleep(wait)
        except Exception:
            raise  # Non-retryable — propagate immediately

# Usage with circuit breaker integration
async def call_payment_service(payment_data):
    if circuit_breaker.state == "OPEN":
        return {"status": "service_unavailable", "fallback": True}
    return await retry_with_backoff(
        lambda: payment_service.charge(payment_data),
        max_attempts=3,
        retryable_exceptions=(TimeoutError, ConnectionError)
    )

Bulkhead & Timeout Patterns

🚢 Bulkhead Pattern

Isolation

Isolate resources per downstream service. Payment service gets its own thread pool (10 threads). If payment is slow, only those 10 threads are blocked — order processing still has its own 20 threads. Prevent one slow dependency from taking down everything.

  • Separate thread pools per external call
  • Semaphore-based connection limits
  • Hystrix, Resilience4j implement this

⏱️ Timeout Strategy

Fail Fast

Always set timeouts. Without them, threads wait forever. Timeout cascade: if you call service B with 30s timeout, B should call C with <30s timeout. If B's upstream timeout is 5s, B must timeout C in <5s.

  • Connection timeout: 1–3s (TCP handshake)
  • Read timeout: depends on operation
  • Deadline propagation via context
  • gRPC: deadline in every call

Chaos Engineering

🐒 Netflix Chaos Monkey

"If systems aren't tested against failures, they will fail when you least expect it." Netflix randomly kills EC2 instances in production to force engineers to build resilient systems. Simian Army extended this to network latency, region failures, and security vulnerabilities.

🔥 Game Day Exercises

Scheduled chaos: kill a database primary, disconnect a service, inject 500ms latency. Teams practice response. Failures in controlled game day > failures at 3am.

📊 Blast Radius Control

Start small: inject failures for 1% of traffic in staging, then 1% production. Gradually increase. Always have a kill switch. Monitor impact before widening the blast radius.

🛠️ Tools

  • Netflix Chaos Monkey
  • AWS Fault Injection Simulator
  • Gremlin (SaaS chaos platform)
  • Chaos Mesh (Kubernetes)
  • Toxiproxy (network faults)

Resilience Patterns at a Glance

PatternProblem SolvedTrade-off
Circuit BreakerCascading failures, thread exhaustionMay incorrectly open; false negatives
Retry + BackoffTransient network errors, momentary overloadRetry storms if no backoff/jitter
BulkheadSlow dependency taking down all servicesResource overhead; pool sizing complexity
TimeoutHung threads waiting foreverToo short = false failures; too long = resource waste
FallbackGraceful degradation when service failsFallback may be stale or incomplete
Idempotency KeyDuplicate requests from retriesStorage overhead for key tracking

Knowledge Check

1. A circuit breaker is in OPEN state. A request comes in. What happens?

2. 1000 clients retry a failed API every 1 second simultaneously. What problem occurs?

3. The Bulkhead pattern uses separate thread pools per downstream service. What failure does this prevent?

4. Service A calls Service B with a 30s timeout. B calls C. What should B's timeout for C be?

5. Netflix runs Chaos Monkey in production (not just staging). What is the key reason?

Day 19 Complete!

You can now design fault-tolerant systems. Next: Distributed Observability & Tracing.