Day 19 — Failure Modes & Fault Tolerance

Types of Failures

💥 Crash Failures

Node stops responding entirely. Easiest to detect via heartbeat timeouts. Handled by: redundancy, leader election, health checks.

⏳ Timeout / Omission

Node is alive but slow or dropping messages. Hardest to distinguish from crash. Handled by: timeouts, retry with backoff, circuit breakers.

🔌 Network Partition

Nodes can't communicate but both are alive. CAP theorem applies. Handled by: quorum, hinted handoff, anti-entropy repair.

😈 Byzantine Failures

Node sends arbitrary/malicious responses. Requires BFT algorithms (3f+1 nodes for f failures). Common in blockchains. Rare in internal clusters.

📊 Failure Rate Math

If one server has 99.9% uptime (8.7 hrs downtime/year), what's a 3-server cluster with NO redundancy?

P(all available) = 0.999³ = 99.7% — worse! But with redundancy: P(all down) = 0.001³ = 0.000001% → effective uptime = 99.9999%. Redundancy transforms series failures into parallel failures.

Circuit Breaker Pattern

🔌 State Machine

CLOSED: Normal operation. Requests flow through. Failure counter tracked.
OPEN: Failure threshold exceeded. Requests immediately rejected (fail-fast). After timeout, try half-open.
HALF-OPEN: Let one test request through. If it succeeds, close. If it fails, open again.

🎯 Why Circuit Breakers?

Without CB: 10K threads waiting on a failing service. Each thread holds memory and a connection. The caller's thread pool exhausts — the caller also fails. This is cascading failure.

With CB: after 5 failures, requests instantly rejected. Thread pool freed. Caller can serve fallback responses. Downstream service gets breathing room to recover.

⚡ Circuit Breaker Simulator

Threshold: 3 failures → OPEN. Half-open probe after 5 seconds. Click to send requests.

CLOSED

Closed

HALF

Half-Open

OPEN

Open

CLOSED

Current State

0

Failures (window)

0

Total Requests

0

Rejected (fast-fail)

—

Recovery Timer

Retry Strategies & Exponential Backoff

♻️ Naive Retry — Don't Do This

If 1000 clients all retry immediately after failure, they create a retry storm that prevents the server from recovering. 1000 clients × 10 retries/second = 10K requests/second on an already-failing server.

📈 Exponential Backoff + Jitter

Wait = min(base × 2^attempt, max_wait) + random_jitter. Jitter prevents all clients retrying at the exact same time (thundering herd). AWS SDK uses "full jitter" by default.

          Attempt 1: wait ~1s

          Attempt 2: wait ~2s

          Attempt 3: wait ~4s

          Attempt 4: wait ~8s + jitter

import asyncio, random, logging

async def retry_with_backoff(
    fn,
    max_attempts: int = 5,
    base_delay: float = 1.0,
    max_delay: float = 60.0,
    retryable_exceptions: tuple = (ConnectionError, TimeoutError),
):
    for attempt in range(max_attempts):
        try:
            return await fn()
        except retryable_exceptions as e:
            if attempt == max_attempts - 1:
                raise  # Final attempt — propagate
            # Exponential backoff with full jitter
            cap = min(base_delay * (2 ** attempt), max_delay)
            wait = random.uniform(0, cap)  # Full jitter
            logging.warning(f"Attempt {attempt+1} failed: {e}. Retrying in {wait:.2f}s")
            await asyncio.sleep(wait)
        except Exception:
            raise  # Non-retryable — propagate immediately

# Usage with circuit breaker integration
async def call_payment_service(payment_data):
    if circuit_breaker.state == "OPEN":
        return {"status": "service_unavailable", "fallback": True}
    return await retry_with_backoff(
        lambda: payment_service.charge(payment_data),
        max_attempts=3,
        retryable_exceptions=(TimeoutError, ConnectionError)
    )

Bulkhead & Timeout Patterns

🚢 Bulkhead Pattern

Isolation

Isolate resources per downstream service. Payment service gets its own thread pool (10 threads). If payment is slow, only those 10 threads are blocked — order processing still has its own 20 threads. Prevent one slow dependency from taking down everything.

Separate thread pools per external call
Semaphore-based connection limits
Hystrix, Resilience4j implement this

⏱️ Timeout Strategy

Fail Fast

Always set timeouts. Without them, threads wait forever. Timeout cascade: if you call service B with 30s timeout, B should call C with <30s timeout. If B's upstream timeout is 5s, B must timeout C in <5s.

Connection timeout: 1–3s (TCP handshake)
Read timeout: depends on operation
Deadline propagation via context
gRPC: deadline in every call

Chaos Engineering

🐒 Netflix Chaos Monkey

"If systems aren't tested against failures, they will fail when you least expect it." Netflix randomly kills EC2 instances in production to force engineers to build resilient systems. Simian Army extended this to network latency, region failures, and security vulnerabilities.

🔥 Game Day Exercises

Scheduled chaos: kill a database primary, disconnect a service, inject 500ms latency. Teams practice response. Failures in controlled game day > failures at 3am.

📊 Blast Radius Control

Start small: inject failures for 1% of traffic in staging, then 1% production. Gradually increase. Always have a kill switch. Monitor impact before widening the blast radius.

🛠️ Tools

Netflix Chaos Monkey
AWS Fault Injection Simulator
Gremlin (SaaS chaos platform)
Chaos Mesh (Kubernetes)
Toxiproxy (network faults)

Resilience Patterns at a Glance

Pattern	Problem Solved	Trade-off
Circuit Breaker	Cascading failures, thread exhaustion	May incorrectly open; false negatives
Retry + Backoff	Transient network errors, momentary overload	Retry storms if no backoff/jitter
Bulkhead	Slow dependency taking down all services	Resource overhead; pool sizing complexity
Timeout	Hung threads waiting forever	Too short = false failures; too long = resource waste
Fallback	Graceful degradation when service fails	Fallback may be stale or incomplete
Idempotency Key	Duplicate requests from retries	Storage overhead for key tracking

Knowledge Check

1. A circuit breaker is in OPEN state. A request comes in. What happens?

2. 1000 clients retry a failed API every 1 second simultaneously. What problem occurs?

3. The Bulkhead pattern uses separate thread pools per downstream service. What failure does this prevent?

4. Service A calls Service B with a 30s timeout. B calls C. What should B's timeout for C be?

5. Netflix runs Chaos Monkey in production (not just staging). What is the key reason?

Day 19 Complete!

You can now design fault-tolerant systems. Next: Distributed Observability & Tracing.