Day 19
Failure Modes & Fault Tolerance
Production systems fail in unexpected ways — cascading failures, slow dependencies, and thundering herds. Learn to identify failure modes and design systems that degrade gracefully.
Identify the Failure Mode — Slow DB Causes 500s on Unrelated Endpoints
A monolithic API service has 12 endpoints. One endpoint (/api/reports) queries a PostgreSQL database for heavy aggregation reports taking 8–15 seconds each. On Monday morning, usage of /api/reports spikes. Within 3 minutes, the entire API — including fast endpoints like /api/health and /api/users (normally <50ms) — starts returning HTTP 500 errors. The database itself shows only 65% CPU utilization. No code was deployed. The on-call engineer needs to identify the failure mode within 5 minutes.
Tasks
- Name the failure mode and explain the mechanism: how does a slow /api/reports endpoint cause thread pool exhaustion that affects /api/health and /api/users?
- Identify three observable symptoms in metrics/logs that would confirm this diagnosis: what would you see in thread pool queue depth, request latency percentiles, and error rate by endpoint?
- Describe two immediate mitigations an on-call engineer can apply within 5 minutes without a code deploy (hint: traffic management and connection pool settings).
- Design the long-term architectural fix: how do you isolate /api/reports so it cannot consume resources needed by fast endpoints? Name the pattern (bulkhead) and describe its implementation for this service.
Retry Strategy with Backoff and Jitter for a Payment API
Your service calls an external payment provider API (Stripe-like) for card charges. The payment API has the following characteristics: it returns HTTP 429 (rate limit) at 100 requests/sec sustained, it experiences transient 503 errors at 2% of requests, it has a p99 latency of 800ms (normally), and each failed charge must be retried to maximize revenue. However, naive retry-immediately behavior causes "thundering herd" — all failed requests retry simultaneously, worsening the overload. You need to design a production-grade retry policy.
Tasks
- Define which HTTP status codes should trigger a retry (503, 429, 408, network timeout) vs which should never retry (400, 401, 402 Payment Required, 422) — justify each decision.
- Design the exponential backoff formula: base=1s, multiplier=2, max_delay=32s, max_attempts=5. Calculate the delay for each attempt and the total time elapsed if all 5 attempts are made.
- Explain "full jitter" vs "equal jitter" retry decorrelation — show how full jitter (delay = random(0, base * 2^attempt)) prevents the thundering herd that pure exponential backoff causes when 1,000 clients all fail simultaneously.
- Design the idempotency key strategy for payment retries: why must you pass the same idempotency key on every retry attempt for the same charge, and what happens if you don't?
Circuit Breaker Design for a Service with 99.5% Success Rate
A recommendation service calls a downstream ML inference API. The ML API has a baseline success rate of 99.5% (0.5% error rate). During an incident, the error rate spikes to 45% for about 8 minutes before auto-recovery. During this spike, your recommendation service continues retrying all failed calls, amplifying load on the already-struggling ML API and making recovery slower. You need to design a circuit breaker that opens fast enough to protect the ML API but doesn't trip on normal 0.5% noise.
Tasks
- Define the three circuit breaker states (Closed, Open, Half-Open) and describe the exact transition conditions between each state for this service, using specific thresholds (failure %, minimum request count, timeout duration).
- Design the sliding window: should you use a count-based window (last N requests) or time-based window (last N seconds)? For a service handling 500 req/sec, calculate the minimum window size to distinguish a genuine 45% failure rate from a statistical fluctuation of the normal 0.5%.
- Design the Half-Open probe strategy: after the circuit opens and waits 30 seconds, how many probe requests do you send to the ML API, and what does "success" look like to transition back to Closed vs staying Open?
- Calculate the error budget savings: if the circuit opens in 3 seconds after the incident starts (vs 8 minutes of continued retries), how many failed calls and how much amplified load does the circuit breaker prevent, given 500 req/sec with 45% failure rate?
Chaos Engineering Plan for a Checkout Service
An e-commerce platform is preparing for Black Friday. The checkout service depends on: Payment API (external), Inventory Service (internal), User Profile Service (internal), Tax Calculation Service (internal), and a PostgreSQL database. The CTO has mandated a chaos engineering exercise before the sale to identify hidden single points of failure and validate that the system degrades gracefully. You are the chaos engineering lead and must design 5 experiments, each with a specific failure scenario and expected system behavior.
Tasks
- Design 5 chaos experiments. For each: name the failure scenario, the injection method (latency, error injection, resource kill, network partition), the blast radius (which users/% of traffic are affected), and the expected behavior (graceful degradation vs full failure).
- For experiment "Tax Service returns 503 for 60 seconds": should checkout still complete? Design the fallback behavior (default tax rate, cached tax rate, or hard block) and justify based on legal and UX requirements.
- For experiment "PostgreSQL primary becomes unreachable for 45 seconds": walk through the exact sequence of events in the checkout service — connection pool behavior, retry exhaustion, error propagation to the user, and what happens when the DB recovers.
- Design the steady-state hypothesis and success criteria: how do you define "the system is healthy" in terms of measurable SLOs before starting each experiment, and what metric threshold triggers an automatic rollback of the chaos injection?