Day 20 Exercises — Distributed Observability

Exercise 1🟢 Easy15 min

Calculating Error Budget for a 99.95% Monthly SLO

Your payment processing service has a published SLO of 99.95% availability per calendar month. The service processes 4 million transactions/month. Your SRE team needs to translate this SLO into concrete error budget numbers to guide release decisions, incident escalation thresholds, and feature development pace. A new feature deployment is planned that historically causes 0.02% error rate for 30 minutes during rollout.

Tasks

Calculate the monthly error budget in minutes of downtime, number of failed transactions, and percentage of requests allowed to fail — show all three representations.
Calculate how much of the monthly error budget the planned feature deployment consumes (0.02% error rate x 30 min x ~5,556 req/min) and state whether this is acceptable.
Explain the error budget policy decision: if the service has already consumed 80% of its monthly error budget by the 20th of the month, should the planned deployment proceed? What process should govern this decision?
Define what "availability" means for this specific payment service — is it request success rate, latency-based (requests completing under 500ms), or both? How does the definition change the error budget calculation?

Your Notes

Exercise 2🔴 Medium30 min

Diagnosing p50=50ms but p99=4000ms — Finding the Structural Cause

A distributed tracing dashboard shows your API gateway has p50 latency of 50ms (excellent) but p99 latency of 4,000ms (alarming). The service handles 10,000 requests/sec. This extreme percentile divergence — 80x difference between median and 99th percentile — means 1% of users (100 requests/sec) experience 4-second latency while 99% experience 50ms. This is not a uniform slowdown but a structural pattern. Your job is to diagnose the root cause and propose a fix.

Tasks

List 4 structural causes that produce high p99 but normal p50 (not random noise): head-of-line blocking, GC pauses, hot partitions, synchronous external calls. For each, explain the mechanism that affects only the tail.
Describe how distributed tracing (OpenTelemetry spans) would reveal which of these causes is responsible — what does the trace waterfall look like for a 4000ms request vs a 50ms request?
Focus on the "synchronized database queries" cause: if 1% of requests trigger a query on a hot database shard (key-range hotspot), how would you identify this using slow query logs + trace correlation, and what schema/sharding fix resolves it?
Explain why optimizing for p50 (average) latency is insufficient for user experience — calculate the probability that a user making 10 API calls in a session experiences at least one p99 event, and why this matters for SLO design.

Your Notes

Exercise 3🔴 Medium35 min

Designing the 3 Observability Signals for a Payment Processing Service

Your team is instrumenting a newly built payment processing service from scratch using OpenTelemetry. The service receives payment requests, calls a fraud detection API, charges a payment gateway, and records the transaction to a PostgreSQL database. The service processes 2,000 payments/minute. You must design all three observability signals (metrics, structured logs, distributed traces) to give on-call engineers full visibility into any incident within 60 seconds of it starting.

Tasks

Design the metrics: list 8 specific Prometheus metrics (counters, gauges, histograms) with their labels. Include payment success rate, fraud API latency, DB connection pool usage, and payment amount distribution.
Design the structured log schema: what fields must every log line contain for a payment event? Include trace_id (for correlation), payment_id, user_id, amount, currency, status, fraud_score, duration_ms, and error_code. Show a JSON log example for a failed payment.
Design the distributed trace: what spans should exist within a single payment transaction trace? For each span, name it, identify the span kind (client/server/internal), and list 3 span attributes that would help diagnose a latency spike in that span.
Explain trace sampling strategy: at 2,000 payments/min, storing 100% of traces costs ~$8,000/month in your observability stack. Design a sampling strategy that ensures all failed payments and all payments over 1 second are traced, while sampling successful fast payments at 1%.

Your Notes

Exercise 4🔥 Hard55 min

SLO Alerting System with Multi-Window Burn Rate Alerts

Your payment service has a 99.9% monthly availability SLO. The SRE team currently uses a simple threshold alert: "page if error rate exceeds 1% for 5 minutes." This alert has two problems: it fires too late (a 0.9% error rate for 2 hours silently burns 50% of the monthly budget) and it fires too often on noise (brief 2-minute spikes that recover immediately). Google's SRE workbook recommends multi-window burn rate alerts to solve both problems. You must design this alerting system.

Tasks

Calculate the monthly error budget in minutes (99.9% SLO): how many minutes of full outage are allowed per month? What is the error budget consumption rate that burns the entire budget in exactly 1 hour (a fast burn)?
Design the 3-level burn rate alert system: (Fast burn: 30x burn rate, 1h + 5min windows — page immediately), (Medium burn: 6x rate, 6h + 30min windows — page urgently), (Slow burn: 3x rate, 3d + 6h windows — create ticket). For each, calculate the actual error rate threshold that triggers each alert.
Explain the "short window + long window" approach: why does the Google SRE model require BOTH a 1-hour window AND a 5-minute window to both exceed the threshold before paging? What problem does the short window solve that the long window cannot?
Design the alert routing: who gets paged at each level (fast=on-call SRE, medium=team lead, slow=next-day ticket), what runbook link is included, and how do you prevent alert fatigue from the medium burn alert firing during every maintenance window?

Your Notes