Day 20

Distributed Observability

Metrics, logs, and traces are the three pillars — but knowing how to use them together, define meaningful SLOs, and alert on burn rate rather than thresholds is what separates reactive from proactive engineering.

Exercise 1🟢 Easy15 min
Calculating Error Budget for a 99.95% Monthly SLO
Your payment processing service has a published SLO of 99.95% availability per calendar month. The service processes 4 million transactions/month. Your SRE team needs to translate this SLO into concrete error budget numbers to guide release decisions, incident escalation thresholds, and feature development pace. A new feature deployment is planned that historically causes 0.02% error rate for 30 minutes during rollout.

Tasks

  • Calculate the monthly error budget in minutes of downtime, number of failed transactions, and percentage of requests allowed to fail — show all three representations.
  • Calculate how much of the monthly error budget the planned feature deployment consumes (0.02% error rate x 30 min x ~5,556 req/min) and state whether this is acceptable.
  • Explain the error budget policy decision: if the service has already consumed 80% of its monthly error budget by the 20th of the month, should the planned deployment proceed? What process should govern this decision?
  • Define what "availability" means for this specific payment service — is it request success rate, latency-based (requests completing under 500ms), or both? How does the definition change the error budget calculation?
Your Notes
Exercise 2🔴 Medium30 min
Diagnosing p50=50ms but p99=4000ms — Finding the Structural Cause
A distributed tracing dashboard shows your API gateway has p50 latency of 50ms (excellent) but p99 latency of 4,000ms (alarming). The service handles 10,000 requests/sec. This extreme percentile divergence — 80x difference between median and 99th percentile — means 1% of users (100 requests/sec) experience 4-second latency while 99% experience 50ms. This is not a uniform slowdown but a structural pattern. Your job is to diagnose the root cause and propose a fix.

Tasks

  • List 4 structural causes that produce high p99 but normal p50 (not random noise): head-of-line blocking, GC pauses, hot partitions, synchronous external calls. For each, explain the mechanism that affects only the tail.
  • Describe how distributed tracing (OpenTelemetry spans) would reveal which of these causes is responsible — what does the trace waterfall look like for a 4000ms request vs a 50ms request?
  • Focus on the "synchronized database queries" cause: if 1% of requests trigger a query on a hot database shard (key-range hotspot), how would you identify this using slow query logs + trace correlation, and what schema/sharding fix resolves it?
  • Explain why optimizing for p50 (average) latency is insufficient for user experience — calculate the probability that a user making 10 API calls in a session experiences at least one p99 event, and why this matters for SLO design.
Your Notes
Exercise 3🔴 Medium35 min
Designing the 3 Observability Signals for a Payment Processing Service
Your team is instrumenting a newly built payment processing service from scratch using OpenTelemetry. The service receives payment requests, calls a fraud detection API, charges a payment gateway, and records the transaction to a PostgreSQL database. The service processes 2,000 payments/minute. You must design all three observability signals (metrics, structured logs, distributed traces) to give on-call engineers full visibility into any incident within 60 seconds of it starting.

Tasks

  • Design the metrics: list 8 specific Prometheus metrics (counters, gauges, histograms) with their labels. Include payment success rate, fraud API latency, DB connection pool usage, and payment amount distribution.
  • Design the structured log schema: what fields must every log line contain for a payment event? Include trace_id (for correlation), payment_id, user_id, amount, currency, status, fraud_score, duration_ms, and error_code. Show a JSON log example for a failed payment.
  • Design the distributed trace: what spans should exist within a single payment transaction trace? For each span, name it, identify the span kind (client/server/internal), and list 3 span attributes that would help diagnose a latency spike in that span.
  • Explain trace sampling strategy: at 2,000 payments/min, storing 100% of traces costs ~$8,000/month in your observability stack. Design a sampling strategy that ensures all failed payments and all payments over 1 second are traced, while sampling successful fast payments at 1%.
Your Notes
Exercise 4🔥 Hard55 min
SLO Alerting System with Multi-Window Burn Rate Alerts
Your payment service has a 99.9% monthly availability SLO. The SRE team currently uses a simple threshold alert: "page if error rate exceeds 1% for 5 minutes." This alert has two problems: it fires too late (a 0.9% error rate for 2 hours silently burns 50% of the monthly budget) and it fires too often on noise (brief 2-minute spikes that recover immediately). Google's SRE workbook recommends multi-window burn rate alerts to solve both problems. You must design this alerting system.

Tasks

  • Calculate the monthly error budget in minutes (99.9% SLO): how many minutes of full outage are allowed per month? What is the error budget consumption rate that burns the entire budget in exactly 1 hour (a fast burn)?
  • Design the 3-level burn rate alert system: (Fast burn: 30x burn rate, 1h + 5min windows — page immediately), (Medium burn: 6x rate, 6h + 30min windows — page urgently), (Slow burn: 3x rate, 3d + 6h windows — create ticket). For each, calculate the actual error rate threshold that triggers each alert.
  • Explain the "short window + long window" approach: why does the Google SRE model require BOTH a 1-hour window AND a 5-minute window to both exceed the threshold before paging? What problem does the short window solve that the long window cannot?
  • Design the alert routing: who gets paged at each level (fast=on-call SRE, medium=team lead, slow=next-day ticket), what runbook link is included, and how do you prevent alert fatigue from the medium burn alert firing during every maintenance window?
Your Notes