You can't fix what you can't see. Build observability into your system: logs, metrics, traces — and the tools to make sense of them.
Alert fires on P99 latency spike (metrics) → find the slow traces (traces) → look at the logs for that trace ID to see the error (logs). All three are complementary — you need all three for full observability.
Simulate a user checkout request flowing through microservices. Each bar = one span. Click "Inject Error" to simulate a database timeout.
Every request gets a unique trace_id propagated via HTTP headers (W3C Trace Context: traceparent). Each service creates a child span with its own span_id, parent_span_id, start/end timestamps, and attributes.
Tracing 100% of requests at 10K req/s generates massive data. Strategies:
Vendor-neutral instrumentation framework for traces, metrics, and logs. One SDK, multiple backends: Jaeger, Zipkin, Datadog, Honeycomb. OTel Collector aggregates and routes telemetry.
SLI (Indicator): What you measure — e.g., request success rate. SLO (Objective): Target — 99.9% success rate over 30 days. SLA (Agreement): Contract — 99.9% or we refund. Error budget = 1 - SLO = how much you can spend on outages before breaching.
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.redis import RedisInstrumentor
import structlog
# Setup tracer provider (exported to Jaeger/Tempo via OTLP)
provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="http://otel-collector:4317")
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
# Auto-instrument common libraries (zero-code changes in handlers)
FastAPIInstrumentor().instrument()
SQLAlchemyInstrumentor().instrument()
RedisInstrumentor().instrument()
tracer = trace.get_tracer("checkout-service")
log = structlog.get_logger()
async def process_checkout(order_id: str, user_id: str):
with tracer.start_as_current_span("checkout.process") as span:
span.set_attribute("order.id", order_id)
span.set_attribute("user.id", user_id)
# Structured log with trace context for correlation
log.info("checkout.started",
order_id=order_id,
trace_id=format(span.get_span_context().trace_id, '032x'))
try:
with tracer.start_as_current_span("checkout.validate_inventory") as child:
result = await inventory_service.check(order_id)
child.set_attribute("items.available", result["available"])
with tracer.start_as_current_span("checkout.charge_payment") as child:
charge = await payment_service.charge(user_id, result["total"])
child.set_attribute("payment.method", charge["method"])
span.set_attribute("checkout.status", "success")
return {"status": "confirmed"}
except Exception as e:
span.record_exception(e)
span.set_status(trace.Status(trace.StatusCode.ERROR, str(e)))
raise
Time to serve a request. Distinguish success vs error latency. Track P50, P95, P99 — not just average. A 1s average can hide 10% of users at 10s.
Demand placed on your system. Requests per second, active connections, messages/sec. Used for capacity planning and detecting unusual spikes.
Rate of failed requests — explicit (5xx) and implicit (200 with wrong content). Error budget: 99.9% SLO = 43.8 min/month downtime budget.
How "full" is your service? CPU%, memory%, disk I/O%, queue depth. Lead indicator — saturation predicts problems before errors occur.
| Category | Open Source | Commercial |
|---|---|---|
| Metrics | Prometheus + Grafana | Datadog, New Relic, Dynatrace |
| Tracing | Jaeger, Zipkin, Tempo | Honeycomb, Lightstep, AWS X-Ray |
| Logs | Loki, Elasticsearch (ELK) | Splunk, Datadog Logs, Papertrail |
| Alerting | Alertmanager, PagerDuty OSS | PagerDuty, OpsGenie, VictorOps |
| Instrumentation | OpenTelemetry SDK | Datadog APM, New Relic agents |
| All-in-one | SigNoz, Signoz, Uptrace | Datadog, Honeycomb, Grafana Cloud |
1. Your P99 latency alert fires (metrics). What's the best next step to find the cause?
2. What is the advantage of tail-based sampling over head-based sampling?
3. Your SLO is 99.9% request success rate over 30 days. You've used 80% of your error budget in week 1. What should you do?
4. Why should you use P99 latency rather than average latency for SLOs?
5. What is the W3C Trace Context (traceparent header) used for?
You can now build observable distributed systems. Next: Distributed KV Store Capstone.