Day 20 — Distributed Observability & Tracing

The Three Pillars of Observability

📋

Logs

Discrete events with context. What happened, when, and where. Best for debugging specific incidents. Structured JSON logs are searchable at scale.

📊

Metrics

Aggregated numeric data over time. CPU%, request rate, error rate, latency percentiles. Best for alerting and dashboards. Low cardinality, high resolution.

🔍

Traces

End-to-end request flow across services. A trace = collection of spans. Each span = one operation with start/end time. Best for latency attribution across microservices.

🎯 When to Use Each

Alert fires on P99 latency spike (metrics) → find the slow traces (traces) → look at the logs for that trace ID to see the error (logs). All three are complementary — you need all three for full observability.

🔍 Distributed Trace Visualizer (Gantt Chart)

Simulate a user checkout request flowing through microservices. Each bar = one span. Click "Inject Error" to simulate a database timeout.

Click "Generate Trace" to visualize a distributed request

—

Total Latency

—

Spans

—

Errors

—

Root Service

Distributed Tracing Concepts

🆔 Trace & Span IDs

Every request gets a unique trace_id propagated via HTTP headers (W3C Trace Context: traceparent). Each service creates a child span with its own span_id, parent_span_id, start/end timestamps, and attributes.

          traceparent: 00-{trace_id}-{span_id}-01
        

🎯 Sampling Strategy

Tracing 100% of requests at 10K req/s generates massive data. Strategies:

Head-based: Decide at trace start (1% random). Simple but misses rare errors.
Tail-based: Decide after trace ends — trace errors and slow requests (100%), sample fast success (1%). Jaeger, Tempo support this.

📦 OpenTelemetry

CNCF Standard

Vendor-neutral instrumentation framework for traces, metrics, and logs. One SDK, multiple backends: Jaeger, Zipkin, Datadog, Honeycomb. OTel Collector aggregates and routes telemetry.

Auto-instrumentation (zero-code)
Manual SDK for custom spans
Replaces: Jaeger client, Zipkin client

🚨 SLO / SLI / SLA

SLI (Indicator): What you measure — e.g., request success rate. SLO (Objective): Target — 99.9% success rate over 30 days. SLA (Agreement): Contract — 99.9% or we refund. Error budget = 1 - SLO = how much you can spend on outages before breaching.

OpenTelemetry — Python Example

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.redis import RedisInstrumentor
import structlog

# Setup tracer provider (exported to Jaeger/Tempo via OTLP)
provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="http://otel-collector:4317")
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)

# Auto-instrument common libraries (zero-code changes in handlers)
FastAPIInstrumentor().instrument()
SQLAlchemyInstrumentor().instrument()
RedisInstrumentor().instrument()

tracer = trace.get_tracer("checkout-service")
log = structlog.get_logger()

async def process_checkout(order_id: str, user_id: str):
    with tracer.start_as_current_span("checkout.process") as span:
        span.set_attribute("order.id", order_id)
        span.set_attribute("user.id", user_id)

        # Structured log with trace context for correlation
        log.info("checkout.started",
                 order_id=order_id,
                 trace_id=format(span.get_span_context().trace_id, '032x'))

        try:
            with tracer.start_as_current_span("checkout.validate_inventory") as child:
                result = await inventory_service.check(order_id)
                child.set_attribute("items.available", result["available"])

            with tracer.start_as_current_span("checkout.charge_payment") as child:
                charge = await payment_service.charge(user_id, result["total"])
                child.set_attribute("payment.method", charge["method"])

            span.set_attribute("checkout.status", "success")
            return {"status": "confirmed"}

        except Exception as e:
            span.record_exception(e)
            span.set_status(trace.Status(trace.StatusCode.ERROR, str(e)))
            raise

SRE Metrics: The Four Golden Signals

🚦 Latency

Time to serve a request. Distinguish success vs error latency. Track P50, P95, P99 — not just average. A 1s average can hide 10% of users at 10s.

🌊 Traffic

Demand placed on your system. Requests per second, active connections, messages/sec. Used for capacity planning and detecting unusual spikes.

💥 Errors

Rate of failed requests — explicit (5xx) and implicit (200 with wrong content). Error budget: 99.9% SLO = 43.8 min/month downtime budget.

🔲 Saturation

How "full" is your service? CPU%, memory%, disk I/O%, queue depth. Lead indicator — saturation predicts problems before errors occur.

Observability Tooling

Category	Open Source	Commercial
Metrics	Prometheus + Grafana	Datadog, New Relic, Dynatrace
Tracing	Jaeger, Zipkin, Tempo	Honeycomb, Lightstep, AWS X-Ray
Logs	Loki, Elasticsearch (ELK)	Splunk, Datadog Logs, Papertrail
Alerting	Alertmanager, PagerDuty OSS	PagerDuty, OpsGenie, VictorOps
Instrumentation	OpenTelemetry SDK	Datadog APM, New Relic agents
All-in-one	SigNoz, Signoz, Uptrace	Datadog, Honeycomb, Grafana Cloud

Knowledge Check

1. Your P99 latency alert fires (metrics). What's the best next step to find the cause?

2. What is the advantage of tail-based sampling over head-based sampling?

3. Your SLO is 99.9% request success rate over 30 days. You've used 80% of your error budget in week 1. What should you do?

4. Why should you use P99 latency rather than average latency for SLOs?

5. What is the W3C Trace Context (traceparent header) used for?

Day 20 Complete!

You can now build observable distributed systems. Next: Distributed KV Store Capstone.