Service Discovery & API Gateway
Client-side vs server-side discovery, circuit breakers, API gateway patterns, and service mesh basics.
4 Exercises
12 Concept Checks
~90 min total
System Design
Session Progress
0 / 4 completed
Service Discovery Mechanism
A payments microservice needs to call the fraud detection service. The fraud service runs on dynamic IPs — Kubernetes pods spin up and down constantly. Hardcoding the IP in the payments service config causes connection failures every time the fraud service restarts. You need a discovery mechanism that always resolves to a live instance.
Architecture Diagram
💳 Payments Service
→
🗂 Service Registry (Consul/etcd)
→
🔍 Fraud Pod 1
🔍 Fraud Pod 2
🔍 Fraud Pod 3
Concept Check — 3 questions
Q1. How does Kubernetes service discovery work for pod-to-pod communication?
AEach pod registers its IP in a shared config file read by other pods
BKubernetes DNS + Service resource — each Service gets a stable DNS name like fraud-svc.payments.svc.cluster.local that always resolves to a live pod
COps manually updates a ConfigMap every time pods restart
DEach pod gets its own load balancer with a static IP
Q2. Client-side discovery: the client queries a service registry and picks an instance. Server-side discovery: requests go to a load balancer/proxy. What is the key advantage of server-side discovery?
AIt is always faster because the load balancer has a local cache
BIt uses less memory on each client service
CIt decouples the client from discovery logic — the client just calls the LB endpoint; registry logic lives in the infrastructure
DIt is cheaper to operate than client-side discovery
Q3. A service instance crashes without deregistering from the service registry. Health checks prevent routing to dead instances by doing what?
APeriodically polling the instance's /health endpoint and removing it from the registry on failure
BWaiting for a TCP connection timeout before removing the entry
CScanning the instance's log files for ERROR entries
DRequiring manual removal by the operations team
Kubernetes creates a virtual Service IP (ClusterIP) backed by DNS — the name fraud-svc resolves to a virtual IP that kube-proxy load-balances across healthy pods. Server-side discovery moves complexity out of every client. Health checks with short intervals (5–10s) catch crashes quickly — Consul default TTL is 30s, Kubernetes liveness probes can be configured to 5s.
Open Design Challenge
Design a health check endpoint for the fraud service. What should it verify (DB connection, downstream services)? What HTTP status codes should it return?
If the service registry (Consul) itself goes down, how do clients continue to route traffic? Describe a fallback strategy.
Compare DNS-based discovery vs sidecar proxy discovery (Envoy). When would you choose each approach?
Concept score: 0/3
API Gateway Responsibilities
You have 50 microservices each implementing their own authentication, rate limiting, and logging. Cross-cutting logic is duplicated everywhere and inconsistently implemented. A security patch requires updating all 50 services. Adding request tracing requires changes to every codebase. An API Gateway centralizes this shared infrastructure logic.
Architecture Diagram
📱 Mobile Client
🌐 Web Client
🤝 Partner API
→
🚪 API Gateway (auth · rate limit · logging · routing)
→
⚙️ Service A
⚙️ Service B
⚙️ Service N
Concept Check — 3 questions
Q1. Which of the following concerns should NOT be implemented in the API Gateway?
AAuthentication — verifying JWT tokens before forwarding requests
BRate limiting — throttling clients that exceed 100 requests/second
CBusiness logic — calculating discount prices or processing payment workflows
DRequest logging — emitting structured access logs for every request
Q2. An API Gateway that is not highly available introduces which type of risk to the entire platform?
ASQL injection attacks from malformed client requests
BSingle point of failure — all traffic routes through it; if it goes down, all services become unreachable
CData loss from corrupted database writes
DReduced latency due to lack of response caching
Q3. Request routing in an API gateway primarily maps what to what?
AExternal paths (e.g., /api/v1/users) to internal services (e.g., user-service:8080/users)
BSQL queries to equivalent NoSQL operations
CUser IDs to session tokens stored in Redis
DHTTP REST calls to gRPC method names
The gateway handles cross-cutting concerns: auth, rate limiting, logging, tracing, SSL termination, and routing. Business logic belongs in domain services — putting it in the gateway couples infrastructure to domain rules. A single gateway instance is a single point of failure; run multiple instances behind a load balancer (active-active). Routing config maps external URL patterns to internal service addresses.
Open Design Challenge
Design a rate limiting scheme for the API gateway. How do you handle per-user limits (100 req/min), per-IP limits (1000 req/min), and global limits (1M req/min) simultaneously?
The API gateway needs to enforce authentication. Sketch the JWT verification flow — what happens if the public key signing service is temporarily down?
How would you implement A/B testing at the gateway layer? Route 10% of /api/v2/checkout traffic to the new checkout service while keeping 90% on the old one.
Concept score: 0/3
Circuit Breaker Pattern
The checkout service calls the inventory service. The inventory service is slow — 500ms response time, up from a normal 20ms. Checkout has no circuit breaker, so it queues all requests waiting for inventory. After 60 seconds, checkout has 10,000 queued requests, runs out of memory, and crashes — taking down checkout even though only inventory is degraded. Classic cascading failure.
Circuit Breaker State Machine
🛒 Checkout
→
CLOSED: route to Inventory
→ failures exceed threshold →
OPEN: fail fast / return fallback
→ timeout period →
HALF-OPEN: test single request
Concept Check — 3 questions
Q1. When a circuit breaker is in the OPEN state, what happens to incoming requests?
AAll requests pass through normally to the downstream service
BRequests are queued and retried every 5 seconds
CRequests fail immediately without calling the downstream service — fast failure, no resource wasting
DThe circuit physically disconnects the network connection
Q2. The circuit breaker transitions from OPEN to HALF-OPEN after what condition?
AManual reset by an operator — it never auto-recovers
BA configurable timeout period — after N seconds, allow a single probe request to test if downstream recovered
CAfter exactly 10 consecutive failed requests are accumulated
DWhen the downstream service restarts and re-registers in the service registry
Q3. Best fallback strategy when the circuit is OPEN for an inventory stock check at checkout:
AAlways return "out of stock" to completely prevent any oversell risk
BReturn the cached last-known-good value or optimistically return "in stock", then reconcile post-recovery
CBlock the checkout flow entirely until inventory recovers
DCall a third-party inventory service as a backup
OPEN state = fail fast to protect your service from resource exhaustion. HALF-OPEN = cautious probe — let one request through to test recovery. The timeout before HALF-OPEN should be longer than the average restart time of the downstream service. Fallback quality matters — returning stale cache is almost always better than failing the entire user request.
Open Design Challenge
What thresholds would you configure for the circuit breaker? Define: failure rate %, minimum request count window, and timeout before HALF-OPEN transition.
How do you monitor circuit breaker state changes? What metrics and alerts would you configure in your observability stack?
Should circuit breakers be implemented per-client (in the calling service) or centrally (in a service mesh sidecar)? Compare the trade-offs of each approach.
Concept score: 0/3
Service Mesh vs API Gateway
200 microservices need mutual TLS between all services, distributed tracing, and per-service rate limiting. The API gateway handles north-south traffic (external clients to services) but east-west traffic (service-to-service) has no observability, no encryption, and no traffic policies. Adding mTLS to 200 services manually would take months of engineering work.
North-South vs East-West Traffic
🌐 Internet
→ north-south →
🚪 API Gateway
→
🖥 Services
|
east-west: Svc A ↔ [sidecar proxy] ↔ [sidecar proxy] ↔ Svc B
Concept Check — 3 questions
Q1. How does a service mesh (Istio/Linkerd) solve east-west traffic observability and security without code changes?
ACentralizing all internal service calls through a single reverse proxy node
BSidecar proxies injected alongside each service that handle mTLS, observability, and traffic policies transparently
CAdding HTTP interceptor middleware to each service's application codebase
DReplacing the API gateway with a more capable version that handles internal traffic too
Q2. mTLS (mutual TLS) in a service mesh means what about connection authentication?
AOne-way TLS — only the server presents a certificate to the client
BTLS with a shared secret password instead of certificates
CMutual TLS — both the client and server authenticate with certificates, verifying identity on both sides of every connection
DData encrypted at rest on disk, not in transit between services
Q3. When does the operational overhead of a service mesh justify its cost?
AAny microservice architecture with more than 2 services communicating
BWhen you have 20+ services with strict security/compliance requirements or significant duplicate cross-cutting logic across services
COnly at FAANG-scale companies with billions of daily users
DNever — a well-configured API gateway can always replace a service mesh
API Gateway = north-south traffic (external-facing). Service mesh = east-west traffic (internal service-to-service). Sidecars are injected automatically by the mesh control plane — services need zero code changes. mTLS via sidecars means every service-to-service call is encrypted and both sides are authenticated by certificate, not IP. The mesh adds ~1ms latency per hop — acceptable for most workloads.
Open Design Challenge
Design the certificate rotation strategy for mTLS in a service mesh with 200 services. How often do certs rotate, and how do you prevent any downtime during rotation?
A new version of Service A is being deployed. Design a canary rollout using Istio traffic splitting: start at 5%, then 25%, then 100% — with automatic rollback triggered when error rate exceeds 1%.
You have both an API Gateway and a service mesh. Trace the full request path for an external mobile client calling the order service, which then calls inventory and payment services internally.
Concept score: 0/3