Day 6 Exercises — Service Discovery & API Gateway

Exercise 1 🟢 Easy ⏱ 15 min

✓ Completed

Service Discovery Mechanism

A payments microservice needs to call the fraud detection service. The fraud service runs on dynamic IPs — Kubernetes pods spin up and down constantly. Hardcoding the IP in the payments service config causes connection failures every time the fraud service restarts. You need a discovery mechanism that always resolves to a live instance.

Architecture Diagram

💳 Payments Service

→

🗂 Service Registry (Consul/etcd)

→

🔍 Fraud Pod 1

🔍 Fraud Pod 2

🔍 Fraud Pod 3

Concept Check — 3 questions

Q1. How does Kubernetes service discovery work for pod-to-pod communication?

AEach pod registers its IP in a shared config file read by other pods

BKubernetes DNS + Service resource — each Service gets a stable DNS name like fraud-svc.payments.svc.cluster.local that always resolves to a live pod

COps manually updates a ConfigMap every time pods restart

DEach pod gets its own load balancer with a static IP

Q2. Client-side discovery: the client queries a service registry and picks an instance. Server-side discovery: requests go to a load balancer/proxy. What is the key advantage of server-side discovery?

AIt is always faster because the load balancer has a local cache

BIt uses less memory on each client service

CIt decouples the client from discovery logic — the client just calls the LB endpoint; registry logic lives in the infrastructure

DIt is cheaper to operate than client-side discovery

Q3. A service instance crashes without deregistering from the service registry. Health checks prevent routing to dead instances by doing what?

APeriodically polling the instance's /health endpoint and removing it from the registry on failure

BWaiting for a TCP connection timeout before removing the entry

CScanning the instance's log files for ERROR entries

DRequiring manual removal by the operations team

Kubernetes creates a virtual Service IP (ClusterIP) backed by DNS — the name fraud-svc resolves to a virtual IP that kube-proxy load-balances across healthy pods. Server-side discovery moves complexity out of every client. Health checks with short intervals (5–10s) catch crashes quickly — Consul default TTL is 30s, Kubernetes liveness probes can be configured to 5s.

Open Design Challenge

1

Design a health check endpoint for the fraud service. What should it verify (DB connection, downstream services)? What HTTP status codes should it return?

2

If the service registry (Consul) itself goes down, how do clients continue to route traffic? Describe a fallback strategy.

3

Compare DNS-based discovery vs sidecar proxy discovery (Envoy). When would you choose each approach?

Concept score: 0/3

Exercise 2 🟢 Easy ⏱ 20 min

✓ Completed

API Gateway Responsibilities

You have 50 microservices each implementing their own authentication, rate limiting, and logging. Cross-cutting logic is duplicated everywhere and inconsistently implemented. A security patch requires updating all 50 services. Adding request tracing requires changes to every codebase. An API Gateway centralizes this shared infrastructure logic.

Architecture Diagram

📱 Mobile Client

🌐 Web Client

🤝 Partner API

→

🚪 API Gateway (auth · rate limit · logging · routing)

→

⚙️ Service A

⚙️ Service B

⚙️ Service N

Concept Check — 3 questions

Q1. Which of the following concerns should NOT be implemented in the API Gateway?

AAuthentication — verifying JWT tokens before forwarding requests

BRate limiting — throttling clients that exceed 100 requests/second

CBusiness logic — calculating discount prices or processing payment workflows

DRequest logging — emitting structured access logs for every request

Q2. An API Gateway that is not highly available introduces which type of risk to the entire platform?

ASQL injection attacks from malformed client requests

BSingle point of failure — all traffic routes through it; if it goes down, all services become unreachable

CData loss from corrupted database writes

DReduced latency due to lack of response caching

Q3. Request routing in an API gateway primarily maps what to what?

AExternal paths (e.g., /api/v1/users) to internal services (e.g., user-service:8080/users)

BSQL queries to equivalent NoSQL operations

CUser IDs to session tokens stored in Redis

DHTTP REST calls to gRPC method names

The gateway handles cross-cutting concerns: auth, rate limiting, logging, tracing, SSL termination, and routing. Business logic belongs in domain services — putting it in the gateway couples infrastructure to domain rules. A single gateway instance is a single point of failure; run multiple instances behind a load balancer (active-active). Routing config maps external URL patterns to internal service addresses.

Open Design Challenge

1

Design a rate limiting scheme for the API gateway. How do you handle per-user limits (100 req/min), per-IP limits (1000 req/min), and global limits (1M req/min) simultaneously?

2

The API gateway needs to enforce authentication. Sketch the JWT verification flow — what happens if the public key signing service is temporarily down?

3

How would you implement A/B testing at the gateway layer? Route 10% of /api/v2/checkout traffic to the new checkout service while keeping 90% on the old one.

Concept score: 0/3

Exercise 3 🟡 Medium ⏱ 25 min

✓ Completed

Circuit Breaker Pattern

The checkout service calls the inventory service. The inventory service is slow — 500ms response time, up from a normal 20ms. Checkout has no circuit breaker, so it queues all requests waiting for inventory. After 60 seconds, checkout has 10,000 queued requests, runs out of memory, and crashes — taking down checkout even though only inventory is degraded. Classic cascading failure.

Circuit Breaker State Machine

🛒 Checkout

→

CLOSED: route to Inventory

→ failures exceed threshold →

OPEN: fail fast / return fallback

→ timeout period →

HALF-OPEN: test single request

Concept Check — 3 questions

Q1. When a circuit breaker is in the OPEN state, what happens to incoming requests?

AAll requests pass through normally to the downstream service

BRequests are queued and retried every 5 seconds

CRequests fail immediately without calling the downstream service — fast failure, no resource wasting

DThe circuit physically disconnects the network connection

Q2. The circuit breaker transitions from OPEN to HALF-OPEN after what condition?

AManual reset by an operator — it never auto-recovers

BA configurable timeout period — after N seconds, allow a single probe request to test if downstream recovered

CAfter exactly 10 consecutive failed requests are accumulated

DWhen the downstream service restarts and re-registers in the service registry

Q3. Best fallback strategy when the circuit is OPEN for an inventory stock check at checkout:

AAlways return "out of stock" to completely prevent any oversell risk

BReturn the cached last-known-good value or optimistically return "in stock", then reconcile post-recovery

CBlock the checkout flow entirely until inventory recovers

DCall a third-party inventory service as a backup

OPEN state = fail fast to protect your service from resource exhaustion. HALF-OPEN = cautious probe — let one request through to test recovery. The timeout before HALF-OPEN should be longer than the average restart time of the downstream service. Fallback quality matters — returning stale cache is almost always better than failing the entire user request.

Open Design Challenge

1

What thresholds would you configure for the circuit breaker? Define: failure rate %, minimum request count window, and timeout before HALF-OPEN transition.

2

How do you monitor circuit breaker state changes? What metrics and alerts would you configure in your observability stack?

3

Should circuit breakers be implemented per-client (in the calling service) or centrally (in a service mesh sidecar)? Compare the trade-offs of each approach.

Concept score: 0/3

Exercise 4 🔴 Hard ⏱ 30 min

✓ Completed

Service Mesh vs API Gateway

200 microservices need mutual TLS between all services, distributed tracing, and per-service rate limiting. The API gateway handles north-south traffic (external clients to services) but east-west traffic (service-to-service) has no observability, no encryption, and no traffic policies. Adding mTLS to 200 services manually would take months of engineering work.

North-South vs East-West Traffic

🌐 Internet

→ north-south →

🚪 API Gateway

→

🖥 Services

|

east-west: Svc A ↔ [sidecar proxy] ↔ [sidecar proxy] ↔ Svc B

Concept Check — 3 questions

Q1. How does a service mesh (Istio/Linkerd) solve east-west traffic observability and security without code changes?

ACentralizing all internal service calls through a single reverse proxy node

BSidecar proxies injected alongside each service that handle mTLS, observability, and traffic policies transparently

CAdding HTTP interceptor middleware to each service's application codebase

DReplacing the API gateway with a more capable version that handles internal traffic too

Q2. mTLS (mutual TLS) in a service mesh means what about connection authentication?

AOne-way TLS — only the server presents a certificate to the client

BTLS with a shared secret password instead of certificates

CMutual TLS — both the client and server authenticate with certificates, verifying identity on both sides of every connection

DData encrypted at rest on disk, not in transit between services

Q3. When does the operational overhead of a service mesh justify its cost?

AAny microservice architecture with more than 2 services communicating

BWhen you have 20+ services with strict security/compliance requirements or significant duplicate cross-cutting logic across services

COnly at FAANG-scale companies with billions of daily users

DNever — a well-configured API gateway can always replace a service mesh

API Gateway = north-south traffic (external-facing). Service mesh = east-west traffic (internal service-to-service). Sidecars are injected automatically by the mesh control plane — services need zero code changes. mTLS via sidecars means every service-to-service call is encrypted and both sides are authenticated by certificate, not IP. The mesh adds ~1ms latency per hop — acceptable for most workloads.

Open Design Challenge

1

Design the certificate rotation strategy for mTLS in a service mesh with 200 services. How often do certs rotate, and how do you prevent any downtime during rotation?

2

A new version of Service A is being deployed. Design a canary rollout using Istio traffic splitting: start at 5%, then 25%, then 100% — with automatic rollback triggered when error rate exceeds 1%.

3

You have both an API Gateway and a service mesh. Trace the full request path for an external mobile client calling the order service, which then calls inventory and payment services internally.

Concept score: 0/3

Day 6 Complete 🎉