Day 1: Horizontal Scaling & Load Balancer Internals

01 — The Problem

What happens when one server isn't enough?

A single server has a ceiling — CPU, RAM, network bandwidth. When you hit it, you need more servers. But how does a client know which server to talk to? That's what load balancers solve.

0

Server 1

0

Server 2

0

Server 3

0

Server 4

💡

Round Robin

Requests rotate through servers in sequence: 1→2→3→4→1→2→... Each server gets exactly equal traffic assuming uniform request duration. Simple and effective for stateless services with similar request costs.

02 — OSI Layers in Practice

What does "L4 vs L7" actually see?

L4 (transport layer) sees IP addresses and ports — that's it. L7 (application layer) decrypts TLS and reads HTTP content. This determines what routing decisions are possible.

🔌 L4 Load Balancer Sees

src_ip: 203.0.113.42

src_port: 54321

dst_ip: 10.0.0.1

dst_port: 443

TLS encrypted payload...

HTTP headers: hidden

URL path: hidden

Cookie: hidden

Can route on: IP, port, TCP flags. Cannot inspect HTTP headers, URL, cookies, or content.

🧠 L7 Load Balancer Sees

src_ip: 203.0.113.42

Host: api.example.com

GET /api/v2/users/123

Authorization: Bearer eyJ...

Cookie: session=abc123

Content-Type: application/json

X-Region: us-east-1

Can route on anything: URL path, headers, cookies, request body. Enables A/B testing, auth, canary deployments.

💡

The Trade-off

L4 processes packets at near line-rate (millions/sec, microseconds latency). L7 must terminate TLS, parse HTTP, read headers — roughly 10-100× more CPU per connection. AWS NLB (L4) handles 10M connections on bare metal. AWS ALB (L7) is optimized for HTTP but can't match raw NLB throughput. Real systems often chain both: NLB → ALB → service.

Technology Decision

Load Balancer Selection

✅ Nginx / HAProxy (L7)

HTTP-aware routing, path-based rules (/api → service A, / → service B), SSL termination, sticky sessions, rate limiting. Self-hosted, open-source, battle-tested. Used by: Airbnb, GitHub, Netflix.

☁️ AWS ALB / CloudFront (Managed L7)

Zero ops overhead, auto-scaling, native AWS integrations (ECS, Lambda, WAF). Use when on AWS and you want managed infra. Cost: higher than self-hosted at scale.

⚡ L4 Load Balancer (TCP)

Protocol-agnostic: works for TCP, UDP, gRPC, gaming, streaming. Faster than L7 (no HTTP parsing). Use for non-HTTP services or when lowest latency matters. AWS NLB, HAProxy TCP mode.

❌ Round-Robin DNS (avoid)

DNS-level load balancing. Clients cache DNS → uneven distribution. No health checking → clients hit dead servers. No session affinity. Only use as a last resort or geographic routing (GeoDNS).

Interview Answer: "For HTTP services I'd use an L7 load balancer like Nginx or AWS ALB — it gives me path-based routing, SSL termination, and health checks. If I need protocol-agnostic load balancing or lower latency, I'd use an L4 load balancer like AWS NLB."

🎯 Quick Check: Your team needs to route requests to different microservices based on the URL path (/api/payments → payment-service, /api/orders → order-service). Which load balancer type do you use?

03 — The Key Algorithm

Consistent Hashing: Minimizing Key Remapping

When you add or remove a cache server, how many keys get remapped? With naive modulo hashing: almost all of them. With consistent hashing: ~1/N. See the difference below.

3 servers, 12 virtual nodes, 0 keys
Add keys and see which server they route to. Then add a server — notice only the keys between the new server and its predecessor get remapped. With modulo hashing, all keys would remap.

❌

Modulo Hashing: key % N

Add 1 server (N→N+1): every key that resolves to server numbers ≥ insertion point remaps. In practice, (N-1)/N ≈ 90–100% of keys remap. This causes a massive cache miss storm — every key looks up a cold cache, and your DB gets 10× normal load.

✅

Consistent Hashing: ~1/N remapped

Add a server: only keys between the new server's hash position and its predecessor on the ring remap. For 10 servers, adding 1 remaps ~10% of keys. Cache miss rate stays manageable — no DB stampede. Used by: Cassandra, Redis Cluster, CDNs, DynamoDB.

⚠️

Why Virtual Nodes?

With 3 servers and 3 ring positions, each server might own unequal hash space (33% average but high variance). One server might own 60% of the ring by chance. Virtual nodes (vnodes) give each server 150+ positions, distributing load much more evenly. Cassandra defaults to 256 vnodes per physical node.

🎯 Quick Check: Your Redis cluster has 10 servers using consistent hashing with 150 vnodes each. You add an 11th server. Approximately what percentage of keys get remapped?

04 — Common Pitfall

Session Affinity: When and Why to Avoid It

Affinity Method	Mechanism	Problem	Verdict
IP Hash	Hash client IP → fixed server	Corporate NAT maps millions of users to one IP → hot server	Dangerous
Cookie-based	LB sets cookie with server ID	Better than IP, still breaks uniform distribution	Acceptable
External Session Store	Redis/memcached stores session; all servers are stateless	Extra hop for session lookup (usually <1ms)	Recommended

💡

The Right Mental Model

If you need sticky sessions, it's usually a sign that your app stores state on the server (local filesystem, in-memory session, etc.). The fix isn't better affinity — it's making your app stateless. Store sessions in Redis. Store uploads in S3. Make any server able to handle any request. This is what enables auto-scaling.

05 — Operations

Health Checks & Connection Draining

💓

Shallow vs Deep Health Checks

Shallow (TCP connect): Only checks if the process is listening. Misses: app that accepted the TCP connection but can't query its DB, zombie processes that accept connections but time out.

Deep (HTTP /health): App queries all dependencies and returns 200 only if truly healthy. This is what you want in production.

🔌

Connection Draining

When you deregister a server, the LB marks it "draining" — stops sending new requests but lets in-flight requests complete (up to drain timeout, usually 300s). Never kill a server without draining: you'll drop active requests mid-flight. This is the mechanism that makes zero-downtime deploys possible.

💡

What Your /health Endpoint Should Check

Database connection pool: can you acquire a connection and run SELECT 1? Redis/cache: can you ping? External APIs (with timeout): are your critical dependencies up? Return 200 with JSON of dependency statuses. Return 503 if any critical dependency is down. Never return 200 when your app can't actually serve requests — that misleads the LB and causes user-facing errors.

Horizontal Scaling &Load Balancer Internals

What happens when one server isn't enough?

Round Robin

What does "L4 vs L7" actually see?

The Trade-off

Load Balancer Selection

Consistent Hashing: Minimizing Key Remapping

Modulo Hashing: key % N

Consistent Hashing: ~1/N remapped

Why Virtual Nodes?

Session Affinity: When and Why to Avoid It

The Right Mental Model

Health Checks & Connection Draining

Shallow vs Deep Health Checks

Connection Draining

What Your /health Endpoint Should Check

Horizontal Scaling &
Load Balancer Internals