Day 1 Exercises — Horizontal Scaling & Load Balancers

Exercise 1 🟡 Easy ⏱ 15 min

✓ Completed

Load Balancer Algorithm Selection

A social network has 3 backend servers. Server A handles auth (stateful sessions), servers B and C handle feed queries (stateless). Current round-robin sends auth requests to different servers each time, causing session loss — users are randomly logged out mid-session.

Architecture Diagram

👤 Client

→

⚖️ Load Balancer

→

🔐 Server A (Auth/Stateful)

📰 Server B (Feed)

📰 Server C (Feed)

Concept Check — 3 questions

Q1. Which load balancer algorithm prevents session loss for the stateful auth server?

ARound-robin — distributes requests evenly across all servers

BIP hash / sticky sessions — the same client IP always routes to the same server

CRandom selection — randomizes server choice each request

DWeighted round-robin — sends more traffic to faster servers

Q2. Stateless services like the feed servers benefit most from which algorithm?

ARound-robin — spreads load evenly since any server can handle any request

BIP hash — ensures the same user always hits the same server

CStatic assignment — each client is permanently mapped to one server

DLeast-connections only — always routes to the least busy server

Q3. Server B crashes. With health checks enabled, what happens to traffic destined for Server B?

AThe load balancer still routes to B, causing errors for those users

BThe load balancer detects failure via health check and removes B from the rotation

CThe load balancer crashes along with Server B

DAll requests are redirected to Server A only, overloading it

IP hash maps the user's IP to a consistent server — the same client always hits the same backend. But Redis sessions are even better: they make auth stateless so any server can handle any request. Health checks: typical settings are interval=10s, failure threshold=3 consecutive failures, timeout=5s. After 3 failures the server is removed from the pool automatically.

Open Design Challenge

1

Design a session storage system using Redis to make the auth server fully stateless. What key structure and TTL would you use?

2

Calculate the number of servers needed for 10,000 concurrent users if each server handles 100 requests/second and average session is 5 minutes.

3

Sketch health check parameters: what interval, failure threshold, and timeout would you set? What happens during a rolling deploy?

Concept score: 0/3

Exercise 2 🟡 Easy ⏱ 15 min

✓ Completed

Capacity Estimation — Social Network

Instagram-scale: 500M registered users, 10% DAU = 50M daily active users. Each user views 50 photos per day. Each photo request = 50KB thumbnail. You need to calculate required QPS and bandwidth to size the infrastructure.

Back-of-Envelope Calculation

50M DAU × 50 photos

÷ 86,400 sec

≈ 29,000 QPS

× 50KB

≈ 1.4 GB/s bandwidth

Concept Check — 3 questions

Q1. 50M DAU × 50 photo views ÷ 86,400 seconds per day ≈ ?

A500 QPS — far too low for a 50M user platform

B29,000 QPS — correct: 50M × 50 / 86,400 ≈ 28,935

C500,000 QPS — overestimate by 17×

D5,000,000 QPS — overestimate by 170×

Q2. At 29,000 QPS × 50KB per photo request, the required read bandwidth is approximately?

A1.4 MB/s — this is three orders of magnitude too low

B14 MB/s — still too low by 100×

C~1.4 GB/s — correct: 29,000 × 50,000 bytes ≈ 1.45 GB/s

D14 GB/s — overestimate by 10×

Q3. The read:write ratio for a photo-sharing application is typically?

A1:1 — equal reads and writes

B100:1 or higher — reads dominate heavily in photo apps

C1:100 — writes dominate (each photo is uploaded repeatedly)

D50:50 — roughly balanced

50M × 50 / 86,400 ≈ 28,935 QPS ≈ 29K. Bandwidth: 29K × 50KB = 1.45 GB/s. Storage: 1 billion photos × 500KB = 500TB. CDNs absorb 95%+ of read traffic — the origin only needs to serve 5% of reads directly. The 100:1 read:write ratio means heavily optimize read paths with caching and CDN.

Open Design Challenge

1

How many servers (each handling 1,000 QPS) do you need for 29K QPS? Include a 30% headroom buffer.

2

Calculate storage for 1 billion photos at 500KB each plus 3 thumbnail sizes at 50KB each. Show your working.

3

If CDN absorbs 95% of reads, recalculate origin QPS and bandwidth requirements.

Concept score: 0/3

Exercise 3 🔴 Medium ⏱ 25 min

✓ Completed

Consistent Hashing for Cache Distribution

A distributed cache has 4 nodes. With simple modulo hashing (key % 4), adding a 5th node invalidates 80% of cache keys. The cache hit rate drops from 95% to 15%, causing a database thundering herd that takes down the primary DB within minutes.

Consistent Hash Ring — Adding Node E only remaps ~20% of keys

🔑 key hash

→ clockwise

Node A (0°)

Node B (90°)

Node C (180°)

Node D (270°)

+ Node E →

Only A→E keys remapped (~20%)

Concept Check — 3 questions

Q1. With modulo hashing and 4 nodes, adding a 5th node remaps what percentage of cache keys?

A20% — only the keys that land on the new node

B50% — half the keys need remapping

C80% — most keys map to a different node after the divisor changes from 4 to 5

D0% — modulo hashing is stable across node additions

Q2. Consistent hashing reduces key remapping when adding 1 node to a 5-node ring to approximately?

A50% of keys — consistent hashing doesn't help much

B1/n ≈ 20% — only keys previously assigned to the new node's predecessor get remapped

C100% — all keys must be re-evaluated on the ring

D0% — no keys ever move with consistent hashing

Q3. Virtual nodes (vnodes) in consistent hashing primarily solve which problem?

AUneven key distribution and hot spots — high-capacity nodes can hold proportionally more vnodes

BNetwork latency — vnodes are placed in geographically closer data centers

CCache invalidation — vnodes expire stale entries automatically

DReplication lag — vnodes synchronize data faster between replicas

Consistent hash ring: hash(key) → find the nearest clockwise node. Adding 1 node to a 5-node ring remaps only 1/5 = 20% of keys (the arc covered by the new node). Virtual nodes: a node with 2× capacity gets 2× the virtual nodes on the ring, receiving 2× the keys. When Node C fails, only the keys that mapped to C move to C's clockwise successor.

Open Design Challenge

1

Trace how key "user:42601" gets assigned to a node on a consistent hash ring with 4 nodes. Describe the steps from hash computation to node selection.

2

If Node A has 2× the memory capacity of Nodes B, C, D — how many virtual nodes should each physical node receive if the total is 100 vnodes?

3

Node C fails unexpectedly. Which keys get remapped and to which node? How do you handle cache warming for those keys?

Concept score: 0/3

Exercise 4 🔥 Hard ⏱ 35 min

✓ Completed

Multi-Region Active-Active Load Balancing

A global fintech app has users in US-East, EU-West, and APAC. Current setup: single US data center. EU users see 150ms latency. APAC users see 280ms. Compliance requires EU user data must stay in EU (GDPR). You must design multi-region active-active routing with data sovereignty.

GeoDNS Multi-Region Architecture

🌍 GeoDNS

→

US-East LB → 3 servers

EU-West LB → 3 servers (GDPR)

APAC LB → 3 servers

→

Cross-region sync (non-PII only)

Concept Check — 3 questions

Q1. How does GeoDNS route EU users to the EU load balancer?

ARound-robin across all regional load balancers regardless of user location

BDNS returns the EU load balancer IP based on the user's IP geolocation

CUsers manually select their region in account settings

DAll traffic routes via US first, then the US LB redirects to EU

Q2. GDPR requires EU user data stays in EU. For a global authentication system, the correct pattern is?

AReplicate all user data including PII to all regions globally for low latency

BBlock all non-EU IP addresses from accessing EU user accounts

CData sovereignty routing: EU users auth against EU replica; only anonymized/non-PII data crosses regions

DStore everything in US and add a GDPR compliance disclaimer to the terms of service

Q3. An EU-West server crashes. EU users should failover to which region to maintain GDPR compliance?

AAPAC immediately — it has spare capacity

BEU fallback zone first; only route to US if all EU zones are unavailable

CUS-East immediately — fastest recovery path

DServe 503 errors until the EU region recovers — never leave EU

Data sovereignty: user PII stays in the home region's database. Session tokens use stateless JWT so they work cross-region without sharing session state. For failover: prefer same geographic region first (EU-West-1 → EU-Central-1), then US only if all EU zones are down. Data migration when a user moves: copy non-PII first, then schedule PII migration with explicit user consent.

Open Design Challenge

1

Design the database strategy: which tables replicate globally (non-PII) vs stay regional (PII)? Give concrete examples for a fintech app.

2

A user moves from EU to US permanently. How do you migrate their PII data while maintaining service continuity and GDPR compliance?

3

Write a failover decision tree for when EU-West is down: step-by-step decisions from detection to recovery including DNS TTL implications.

Concept score: 0/3

Day 1 Complete 🎉