Day 22 Exercises — Design YouTube

Exercise 1 🟡 Easy ⏱ 15 min

✓ Completed

Video Upload Decoupling

YouTube receives 500 hours of video per minute. A user uploads a 4GB video file. Without async architecture, the upload API must transcode in-line — causing 30-minute request timeouts and server overload that takes down the entire upload fleet.

Architecture Diagram

👤 Client

→

📡 Upload API

→

🪣 S3 Raw

→

📨 Kafka Queue

→

🔧 Worker 360p

🔧 Worker 720p

🔧 Worker 1080p

→

🌐 CDN

Concept Check — 3 questions

Q1. When the upload API writes to S3 and enqueues a Kafka event, what HTTP status should it return to the client?

A200 OK — the upload is done

B202 Accepted — request received, processing in background

C201 Created — the video resource has been created

D204 No Content — the video is ready

Q2. A transcoding worker crashes halfway through a 4K video job. How does Kafka prevent job loss?

AKafka automatically completes the job on another worker

BThe job is permanently lost — Kafka doesn't persist failed jobs

CThe consumer never committed the offset, so Kafka redelivers the message to another worker

DThe API server retries the upload

Q3. How does the uploader know when transcoding is complete?

AThe transcoder sends an email/push notification OR the client polls a /status endpoint

BThe client keeps the upload HTTP connection open until transcoding finishes

CYouTube shows the video immediately during transcoding

DKafka notifies the client directly over TCP

202 Accepted is the right code for async work. Kafka offsets: the consumer commits its offset ONLY after successfully processing — a crash before commit causes re-delivery. Notification: emit a video_ready event from the transcoder; a notification service subscribes and sends the email/push.

Open Design Challenge

1

Design the data model for a video: what fields does the videos table need? Include status, storage path, and quality variants.

2

If the transcoder must produce 360p, 720p, 1080p, and 4K versions — how do you parallelize this across 4 workers for one video?

3

A transcoding job fails 3 times. What is your dead-letter strategy and how do you alert the engineering team?

Concept score: 0/3

Exercise 2 🔴 Medium ⏱ 20 min

✓ Completed

Adaptive Bitrate Streaming (ABR)

A user on mobile opens a YouTube video. Their network speed changes every 30 seconds: 15 Mbps → 1.5 Mbps → 8 Mbps. Without ABR, they would see constant buffering. With ABR, quality adapts automatically — the player switches between pre-transcoded quality levels at segment boundaries.

HLS / DASH Segment Structure

🎞️ 360p segments (500Kbps)

🎞️ 720p segments (2.5Mbps)

🎞️ 1080p segments (8Mbps)

→

📋 .m3u8 Manifest

→

📱 Player

monitors bandwidth

⬆/⬇ Switch quality

Concept Check — 3 questions

Q1. The user's bandwidth is 2 Mbps. Which quality level should the ABR player choose?

A1080p (8 Mbps) — always stream the highest quality

B720p (2.5 Mbps) — no, 2 Mbps is below 2.5 Mbps. Choose 360p (500 Kbps) with buffer to spare

C360p (500 Kbps) — safely below the 2 Mbps available bandwidth

DPause and wait until bandwidth reaches 8 Mbps

Q2. HLS splits videos into segments. What is the typical segment duration, and why does it matter for quality switching?

A30 seconds — quality only switches every 30 seconds

B2–10 seconds — the player can switch quality on each segment boundary, adapting quickly

C1 millisecond — for real-time adaptation

DThe entire video file — no splitting

Q3. Where should video segments be stored for best performance? What cache hit rate do you expect for a popular video?

AOrigin servers only — CDNs add too much latency

BOnly in RAM — video files are too small for CDN

CCDN edge nodes globally — popular video segments achieve 99%+ cache hit rate, serving from <50ms away

DUser's local device only

B is correct for Q1: 2 Mbps < 2.5 Mbps needed for 720p, so choose 360p safely. ABR is conservative — it picks quality BELOW available bandwidth to avoid stalling. Segment duration 2-10s lets the player adapt quickly. CDN stores hot segments at 300+ edge PoPs globally.

Open Design Challenge

1

Design the CDN cache key for a 720p segment of video "abc123", minute 3, segment 4. What URL structure guarantees cache hits for the same segment across all users?

2

A new video goes viral 10 minutes after upload. The CDN has 0% cache hit rate. Design "request coalescing" to prevent 100,000 requests hitting the origin simultaneously.

3

How does "predictive preloading" work? When a user is 3 seconds into a video, what segments should already be buffered?

Concept score: 0/3

Exercise 3 🔴 Medium ⏱ 25 min

✓ Completed

View Count at Scale

YouTube serves 1 billion views per day = ~11,600 view events per second. A single PostgreSQL counter column for a viral video receives 500,000 writes in the first hour (139 writes/second on ONE ROW). The database row is locked for each increment, causing writes to queue and timeout.

Counter Architecture Options

❌ NAIVE (breaks at scale)

View event

↓ direct write

UPDATE videos SET views=views+1 WHERE id=X

Row lock contention → 139 writes/sec on 1 row

✅ CORRECT (Redis + batch)

View event

↓ INCR in-memory

Redis: INCR view:{video_id}

↓ batch every 30s

UPDATE videos SET views=views+N

Concept Check — 3 questions

Q1. Using Redis INCR to count views: the server crashes before flushing to PostgreSQL. What is the worst-case data loss?

AAll view counts are lost permanently

BNo data loss — Redis persists everything to disk instantly

CUp to ~30 seconds of view counts are lost — acceptable for a view counter but not for financial data

DPostgreSQL rolls back to a consistent state automatically

Q2. "Sharded counters" split one counter into N shards (e.g., 100 Redis keys for one video). What problem does this solve?

AIt reduces total storage for view counts

BA single Redis key becomes a hot spot at extreme write rates. 100 shards distribute writes — each key gets 1/100th of the traffic. Sum all shards to get total.

CIt makes reads faster by caching the result

DIt prevents duplicate view counts from the same user

Q3. YouTube shows "1.2M views" but the exact count is 1,247,832. This is intentional — why?

AYouTube's engineers can't count accurately at this scale

BExact counts require synchronous writes, which would slow down video loading

CRounding avoids unnecessary DB reads and the precision difference (0.06%) doesn't affect user experience

DExact counts are stored but rounding is a legal requirement

Redis INCR is atomic and returns ~100K ops/sec — far faster than PostgreSQL's row locking. Batch flush every 30s: a cron or Kafka consumer reads GETDEL view:{id} and writes one SQL UPDATE with the accumulated count. Loss on crash = at most 30 seconds of views. For YouTube this is acceptable — it's not money.

Open Design Challenge

1

Design the full view counting pipeline: from browser click → Redis → Kafka → PostgreSQL. Include deduplication (same user shouldn't count twice in 24h).

2

The "views in last 24 hours" feature requires a sliding window count. How do you implement this without scanning billions of rows?

Concept score: 0/3

Exercise 4 🔥 Hard ⏱ 35 min

✓ Completed

Viral Video: Cold Start Problem

A creator uploads a video at 9:00 AM. At 9:05 AM, a celebrity tweets the link. By 9:07 AM, 100,000 users click simultaneously. The CDN has 0% cache hit rate — every request passes through to the origin. 3 origin servers handling video must each serve 33,000 concurrent streams from disk.

Cold Start: Request Coalescing

User 1

User 2

User 3...100K

→

🔒 CDN Edge
(coalescing hold)

→ single request

📦 Origin

→ cached

CDN stores

→ all 100K served

⚡ Fast delivery

Concept Check — 3 questions

Q1. "CDN Request Coalescing" means 100,000 requests all miss the CDN cache simultaneously. What does the CDN do?

AForwards all 100,000 requests to the origin server simultaneously

BReturns 503 errors until the origin responds

CHolds all 100,000 in a waiting queue, sends ONE request to the origin, and when it responds, serves the cached copy to all 100,000

DRandomly drops 99% of requests and only forwards 1,000

Q2. Virality prediction: a video gets 500 views in its first 60 seconds. What should the system do automatically?

ANothing — wait until the video is popular before caching

BTrigger CDN pre-warming: proactively push all quality segments to edge nodes in top 50 cities before the traffic arrives

CLimit the video to 1,000 viewers until transcoding catches up

DDelete the old video and re-upload it with better compression

Q3. During a viral spike, origin servers are overwhelmed. What should the CDN return if it can't reach the origin?

A503 Service Unavailable immediately

BServe a stale cached version with a "stale-while-revalidate" directive — users get slightly old content rather than an error

CDelete the cache and force all users to wait for a fresh response

DRedirect users to a competitor's video

Request coalescing is a CDN feature: only 1 cache miss travels to origin regardless of how many users miss simultaneously. Pre-warming: Kafka event video_velocity_high (500 views/min) → pre-warming service calls each CDN PoP's edge API to pull-and-cache all segments. Stale-while-revalidate: CDN serves old content immediately while revalidating in the background.

Open Design Challenge

1

Design the virality detection pipeline: from view events → velocity calculation → CDN pre-warm trigger. What Kafka topics and consumers are involved?

2

Pre-warming 300 CDN PoPs × 5 quality levels × 100 segments/level = 150,000 HTTP calls. How do you execute this within 60 seconds without overloading the origin?

3

What is the SLA you'd set for a video to be "CDN warm" after upload? How do you measure it?

Concept score: 0/3

Day 22 Complete 🎉