Day 22 — Design YouTube

Section 1

The Scale Problem

📤

Video Processing

500 hours uploaded every minute. A 4K video can take 10+ minutes to transcode. Synchronous processing would cause request timeouts — async pipeline is mandatory.

💾

Storage

1B+ videos × 7 quality levels (144p to 4K) × HLS segments = petabytes of object storage. S3/GCS with lifecycle policies moves cold content to cheaper tiers automatically.

🌐

Delivery

2.5B users globally. Raw origin bandwidth would be astronomical. CDN is not optional — it's the architecture. 300+ Points of Presence (PoPs) serve content from the nearest edge.

👁️

View Counting

5B views/day = 57,000 views/second. Writing to a PostgreSQL counter per view causes lock contention. Redis INCR + batch flush is the only viable pattern at this scale.

⚠️

Without architecture: 30-minute synchronous upload processing, request timeouts, server overload. Every 4K upload would hold a worker thread captive for the full transcode duration. YouTube would be unusable within hours of launch.

Section 2 — Interactive Simulation

The Upload Pipeline

ℹ️

How to use: Click "Next Step" to walk through each stage of the upload pipeline. Each stage shows what's happening, why, and the approximate latency at that step.

Adaptive Bitrate (ABR) Quality Selector

Drag the bandwidth slider to simulate what quality level YouTube's player would select. The highlighted bar is the active quality.

Available Bandwidth 8 Mbps

144p360p480p720p1080p1440p4K

Section 3

Adaptive Bitrate Streaming (ABR)

ABR streaming is the technology that lets YouTube automatically switch quality based on your network. The video is pre-encoded at multiple bitrates and split into short segments. The player requests segments one at a time, measuring download speed to determine which quality to request next.

📋

Manifest Files

The .m3u8 (HLS) or .mpd (DASH) file lists all available renditions with their bitrates and segment URLs. The player fetches this first.

✂️

Segment Duration

Segments are typically 2–10 seconds. Quality switches happen at segment boundaries. Shorter segments = faster quality adaptation but more HTTP requests.

📊

Buffer-Based ABR

YouTube's BOLA algorithm: if the buffer is full, request higher quality. If buffer is draining fast, step down. Prioritizes smooth playback over maximum quality.

💡

Why segment duration matters: A 2-second segment means quality can switch every 2 seconds — great for fluctuating networks. A 10-second segment means you're committed to the current quality for 10 seconds. YouTube uses 2–4 second segments for live, 6–10 seconds for VOD.

HLS vs DASH Comparison

Feature	HLS (Apple)	MPEG-DASH	Winner
Full name	HTTP Live Streaming	Dynamic Adaptive Streaming over HTTP	—
Manifest format	.m3u8 (proprietary)	.mpd (XML, open standard)	DASH (open)
iOS/Safari support	Native	Requires library	HLS
Android/Chrome	Requires library	Native (EME/MSE)	DASH
Segment format	.ts (MPEG-TS) or fMP4	fMP4 / WebM	DASH (fMP4 more efficient)
Live streaming	Excellent	Good	Tie
YouTube uses	For Apple devices	Primary protocol	Both (adaptive)

Section 4

View Count Architecture

The naive approach — incrementing a database counter per view — breaks catastrophically at YouTube's scale. Here's why, and how the correct architecture works.

NAIVE APPROACH (Broken)

User watches → UPDATE counter → Row lock

At 5B views/day on a viral video: 139 writes/sec on one row. The UPDATE acquires a row-level lock. Queries queue. Timeouts cascade. Database melts.

CORRECT APPROACH

User watches → Redis INCR → Flush every 30s

Redis INCR is atomic and non-blocking. Each region has its own counter. A background job flushes accumulated counts to PostgreSQL every 30 seconds. No locks, no contention.

PYTHON · view_counter.py

# View counting with Redis batching
async def record_view(video_id: str, user_id: str):
    pipe = redis.pipeline()
    pipe.incr(f"views:{video_id}")          # atomic counter
    pipe.sadd(f"viewers:{video_id}", user_id)  # unique viewers set
    pipe.expire(f"views:{video_id}", 3600)   # 1hr TTL
    await pipe.execute()

# Background flush every 30 seconds
async def flush_view_counts():
    keys = await redis.keys("views:*")
    for key in keys:
        video_id = key.split(":")[1]
        count = int(await redis.getdel(key))
        await db.execute(
            "UPDATE videos SET view_count = view_count + $1 WHERE id = $2",
            count, video_id
        )
        # Also flush unique viewer count to analytics
        unique = await redis.scard(f"viewers:{video_id}")
        await analytics.record(video_id, views=count, unique=unique)

💡

Why eventual consistency is acceptable here: View counts are not financial transactions. Showing "14.2M views" when the true count is 14.3M is perfectly acceptable. Users don't notice. What matters is no data loss — the flush job persists counts durably before TTL expiry.

Section 5

Technology Decisions

Storage

S3 / GCS + PostgreSQL

Object storage for video files (unlimited scale, cheap). PostgreSQL for metadata (video title, uploader, status). Not MySQL — JSON column support and better concurrency.

Message Queue

Kafka

Not RabbitMQ. Kafka allows multiple consumers (transcoding + thumbnail + content-id + search indexing all consume the same upload event) and replay for debugging failed jobs.

View Counts + Cache

Redis

INCR is atomic and non-blocking. Also caches trending videos and hot video metadata. Memcached not chosen — Redis supports sorted sets for trending leaderboards.

Content Delivery

Akamai / CloudFront

300+ PoPs globally. For popular content, CDN hit rate reaches 99%+. Immutable segment URLs (Cache-Control: max-age=31536000) mean segments never need to be purged.

Section 6 — Knowledge Check

Quiz

Question 1 of 5

When a user uploads a video, the API should return which HTTP status code while transcoding continues in the background?

A200 OK — the upload is complete

B202 Accepted — the request was received and is being processed

C201 Created — the video was created successfully

D204 No Content — nothing to return yet

Question 2 of 5

YouTube uses HLS/DASH for video delivery. These protocols split video into segments. Why does segment duration (2–10s) matter?

AShorter segments enable better video compression algorithms

BSegment boundaries are where quality switches happen — shorter segments mean faster adaptation to network changes

CLonger segments always reduce server load proportionally

DIt's a licensing requirement from the HLS patent holders

Question 3 of 5

At 5B views/day on viral videos, directly incrementing a PostgreSQL counter (UPDATE videos SET views=views+1) fails because:

APostgreSQL doesn't support integer arithmetic on large numbers

BThe UPDATE acquires a row lock — at 139 writes/sec on one row, queries queue and timeout causing a cascade failure

CView counts are stored as strings and can't be incremented directly

DPostgreSQL replication can't handle this write throughput

Question 4 of 5

Content ID (copyright checking) happens at which stage in YouTube's upload pipeline?

ABefore accepting the upload — to prevent copyrighted content from ever being stored

BAfter transcoding — fingerprint matching runs on the processed video, not during upload

COnly after a video reaches 100 organic views

DDuring live streaming only, not for uploaded videos

Question 5 of 5

For a video with 10M daily views, what CDN cache hit rate should you target?

A50% — many users bypass CDN by refreshing or using VPNs

B80% — typical for popular content across a normal CDN

C99%+ — popular video segments are cached at 300+ edge PoPs with immutable URLs (max-age=1yr)

D0% — videos are personalized per user so they cannot be cached at a shared CDN