Day 7: URL Shortener Capstone | System Design Mastery

01 — Requirements & Scale

Start with numbers — always

Interviewers expect you to do back-of-envelope math before drawing any boxes. Adjust the sliders to match your assumed scale and read off the derived numbers. 100:1 read:write is typical for URL shorteners.

📋

Interview format: use this structure every time

1) Clarify requirements (functional + non-functional) → 2) Estimate scale (QPS, storage, bandwidth) → 3) High-level design (components) → 4) Deep dive (bottlenecks, trade-offs). Never jump to boxes without numbers first.

URLs created / day 10M

Read:Write ratio 100:1

Retention (years) 5 yr

Bytes per URL record 500 B

116

Write QPS

11.6K

Read QPS

9.1 TB

Total Storage

~1.5 GB

Hot Cache (20%)

2

API Servers (est.)

18.3B

Total URLs

02 — ID Generation: Base62 Encoding

How 6 characters cover 56 billion unique URLs

Base62 uses digits (0-9), lowercase (a-z), and uppercase (A-Z) — 62 characters total. 6-character codes give 62⁶ = 56,800,235,584 unique combinations. That's 56 billion — enough for decades of URL creation at any realistic scale.

Steps will appear here...

Charset: 0–9 a–z A–Z
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ

--

62^6 = 56,800,235,584 unique codes
62^7 = 3,521,614,606,208 (7-char)
16^6 = 16,777,216 (6-char hex — far fewer!)

Python Implementation

import string

BASE62 = string.digits + string.ascii_lowercase + string.ascii_uppercase
# "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"

def encode(num: int) -> str:
    """Convert an integer to base62 string."""
    if num == 0:
        return BASE62[0]
    chars = []
    while num:
        chars.append(BASE62[num % 62])
        num //= 62
    return ''.join(reversed(chars))

def decode(s: str) -> int:
    """Convert base62 string back to integer."""
    return sum(BASE62.index(c) * (62 ** i)
               for i, c in enumerate(reversed(s)))

# Examples
# encode(0)         → "0"
# encode(61)        → "Z"
# encode(62)        → "10"
# encode(12345678)  → "FpQx"   (4 chars)
# encode(3521614606208) → "aaaaaaa"  (7 chars)

# 62^6 = 56,800,235,584 unique short codes with 6 characters
# At 10M URLs/day: 56B / 10M = 5,680 days ≈ 15+ years before exhaustion

# Key Generation Service (KGS) approach — avoids collision entirely:
# 1. Pre-generate millions of random 6-char codes offline
# 2. Store in a "used keys" DB and "unused keys" DB
# 3. On each write request: pop one unused key (O(1), no collision check)
# 4. Background job refills the pool when < 20% remaining
# Advantage: no race conditions, no collision detection, O(1) key assignment

⚠️

Hash-based vs Counter-based IDs

Hash approach: MD5(url)[:6] encoded as base62. Risk: birthday paradox causes collisions — detect with ON CONFLICT and retry with salted hash. Counter approach: global auto-increment. Risk: single point of failure and reveals your volume to competitors. KGS (pre-generated pool) avoids both risks.

🛡️

Custom Alias Handling

Allow users to request custom slugs like sho.rt/my-brand. Validate: 6–50 chars, alphanumeric + hyphens, not in reserved list (admin, api, login, www). Store with is_custom=true flag. Custom aliases never expire unless the user deletes them — unlike auto-generated codes which may have TTL.

03 — System Architecture

Two paths: create URL (write) and redirect (read)

The redirect path is the hot path — it must return a 302 response in under 5ms at scale. Every component choice on this path matters. The write path is less latency-sensitive but needs durability.

Complete URL Shortener Architecture

Browser / App

Mobile Client

↓

Load Balancer
Nginx / AWS ALB

↓

API Server 1

API Server 2

API Server 3

↓ reads hit cache first

Redis Cluster
HOT PATH

PostgreSQL
URL store

Kafka
Analytics events

The Redirect Hot Path — must be under 5ms

1

Request arrives: GET /abc123 Load balancer routes to any API server — stateless, so any server works

2

Redis lookup: GET url:abc123 → ~0.2ms Cache HIT (95%+ of popular URLs): return 302 redirect immediately. Done. Total: ~1ms

3

Cache MISS: query PostgreSQL → ~5ms SELECT long_url FROM urls WHERE code = 'abc123'. Re-warm cache: SET url:abc123 TTL 86400

4

Async: publish to Kafka analytics topic Fire-and-forget: {code, timestamp, ip, user_agent}. Never block the redirect response for analytics

⚡

Never do a synchronous DB write on the redirect path

A common mistake: updating a click counter directly in PostgreSQL on every redirect. At 100K read QPS, that's 100K writes/sec to a single DB row — instant bottleneck. Instead: increment Redis counter (atomic, in-memory), batch-flush to DB every minute via a background job.

04 — Caching Strategy

Why caching is the most critical design decision

URL shortener traffic follows Zipf's law — a small fraction of URLs get the vast majority of clicks. This makes caching extraordinarily effective. Get this right and your DB barely matters for reads.

📈

Zipf's Law in Action

The most popular URL gets 2× the traffic of the second most popular, 3× the third, etc. In practice: top 20% of URLs receive 80% of all redirect traffic. Cache those 20%, and your cache hit rate exceeds 80% — the DB barely matters for reads.

🗂️

LRU Eviction for URL Cache

Least Recently Used (LRU) eviction works perfectly for URL shorteners — if a URL hasn't been accessed recently, it's unlikely to be popular. Set Redis maxmemory-policy allkeys-lru. Cache size: allocate ~20% of daily active URL set. At 500 bytes/URL and 1M daily URLs, that's ~100MB — very cheap.

⏱️

TTL Strategy

Default TTL: 24 hours. Long-lived URLs (company links, QR codes): extend to 7 days or infinite (refresh on access). Short campaign URLs: TTL matches campaign end date. Expired URL codes: return 410 Gone, not 404 Not Found — 410 tells crawlers the resource is permanently gone.

🌍

CDN for Global Edge Caching

For globally popular URLs (viral content), the redirect response itself can be cached at CDN edge nodes. Cache-Control header on 301 redirects (if using 301) lets browsers cache indefinitely. For 302 (analytics-enabled), add Cache-Control: private, no-store to force browser re-request.

05 — Abuse Prevention

Protecting your platform from misuse

URL shorteners are high-value abuse targets — they can be used to hide phishing links, spam, or malware. At scale, you need automated defenses or you'll be deplatformed by browsers and security tools.

🔗 Phishing Detection

Integrate Google Safe Browsing API on URL creation. Check the long URL against the malware/phishing database before generating a short code. Block immediately if flagged. Re-check periodically for URLs that become malicious after creation. Cost: ~$0.80 per 10K lookups.

🚦 Rate Limiting URL Creation

Unauthenticated: 10 URL creates/hour per IP. Authenticated free tier: 100/day. Paid tier: 10,000/day. Use Redis with sliding window counter. Return 429 Too Many Requests with Retry-After header. Exponential backoff for repeated violations → temporary IP ban.

🚫 Reserved Slug Blocklist

Block custom aliases that could cause confusion: admin, api, login, signup, www, help, support, blog, brand names (Google, Apple, Amazon). Store as a Redis Set for O(1) lookup. Update via feature flag without deployment.

📱 Link Preview & QR Codes

Show a preview page for suspicious domains before redirect (user explicitly clicks "Continue"). Generate QR codes only for verified/authenticated URLs. Include warning on preview page for known shortener-heavy phishing patterns. Log all preview page visits for abuse analysis.

✅

The `301` vs `302` trade-off

301 Permanent: browser caches redirect forever — reduces your server load but you lose analytics (browser goes direct, bypassing your server). Can't update the destination. 302 Temporary: browser always asks you — full analytics, can change destination, but higher server load. Analytics-first shorteners always use 302.

06 — Technology Decisions

The right tool for each layer

Redirect Cache

Redis

Sub-millisecond lookups (avg 0.2ms)
LRU eviction built-in
Atomic INCR for click counters
Cluster mode for horizontal scale

Set: allkeys-lru, maxmemory 8GB

URL Storage

PostgreSQL

ACID transactions for URL creation
ON CONFLICT for collision detection
Unique index on short code (fast lookups)
Read replicas for cache miss fallback

Schema: urls(code, long_url, created_at, user_id, ttl)

Analytics Events

Kafka

Async, non-blocking — redirect not delayed
Fan-out: one event → multiple consumers
Replay clicks if analytics pipeline fails
Batch write to analytics DB (ClickHouse)

Topic: url.clicks, partition by short code

Global Performance

CDN (CloudFront)

Edge-cache 301 redirects globally
Serve error pages for 404/410
DDoS protection at edge
Geographic latency reduction

Only viable with 301 (browser-cached redirects)

07 — Knowledge Check

Five questions on the URL shortener design

1. bit.ly uses 302 redirects instead of 301. What is the primary reason?

301 Permanent tells the browser to cache the redirect forever — once visited, future requests go directly to the destination, bypassing your servers entirely. You lose all click analytics, can't retarget the URL, and can't measure engagement. 302 Temporary means the browser asks your server on every click. The slight performance overhead is irrelevant compared to the analytics value.

2. How many unique short codes does a 6-character base62 system support?

62^6 = 56,800,235,584 — about 56 billion. Compare to 16^6 (hex) = 16,777,216 — only 16 million. Using the full alphanumeric charset (A-Z, a-z, 0-9 = 62 chars) instead of just hex (0-9, a-f = 16 chars) gives you 3,400× more unique codes with the same code length. At 100M URLs/day, 56B codes lasts ~1.5 years. Move to 7 chars (3.5 trillion) for 95+ years of capacity.

3. When a redirect request hits the API server and the URL is in Redis cache — which components are involved?

On a cache HIT: Load Balancer routes the request to an API server (stateless, any server works). API server does GET url:{code} on Redis (~0.2ms). Redis returns the long URL. API server returns HTTP 302 with Location: {long_url}. Total latency: ~1-2ms. PostgreSQL is never touched. This is why the cache hit rate is so critical — at 95% hit rate, only 5% of requests need a DB round-trip.

4. What is the main drawback of using a global auto-increment counter for ID generation?

A global counter requires a single synchronized counter — if that DB goes down, you can't create any URLs. At high write QPS, the counter becomes a write bottleneck. Worse: sequential IDs (1, 2, 3, ...) reveal your exact creation rate and let competitors enumerate all your URLs (just increment the counter and decode). Solutions: use multiple counters (one per server), UUID-based KGS, or hash-based with collision detection.

5. Why use Kafka for analytics (click tracking) instead of writing directly to PostgreSQL on each redirect?

The redirect path must be as fast as possible. Writing to PostgreSQL synchronously on every redirect would add 5-15ms of latency per request and create a massive write bottleneck at 100K+ QPS. With Kafka: the API server fires off an async message (non-blocking, ~0.1ms), returns the 302 immediately, and Kafka consumers process analytics events in batches. If the analytics pipeline fails, you can replay events from Kafka — no data loss. This is the key insight: separate latency-critical paths from eventually-consistent analytics.