URL Shortener Capstone
Capacity estimation, ID generation, redirect mechanics, and abuse prevention at bit.ly scale.
4 Exercises
12 Concept Checks
~100 min total
System Design
Session Progress
0 / 4 completed
Capacity Estimation
Design bit.ly at scale: 100M shortened URLs created per day. 10:1 read-to-write ratio means 1 billion redirects per day. How many servers, how much storage, and how much cache capacity do you need? Good estimation shows you understand order-of-magnitude thinking, not exact numbers.
Back-of-Envelope Numbers
100M writes/day = ~1,200 writes/sec
×10 read ratio
1B reads/day = ~11,600 reads/sec
6-char base62
62^6 = 56B unique codes
Concept Check — 3 questions
Q1. 100M URL creations per day converts to approximately how many writes per second?
A~100 writes/sec
B~1,200 writes/sec
C~10,000 writes/sec
D~100,000 writes/sec
Q2. For 5 years of operation at 100M URLs/day, with each URL record averaging 500 bytes — what is the total storage needed?
A~1 GB — URLs compress well
B~1 TB — one terabyte is enough
C~91 TB — 5 × 365 × 100M × 500B
D~1 PB — petabyte scale is required
Q3. The redirect operation returns a 301 or 302 HTTP response. What is the key difference between them for a URL shortener?
A301 is faster because it skips the DNS lookup on subsequent redirects
B302 enables click analytics by hitting your server on every redirect; 301 is cached by browsers permanently, bypassing your server
C301 is for HTTP to HTTPS upgrades; 302 is for URL shortening
DThere is no meaningful difference — browsers treat them identically
100M / 86,400 seconds ≈ 1,157 writes/sec. Storage: 5yr × 365days × 100M URLs × 500B = 91.25 TB — plan for 100 TB with replication. 301 (permanent): browsers cache forever, great for load reduction but you lose analytics. 302 (temporary): browser always re-requests, you see every click for analytics. bit.ly uses 302 by default for click tracking.
Open Design Challenge
Estimate the number of application servers needed to handle 11,600 redirect reads/sec. Assume each server handles 1,000 req/sec with 2× headroom. How many do you need?
Design the Redis cache tier: how much RAM do you need to cache the top 20% of URLs (those receiving 80% of traffic)? Assume the hot set is 20M URLs at 500 bytes each.
At 1,200 writes/sec, is a single PostgreSQL primary sufficient? At what write rate would you shard, and by what key?
Concept score: 0/3
ID Generation Strategies
You need to generate unique 6-character short codes for 100M URLs per day. Options: random base62, sequential DB counter, MD5 hash of the URL, or Snowflake-style timestamp IDs. Each has trade-offs around uniqueness guarantees, collision probability, predictability, and database performance.
ID Generation Comparison
MD5(url)[:6] — may collide
vs
Counter — predictable
vs
Random — 62^6 = 56B space
vs
Snowflake — time+machine+seq
Concept Check — 3 questions
Q1. Using MD5(url)[:6] to generate a short code — what is the main risk?
AMD5 is too slow for 1,200 writes/sec
BHash collision — two different URLs could produce the same 6-character prefix
CThe generated codes are too long for practical use
DMD5 is cryptographically deprecated and blocked by browsers
Q2. A global auto-increment DB counter for short codes: what is the main bottleneck at 1,200 writes/sec?
ACounter values are not unique across distributed nodes
BCounter rolls over after 2^32 values, causing duplicates
CSerialization bottleneck — all writes must serialize on the single counter row, limiting throughput
DCounter-based codes are too long to fit in 6 characters
Q3. Snowflake IDs embed a millisecond timestamp in the first 41 bits. What does this mean for generated short codes?
ACodes are time-ordered — early URLs have lower IDs, and creation time can be extracted from the code
BCodes are completely random and unpredictable
CCodes are based on the content of the URL being shortened
DCodes are guaranteed collision-free at any distributed scale without coordination
MD5 truncated to 6 chars: birthday paradox means collision after ~sqrt(62^6) = ~7,500 URLs — unacceptable. DB counter serializes all writes — use range pre-allocation instead (grab 1000 IDs at once). Snowflake: 41-bit ms timestamp + 10-bit machine ID + 12-bit sequence = guaranteed unique per machine per ms. Creation time can be decoded from the ID — privacy consideration for short codes.
Open Design Challenge
Design a range-based counter allocation: each application server pre-fetches a batch of 1,000 IDs from the DB. Sketch the schema and the pre-fetch logic.
A user submits the same long URL twice. Should you return the same short code (deduplication) or generate a new one? Discuss the trade-offs for each approach.
If short codes expose creation time (Snowflake-based), how does that affect privacy? Design a scrambling function that preserves uniqueness but hides ordering.
Concept score: 0/3
Cache Strategy for Redirects
80% of redirect traffic hits the top 20% of URLs (power law distribution). Read latency target is <5ms. PostgreSQL lookup takes ~10ms. Redis lookup takes ~0.5ms. Design a caching strategy to serve hot URLs from Redis while keeping PostgreSQL as the source of truth.
Read Path with Cache
📎 short URL request
→
Redis lookup (0.5ms)
hit →
return redirect ✓
miss →
PostgreSQL (10ms) → populate Redis → return redirect
Concept Check — 3 questions
Q1. Which cache eviction policy best preserves the hot 20% of URLs in Redis for a URL shortener workload?
AFIFO — first-in, first-out, evicting oldest cached entries first
BLRU — least recently used. Popular URLs stay in cache; cold URLs evict naturally as memory fills
CRandom — evict a random entry on memory pressure
DLIFO — last-in, first-out, evicting the most recently cached entries
Q2. A user deletes their short URL. How should you invalidate the Redis cache entry immediately?
AWait for the TTL to expire naturally — the entry will be gone soon enough
BRestart Redis to flush all cached entries
CExplicit cache invalidation: issue DEL url:{short_code} to Redis synchronously before returning 200 to the client
DSet a very short TTL (1 second) on all cache entries as a precaution
Q3. The cache hit rate for URL shortener redirects is expected to be very high because:
AAll URLs are accessed with roughly equal frequency
BPower law distribution — a small percentage of URLs get the vast majority of clicks; these hot URLs stay warm in cache
CRedis pre-loads all URLs from PostgreSQL on startup
DBrowsers cache 301 redirects so the cache is rarely needed
LRU is perfect for power-law workloads — frequently accessed URLs naturally stay hot, rarely accessed URLs evict on their own. Cache invalidation on delete must be synchronous and happen before the 200 OK, otherwise a deleted URL may still redirect for minutes. Power law: in practice, ~1% of URLs get ~50% of traffic — the hot set fits comfortably in Redis RAM.
Open Design Challenge
A viral URL gets 500,000 concurrent redirects. All are cache misses simultaneously (cache stampede / thundering herd). Design a solution to prevent 500,000 PostgreSQL queries firing at once.
Design a two-level cache: local in-memory cache in each app server (L1) + shared Redis (L2). What are the TTL and consistency trade-offs between levels?
How would you monitor cache effectiveness? Define the metrics (hit rate, latency p50/p99, eviction rate) and set alert thresholds.
Concept score: 0/3
Abuse Prevention & Custom Slugs
Users can create custom short URLs (bit.ly/my-brand). Attackers exploit this to create phishing links (bit.ly/paypal-login pointing to evil-site.com). 15% of new URLs created are malicious. Custom slugs risk brand abuse (bit.ly/google, bit.ly/apple). You need automated detection and brand protection at creation time.
URL Creation Safety Pipeline
🔗 New URL submitted
→
🛡 Safe Browsing API check
🏷 Brand protection check
📊 Spam score model
→
✅ Approved
or
❌ Rejected
Concept Check — 3 questions
Q1. How does the Google Safe Browsing API help detect malicious short URLs at creation time?
AIt checks the short code for suspicious character patterns
BIt checks the destination (long) URL against a continuously updated database of known phishing and malware sites
CIt verifies the identity of the user creating the short URL
DIt rate-limits URL creation requests from suspicious IP addresses
Q2. Reserved slug protection prevents bit.ly/google from being registered. How is this implemented?
ADisabling custom slugs entirely — only random slugs are permitted
BFiling a lawsuit after brand abuse is discovered
CMaintaining a blocklist of reserved brand names and common words; the slug is checked against it at creation time before being accepted
DCharging premium fees for custom slugs to deter abuse
Q3. Short URL expiry after N days of no clicks is a good practice. Why?
ATo monetize users by forcing them to re-shorten expired links
BReduces storage costs, removes stale dead links, and limits the time window for destination-swapping attacks
CDNS TTL requirements mandate that records expire periodically
DNo reason — all links should be permanent for reliability
Safe Browsing API: check the full destination URL before storing. Destination-swapping attack: a URL starts benign, gets cached by Safe Browsing, then the destination is changed to malware. Expiry combats this — short windows limit the blast radius. Brand protection blocklist should include: top 1000 brands, common words (support, login, account, password), your own infrastructure terms (api, admin, cdn).
Open Design Challenge
Design the destination-swap attack detection: a URL passes Safe Browsing at creation but the destination server later starts serving malware. How do you detect and respond after the fact?
Design a rate limiting system specifically for URL creation: prevent one user from creating 1M phishing links. What signals would you use beyond IP address?
Should you show a warning interstitial page before redirecting to unknown domains? Design the UX and specify when it triggers vs when you redirect immediately.
Concept score: 0/3