Day 7 Exercises — URL Shortener Capstone

Exercise 1 🟢 Easy ⏱ 15 min

✓ Completed

Capacity Estimation

Design bit.ly at scale: 100M shortened URLs created per day. 10:1 read-to-write ratio means 1 billion redirects per day. How many servers, how much storage, and how much cache capacity do you need? Good estimation shows you understand order-of-magnitude thinking, not exact numbers.

Back-of-Envelope Numbers

100M writes/day = ~1,200 writes/sec

×10 read ratio

1B reads/day = ~11,600 reads/sec

6-char base62

62^6 = 56B unique codes

Concept Check — 3 questions

Q1. 100M URL creations per day converts to approximately how many writes per second?

A~100 writes/sec

B~1,200 writes/sec

C~10,000 writes/sec

D~100,000 writes/sec

Q2. For 5 years of operation at 100M URLs/day, with each URL record averaging 500 bytes — what is the total storage needed?

A~1 GB — URLs compress well

B~1 TB — one terabyte is enough

C~91 TB — 5 × 365 × 100M × 500B

D~1 PB — petabyte scale is required

Q3. The redirect operation returns a 301 or 302 HTTP response. What is the key difference between them for a URL shortener?

A301 is faster because it skips the DNS lookup on subsequent redirects

B302 enables click analytics by hitting your server on every redirect; 301 is cached by browsers permanently, bypassing your server

C301 is for HTTP to HTTPS upgrades; 302 is for URL shortening

DThere is no meaningful difference — browsers treat them identically

100M / 86,400 seconds ≈ 1,157 writes/sec. Storage: 5yr × 365days × 100M URLs × 500B = 91.25 TB — plan for 100 TB with replication. 301 (permanent): browsers cache forever, great for load reduction but you lose analytics. 302 (temporary): browser always re-requests, you see every click for analytics. bit.ly uses 302 by default for click tracking.

Open Design Challenge

1

Estimate the number of application servers needed to handle 11,600 redirect reads/sec. Assume each server handles 1,000 req/sec with 2× headroom. How many do you need?

2

Design the Redis cache tier: how much RAM do you need to cache the top 20% of URLs (those receiving 80% of traffic)? Assume the hot set is 20M URLs at 500 bytes each.

3

At 1,200 writes/sec, is a single PostgreSQL primary sufficient? At what write rate would you shard, and by what key?

Concept score: 0/3

Exercise 2 🟢 Easy ⏱ 20 min

✓ Completed

ID Generation Strategies

You need to generate unique 6-character short codes for 100M URLs per day. Options: random base62, sequential DB counter, MD5 hash of the URL, or Snowflake-style timestamp IDs. Each has trade-offs around uniqueness guarantees, collision probability, predictability, and database performance.

ID Generation Comparison

MD5(url)[:6] — may collide

vs

Counter — predictable

vs

Random — 62^6 = 56B space

vs

Snowflake — time+machine+seq

Concept Check — 3 questions

Q1. Using MD5(url)[:6] to generate a short code — what is the main risk?

AMD5 is too slow for 1,200 writes/sec

BHash collision — two different URLs could produce the same 6-character prefix

CThe generated codes are too long for practical use

DMD5 is cryptographically deprecated and blocked by browsers

Q2. A global auto-increment DB counter for short codes: what is the main bottleneck at 1,200 writes/sec?

ACounter values are not unique across distributed nodes

BCounter rolls over after 2^32 values, causing duplicates

CSerialization bottleneck — all writes must serialize on the single counter row, limiting throughput

DCounter-based codes are too long to fit in 6 characters

Q3. Snowflake IDs embed a millisecond timestamp in the first 41 bits. What does this mean for generated short codes?

ACodes are time-ordered — early URLs have lower IDs, and creation time can be extracted from the code

BCodes are completely random and unpredictable

CCodes are based on the content of the URL being shortened

DCodes are guaranteed collision-free at any distributed scale without coordination

MD5 truncated to 6 chars: birthday paradox means collision after ~sqrt(62^6) = ~7,500 URLs — unacceptable. DB counter serializes all writes — use range pre-allocation instead (grab 1000 IDs at once). Snowflake: 41-bit ms timestamp + 10-bit machine ID + 12-bit sequence = guaranteed unique per machine per ms. Creation time can be decoded from the ID — privacy consideration for short codes.

Open Design Challenge

1

Design a range-based counter allocation: each application server pre-fetches a batch of 1,000 IDs from the DB. Sketch the schema and the pre-fetch logic.

2

A user submits the same long URL twice. Should you return the same short code (deduplication) or generate a new one? Discuss the trade-offs for each approach.

3

If short codes expose creation time (Snowflake-based), how does that affect privacy? Design a scrambling function that preserves uniqueness but hides ordering.

Concept score: 0/3

Exercise 3 🟡 Medium ⏱ 25 min

✓ Completed

Cache Strategy for Redirects

80% of redirect traffic hits the top 20% of URLs (power law distribution). Read latency target is <5ms. PostgreSQL lookup takes ~10ms. Redis lookup takes ~0.5ms. Design a caching strategy to serve hot URLs from Redis while keeping PostgreSQL as the source of truth.

Read Path with Cache

📎 short URL request

→

Redis lookup (0.5ms)

hit →

return redirect ✓

miss →

PostgreSQL (10ms) → populate Redis → return redirect

Concept Check — 3 questions

Q1. Which cache eviction policy best preserves the hot 20% of URLs in Redis for a URL shortener workload?

AFIFO — first-in, first-out, evicting oldest cached entries first

BLRU — least recently used. Popular URLs stay in cache; cold URLs evict naturally as memory fills

CRandom — evict a random entry on memory pressure

DLIFO — last-in, first-out, evicting the most recently cached entries

Q2. A user deletes their short URL. How should you invalidate the Redis cache entry immediately?

AWait for the TTL to expire naturally — the entry will be gone soon enough

BRestart Redis to flush all cached entries

CExplicit cache invalidation: issue DEL url:{short_code} to Redis synchronously before returning 200 to the client

DSet a very short TTL (1 second) on all cache entries as a precaution

Q3. The cache hit rate for URL shortener redirects is expected to be very high because:

AAll URLs are accessed with roughly equal frequency

BPower law distribution — a small percentage of URLs get the vast majority of clicks; these hot URLs stay warm in cache

CRedis pre-loads all URLs from PostgreSQL on startup

DBrowsers cache 301 redirects so the cache is rarely needed

LRU is perfect for power-law workloads — frequently accessed URLs naturally stay hot, rarely accessed URLs evict on their own. Cache invalidation on delete must be synchronous and happen before the 200 OK, otherwise a deleted URL may still redirect for minutes. Power law: in practice, ~1% of URLs get ~50% of traffic — the hot set fits comfortably in Redis RAM.

Open Design Challenge

1

A viral URL gets 500,000 concurrent redirects. All are cache misses simultaneously (cache stampede / thundering herd). Design a solution to prevent 500,000 PostgreSQL queries firing at once.

2

Design a two-level cache: local in-memory cache in each app server (L1) + shared Redis (L2). What are the TTL and consistency trade-offs between levels?

3

How would you monitor cache effectiveness? Define the metrics (hit rate, latency p50/p99, eviction rate) and set alert thresholds.

Concept score: 0/3

Exercise 4 🔴 Hard ⏱ 30 min

✓ Completed

Abuse Prevention & Custom Slugs

Users can create custom short URLs (bit.ly/my-brand). Attackers exploit this to create phishing links (bit.ly/paypal-login pointing to evil-site.com). 15% of new URLs created are malicious. Custom slugs risk brand abuse (bit.ly/google, bit.ly/apple). You need automated detection and brand protection at creation time.

URL Creation Safety Pipeline

🔗 New URL submitted

→

🛡 Safe Browsing API check

🏷 Brand protection check

📊 Spam score model

→

✅ Approved

or

❌ Rejected

Concept Check — 3 questions

Q1. How does the Google Safe Browsing API help detect malicious short URLs at creation time?

AIt checks the short code for suspicious character patterns

BIt checks the destination (long) URL against a continuously updated database of known phishing and malware sites

CIt verifies the identity of the user creating the short URL

DIt rate-limits URL creation requests from suspicious IP addresses

Q2. Reserved slug protection prevents bit.ly/google from being registered. How is this implemented?

ADisabling custom slugs entirely — only random slugs are permitted

BFiling a lawsuit after brand abuse is discovered

CMaintaining a blocklist of reserved brand names and common words; the slug is checked against it at creation time before being accepted

DCharging premium fees for custom slugs to deter abuse

Q3. Short URL expiry after N days of no clicks is a good practice. Why?

ATo monetize users by forcing them to re-shorten expired links

BReduces storage costs, removes stale dead links, and limits the time window for destination-swapping attacks

CDNS TTL requirements mandate that records expire periodically

DNo reason — all links should be permanent for reliability

Safe Browsing API: check the full destination URL before storing. Destination-swapping attack: a URL starts benign, gets cached by Safe Browsing, then the destination is changed to malware. Expiry combats this — short windows limit the blast radius. Brand protection blocklist should include: top 1000 brands, common words (support, login, account, password), your own infrastructure terms (api, admin, cdn).

Open Design Challenge

1

Design the destination-swap attack detection: a URL passes Safe Browsing at creation but the destination server later starts serving malware. How do you detect and respond after the fact?

2

Design a rate limiting system specifically for URL creation: prevent one user from creating 1M phishing links. What signals would you use beyond IP address?

3

Should you show a warning interstitial page before redirecting to unknown domains? Design the UX and specify when it triggers vs when you redirect immediately.

Concept score: 0/3

Day 7 Complete 🎉