Caching Architecture & Invalidation

Cache-aside, write-through, write-behind, TTL strategy, and cache stampede prevention.

4 Exercises
12 Concept Checks
~85 min total
System Design
Session Progress
0 / 4 completed
Exercise 1 🟡 Easy ⏱ 15 min
✓ Completed
Write Strategy Selection
An e-commerce product page shows price, inventory, and description. Price changes during flash sales (high write frequency). Description rarely changes. With naive cache, a price update takes 45 seconds to propagate because of a stale cache entry — users see wrong prices and over-purchase.
Cache Write Strategies Compared
Cache-Aside
Read: check cache → miss → read DB → populate cache
Write-Through
Write: update cache + DB simultaneously — always consistent
Write-Behind
Write: update cache only → async flush to DB later
Concept Check — 3 questions
Q1. Price updates must be visible within 1 second. Which write strategy ensures this?
AWrite-behind — update cache asynchronously before flushing to DB
BWrite-through — update both cache and DB synchronously on every write
CCache-aside with 60s TTL — the cache auto-expires in under a minute
DNo cache — always read from DB directly for fresh prices
Q2. Which cache write strategy risks data loss if the cache server crashes before the DB flush?
AWrite-through — both DB and cache are updated synchronously
BCache-aside — the app reads cache then falls back to DB
CWrite-behind — data lives in cache before the async DB flush completes
DRead-through — the cache populates itself from the DB on reads
Q3. Cache-aside pattern: user requests a product page. Cache miss. What are the steps in correct order?
ACheck cache → miss → query DB → write result to cache → return to user
BQuery DB → write to cache → return to user (skip cache check)
CWrite to cache → query DB → return to user
DQuery DB → return to user without caching
Write-through synchronously updates cache + DB — no stale reads. Write-behind is faster (write to cache only, flush async) but risks data loss on crash. TTL strategy: product descriptions 24h, prices 5s or event-driven invalidation, inventory 1s or no cache. Warming: pre-populate cache on deploy to avoid first-user cold start penalty.
Open Design Challenge
1
A flash sale drops price from $99 to $10 for 1 hour. Design cache invalidation so all 5M users see the correct price within 1 second. Show the event flow.
2
Define TTL strategy for: product description (changes rarely), price (changes often), inventory count (changes very often during flash sale).
3
How does cache warming prevent the first user after a deploy from experiencing a cold-start slow response? Describe the warming process.
Concept score: 0/3
Exercise 2 🔴 Medium ⏱ 20 min
✓ Completed
Cache Stampede Prevention
Reddit's front page is cached with a 60s TTL. At T=0, 50,000 concurrent users are viewing the page. At T=60, the cache expires. All 50,000 requests see a cache miss simultaneously and query the database — the thundering herd crashes the DB within 2 seconds.
The Thundering Herd Problem
T=60: cache expires
50K concurrent requests
all miss → 50K DB queries
DB CPU 100% → timeout cascade
Concept Check — 3 questions
Q1. Which technique allows only 1 request to repopulate the cache while all others wait for the result?
AShorter TTL — reducing cache lifetime prevents synchronized expiry
BMutex lock on the cache key — only the lock holder queries DB; others wait for cache population
CAdd a CDN layer in front of the cache
DRate limiting users to 1 request per second each
Q2. Probabilistic Early Rehydration (PER) prevents cache stampede by doing what?
ACaching responses at the CDN edge nodes instead of in Redis
BUsing write-through to keep cache always fresh
CProbabilistically refreshing the cache BEFORE it expires, based on TTL remaining and request rate
DReturning stale data forever without ever refreshing
Q3. The root cause of a cache stampede is?
ASynchronized cache expiry under high concurrency — all replicas expire at the same instant
BToo much data stored in the cache causing memory pressure
CSlow network between cache and application servers
DA poorly designed database schema causing slow queries
Redis lock: SETNX lock:{key} 1 then EXPIRE lock:{key} 5 — only one client acquires it and queries the DB. Others wait or serve stale. Jitter: instead of exact 60s TTL, use 60 ± random(0,15) seconds to spread expiry times across the cluster. Stale-while-revalidate: return the cached value immediately and spawn an async refresh thread.
Open Design Challenge
1
Write the Redis command sequence (SETNX, EXPIRE, GET, DEL) to implement a mutex lock that prevents cache stampede on key "front_page".
2
Add TTL jitter: instead of exact 60s, use 60s ± 15s. Explain mathematically why this prevents synchronized expiry across a 10-node Redis cluster.
3
Design a stale-while-revalidate system: serve stale cache immediately while asynchronously refreshing. How do you prevent multiple concurrent refresh attempts?
Concept score: 0/3
Exercise 3 🔴 Medium ⏱ 25 min
✓ Completed
Cache Invalidation in Microservices
An order service updates an order status. Three downstream services cache order data: a notification service, a dashboard service, and a mobile API. How does the order service invalidate 3 separate caches without creating tight coupling between services?
Event-Driven Cache Invalidation
Order Service
writes DB
publishes order_updated → Kafka
Notification Cache (subscribes)
Dashboard Cache (subscribes)
Mobile Cache (subscribes)
Concept Check — 3 questions
Q1. What is the cleanest way for the Order service to invalidate caches in other services?
AOrder service directly calls each service's cache invalidation REST API endpoint
BPublish a domain event to a message bus; each service subscribes and self-invalidates its own cache
CUse a shared cache namespace with a global invalidation command that clears all services
DSet very short TTLs (1s) in all downstream caches so they self-expire quickly
Q2. Event-driven cache invalidation introduces which consistency model between services?
AStrong consistency — all caches update atomically with the write
BLinearizability — reads always reflect the latest write globally
CEventual consistency — caches may be stale for milliseconds to seconds after the event
DCausal consistency — caches update in causal order
Q3. A cache invalidation event is lost due to a Kafka partition failure. What do downstream services see?
AThey immediately fall back to the database for fresh data
BStale cached data until the TTL expires or the next successful invalidation event
CThe service crashes due to missing invalidation signal
DAn automatic rollback of the order update in the database
Event schema: {entity_type, entity_id, operation, timestamp, version}. On service restart after downtime: flush all cached entities of the affected type as a recovery strategy. Last-resort TTL: set TTL equal to acceptable max staleness — for order status, 5 minutes is reasonable. On service restart, replay missed events from the Kafka offset where the service last left off.
Open Design Challenge
1
Design the Kafka topic schema for cache invalidation events. What fields are required? Show a JSON example.
2
A service comes back online after 30 minutes of downtime with a stale cache. How do you recover? Describe the replay strategy.
3
What TTL should serve as the "last resort" failsafe if events fail to arrive? Justify your choice for order status data.
Concept score: 0/3
Exercise 4 🔥 Hard ⏱ 30 min
✓ Completed
Multi-Tier Cache Architecture
A news site has 10M articles. Top 1,000 articles get 95% of traffic (power law distribution). Articles have 3 tiers: breaking news (updated every minute), feature articles (updated daily), archived content (never changes). Design a multi-tier caching strategy optimizing for each tier.
Multi-Tier Cache Hierarchy
Browser Cache (L1)
→ miss →
CDN Edge (L2)
→ miss →
Redis (L3)
→ miss →
PostgreSQL (source)
Concept Check — 3 questions
Q1. For archived content that never changes, what is the ideal CDN Cache-Control header?
Ano-cache — always revalidate with the origin
Bmax-age=60 — cache for 1 minute
CCache-Control: public, max-age=31536000, immutable — cache for 1 year, never revalidate
Dprivate — only cache in the browser, not CDN
Q2. Breaking news articles (updated every minute) should use which cache strategy?
ANo cache — breaking news is too dynamic to cache at all
BShort TTL (60s) + event-driven CDN purge when the article is updated
C24h TTL — acceptable staleness for news content
DBrowser cache only — do not use CDN for breaking news
Q3. The correct cache hierarchy hit order for a CDN-served article request is?
ABrowser L1 → CDN Edge L2 → CDN Origin Shield L3 → origin server
BOrigin server → CDN Edge → Browser (data flows outward only)
CRedis → CDN Edge → Browser
DDB → Redis → User directly
Cache key convention: /articles/{id}/v{version} — version bump on update creates a new cache entry. CDN purge: call CDN purge API on publish event; use surrogate/cache-tag keys to purge all quality variants atomically. Storage for top-1,000: 1,000 × 500KB × 5 variants = 2.5GB — fits comfortably in L2 CDN edge cache. Immutable flag tells browsers never to revalidate the file.
Open Design Challenge
1
Design a cache key naming convention for articles that includes content type, article ID, and version. How does versioning enable instant cache busting?
2
When a breaking news article is updated, how do you purge it from CDN in under 5 seconds? Describe the CDN purge API flow including surrogate keys.
3
Calculate CDN storage needed for the top-1,000 articles averaging 500KB each with 5 quality/format variants.
Concept score: 0/3