Day 24 · Week 4 — Real-World Systems
Design Google Search
Web crawling, inverted index construction, PageRank, query serving at 8.5 billion searches per day, and real-time index updates at global scale.
5 Stages
~4h Study
2 Simulations
5 Quizzes
Section 1
Three Massive Systems
Crawler
Discovers and downloads the web. 20+ billion pages indexed. Breadth-first traversal from seed URLs. Bloom filter tracks visited URLs. Respects robots.txt and per-domain crawl-delay politeness.
Indexer
Builds inverted index via MapReduce: Map emits (term, doc_id) pairs; Reduce merges into sorted posting lists. Index sharded by document range across thousands of servers.
Ranker
PageRank (link graph authority) + TF-IDF (term relevance) + 200+ signals. ML learning-to-rank (LambdaMART) re-ranks top-1000 candidates. Updated continuously without full re-index.
Query Server
Returns results in <200ms. Serves 99,000 queries/sec globally. Query fans out to all index shards simultaneously. Serving layer merges top-K from each shard and re-ranks.
Scale that's hard to grasp:
8.5 billion searches/day = 99,000 queries/second. Each query must complete in 200ms and touches thousands of servers in parallel. The index covers 20+ billion pages. The crawler downloads billions of pages per day. Nothing about this is off-the-shelf.
Section 2 — Interactive Simulation
Inverted Index Builder
How it works:
Edit the documents below and click "Build Index" to see how words map to documents. Then search for a term to see real-time document retrieval using the inverted index you just built.
Document Corpus
Doc 1 — Technology
Doc 2 — Databases
Doc 3 — Infrastructure
Doc 4 — Search
Inverted Index
Click "Build Index" to see the term → document postings list
Search Documents
PageRank Link Graph Visualization
Node size = PageRank score. Arrows = links. High-authority pages (linked by many others) grow larger. Click "Run PageRank" to animate the iterative calculation.
Section 3
PageRank Explained
PR(A) = (1 - d) / N + d × Σ ( PR(Bᵢ) / L(Bᵢ) )
d = 0.85 (damping factor)
N = total pages
PR(Bᵢ) = PageRank of linking page
L(Bᵢ) = outbound links from Bᵢ
Google's original breakthrough:
PageRank was Larry Page's PhD thesis insight. Before it, search engines ranked pages by keyword frequency alone — easily gamed. PageRank models the web as a random walk: a "random surfer" clicks links with probability d=0.85 or jumps to a random page with probability 0.15. Pages linked by high-PR pages (like Wikipedia) accumulate authority and rank higher.
Why the damping factor (0.85) matters:
Without damping, "rank sinks" (pages with no outbound links) would absorb all PageRank. With d=0.85, every page leaks 15% of its rank back to the random distribution, preventing rank accumulation in dead ends. The formula converges after ~50–100 iterations.
PYTHON · pagerank.py
def pagerank(graph: dict, d=0.85, iterations=100) -> dict: """ graph = {'A': ['B', 'C'], 'B': ['C'], 'C': ['A']} Returns: {'A': 0.387, 'B': 0.213, 'C': 0.400} """ N = len(graph) pr = {page: 1/N for page in graph} # init equal for _ in range(iterations): new_pr = {} for page in graph: # Sum contributions from all pages that link to this page rank_sum = sum( pr[linker] / len(outlinks) # PR divided by outbound links for linker, outlinks in graph.items() if page in outlinks ) new_pr[page] = (1 - d) / N + d * rank_sum pr = new_pr return pr # Converges in ~50-100 iterations # Wikipedia links to 10,000 pages: each page gets PR(Wikipedia)/10,000 # One link from Wikipedia >> 10,000 links from spam blogs
Section 4
Query Serving at Scale
99,000 queries/second × 200ms budget: using Little's Law (L = λ × W), Google processes 19,800 concurrent queries at any moment (99,000 × 0.2 seconds). Each query fans out to thousands of servers in parallel to return results within the 200ms window.
Spelling Correction
"gogle" → "google" — statistical correction using query frequency logs
<5ms
Query Understanding
Identify entities, synonyms, intent. "apple" → fruit or tech company?
<10ms
Index Lookup (Fan-out)
Query broadcasts to thousands of index shards simultaneously. Each returns top-K candidate docs.
<50ms
Ranking
Apply 200+ signals (PageRank, TF-IDF, freshness, location, CTR) to candidates. ML re-ranks.
<100ms
Snippet Generation
Extract and highlight relevant portions of each document. Generate title, URL, description.
<30ms
Serving Architecture Breakdown
| Component | What it does | Latency Budget | Scaling Approach |
|---|---|---|---|
| Spell check | "gogle" → "google" using n-gram frequency model | <5ms | In-memory trie / pre-computed corrections |
| Query parser | Identify entities, synonyms, intent classification | <10ms | Stateless — horizontally scalable |
| Index lookup | Inverted index lookup → candidate documents | <50ms | Fan-out to 1000s of index shards in parallel |
| Ranking | Score candidates with 200+ signals + ML model | <100ms | Dedicated ranking servers with GPU/TPU |
| Snippet gen | Extract relevant text from matching documents | <30ms | Forward index lookup + highlighting |
| Cache | Top-1M queries cached — ~80% hit rate globally | <5ms | Distributed Redis / Memcached tier |
Little's Law in action:
L (concurrent queries) = λ (queries/sec) × W (avg time per query) = 99,000 × 0.2s = 19,800. This means at steady state, Google has ~19,800 queries simultaneously in-flight across its serving stack. This directly drives infrastructure sizing — you need enough servers to handle 19,800 parallel queries without queuing.
Section 5
Technology Decisions
Index Storage
Custom SSTable Format
Not Elasticsearch or Solr. Google's index is a custom SSTable-like binary format optimized for their specific access patterns. Off-the-shelf search engines can't handle 20B+ pages efficiently.
Crawler Storage
Bigtable + Bloom Filter
Crawl frontier (URLs to visit) in a priority queue. Visited URL tracking uses a Bloom filter — zero false negatives (never miss a URL), occasional false positive means a rare harmless re-crawl.
Result Caching
Distributed Cache (80% hit)
Top-1M queries account for ~80% of traffic. Cache these results globally. Cache invalidated when pages are re-crawled and scores change significantly. Stale results for seconds are acceptable.
Infrastructure
Colossus + Bigtable
Colossus (Google's distributed filesystem, successor to GFS) stores index shards. Bigtable stores document metadata and crawl state. Custom hardware (TPUs) for ML ranking inference.
Section 6 — Knowledge Check
Quiz
Question 1 of 5
An inverted index maps:
ADocument IDs to their full text content and URLs
BEach word to the sorted list of documents containing that word (a postings list)
CUser queries to their historically most common result sets
DWeb pages to their PageRank score and authority metrics
Question 2 of 5
PageRank's damping factor (0.85) models:
AThe rate at which the web crawler downloads new pages per second
BThe probability that a random web surfer follows a link (0.85) vs jumps to a completely random page (0.15)
CThe freshness decay rate of indexed web pages over time
DThe compression ratio applied to the inverted index on disk
Question 3 of 5
Google serves 99,000 queries per second. Each query takes 200ms to complete. Using Little's Law (L = λ × W), how many concurrent queries are being processed at any moment?
A99,000 — each query occupies one server for its full duration
B19,800 — from L = 99,000 qps × 0.2 seconds
C495,000 — accounting for parallel shard fan-out per query
D1,980 — using a 10% utilization safety factor
Question 4 of 5
A web page links to 10 other pages. In PageRank, how does it distribute its authority to those pages?
AIt gives its full PageRank value to each of the 10 linked pages
BIt divides its PageRank equally — each linked page receives PR(source) / 10
CIt contributes nothing — only inbound links to the source page matter
DIt multiplies its PageRank by 10 and shares the total across linked pages
Question 5 of 5
Google's web crawler must respect robots.txt. If a site's robots.txt says "Disallow: /private/", the crawler should:
ACrawl /private/ anyway since it's on the public internet and technically accessible
BSkip all pages on that entire domain as a precaution
CRespect the directive and skip /private/ and all subpaths — this is a standard, voluntary but universally honored convention
DCrawl it once for discovery, then stop re-visiting after the first crawl