Day 24 · Week 4 — Real-World Systems

Design Google Search

Web crawling, inverted index construction, PageRank, query serving at 8.5 billion searches per day, and real-time index updates at global scale.

5 Stages
~4h Study
2 Simulations
5 Quizzes

Three Massive Systems

🕷️
Crawler
Discovers and downloads the web. 20+ billion pages indexed. Breadth-first traversal from seed URLs. Bloom filter tracks visited URLs. Respects robots.txt and per-domain crawl-delay politeness.
📑
Indexer
Builds inverted index via MapReduce: Map emits (term, doc_id) pairs; Reduce merges into sorted posting lists. Index sharded by document range across thousands of servers.
🏆
Ranker
PageRank (link graph authority) + TF-IDF (term relevance) + 200+ signals. ML learning-to-rank (LambdaMART) re-ranks top-1000 candidates. Updated continuously without full re-index.
Query Server
Returns results in <200ms. Serves 99,000 queries/sec globally. Query fans out to all index shards simultaneously. Serving layer merges top-K from each shard and re-ranks.
⚠️
Scale that's hard to grasp: 8.5 billion searches/day = 99,000 queries/second. Each query must complete in 200ms and touches thousands of servers in parallel. The index covers 20+ billion pages. The crawler downloads billions of pages per day. Nothing about this is off-the-shelf.

Inverted Index Builder

ℹ️
How it works: Edit the documents below and click "Build Index" to see how words map to documents. Then search for a term to see real-time document retrieval using the inverted index you just built.

Document Corpus

Doc 1 — Technology
Doc 2 — Databases
Doc 3 — Infrastructure
Doc 4 — Search

Inverted Index

Click "Build Index" to see the term → document postings list

Search Documents

PageRank Link Graph Visualization

Node size = PageRank score. Arrows = links. High-authority pages (linked by many others) grow larger. Click "Run PageRank" to animate the iterative calculation.

PageRank Explained

PR(A) = (1 - d) / N + d × Σ ( PR(Bᵢ) / L(Bᵢ) )
d = 0.85 (damping factor) N = total pages PR(Bᵢ) = PageRank of linking page L(Bᵢ) = outbound links from Bᵢ
ℹ️
Google's original breakthrough: PageRank was Larry Page's PhD thesis insight. Before it, search engines ranked pages by keyword frequency alone — easily gamed. PageRank models the web as a random walk: a "random surfer" clicks links with probability d=0.85 or jumps to a random page with probability 0.15. Pages linked by high-PR pages (like Wikipedia) accumulate authority and rank higher.
💡
Why the damping factor (0.85) matters: Without damping, "rank sinks" (pages with no outbound links) would absorb all PageRank. With d=0.85, every page leaks 15% of its rank back to the random distribution, preventing rank accumulation in dead ends. The formula converges after ~50–100 iterations.
PYTHON · pagerank.py
def pagerank(graph: dict, d=0.85, iterations=100) -> dict:
    """
    graph = {'A': ['B', 'C'], 'B': ['C'], 'C': ['A']}
    Returns: {'A': 0.387, 'B': 0.213, 'C': 0.400}
    """
    N = len(graph)
    pr = {page: 1/N for page in graph}  # init equal

    for _ in range(iterations):
        new_pr = {}
        for page in graph:
            # Sum contributions from all pages that link to this page
            rank_sum = sum(
                pr[linker] / len(outlinks)  # PR divided by outbound links
                for linker, outlinks in graph.items()
                if page in outlinks
            )
            new_pr[page] = (1 - d) / N + d * rank_sum
        pr = new_pr

    return pr  # Converges in ~50-100 iterations

# Wikipedia links to 10,000 pages: each page gets PR(Wikipedia)/10,000
# One link from Wikipedia >> 10,000 links from spam blogs

Query Serving at Scale

99,000 queries/second × 200ms budget: using Little's Law (L = λ × W), Google processes 19,800 concurrent queries at any moment (99,000 × 0.2 seconds). Each query fans out to thousands of servers in parallel to return results within the 200ms window.

Spelling Correction
"gogle" → "google" — statistical correction using query frequency logs
<5ms
Query Understanding
Identify entities, synonyms, intent. "apple" → fruit or tech company?
<10ms
Index Lookup (Fan-out)
Query broadcasts to thousands of index shards simultaneously. Each returns top-K candidate docs.
<50ms
Ranking
Apply 200+ signals (PageRank, TF-IDF, freshness, location, CTR) to candidates. ML re-ranks.
<100ms
Snippet Generation
Extract and highlight relevant portions of each document. Generate title, URL, description.
<30ms

Serving Architecture Breakdown

Component What it does Latency Budget Scaling Approach
Spell check "gogle" → "google" using n-gram frequency model <5ms In-memory trie / pre-computed corrections
Query parser Identify entities, synonyms, intent classification <10ms Stateless — horizontally scalable
Index lookup Inverted index lookup → candidate documents <50ms Fan-out to 1000s of index shards in parallel
Ranking Score candidates with 200+ signals + ML model <100ms Dedicated ranking servers with GPU/TPU
Snippet gen Extract relevant text from matching documents <30ms Forward index lookup + highlighting
Cache Top-1M queries cached — ~80% hit rate globally <5ms Distributed Redis / Memcached tier
💡
Little's Law in action: L (concurrent queries) = λ (queries/sec) × W (avg time per query) = 99,000 × 0.2s = 19,800. This means at steady state, Google has ~19,800 queries simultaneously in-flight across its serving stack. This directly drives infrastructure sizing — you need enough servers to handle 19,800 parallel queries without queuing.

Technology Decisions

Index Storage
Custom SSTable Format
Not Elasticsearch or Solr. Google's index is a custom SSTable-like binary format optimized for their specific access patterns. Off-the-shelf search engines can't handle 20B+ pages efficiently.
Crawler Storage
Bigtable + Bloom Filter
Crawl frontier (URLs to visit) in a priority queue. Visited URL tracking uses a Bloom filter — zero false negatives (never miss a URL), occasional false positive means a rare harmless re-crawl.
Result Caching
Distributed Cache (80% hit)
Top-1M queries account for ~80% of traffic. Cache these results globally. Cache invalidated when pages are re-crawled and scores change significantly. Stale results for seconds are acceptable.
Infrastructure
Colossus + Bigtable
Colossus (Google's distributed filesystem, successor to GFS) stores index shards. Bigtable stores document metadata and crawl state. Custom hardware (TPUs) for ML ranking inference.

Quiz

Question 1 of 5
An inverted index maps:
ADocument IDs to their full text content and URLs
BEach word to the sorted list of documents containing that word (a postings list)
CUser queries to their historically most common result sets
DWeb pages to their PageRank score and authority metrics
Question 2 of 5
PageRank's damping factor (0.85) models:
AThe rate at which the web crawler downloads new pages per second
BThe probability that a random web surfer follows a link (0.85) vs jumps to a completely random page (0.15)
CThe freshness decay rate of indexed web pages over time
DThe compression ratio applied to the inverted index on disk
Question 3 of 5
Google serves 99,000 queries per second. Each query takes 200ms to complete. Using Little's Law (L = λ × W), how many concurrent queries are being processed at any moment?
A99,000 — each query occupies one server for its full duration
B19,800 — from L = 99,000 qps × 0.2 seconds
C495,000 — accounting for parallel shard fan-out per query
D1,980 — using a 10% utilization safety factor
Question 4 of 5
A web page links to 10 other pages. In PageRank, how does it distribute its authority to those pages?
AIt gives its full PageRank value to each of the 10 linked pages
BIt divides its PageRank equally — each linked page receives PR(source) / 10
CIt contributes nothing — only inbound links to the source page matter
DIt multiplies its PageRank by 10 and shares the total across linked pages
Question 5 of 5
Google's web crawler must respect robots.txt. If a site's robots.txt says "Disallow: /private/", the crawler should:
ACrawl /private/ anyway since it's on the public internet and technically accessible
BSkip all pages on that entire domain as a precaution
CRespect the directive and skip /private/ and all subpaths — this is a standard, voluntary but universally honored convention
DCrawl it once for discovery, then stop re-visiting after the first crawl