Day 24 — Design Google Search

Section 1

Three Massive Systems

🕷️

Crawler

Discovers and downloads the web. 20+ billion pages indexed. Breadth-first traversal from seed URLs. Bloom filter tracks visited URLs. Respects robots.txt and per-domain crawl-delay politeness.

📑

Indexer

Builds inverted index via MapReduce: Map emits (term, doc_id) pairs; Reduce merges into sorted posting lists. Index sharded by document range across thousands of servers.

🏆

Ranker

PageRank (link graph authority) + TF-IDF (term relevance) + 200+ signals. ML learning-to-rank (LambdaMART) re-ranks top-1000 candidates. Updated continuously without full re-index.

⚡

Query Server

Returns results in <200ms. Serves 99,000 queries/sec globally. Query fans out to all index shards simultaneously. Serving layer merges top-K from each shard and re-ranks.

⚠️

Scale that's hard to grasp: 8.5 billion searches/day = 99,000 queries/second. Each query must complete in 200ms and touches thousands of servers in parallel. The index covers 20+ billion pages. The crawler downloads billions of pages per day. Nothing about this is off-the-shelf.

Section 2 — Interactive Simulation

Inverted Index Builder

ℹ️

How it works: Edit the documents below and click "Build Index" to see how words map to documents. Then search for a term to see real-time document retrieval using the inverted index you just built.

Document Corpus

Doc 1 — Technology

Doc 2 — Databases

Doc 3 — Infrastructure

Doc 4 — Search

Inverted Index

Click "Build Index" to see the term → document postings list

Search Documents

PageRank Link Graph Visualization

Node size = PageRank score. Arrows = links. High-authority pages (linked by many others) grow larger. Click "Run PageRank" to animate the iterative calculation.

Section 3

PageRank Explained

PR(A) = (1 - d) / N + d × Σ ( PR(Bᵢ) / L(Bᵢ) )

d = 0.85 (damping factor) N = total pages PR(Bᵢ) = PageRank of linking page L(Bᵢ) = outbound links from Bᵢ

ℹ️

Google's original breakthrough: PageRank was Larry Page's PhD thesis insight. Before it, search engines ranked pages by keyword frequency alone — easily gamed. PageRank models the web as a random walk: a "random surfer" clicks links with probability d=0.85 or jumps to a random page with probability 0.15. Pages linked by high-PR pages (like Wikipedia) accumulate authority and rank higher.

💡

Why the damping factor (0.85) matters: Without damping, "rank sinks" (pages with no outbound links) would absorb all PageRank. With d=0.85, every page leaks 15% of its rank back to the random distribution, preventing rank accumulation in dead ends. The formula converges after ~50–100 iterations.

PYTHON · pagerank.py

def pagerank(graph: dict, d=0.85, iterations=100) -> dict:
    """
    graph = {'A': ['B', 'C'], 'B': ['C'], 'C': ['A']}
    Returns: {'A': 0.387, 'B': 0.213, 'C': 0.400}
    """
    N = len(graph)
    pr = {page: 1/N for page in graph}  # init equal

    for _ in range(iterations):
        new_pr = {}
        for page in graph:
            # Sum contributions from all pages that link to this page
            rank_sum = sum(
                pr[linker] / len(outlinks)  # PR divided by outbound links
                for linker, outlinks in graph.items()
                if page in outlinks
            )
            new_pr[page] = (1 - d) / N + d * rank_sum
        pr = new_pr

    return pr  # Converges in ~50-100 iterations

# Wikipedia links to 10,000 pages: each page gets PR(Wikipedia)/10,000
# One link from Wikipedia >> 10,000 links from spam blogs

Section 4

Query Serving at Scale

99,000 queries/second × 200ms budget: using Little's Law (L = λ × W), Google processes 19,800 concurrent queries at any moment (99,000 × 0.2 seconds). Each query fans out to thousands of servers in parallel to return results within the 200ms window.

Spelling Correction

"gogle" → "google" — statistical correction using query frequency logs

<5ms

Query Understanding

Identify entities, synonyms, intent. "apple" → fruit or tech company?

<10ms

Index Lookup (Fan-out)

Query broadcasts to thousands of index shards simultaneously. Each returns top-K candidate docs.

<50ms

Ranking

Apply 200+ signals (PageRank, TF-IDF, freshness, location, CTR) to candidates. ML re-ranks.

<100ms

Snippet Generation

Extract and highlight relevant portions of each document. Generate title, URL, description.

<30ms

Serving Architecture Breakdown

Component	What it does	Latency Budget	Scaling Approach
Spell check	"gogle" → "google" using n-gram frequency model	<5ms	In-memory trie / pre-computed corrections
Query parser	Identify entities, synonyms, intent classification	<10ms	Stateless — horizontally scalable
Index lookup	Inverted index lookup → candidate documents	<50ms	Fan-out to 1000s of index shards in parallel
Ranking	Score candidates with 200+ signals + ML model	<100ms	Dedicated ranking servers with GPU/TPU
Snippet gen	Extract relevant text from matching documents	<30ms	Forward index lookup + highlighting
Cache	Top-1M queries cached — ~80% hit rate globally	<5ms	Distributed Redis / Memcached tier

💡

Little's Law in action: L (concurrent queries) = λ (queries/sec) × W (avg time per query) = 99,000 × 0.2s = 19,800. This means at steady state, Google has ~19,800 queries simultaneously in-flight across its serving stack. This directly drives infrastructure sizing — you need enough servers to handle 19,800 parallel queries without queuing.

Section 5

Technology Decisions

Index Storage

Custom SSTable Format

Not Elasticsearch or Solr. Google's index is a custom SSTable-like binary format optimized for their specific access patterns. Off-the-shelf search engines can't handle 20B+ pages efficiently.

Crawler Storage

Bigtable + Bloom Filter

Crawl frontier (URLs to visit) in a priority queue. Visited URL tracking uses a Bloom filter — zero false negatives (never miss a URL), occasional false positive means a rare harmless re-crawl.

Result Caching

Distributed Cache (80% hit)

Top-1M queries account for ~80% of traffic. Cache these results globally. Cache invalidated when pages are re-crawled and scores change significantly. Stale results for seconds are acceptable.

Infrastructure

Colossus + Bigtable

Colossus (Google's distributed filesystem, successor to GFS) stores index shards. Bigtable stores document metadata and crawl state. Custom hardware (TPUs) for ML ranking inference.

Section 6 — Knowledge Check

Quiz

Question 1 of 5

An inverted index maps:

ADocument IDs to their full text content and URLs

BEach word to the sorted list of documents containing that word (a postings list)

CUser queries to their historically most common result sets

DWeb pages to their PageRank score and authority metrics

Question 2 of 5

PageRank's damping factor (0.85) models:

AThe rate at which the web crawler downloads new pages per second

BThe probability that a random web surfer follows a link (0.85) vs jumps to a completely random page (0.15)

CThe freshness decay rate of indexed web pages over time

DThe compression ratio applied to the inverted index on disk

Question 3 of 5

Google serves 99,000 queries per second. Each query takes 200ms to complete. Using Little's Law (L = λ × W), how many concurrent queries are being processed at any moment?

A99,000 — each query occupies one server for its full duration

B19,800 — from L = 99,000 qps × 0.2 seconds

C495,000 — accounting for parallel shard fan-out per query

D1,980 — using a 10% utilization safety factor

Question 4 of 5

A web page links to 10 other pages. In PageRank, how does it distribute its authority to those pages?

AIt gives its full PageRank value to each of the 10 linked pages

BIt divides its PageRank equally — each linked page receives PR(source) / 10

CIt contributes nothing — only inbound links to the source page matter

DIt multiplies its PageRank by 10 and shares the total across linked pages

Question 5 of 5

Google's web crawler must respect robots.txt. If a site's robots.txt says "Disallow: /private/", the crawler should:

ACrawl /private/ anyway since it's on the public internet and technically accessible

BSkip all pages on that entire domain as a precaution

CRespect the directive and skip /private/ and all subpaths — this is a standard, voluntary but universally honored convention

DCrawl it once for discovery, then stop re-visiting after the first crawl