Day 24 Exercises
Search Engine Design
Apply inverted index construction, ranking algorithms, and web crawling architecture to search engine design.
Inverted Index Construction
You have 5 web pages and need to build an inverted index. Doc 1: "cat and dog", Doc 2: "dog and fish", Doc 3: "cat fish food".
Tasks
- Build the complete inverted index (term β list of doc IDs).
- What is a "posting list" and what additional information does it store beyond doc IDs?
- Run a boolean AND query for "cat AND fish" β which docs match?
- How do you compress posting lists when they contain millions of document IDs?
PageRank Intuition
The web has 4 pages: A links to B and C; B links to D; C links to D; D links to A. Design a simplified PageRank computation.
Tasks
- Describe what PageRank measures in plain English.
- Write one iteration of PageRank for all 4 pages (start with equal ranks 0.25).
- Why does PageRank use a "damping factor" of 0.85?
- How does link spam (buying links from low-quality sites) get detected and devalued?
Web Crawler Design
Design a distributed web crawler that can crawl 1 billion URLs. The crawler must be polite (obey robots.txt), deduplicate (not re-crawl the same URL), and handle failures.
Tasks
- Design the URL frontier β how do you prioritize which URLs to crawl first?
- How do you deduplicate 1B URLs efficiently? (Hint: Bloom filter)
- How do you enforce crawl politeness (max 1 request/second per domain)?
- What's a "crawl trap" and how do you detect/avoid it?
Sub-200ms Search Query Serving
Google serves 99,000 searches/second with p99 latency under 200ms. The inverted index is 100TB across 1,000 index servers. Design the query serving path.
Tasks
- How is the 100TB index sharded across 1,000 servers? Describe two strategies.
- A query fans out to how many index servers simultaneously? What pattern is this?
- How does the serving system handle one slow index server (stragglers)?
- Design the result merging and ranking step after collecting partial results.