Day 24 Exercises

Search Engine Design

Apply inverted index construction, ranking algorithms, and web crawling architecture to search engine design.

Exercise 1🟑 Easy15 min
Inverted Index Construction
You have 5 web pages and need to build an inverted index. Doc 1: "cat and dog", Doc 2: "dog and fish", Doc 3: "cat fish food".

Tasks

  • Build the complete inverted index (term β†’ list of doc IDs).
  • What is a "posting list" and what additional information does it store beyond doc IDs?
  • Run a boolean AND query for "cat AND fish" β€” which docs match?
  • How do you compress posting lists when they contain millions of document IDs?
Your Notes
Exercise 2πŸ”΄ Medium20 min
PageRank Intuition
The web has 4 pages: A links to B and C; B links to D; C links to D; D links to A. Design a simplified PageRank computation.

Tasks

  • Describe what PageRank measures in plain English.
  • Write one iteration of PageRank for all 4 pages (start with equal ranks 0.25).
  • Why does PageRank use a "damping factor" of 0.85?
  • How does link spam (buying links from low-quality sites) get detected and devalued?
Your Notes
Exercise 3πŸ”΄ Medium25 min
Web Crawler Design
Design a distributed web crawler that can crawl 1 billion URLs. The crawler must be polite (obey robots.txt), deduplicate (not re-crawl the same URL), and handle failures.

Tasks

  • Design the URL frontier β€” how do you prioritize which URLs to crawl first?
  • How do you deduplicate 1B URLs efficiently? (Hint: Bloom filter)
  • How do you enforce crawl politeness (max 1 request/second per domain)?
  • What's a "crawl trap" and how do you detect/avoid it?
Your Notes
Exercise 4πŸ”₯ Hard35 min
Sub-200ms Search Query Serving
Google serves 99,000 searches/second with p99 latency under 200ms. The inverted index is 100TB across 1,000 index servers. Design the query serving path.

Tasks

  • How is the 100TB index sharded across 1,000 servers? Describe two strategies.
  • A query fans out to how many index servers simultaneously? What pattern is this?
  • How does the serving system handle one slow index server (stragglers)?
  • Design the result merging and ranking step after collecting partial results.
Your Notes