System Design Interview: Design a Web Search Engine
Designing a web search engine is one of the most technically rich system design questions. It requires understanding web crawling, large-scale indexing, ranking algorithms, and low-latency query serving. This guide covers the architecture used by systems like Google and Bing, distilled into a framework you can present confidently in an interview.
Clarifying Questions
- Scale: how many web pages to index? (Assume 50 billion pages)
- Query volume: how many searches per second? (Assume 100K QPS)
- Freshness: how quickly should new content appear in results? (Minutes for news, days for general content)
- Features: full-text search only, or also images, videos, knowledge panels?
Core Components
1. Web Crawler
The crawler discovers and downloads web pages. Key design decisions:
- Seed URLs: start from a curated list of high-authority domains
- URL Frontier: priority queue of URLs to crawl, prioritized by PageRank estimate and freshness
- Politeness: respect robots.txt, limit crawl rate per domain (1 req/sec default)
- Distributed crawlers: partition URL space by domain hash across thousands of crawler nodes
- Deduplication: use SimHash to detect near-duplicate pages and skip them
At 50B pages with avg 100KB per page = 5 PB of raw HTML. Crawlers must revisit pages; revisit frequency is proportional to page change rate (detected via ETag/Last-Modified headers).
2. Document Processor
After downloading, each page is processed:
- HTML parsing: extract text, links, metadata (title, description, canonical URL)
- Link extraction: add discovered URLs to the URL frontier
- Language detection: route to language-specific indexes
- Content extraction: remove boilerplate (navigation, ads) to isolate main content
- Indexing pipeline: tokenize text, stem words, compute term frequencies
3. Inverted Index
The inverted index maps each term to the list of documents containing it (a “posting list”):
"distributed" -> [(doc_12, tf=3, positions=[45,102,890]), (doc_47, tf=1, positions=[23]), ...]
"systems" -> [(doc_12, tf=7, positions=[...]), ...]
Storage: the index at Google scale is ~100+ PB. It is sharded by document ID range and replicated across datacenters. Each shard is stored as an immutable file (similar to LSM-tree SSTables) updated via periodic merges.
Index tiers: a three-tier architecture balances freshness vs. cost:
- Realtime index: last 24-48 hours of crawled content, kept in memory, updated continuously
- Recent index: last 30 days, on fast SSDs
- Full index: all 50B pages, on HDD/object storage, rebuilt weekly
4. Ranking System
Given a query, thousands of documents match. Ranking determines the order. The process has two stages:
Retrieval (recall): fetch top-1000 candidate documents from the inverted index using BM25 (term frequency × inverse document frequency). BM25 score for a document D given query Q:
score(D,Q) = Σ IDF(qi) × (tf(qi,D) × (k1+1)) / (tf(qi,D) + k1×(1 - b + b×|D|/avgdl))
Where k1≈1.5, b≈0.75 are tuning constants, |D| is document length, avgdl is average document length.
Ranking (precision): rerank the 1000 candidates using a Learning to Rank model with hundreds of signals:
- Link signals: PageRank (recursive authority from inbound links), domain authority
- Content signals: query term density, title match, URL match, freshness
- User signals: CTR from past queries, dwell time, bounce rate
- Quality signals: HTTPS, mobile-friendly, page speed, Core Web Vitals
Google uses a neural ranker (MUM/BERT-based) to understand query intent beyond keyword matching.
5. Query Serving
For 100K QPS with <200ms p99 latency:
- Query understanding: spell correction, query expansion, entity recognition, intent classification (navigational / informational / transactional)
- Fan-out: query dispatcher sends the query to all index shards in parallel; each shard returns its top-K results; dispatcher merges and re-ranks globally
- Caching: popular queries (top 5% cover ~50% of traffic) are cached in Redis with a short TTL (minutes for news-sensitive queries, hours for stable queries)
- Snippets: extract the most relevant sentence around query terms from the document for the result description
6. PageRank Algorithm
PageRank models the probability that a random web surfer ends up on a page by following links. Formula:
PR(A) = (1-d)/N + d × Σ PR(Bi)/L(Bi)
Where d≈0.85 (damping factor), N is total pages, Bi are pages linking to A, L(Bi) is the number of outbound links from Bi. Computed iteratively via MapReduce until convergence (~50 iterations for the web graph). At 50B pages this is a massive distributed computation run periodically (weekly) and stored per-document.
Data Flow Summary
Web → Crawler → Document Processor → Inverted Index Builder
↓
User Query → Query Understanding → Index Shards (parallel) → Merge & Rank → Results
Scalability and Reliability
- Crawl at scale: 50B pages ÷ 30-day crawl cycle = ~20K pages/second. Achieved with thousands of distributed crawler nodes
- Fault tolerance: each index shard is replicated 3x. If a shard replica fails, another serves the query; results are slightly degraded but not unavailable
- Datacenter redundancy: queries are routed to the nearest datacenter; if unavailable, traffic fails over to another region
- Index consistency: eventual consistency is acceptable — a page not appearing for a few hours is tolerable; serving stale results is better than downtime
Interview Tips
- Start with the data pipeline: crawl → process → index → rank → serve
- Know BM25 at a high level — you don't need the exact formula but mention TF-IDF and document length normalization
- Explain the two-stage retrieval+ranking pattern — interviewers appreciate knowing you understand recall vs. precision
- Discuss the index tier architecture (realtime / recent / full) to show you understand freshness-cost trade-offs
- If asked about anti-spam, mention link farm detection, trust scores for domains, and manual quality rater programs