System Design Interview: Design a Web Search Engine (Google)

System Design Interview: Design a Web Search Engine

Designing a web search engine is one of the most technically rich system design questions. It requires understanding web crawling, large-scale indexing, ranking algorithms, and low-latency query serving. This guide covers the architecture used by systems like Google and Bing, distilled into a framework you can present confidently in an interview.

Clarifying Questions

  • Scale: how many web pages to index? (Assume 50 billion pages)
  • Query volume: how many searches per second? (Assume 100K QPS)
  • Freshness: how quickly should new content appear in results? (Minutes for news, days for general content)
  • Features: full-text search only, or also images, videos, knowledge panels?

Core Components

1. Web Crawler

The crawler discovers and downloads web pages. Key design decisions:

  • Seed URLs: start from a curated list of high-authority domains
  • URL Frontier: priority queue of URLs to crawl, prioritized by PageRank estimate and freshness
  • Politeness: respect robots.txt, limit crawl rate per domain (1 req/sec default)
  • Distributed crawlers: partition URL space by domain hash across thousands of crawler nodes
  • Deduplication: use SimHash to detect near-duplicate pages and skip them

At 50B pages with avg 100KB per page = 5 PB of raw HTML. Crawlers must revisit pages; revisit frequency is proportional to page change rate (detected via ETag/Last-Modified headers).

2. Document Processor

After downloading, each page is processed:

  • HTML parsing: extract text, links, metadata (title, description, canonical URL)
  • Link extraction: add discovered URLs to the URL frontier
  • Language detection: route to language-specific indexes
  • Content extraction: remove boilerplate (navigation, ads) to isolate main content
  • Indexing pipeline: tokenize text, stem words, compute term frequencies

3. Inverted Index

The inverted index maps each term to the list of documents containing it (a “posting list”):

"distributed" -> [(doc_12, tf=3, positions=[45,102,890]), (doc_47, tf=1, positions=[23]), ...]
"systems"     -> [(doc_12, tf=7, positions=[...]), ...]

Storage: the index at Google scale is ~100+ PB. It is sharded by document ID range and replicated across datacenters. Each shard is stored as an immutable file (similar to LSM-tree SSTables) updated via periodic merges.

Index tiers: a three-tier architecture balances freshness vs. cost:

  • Realtime index: last 24-48 hours of crawled content, kept in memory, updated continuously
  • Recent index: last 30 days, on fast SSDs
  • Full index: all 50B pages, on HDD/object storage, rebuilt weekly

4. Ranking System

Given a query, thousands of documents match. Ranking determines the order. The process has two stages:

Retrieval (recall): fetch top-1000 candidate documents from the inverted index using BM25 (term frequency × inverse document frequency). BM25 score for a document D given query Q:

score(D,Q) = Σ IDF(qi) × (tf(qi,D) × (k1+1)) / (tf(qi,D) + k1×(1 - b + b×|D|/avgdl))

Where k1≈1.5, b≈0.75 are tuning constants, |D| is document length, avgdl is average document length.

Ranking (precision): rerank the 1000 candidates using a Learning to Rank model with hundreds of signals:

  • Link signals: PageRank (recursive authority from inbound links), domain authority
  • Content signals: query term density, title match, URL match, freshness
  • User signals: CTR from past queries, dwell time, bounce rate
  • Quality signals: HTTPS, mobile-friendly, page speed, Core Web Vitals

Google uses a neural ranker (MUM/BERT-based) to understand query intent beyond keyword matching.

5. Query Serving

For 100K QPS with <200ms p99 latency:

  • Query understanding: spell correction, query expansion, entity recognition, intent classification (navigational / informational / transactional)
  • Fan-out: query dispatcher sends the query to all index shards in parallel; each shard returns its top-K results; dispatcher merges and re-ranks globally
  • Caching: popular queries (top 5% cover ~50% of traffic) are cached in Redis with a short TTL (minutes for news-sensitive queries, hours for stable queries)
  • Snippets: extract the most relevant sentence around query terms from the document for the result description

6. PageRank Algorithm

PageRank models the probability that a random web surfer ends up on a page by following links. Formula:

PR(A) = (1-d)/N + d × Σ PR(Bi)/L(Bi)

Where d≈0.85 (damping factor), N is total pages, Bi are pages linking to A, L(Bi) is the number of outbound links from Bi. Computed iteratively via MapReduce until convergence (~50 iterations for the web graph). At 50B pages this is a massive distributed computation run periodically (weekly) and stored per-document.

Data Flow Summary

Web → Crawler → Document Processor → Inverted Index Builder
                                              ↓
User Query → Query Understanding → Index Shards (parallel) → Merge & Rank → Results

Scalability and Reliability

  • Crawl at scale: 50B pages ÷ 30-day crawl cycle = ~20K pages/second. Achieved with thousands of distributed crawler nodes
  • Fault tolerance: each index shard is replicated 3x. If a shard replica fails, another serves the query; results are slightly degraded but not unavailable
  • Datacenter redundancy: queries are routed to the nearest datacenter; if unavailable, traffic fails over to another region
  • Index consistency: eventual consistency is acceptable — a page not appearing for a few hours is tolerable; serving stale results is better than downtime

Interview Tips

  • Start with the data pipeline: crawl → process → index → rank → serve
  • Know BM25 at a high level — you don't need the exact formula but mention TF-IDF and document length normalization
  • Explain the two-stage retrieval+ranking pattern — interviewers appreciate knowing you understand recall vs. precision
  • Discuss the index tier architecture (realtime / recent / full) to show you understand freshness-cost trade-offs
  • If asked about anti-spam, mention link farm detection, trust scores for domains, and manual quality rater programs

  • Atlassian Interview Guide
  • Databricks Interview Guide
  • Cloudflare Interview Guide
  • Twitter Interview Guide
  • LinkedIn Interview Guide
  • Meta Interview Guide
  • Scroll to Top