How does an inverted index work in a search engine?

An inverted index maps each term to the list of documents containing it (posting lists). When you query "distributed systems", the engine fetches the posting list for "distributed" and "systems", intersects them, and ranks the results. At web scale (50B pages), the index is sharded by document ID range across thousands of machines, each holding a slice of every posting list. A query fans out to all shards in parallel, each returns its top-K results, and a central aggregator merges and re-ranks globally.

What is the two-stage retrieval and ranking pipeline in search?

Stage 1 (retrieval/recall): use BM25 (TF-IDF with document length normalization) to fetch the top ~1000 candidate documents from the inverted index. BM25 is fast and runs on every shard in parallel. Stage 2 (ranking/precision): a Learning to Rank model re-scores the 1000 candidates using hundreds of signals: PageRank, query-title match, freshness, HTTPS, mobile-friendliness, Core Web Vitals, and user engagement signals (CTR, dwell time). This two-stage approach balances recall (get enough candidates) with precision (rank the best ones first).

How does PageRank work and why does it matter for search?

PageRank models the probability a random web surfer lands on a page by following links. Formula: PR(A) = (1-d)/N + d * sum(PR(Bi)/L(Bi)), where d=0.85 is the damping factor, Bi are pages linking to A, and L(Bi) is the number of outbound links from Bi. Computed iteratively (50+ MapReduce rounds) on the full web graph weekly. Pages with many high-quality inbound links get high PageRank. It is one of hundreds of ranking signals but provides a strong prior for authority before considering query-specific relevance.

System Design Interview: Design a Web Search Engine (Google)

⏱ 5 min read

System Design Interview: Design a Web Search Engine

Designing a web search engine is one of the most technically rich system design questions. It requires understanding web crawling, large-scale indexing, ranking algorithms, and low-latency query serving. This guide covers the architecture used by systems like Google and Bing, distilled into a framework you can present confidently in an interview.

Clarifying Questions

Scale: how many web pages to index? (Assume 50 billion pages)
Query volume: how many searches per second? (Assume 100K QPS)
Freshness: how quickly should new content appear in results? (Minutes for news, days for general content)
Features: full-text search only, or also images, videos, knowledge panels?

Core Components

1. Web Crawler

The crawler discovers and downloads web pages. Key design decisions:

Seed URLs: start from a curated list of high-authority domains
URL Frontier: priority queue of URLs to crawl, prioritized by PageRank estimate and freshness
Politeness: respect robots.txt, limit crawl rate per domain (1 req/sec default)
Distributed crawlers: partition URL space by domain hash across thousands of crawler nodes
Deduplication: use SimHash to detect near-duplicate pages and skip them

At 50B pages with avg 100KB per page = 5 PB of raw HTML. Crawlers must revisit pages; revisit frequency is proportional to page change rate (detected via ETag/Last-Modified headers).

2. Document Processor

After downloading, each page is processed:

HTML parsing: extract text, links, metadata (title, description, canonical URL)
Link extraction: add discovered URLs to the URL frontier
Language detection: route to language-specific indexes
Content extraction: remove boilerplate (navigation, ads) to isolate main content
Indexing pipeline: tokenize text, stem words, compute term frequencies

3. Inverted Index

The inverted index maps each term to the list of documents containing it (a “posting list”):

"distributed" -> [(doc_12, tf=3, positions=[45,102,890]), (doc_47, tf=1, positions=[23]), ...]
"systems"     -> [(doc_12, tf=7, positions=[...]), ...]

Storage: the index at Google scale is ~100+ PB. It is sharded by document ID range and replicated across datacenters. Each shard is stored as an immutable file (similar to LSM-tree SSTables) updated via periodic merges.

Index tiers: a three-tier architecture balances freshness vs. cost:

Realtime index: last 24-48 hours of crawled content, kept in memory, updated continuously
Recent index: last 30 days, on fast SSDs
Full index: all 50B pages, on HDD/object storage, rebuilt weekly

4. Ranking System

Given a query, thousands of documents match. Ranking determines the order. The process has two stages:

Retrieval (recall): fetch top-1000 candidate documents from the inverted index using BM25 (term frequency × inverse document frequency). BM25 score for a document D given query Q:

score(D,Q) = Σ IDF(qi) × (tf(qi,D) × (k1+1)) / (tf(qi,D) + k1×(1 - b + b×|D|/avgdl))

Where k1≈1.5, b≈0.75 are tuning constants, |D| is document length, avgdl is average document length.

Ranking (precision): rerank the 1000 candidates using a Learning to Rank model with hundreds of signals:

Link signals: PageRank (recursive authority from inbound links), domain authority
Content signals: query term density, title match, URL match, freshness
User signals: CTR from past queries, dwell time, bounce rate
Quality signals: HTTPS, mobile-friendly, page speed, Core Web Vitals

Google uses a neural ranker (MUM/BERT-based) to understand query intent beyond keyword matching.

5. Query Serving

For 100K QPS with <200ms p99 latency:

Query understanding: spell correction, query expansion, entity recognition, intent classification (navigational / informational / transactional)
Fan-out: query dispatcher sends the query to all index shards in parallel; each shard returns its top-K results; dispatcher merges and re-ranks globally
Caching: popular queries (top 5% cover ~50% of traffic) are cached in Redis with a short TTL (minutes for news-sensitive queries, hours for stable queries)
Snippets: extract the most relevant sentence around query terms from the document for the result description

6. PageRank Algorithm

PageRank models the probability that a random web surfer ends up on a page by following links. Formula:

PR(A) = (1-d)/N + d × Σ PR(Bi)/L(Bi)

Where d≈0.85 (damping factor), N is total pages, Bi are pages linking to A, L(Bi) is the number of outbound links from Bi. Computed iteratively via MapReduce until convergence (~50 iterations for the web graph). At 50B pages this is a massive distributed computation run periodically (weekly) and stored per-document.

Data Flow Summary

Web → Crawler → Document Processor → Inverted Index Builder
                                              ↓
User Query → Query Understanding → Index Shards (parallel) → Merge & Rank → Results

Scalability and Reliability

Crawl at scale: 50B pages ÷ 30-day crawl cycle = ~20K pages/second. Achieved with thousands of distributed crawler nodes
Fault tolerance: each index shard is replicated 3x. If a shard replica fails, another serves the query; results are slightly degraded but not unavailable
Datacenter redundancy: queries are routed to the nearest datacenter; if unavailable, traffic fails over to another region
Index consistency: eventual consistency is acceptable — a page not appearing for a few hours is tolerable; serving stale results is better than downtime

Interview Tips

Start with the data pipeline: crawl → process → index → rank → serve
Know BM25 at a high level — you don't need the exact formula but mention TF-IDF and document length normalization
Explain the two-stage retrieval+ranking pattern — interviewers appreciate knowing you understand recall vs. precision
Discuss the index tier architecture (realtime / recent / full) to show you understand freshness-cost trade-offs
If asked about anti-spam, mention link farm detection, trust scores for domains, and manual quality rater programs

Atlassian Interview Guide

Databricks Interview Guide

Cloudflare Interview Guide

Twitter Interview Guide

LinkedIn Interview Guide

Meta Interview Guide