Question 1

How does an inverted index work and why is it more efficient than a forward index?

Accepted Answer

A forward index maps document → list of terms (what terms does this document contain?). A search query needs the opposite: for a given term, which documents contain it? The forward index answers this in O(N*|terms_per_doc|) — scan all documents. An inverted index maps term → sorted list of document IDs (postings list). A search for "python" returns the precomputed list of all documents containing "python" in O(1) lookup + O(|postings|) to read. For multi-term queries ("python interview"): fetch postings lists for "python" and "interview", intersect the two sorted lists in O(|p1| + |p2|). This is orders of magnitude faster than scanning all documents. Storage: the inverted index for a corpus of N documents with average M terms each and vocabulary size V is O(N*M) total postings entries. For 100B web pages at 500 words each, this is 50 trillion entries — stored compressed (delta encoding of doc IDs + variable-length integers) across thousands of shards. Google's inverted index is the largest data structure ever built by a company.

Question 2

How does BM25 improve on TF-IDF for document ranking?

Accepted Answer

TF-IDF (Term Frequency-Inverse Document Frequency) has two weaknesses: (1) TF is unbounded — a document mentioning "python" 100 times scores 10x higher than one mentioning it 10 times, even though the extra repetitions add little additional relevance signal. (2) No document length normalization — a 10,000-word document will naturally contain more occurrences of any term than a 100-word document, artificially boosting its TF score. BM25 fixes both: TF saturation parameter k1 (typically 1.5): effective TF = tf * (k1+1) / (tf + k1). As tf grows, effective TF approaches (k1+1) — it saturates at a maximum. A document with tf=100 gets nearly the same score as tf=50. Length normalization parameter b (typically 0.75): penalizes documents longer than average. Effective TF is divided by (1 - b + b * doc_length / avg_doc_length). A short document mentioning the term twice scores higher than a long document mentioning it twice. BM25 consistently outperforms TF-IDF in information retrieval benchmarks (TREC evaluations) and is the default in Elasticsearch and Lucene.

Question 3

How do you shard an inverted index across thousands of machines for 100 billion documents?

Accepted Answer

Two sharding strategies: document sharding (horizontal) and term sharding (vertical). Document sharding: each shard holds a subset of all documents and a complete inverted index for those documents. A query fans out to all shards in parallel: each shard independently computes its top-K results using BM25, returns them to the router, which merges all shards' top-K results into a global top-K. Trade-off: every query touches every shard (expensive fan-out), but each shard's index is independent and simple. Term sharding: each shard holds a subset of terms and their complete postings lists across all documents. A query for "python interview" routes "python" to shard A and "interview" to shard B; results are fetched and intersected at the router. Trade-off: queries require coordination across shards (complex), but single-term queries touch only one shard. Google and most production search engines use document sharding because: (1) fan-out parallelism is well-understood and scalable, (2) adding documents requires updating only the shard assigned to that document, not the entire index. Replica shards handle read throughput and availability.

System Design Interview: Design a Search Engine (Query Processing and Ranking)

What Is a Search Engine?

System Requirements

Functional

Non-Functional

Inverted Index

Query Processing

BM25 Ranking Formula

Distributed Index Architecture

Index Updates

Spell Correction

ML Re-Ranking

Interview Tips