What Is a Search Engine?
A search engine indexes a corpus of documents and retrieves the most relevant results for a query in milliseconds. Examples: Google Search, Elasticsearch, Algolia. Core challenges: inverted index construction at scale, query processing (tokenization, stemming, ranking), and relevance ranking (BM25, PageRank, ML re-ranking). This complements the web crawler design — this post covers the query-time retrieval side.
System Requirements
Functional
- Index documents: tokenize, stem, build inverted index
- Search: return top-K relevant documents for a query
- Support: phrase search, boolean operators (AND/OR/NOT), filters
- Spell correction and query suggestion
- Search latency <100ms at P99
Non-Functional
- 100B documents indexed, 10K queries/second
- Index updates propagated within 60 seconds of document change
Inverted Index
An inverted index maps each term to the list of documents containing it (postings list):
term → [(doc_id, tf, positions), ...]
"python" → [(doc1, tf=3, [5,12,45]), (doc5, tf=1, [7]), ...]
"interview" → [(doc1, tf=2, [8,20]), (doc3, tf=5, [1,2,3,4,5]), ...]
tf = term frequency (occurrences of the term in the document). Positions enable phrase search (“python interview” = “python” and “interview” are adjacent). The index is stored as sorted arrays of doc_ids (for fast intersection via merge).
Query Processing
Query: "python interview tips"
1. Tokenize: ["python", "interview", "tips"]
2. Normalize: lowercase, stem ("tips" → "tip")
3. Lookup each term's postings list
4. Intersect (AND query): merge sorted doc_id lists
5. Score each surviving document with BM25
6. Return top-K by score
Postings list intersection: merge two sorted lists in O(N1 + N2). For AND of multiple terms: start with the shortest postings list (most selective term), intersect incrementally. Phrase search: after AND intersection, verify positions are adjacent.
BM25 Ranking Formula
BM25(d, q) = sum over terms t in q:
IDF(t) * (tf(t,d) * (k1+1)) / (tf(t,d) + k1*(1 - b + b*|d|/avgdl))
IDF(t) = log((N - df(t) + 0.5) / (df(t) + 0.5))
IDF (Inverse Document Frequency): terms appearing in fewer documents are more informative (higher weight). TF normalization: document length normalization (b parameter, typically 0.75) prevents long documents from dominating just by containing more words. k1 (typically 1.5) controls TF saturation — repeated occurrences have diminishing returns. BM25 is the standard baseline ranking function, outperforming TF-IDF.
Distributed Index Architecture
100B docs / 1K docs per shard = 100M shards (too many)
100B docs / 1M docs per shard = 100K shards (more realistic)
Query fan-out:
Query → Query Router → N index shards (parallel)
← top-K results from each shard
Merge and re-rank ← aggregate top-K globally
Each shard independently scores and returns its top-K. The query router merges results from all shards (heap merge) and returns the global top-K. Sharding by document ID (random) ensures even distribution. Replication: each shard has 3 replicas for availability and read throughput.
Index Updates
New documents: write to a small in-memory “delta index” that is merged with the main index periodically (segment merge, like Lucene/Elasticsearch). Deleted documents: mark as deleted in a tombstone list; filter from results until the next full merge. This write-optimized approach avoids expensive random updates to the main index.
Spell Correction
Edit distance (Levenshtein) to find dictionary words within distance 1 or 2 from the query term. At scale: BK-tree or SymSpell for O(1) lookup per query. “Did you mean” shown when the query has zero or few results and a correction produces significantly more.
ML Re-Ranking
After BM25 retrieves the top-100 candidates: a neural re-ranker (BERT, cross-encoder) re-scores the query-document pairs using semantic understanding. Expensive (O(100ms) per query for BERT inference) — only applied to the top-100 BM25 results, not all documents. This two-stage approach combines BM25 efficiency with ML accuracy.
Interview Tips
- Inverted index is the foundational data structure — describe it before anything else.
- BM25 > TF-IDF — know the formula and what each term does.
- Two-stage ranking: BM25 for recall, ML re-ranker for precision.
- Shard fan-out with per-shard top-K and global merge is the distributed query pattern.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does an inverted index work and why is it more efficient than a forward index?”,
“acceptedAnswer”: { “@type”: “Answer”, “text”: “A forward index maps document → list of terms (what terms does this document contain?). A search query needs the opposite: for a given term, which documents contain it? The forward index answers this in O(N*|terms_per_doc|) — scan all documents. An inverted index maps term → sorted list of document IDs (postings list). A search for "python" returns the precomputed list of all documents containing "python" in O(1) lookup + O(|postings|) to read. For multi-term queries ("python interview"): fetch postings lists for "python" and "interview", intersect the two sorted lists in O(|p1| + |p2|). This is orders of magnitude faster than scanning all documents. Storage: the inverted index for a corpus of N documents with average M terms each and vocabulary size V is O(N*M) total postings entries. For 100B web pages at 500 words each, this is 50 trillion entries — stored compressed (delta encoding of doc IDs + variable-length integers) across thousands of shards. Google's inverted index is the largest data structure ever built by a company.” }
},
{
“@type”: “Question”,
“name”: “How does BM25 improve on TF-IDF for document ranking?”,
“acceptedAnswer”: { “@type”: “Answer”, “text”: “TF-IDF (Term Frequency-Inverse Document Frequency) has two weaknesses: (1) TF is unbounded — a document mentioning "python" 100 times scores 10x higher than one mentioning it 10 times, even though the extra repetitions add little additional relevance signal. (2) No document length normalization — a 10,000-word document will naturally contain more occurrences of any term than a 100-word document, artificially boosting its TF score. BM25 fixes both: TF saturation parameter k1 (typically 1.5): effective TF = tf * (k1+1) / (tf + k1). As tf grows, effective TF approaches (k1+1) — it saturates at a maximum. A document with tf=100 gets nearly the same score as tf=50. Length normalization parameter b (typically 0.75): penalizes documents longer than average. Effective TF is divided by (1 – b + b * doc_length / avg_doc_length). A short document mentioning the term twice scores higher than a long document mentioning it twice. BM25 consistently outperforms TF-IDF in information retrieval benchmarks (TREC evaluations) and is the default in Elasticsearch and Lucene.” }
},
{
“@type”: “Question”,
“name”: “How do you shard an inverted index across thousands of machines for 100 billion documents?”,
“acceptedAnswer”: { “@type”: “Answer”, “text”: “Two sharding strategies: document sharding (horizontal) and term sharding (vertical). Document sharding: each shard holds a subset of all documents and a complete inverted index for those documents. A query fans out to all shards in parallel: each shard independently computes its top-K results using BM25, returns them to the router, which merges all shards' top-K results into a global top-K. Trade-off: every query touches every shard (expensive fan-out), but each shard's index is independent and simple. Term sharding: each shard holds a subset of terms and their complete postings lists across all documents. A query for "python interview" routes "python" to shard A and "interview" to shard B; results are fetched and intersected at the router. Trade-off: queries require coordination across shards (complex), but single-term queries touch only one shard. Google and most production search engines use document sharding because: (1) fan-out parallelism is well-understood and scalable, (2) adding documents requires updating only the shard assigned to that document, not the entire index. Replica shards handle read throughput and availability.” }
}
]
}