What Is a Search Engine?
A search engine indexes a corpus of documents and retrieves the most relevant results for a query in milliseconds. Examples: Google Search, Elasticsearch, Algolia. Core challenges: inverted index construction at scale, query processing (tokenization, stemming, ranking), and relevance ranking (BM25, PageRank, ML re-ranking). This complements the web crawler design — this post covers the query-time retrieval side.
System Requirements
Functional
- Index documents: tokenize, stem, build inverted index
- Search: return top-K relevant documents for a query
- Support: phrase search, boolean operators (AND/OR/NOT), filters
- Spell correction and query suggestion
- Search latency <100ms at P99
Non-Functional
- 100B documents indexed, 10K queries/second
- Index updates propagated within 60 seconds of document change
Inverted Index
An inverted index maps each term to the list of documents containing it (postings list):
term → [(doc_id, tf, positions), ...]
"python" → [(doc1, tf=3, [5,12,45]), (doc5, tf=1, [7]), ...]
"interview" → [(doc1, tf=2, [8,20]), (doc3, tf=5, [1,2,3,4,5]), ...]
tf = term frequency (occurrences of the term in the document). Positions enable phrase search (“python interview” = “python” and “interview” are adjacent). The index is stored as sorted arrays of doc_ids (for fast intersection via merge).
Query Processing
Query: "python interview tips"
1. Tokenize: ["python", "interview", "tips"]
2. Normalize: lowercase, stem ("tips" → "tip")
3. Lookup each term's postings list
4. Intersect (AND query): merge sorted doc_id lists
5. Score each surviving document with BM25
6. Return top-K by score
Postings list intersection: merge two sorted lists in O(N1 + N2). For AND of multiple terms: start with the shortest postings list (most selective term), intersect incrementally. Phrase search: after AND intersection, verify positions are adjacent.
BM25 Ranking Formula
BM25(d, q) = sum over terms t in q:
IDF(t) * (tf(t,d) * (k1+1)) / (tf(t,d) + k1*(1 - b + b*|d|/avgdl))
IDF(t) = log((N - df(t) + 0.5) / (df(t) + 0.5))
IDF (Inverse Document Frequency): terms appearing in fewer documents are more informative (higher weight). TF normalization: document length normalization (b parameter, typically 0.75) prevents long documents from dominating just by containing more words. k1 (typically 1.5) controls TF saturation — repeated occurrences have diminishing returns. BM25 is the standard baseline ranking function, outperforming TF-IDF.
Distributed Index Architecture
100B docs / 1K docs per shard = 100M shards (too many)
100B docs / 1M docs per shard = 100K shards (more realistic)
Query fan-out:
Query → Query Router → N index shards (parallel)
← top-K results from each shard
Merge and re-rank ← aggregate top-K globally
Each shard independently scores and returns its top-K. The query router merges results from all shards (heap merge) and returns the global top-K. Sharding by document ID (random) ensures even distribution. Replication: each shard has 3 replicas for availability and read throughput.
Index Updates
New documents: write to a small in-memory “delta index” that is merged with the main index periodically (segment merge, like Lucene/Elasticsearch). Deleted documents: mark as deleted in a tombstone list; filter from results until the next full merge. This write-optimized approach avoids expensive random updates to the main index.
Spell Correction
Edit distance (Levenshtein) to find dictionary words within distance 1 or 2 from the query term. At scale: BK-tree or SymSpell for O(1) lookup per query. “Did you mean” shown when the query has zero or few results and a correction produces significantly more.
ML Re-Ranking
After BM25 retrieves the top-100 candidates: a neural re-ranker (BERT, cross-encoder) re-scores the query-document pairs using semantic understanding. Expensive (O(100ms) per query for BERT inference) — only applied to the top-100 BM25 results, not all documents. This two-stage approach combines BM25 efficiency with ML accuracy.
Interview Tips
- Inverted index is the foundational data structure — describe it before anything else.
- BM25 > TF-IDF — know the formula and what each term does.
- Two-stage ranking: BM25 for recall, ML re-ranker for precision.
- Shard fan-out with per-shard top-K and global merge is the distributed query pattern.