Core Architecture
A search system takes a text query and returns a ranked list of documents (web pages, products, articles). Components: (1) Crawler/indexer: fetches and processes documents, builds the inverted index. (2) Query processor: parses the query, looks up the index, computes relevance scores, returns ranked results. (3) Serving layer: low-latency API that handles millions of queries per second. At Google scale: hundreds of billions of documents, trillions of queries per year. Even for application-level search (e-commerce search, internal search): the same principles apply at smaller scale.
Inverted Index
An inverted index maps each term to the list of documents containing that term: term → [(doc_id, position, frequency), …]. This “posting list” is the core data structure. Building the index: tokenize documents (split by whitespace, punctuation), normalize (lowercase, stemming: “running” → “run”), remove stop words (the, a, an), compute term frequency per document. For a query “fast search”: look up posting lists for “fast” and “search”, intersect the lists (AND query) or union (OR query). Intersection is faster to intersect in order of shortest posting list first. Index size: raw text → inverted index is typically 10-30% of the raw text size (with compression). Sharding: partition the index by document (each shard holds the full index for a subset of documents). Query fan-out: scatter the query to all shards, gather results, merge-sort by score, return top K. Elasticsearch/Lucene use this architecture.
Relevance Scoring: TF-IDF and BM25
TF-IDF: score(term, doc) = TF(term, doc) * IDF(term). TF (term frequency) = log(1 + count of term in doc). IDF (inverse document frequency) = log(N / df(term)) where N = total documents and df = documents containing the term. High IDF = rare term = more discriminative. “The” has IDF ≈ 0 (appears in almost every document); “photosynthesis” has high IDF. BM25 (Best Match 25): the industry standard. Improvements over TF-IDF: (1) TF saturation: BM25 uses TF / (TF + k1) — more occurrences help, but with diminishing returns (k1 typically 1.2). (2) Document length normalization: longer documents get a slight penalty. score = IDF * TF * (k1+1) / (TF + k1 * (1 – b + b * dl/avgdl)) where b=0.75, dl=document length, avgdl=average document length. BM25 is the default in Elasticsearch and Solr.
Query Processing Pipeline
Query flow: (1) Query parsing: tokenize, normalize, handle operators (AND, OR, NOT, phrase “exact match”, field:value). (2) Query understanding: spell correction, synonym expansion (“car” → also “automobile”), entity recognition (recognizing “iPhone 15” as a product entity). (3) Index lookup: retrieve posting lists for all query terms. (4) Candidate selection: intersect/union posting lists to get candidate document set. For large posting lists: apply WAND (Weak AND) algorithm to skip documents that can’t possibly rank in the top K without scoring them. (5) Scoring: compute BM25 + additional signals (recency, popularity, click-through rate). (6) Re-ranking: apply ML model (LambdaMART, BERT-based cross-encoder) to top-N candidates. (7) Return top K results with snippets. Latency budget: total query latency target = 100ms. Index lookup: 10ms. Scoring: 20ms. Re-ranking: 50ms. Serialization and network: 20ms.
Personalization and Learning to Rank
Pure lexical relevance is insufficient for production search. Ranking signals: (1) Text relevance: BM25 score. (2) Click-through rate (CTR): documents clicked more often for similar queries rank higher. (3) Recency: newer documents rank higher for time-sensitive queries. (4) Authority: PageRank-style link analysis for web; seller rating for e-commerce. (5) User context: past purchases, browsing history (personalized ranking). Learning to Rank (LTR): train a gradient boosted model (XGBoost, LightGBM) or neural model on query-document pairs labeled with relevance judgments (from human raters or click data). Offline evaluation: NDCG (Normalized Discounted Cumulative Gain) measures ranking quality. A/B testing: deploy new ranking model to 1% traffic, measure click-through rate and purchase conversion improvement. Feedback loop: click data from production continuously retrains the ranking model.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does an inverted index differ from a forward index?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A forward index maps document u2192 list of terms it contains: doc_id u2192 [word1, word2, word3, …]. This is the natural representation after document processing. A reverse (inverted) index maps term u2192 list of documents containing it: word u2192 [(doc_id, position, frequency), …]. The forward index is useful for re-indexing (recomputing a document’s index when it changes) and for computing per-document features. The inverted index is the core data structure for query processing — given a query word, instantly find all documents containing it. In practice, both are built. The build process: for each document in the corpus, tokenize and normalize it, then emit (term, doc_id, position) tuples. Sort by term. Group by term to create posting lists. The posting list for each term is stored sorted by doc_id for efficient intersection (merge-join style). Posting lists can be compressed (delta encoding: store differences between consecutive doc_ids, then variable-length encode — typical 4-8x compression).”
}
},
{
“@type”: “Question”,
“name”: “What is BM25 and why is it better than raw TF-IDF?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “TF-IDF has two weaknesses: (1) Term frequency is unbounded and linear — a document with “apple” 100 times scores 100x higher than one with “apple” 1 time, even though the extra occurrences add diminishing value. (2) No document length normalization — a 10,000-word document has more raw term occurrences than a 100-word document just because it’s longer. BM25 (Best Match 25) fixes both: TF saturation: BM25 score uses TF / (TF + k1) where k1 is typically 1.2-2.0. As TF grows, the score asymptotically approaches 1 — each additional occurrence adds less weight. Document length normalization: multiply TF by (1 – b + b * dl/avgdl) in the denominator, where b=0.75. A longer-than-average document gets slightly penalized; shorter gets a slight bonus. IDF component: same as TF-IDF but with a slight smoothing to avoid zero for terms in all documents. Empirically, BM25 significantly outperforms TF-IDF on standard IR benchmarks. It is the default algorithm in Elasticsearch, Solr, and most search libraries.”
}
},
{
“@type”: “Question”,
“name”: “How do you handle typos and misspellings in search queries?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Spell correction approaches: (1) Edit distance (Levenshtein): for each query term not found in the index vocabulary, find vocabulary words within edit distance 1 or 2. Levenshtein automata allow efficient fuzzy matching against large vocabularies. Elasticsearch fuzzy query uses this. (2) Phonetic matching: Soundex or Metaphone converts words to phonetic representations — “colour” and “color” have the same Soundex. (3) N-gram index: index all N-grams (trigrams) of vocabulary words. For a misspelled query term, find vocabulary words sharing many trigrams. Fast at the cost of larger index. (4) Statistical spell correction: use query logs to learn common misspellings. If 90% of users who type “applle” then re-type “apple”, treat “applle” as a misspelling of “apple.” (5) Context-aware correction: “read” is not misspelled, but in the context of other query terms (medical query), the intended word might be “reed.” Use language models to suggest corrections in context. Modern search: combine multiple signals, show “Did you mean X?” below the first few results without hard-redirecting.”
}
},
{
“@type”: “Question”,
“name”: “What is NDCG and how is it used to evaluate search ranking quality?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “NDCG (Normalized Discounted Cumulative Gain) measures ranking quality by weighing both relevance and position. DCG = sum of (relevance_i / log2(i+1)) for positions i=1..K. The log2 discount penalizes relevant documents placed lower in the ranking. NDCG = DCG / IDCG, where IDCG is the ideal DCG (perfect ranking of documents by relevance). NDCG ranges from 0 to 1; 1 = perfect ranking. Example: query “laptop charger”. Position 1: highly relevant (relevance=3), position 2: slightly relevant (relevance=1), position 3: not relevant (relevance=0). DCG@3 = 3/log2(2) + 1/log2(3) + 0/log2(4) = 3 + 0.63 + 0 = 3.63. IDCG@3 (ideal order: 3,1,0) = 3.63 = same in this case = NDCG = 1.0. Relevance judgments come from human raters or click-through data (clicks = implicit relevance signal). Common practice: compute NDCG@5 or NDCG@10 (top 5 or 10 results) since users rarely go beyond page 1. A/B test new ranking models by measuring NDCG uplift on held-out query sets before deploying to production.”
}
},
{
“@type”: “Question”,
“name”: “How does Elasticsearch handle distributed search across multiple shards?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Elasticsearch index is divided into N primary shards (typically 3-5). Each shard is a complete Lucene index. Queries fan out to all shards in parallel: (1) Scatter phase: the coordinator node sends the query to all shards. Each shard runs the query locally and returns the top K doc IDs and scores (not the full documents). (2) Gather phase: coordinator merges all shard results, sorts by score, selects the global top K. (3) Fetch phase: coordinator fetches the full documents for the top K from the relevant shards. Two-phase approach (scatter-gather + fetch) minimizes data transferred: only scores and IDs (small) are transferred in the gather phase; full document data (large) is fetched only for the final K results. Replica shards: each primary has N replicas (default 1). Reads are load-balanced across replicas. Writes go to the primary, then replicate to replicas. Deep pagination problem: requesting page 100 of 10 results requires fetching top 1,000 from each shard (expensive). Use search_after (keyset pagination) for deep paging.”
}
}
]
}
Asked at: LinkedIn Interview Guide
Asked at: Shopify Interview Guide
Asked at: Databricks Interview Guide
Asked at: Twitter/X Interview Guide