Question 1

How does an inverted index differ from a forward index?

Accepted Answer

A forward index maps document u2192 list of terms it contains: doc_id u2192 [word1, word2, word3, ...]. This is the natural representation after document processing. A reverse (inverted) index maps term u2192 list of documents containing it: word u2192 [(doc_id, position, frequency), ...]. The forward index is useful for re-indexing (recomputing a document's index when it changes) and for computing per-document features. The inverted index is the core data structure for query processing -- given a query word, instantly find all documents containing it. In practice, both are built. The build process: for each document in the corpus, tokenize and normalize it, then emit (term, doc_id, position) tuples. Sort by term. Group by term to create posting lists. The posting list for each term is stored sorted by doc_id for efficient intersection (merge-join style). Posting lists can be compressed (delta encoding: store differences between consecutive doc_ids, then variable-length encode -- typical 4-8x compression).

Question 2

What is BM25 and why is it better than raw TF-IDF?

Accepted Answer

TF-IDF has two weaknesses: (1) Term frequency is unbounded and linear -- a document with "apple" 100 times scores 100x higher than one with "apple" 1 time, even though the extra occurrences add diminishing value. (2) No document length normalization -- a 10,000-word document has more raw term occurrences than a 100-word document just because it's longer. BM25 (Best Match 25) fixes both: TF saturation: BM25 score uses TF / (TF + k1) where k1 is typically 1.2-2.0. As TF grows, the score asymptotically approaches 1 -- each additional occurrence adds less weight. Document length normalization: multiply TF by (1 - b + b * dl/avgdl) in the denominator, where b=0.75. A longer-than-average document gets slightly penalized; shorter gets a slight bonus. IDF component: same as TF-IDF but with a slight smoothing to avoid zero for terms in all documents. Empirically, BM25 significantly outperforms TF-IDF on standard IR benchmarks. It is the default algorithm in Elasticsearch, Solr, and most search libraries.

Question 3

How do you handle typos and misspellings in search queries?

Accepted Answer

Spell correction approaches: (1) Edit distance (Levenshtein): for each query term not found in the index vocabulary, find vocabulary words within edit distance 1 or 2. Levenshtein automata allow efficient fuzzy matching against large vocabularies. Elasticsearch fuzzy query uses this. (2) Phonetic matching: Soundex or Metaphone converts words to phonetic representations -- "colour" and "color" have the same Soundex. (3) N-gram index: index all N-grams (trigrams) of vocabulary words. For a misspelled query term, find vocabulary words sharing many trigrams. Fast at the cost of larger index. (4) Statistical spell correction: use query logs to learn common misspellings. If 90% of users who type "applle" then re-type "apple", treat "applle" as a misspelling of "apple." (5) Context-aware correction: "read" is not misspelled, but in the context of other query terms (medical query), the intended word might be "reed." Use language models to suggest corrections in context. Modern search: combine multiple signals, show "Did you mean X?" below the first few results without hard-redirecting.

Question 4

What is NDCG and how is it used to evaluate search ranking quality?

Accepted Answer

NDCG (Normalized Discounted Cumulative Gain) measures ranking quality by weighing both relevance and position. DCG = sum of (relevance_i / log2(i+1)) for positions i=1..K. The log2 discount penalizes relevant documents placed lower in the ranking. NDCG = DCG / IDCG, where IDCG is the ideal DCG (perfect ranking of documents by relevance). NDCG ranges from 0 to 1; 1 = perfect ranking. Example: query "laptop charger". Position 1: highly relevant (relevance=3), position 2: slightly relevant (relevance=1), position 3: not relevant (relevance=0). DCG@3 = 3/log2(2) + 1/log2(3) + 0/log2(4) = 3 + 0.63 + 0 = 3.63. IDCG@3 (ideal order: 3,1,0) = 3.63 = same in this case = NDCG = 1.0. Relevance judgments come from human raters or click-through data (clicks = implicit relevance signal). Common practice: compute NDCG@5 or NDCG@10 (top 5 or 10 results) since users rarely go beyond page 1. A/B test new ranking models by measuring NDCG uplift on held-out query sets before deploying to production.

Question 5

How does Elasticsearch handle distributed search across multiple shards?

Accepted Answer

Elasticsearch index is divided into N primary shards (typically 3-5). Each shard is a complete Lucene index. Queries fan out to all shards in parallel: (1) Scatter phase: the coordinator node sends the query to all shards. Each shard runs the query locally and returns the top K doc IDs and scores (not the full documents). (2) Gather phase: coordinator merges all shard results, sorts by score, selects the global top K. (3) Fetch phase: coordinator fetches the full documents for the top K from the relevant shards. Two-phase approach (scatter-gather + fetch) minimizes data transferred: only scores and IDs (small) are transferred in the gather phase; full document data (large) is fetched only for the final K results. Replica shards: each primary has N replicas (default 1). Reads are load-balanced across replicas. Writes go to the primary, then replicate to replicas. Deep pagination problem: requesting page 100 of 10 results requires fetching top 1,000 from each shard (expensive). Use search_after (keyset pagination) for deep paging.

System Design: Search Ranking — Query Processing, Inverted Index, and Relevance Scoring

Core Architecture

Inverted Index

Relevance Scoring: TF-IDF and BM25

Query Processing Pipeline

Personalization and Learning to Rank