System Design: Distributed Search Engine
A distributed search engine indexes documents and serves queries with sub-100ms latency at massive scale. Elasticsearch powers search at Wikipedia, GitHub, Uber, and Airbnb. This guide covers the architecture of building a search system from scratch — the inverted index, distributed query execution, relevance scoring, and operational concerns.
Requirements
Functional: Index documents (text, structured fields). Full-text search with ranking by relevance. Boolean queries (AND, OR, NOT). Faceted search (filter by category, price range). Near-real-time indexing (<1 second from write to searchable).
Non-functional: 1 billion documents, 10,000 QPS search, p99 < 100ms, horizontal scalability, fault tolerance.
Core Data Structure: Inverted Index
An inverted index maps each term (word) to a list of document IDs that contain it — the inverse of a document containing a list of words.
Term → Posting List (doc_id, position, frequency)
"python" → [(doc:1, pos:5, freq:3), (doc:4, pos:1, freq:1), (doc:9, pos:12, freq:2)]
"interview" → [(doc:1, pos:8, freq:1), (doc:3, pos:2, freq:4)]
"design" → [(doc:2, pos:1, freq:2), (doc:3, pos:7, freq:1), (doc:9, pos:3, freq:5)]
Query “python AND interview” = intersection of posting lists for “python” and “interview”. Posting lists are stored sorted by doc_id, enabling efficient merge with two pointers in O(|posting1| + |posting2|). Position data enables phrase queries (“system design” as adjacent words).
Indexing Pipeline
Document → Tokenizer → Analyzer (lowercase, stemming, stopwords) → Token Stream
↓
Inverted Index Update
↓
Segment File (immutable)
Text analysis steps:
- Tokenization: Split text into tokens (“System Design” → [“System”, “Design”])
- Lowercasing: “Python” → “python”
- Stop word removal: Remove common words (“the”, “is”, “at”) that add noise
- Stemming/Lemmatization: “running” → “run”, “designs” → “design”. Enables matching “design” queries to “designs” in documents.
- Synonym expansion: “car” → also index “automobile”
Lucene Segment Architecture
Lucene (the engine under Elasticsearch) uses an immutable segment design. New documents go into an in-memory buffer. Periodically (every 1 second in Elasticsearch), the buffer is flushed to a new immutable segment file on disk. This makes each write O(1) — just append to the buffer.
Segment merging: Many small segments accumulate over time. Background threads merge small segments into larger ones. Merging improves query performance (fewer segments to search) and reclaims space from deleted documents. Deleted documents are tracked with a tombstone bitset — physically removed during merges.
Query execution: Each query searches all segments in parallel and merges results. 5 segments × 200ms each → parallelized → effectively 200ms total (not 1000ms serial).
Distributed Architecture
Client → Load Balancer → Coordinator Node
↓
[shard_0 on Node1] [shard_1 on Node2] [shard_2 on Node3]
↓
Coordinator merges results and returns top-K to client
Sharding: The document corpus is split into shards. Shard assignment: shard = hash(document_id) % num_shards. Each shard is a complete Lucene index. A query fans out to all shards in parallel, each returns its top-K results, the coordinator merges and re-ranks.
Replication: Each shard has a primary + N replicas. Writes go to the primary and replicate. Reads can be served by any replica (load balancing). Replication factor = 2 (3 copies total) is standard for fault tolerance — survives loss of any one node.
Relevance Scoring: BM25
BM25 (Best Match 25) is the standard relevance ranking algorithm, used by Elasticsearch by default. It extends TF-IDF with document length normalization:
BM25(q, d) = Σ IDF(t) * (TF(t,d) * (k1+1)) / (TF(t,d) + k1*(1-b+b*|d|/avgdl))
Where:
IDF(t) = log((N - df(t) + 0.5) / (df(t) + 0.5) + 1) # rare terms score higher
TF(t,d) = frequency of term t in document d
|d| = length of document d in tokens
avgdl = average document length in corpus
k1=1.2, b=0.75 # tuning parameters
Key properties: rare terms (low document frequency) score higher (IDF). High term frequency helps but with diminishing returns (the denominator grows). Short documents score higher for the same term frequency (length normalization with b).
Near-Real-Time Indexing
Elasticsearch’s default refresh_interval=1s: every second, in-memory changes are flushed to a new segment and become searchable. The write path: documents → in-memory translog (durability) → buffer. The read path: searches the buffer + all committed segments. Translog acts as a WAL (write-ahead log) — on crash, replays uncommitted buffer from the translog. fsync (durable commit) happens every 30 seconds or on explicit flush — this is the expensive operation, not the 1-second refresh.
Query Types
- Term query: Exact match on a field. “status:active” → looks up posting list for “active” in the status field.
- Match query: Full-text. “python interview” → analyzes input, generates term queries, OR/AND of posting lists.
- Phrase query: “system design” as adjacent words. Uses position data in posting lists.
- Range query: “price: [10 TO 100]”. Uses a numeric index (doc values) not the inverted index.
- Fuzzy query: “phyton” matches “python” with edit distance ≤ 2. Uses BK-tree or automaton.
Interview Tips
- Explain the inverted index clearly — it’s the core data structure.
- Mention the sharding fan-out pattern: query → all shards in parallel → coordinator merges.
- BM25 vs TF-IDF: BM25 is strictly better. Know the length normalization intuition.
- Near-real-time vs real-time: explain the 1-second refresh window and when to use force refresh.
- Faceted search: use aggregations (bucket by field value, compute counts) run in parallel with the search query on the same shards.
🏢 Asked at: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture
🏢 Asked at: LinkedIn Interview Guide 2026: Social Graph Engineering, Feed Ranking, and Professional Network Scale
🏢 Asked at: Airbnb Interview Guide 2026: Search Systems, Trust and Safety, and Full-Stack Engineering
🏢 Asked at: Cloudflare Interview Guide
🏢 Asked at: Atlassian Interview Guide