Question 1

How does HNSW (Hierarchical Navigable Small World) enable fast ANN search?

Accepted Answer

HNSW builds a multi-layer graph. The top layers are sparse long-range connections (coarse navigation); lower layers are dense short-range connections (fine-grained search). To search: start at the top layer, greedily traverse toward the query vector, descend to the next layer when no closer neighbor is found in the current layer, and repeat until the bottom layer. This greedy traversal runs in O(log N) on average, compared to O(N) for brute force.

Question 2

What is product quantization and how does it compress vectors?

Accepted Answer

Product quantization divides each high-dimensional vector into M equal sub-vectors. Each sub-vector is quantized to one of 256 centroids (fit by k-means on the training data), represented by 1 byte. A 1536-dimensional float32 vector (6144 bytes) split into 192 sub-vectors of 8 dimensions each becomes 192 bytes — an 8x compression. Distance computations use precomputed lookup tables for sub-vector distances, maintaining fast approximate search with negligible accuracy loss at typical recall targets.

Question 3

What is the difference between pre-filtering and post-filtering in vector search?

Accepted Answer

Pre-filtering: apply metadata filters before ANN search — retrieve matching IDs from the metadata store, then search ANN only within that subset. More accurate (all returned results pass the filter) but slower (ANN on a smaller, potentially sparse set). Post-filtering: run ANN across all vectors, then filter the top-K results. Faster (full ANN index used) but may return fewer than K results if many are filtered out. Most production systems use pre-filtering for small filtered sets and post-filtering for large sets.

Question 4

How do you scale a vector database beyond a single machine?

Accepted Answer

Shard the vector index across multiple machines (by ID range or consistent hashing). Each shard maintains its own HNSW index and serves queries for its partition. On query: broadcast to all shards, each returns its top-K candidates, the coordinator merges and returns the global top-K. Replication: each shard has multiple replicas for fault tolerance and read throughput. This horizontal scaling supports 100M+ vectors while keeping per-shard memory manageable (e.g., 10M vectors per shard).

Question 5

What are vector databases used for in LLM applications (RAG)?

Accepted Answer

In Retrieval-Augmented Generation (RAG), a knowledge base of documents is pre-embedded using an embedding model and stored in a vector database. At inference time, the user's query is embedded with the same model, and the K most semantically similar documents are retrieved from the vector database. These retrieved documents are injected into the LLM's context window as grounding information, allowing the LLM to answer questions about private or up-to-date knowledge it was not trained on.

System Design: Vector Database — Embeddings Storage, ANN Search, and Semantic Retrieval

What is a Vector Database?

Approximate Nearest Neighbor (ANN) Search

Architecture

Quantization and Compression

Interview Tips