System Design: Vector Database — Embeddings Storage, ANN Search, and Semantic Retrieval

What is a Vector Database?

A vector database stores high-dimensional numerical vectors (embeddings) and supports efficient similarity search: “find the K vectors most similar to this query vector.” Embeddings are produced by ML models — a sentence embedding might be 1536 dimensions (OpenAI text-embedding-3-small), an image embedding 2048 dimensions (ResNet). Vector databases power semantic search (find documents by meaning, not keywords), recommendation systems (find items similar to what a user liked), RAG (Retrieval-Augmented Generation for LLM applications), and duplicate detection. Systems: Pinecone, Weaviate, Qdrant, Milvus, pgvector (PostgreSQL extension). This is increasingly asked at AI-focused companies (OpenAI, Cohere, Anthropic, Databricks) and any company building AI features.

Exact nearest neighbor (brute force): compute cosine similarity or L2 distance between the query vector and every stored vector. O(N * D) where N = number of vectors, D = dimensions. For 100M vectors at 1536 dimensions: 100M * 1536 = 153 billion multiplications per query. At ~1ns per operation: 153 seconds per query. Unacceptable. ANN algorithms trade a small accuracy loss for orders-of-magnitude speedup. Three main approaches: HNSW (Hierarchical Navigable Small World graphs): builds a multi-layer graph where each node connects to nearby nodes. Search traverses from a coarse upper layer to a fine lower layer, following edges greedily toward the query. O(log N) average query time. 99%+ recall at typical settings. The gold standard for production ANN. IVF (Inverted File Index): partition vectors into K clusters (k-means). At query time, search only the C closest clusters (C << K). Reduces comparisons from N to N/K * C. Less accurate than HNSW but simpler and memory-efficient. LSH (Locality-Sensitive Hashing): hash similar vectors to the same bucket with high probability. Very fast but lower recall than HNSW/IVF. Rarely used in production today. HNSW performance: 100M vectors, 1536 dimensions, HNSW with ef_search=100: ~5ms per query at 98% recall on a single machine. Index build time: O(N log N), takes hours for 100M vectors.

Architecture

Data model: each vector has an id, vector (float32 array), and optional metadata ({title, author, date, category}). Metadata is used for pre-filtering (search only in category=’finance’). Write path: client sends (id, vector, metadata) to the API. API validates dimension (must match index configuration). Vector is written to the HNSW index (update the graph) + to the metadata store (PostgreSQL or DynamoDB for filter queries). Write throughput: HNSW inserts are O(log N) but require updating graph edges — slower than a simple append-only store. Typical throughput: 1K-10K inserts/second on a single node. Read path: client sends a query vector + k + optional metadata filter. Pre-filtering: if a metadata filter is specified, retrieve the set of IDs matching the filter from the metadata store, then search only within that subset. Post-filtering: search ANN across all vectors, then filter the top-k results by metadata. Pre-filtering is more accurate but slower (smaller candidate set for ANN); post-filtering is faster but may return fewer than k results if many are filtered out. Sharding: partition vectors across shards (by id range or consistent hashing). Each shard maintains its own HNSW index. Query is broadcast to all shards; results are merged (top-k from each shard → global top-k). Replication: each shard has multiple replicas. Reads are load-balanced. Writes go to all replicas.

Quantization and Compression

100M vectors * 1536 dimensions * 4 bytes (float32) = 614GB of raw vector data. Too large for RAM on a single machine. Quantization reduces memory: Product Quantization (PQ): divide each vector into M sub-vectors, quantize each to one of 256 centroids (1 byte each). 1536-dim float32 → 192 bytes (8x compression, negligible accuracy loss). Scalar Quantization (SQ): quantize each float32 to int8. 4x compression (1536 * 1 byte = 1.5KB vs 6KB). Fast hardware support (int8 multiply-accumulate). Binary quantization: each float → 1 bit. 32x compression, significant accuracy loss, used with re-ranking. Re-ranking: first pass with compressed vectors (fast, approximate), then re-score top-2K candidates with original float32 vectors (slower, accurate). Returns top-K accurate results at 10-20x lower memory footprint. HNSW + PQ (IVFPQ): combine IVF partitioning and PQ quantization. The most memory-efficient production configuration for very large indexes.

Interview Tips

  • HNSW is the standard algorithm to know — explain multi-layer graph, O(log N) search, and recall vs. latency tradeoff.
  • Distinguish exact nearest neighbor (impossible at scale) from ANN (practical with 98%+ recall).
  • Quantization (PQ, SQ) is how you fit 100M vectors in RAM — mention it when scale is discussed.
  • Pre-filter vs post-filter is a key design tradeoff for metadata-filtered search.

See also: Databricks Interview Prep

See also: Meta Interview Prep

See also: Atlassian Interview Prep

Scroll to Top