System Design Interview: Recommendation Engine at Scale

Recommendation systems power product discovery at Netflix, Spotify, Amazon, and LinkedIn. Designing one at scale involves a multi-stage architecture that balances relevance quality with latency constraints serving millions of users.

Requirements

Functional: Return N personalized item recommendations per user, support real-time signals (recent clicks), handle cold start for new users/items, support multiple contexts (homepage, item detail, email).

Non-functional: < 100ms p99 for recommendation serving, handle 10M DAU, update models daily (batch) and recommendations in near-real-time (stream), support A/B testing of recommendation algorithms.

Multi-Stage Recommendation Architecture

All items (millions)
    │
    ▼
[Candidate Generation]  ← retrieval: narrow to ~1000 candidates
    │  (ANN search, collaborative filtering, content rules)
    │
    ▼
[Pre-Ranking / Filtering]  ← remove seen, blocked, low-quality
    │
    ▼
[Ranking Model]  ← ML model scores each candidate (100ms budget)
    │  (gradient boosted trees, deep learning ranker)
    │
    ▼
[Post-Processing]  ← diversity injection, business rules, A/B
    │
    ▼
Final N recommendations

Candidate Generation: Collaborative Filtering

Matrix Factorization

User-Item interaction matrix R (sparse):
    Items →
U  [5, ?, 3, ?, 1]
s  [?, 4, ?, 5, ?]
e  [1, ?, ?, 3, 5]
r  ...
s
↓
Factorize: R ≈ U × Vᵀ
  U: user embeddings (n_users × k)
  V: item embeddings (n_items × k)
  k = 64-256 latent factors

Prediction: r̂_ui = u_i · v_j (dot product)

Training: minimize Σ(r_ui - u_i · v_j)² + λ(‖u_i‖² + ‖v_j‖²)
  via ALS (Alternating Least Squares) or SGD

Two-Tower Neural Model

User Tower                Item Tower
────────────              ──────────
user_id embedding         item_id embedding
+ user features           + item features
+ recent history          + content features
     │                         │
  Dense layers              Dense layers
     │                         │
  user vector (256d) ──── item vector (256d)
                    cosine similarity → score

Training: contrastive loss (positive pairs from interactions,
          negative sampling from random items)

Two-tower models allow pre-computing item embeddings offline. At serving time, only the user tower runs, and ANN search finds nearest item vectors in milliseconds.

Approximate Nearest Neighbor (ANN) Search

Problem: find top-K closest item vectors to user vector
         from 10M item embeddings in < 10ms

Algorithms:
  HNSW (Hierarchical Navigable Small World):
    - Multi-layer graph; traverse from top layer to bottom
    - ~95% recall at 10x speedup vs brute force
    - Memory: O(n × d × 4 bytes) = 10M × 256 × 4 = 10GB

  FAISS (Facebook AI Similarity Search):
    - Supports IVF (inverted file index): cluster items, search top clusters
    - IVFFlat: exact within clusters, approximate overall
    - GPU acceleration for batch serving

Services: Pinecone, Weaviate, Milvus, Elasticsearch kNN

Ranking Model

Input features per (user, item) pair:
  User features:  age, country, device, account_age, preferences
  Item features:  category, popularity, recency, avg_rating
  Interaction:    historical CTR for this user × category
  Context:        time_of_day, surface (homepage vs search)
  Cross features: user_category_affinity, item_user_overlap

Model: LightGBM (fast, interpretable) or
       Deep & Cross Network (DCN) for feature crosses

Output: P(click), P(purchase), P(watch_completion)
        → weighted combination = ranking score

Serving: pre-score at candidate generation time (batch for
         returning users), real-time for personalization signals

Cold Start Problem

New User Cold Start

  • Onboarding signals: ask users to rate seed items or select interests
  • Demographic-based: serve popular items in user’s country/age cohort
  • Session-based: update recommendations after first 2-3 interactions using session context model (RNN/Transformer over click sequence)

New Item Cold Start

  • Content-based: embed item using text/image features; find nearest existing items in embedding space
  • Exploration injection: insert new items in ranked list with boosted score for first N impressions to gather interaction data
  • Warm-up period: use content features for ranking; switch to collaborative features after 100+ interactions

Real-Time Personalization

User clicks item → Kafka event
  → Feature pipeline: update user session features (Redis, TTL 30min)
  → Stream processor: rerank current recommendation slate
  → A/B framework: route to correct model variant
  → Cache invalidation: bust user's cached recommendations

Architecture:
  Flink consumer → user session store (Redis)
                 → online feature store (Feast/Tecton)
                 → trigger re-ranking if session changed significantly

Diversity and Serendipity

Pure relevance optimization creates filter bubbles and repetitive recommendations. Post-processing injects diversity:

  • MMR (Maximal Marginal Relevance): iteratively select next item maximizing λ × relevance – (1-λ) × similarity to already selected items
  • Category caps: max 2 items per category in top-10
  • Freshness injection: reserve slots for items newer than 7 days
  • Exploration slots: ε-greedy: 5% of slots serve exploratory items to combat popularity bias

Offline Evaluation vs Online Metrics

Metric Stage Measures
Recall@K, NDCG@K Offline Ranking quality on held-out interactions
AUC-ROC Offline Discriminative power of ranker
CTR, CVR Online A/B User engagement
Session length, D7 retention Online A/B Long-term user satisfaction
Catalog coverage Online Diversity of items surfaced

Survivorship bias: offline metrics only evaluate on items users saw (logged policy). Use inverse propensity scoring or counterfactual evaluation to correct for this.

System Design Summary

  • Scale: Two-tower model for retrieval (ANN over 10M item embeddings), LightGBM/DCN for ranking ~1000 candidates
  • Latency: ANN search < 10ms, ranking < 50ms, post-processing < 10ms = 70ms total
  • Freshness: Flink pipeline for real-time user session features; model retrained daily
  • Infrastructure: Kafka → Flink → Feast (feature store) → Serving layer (TF Serving / Triton)

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is a two-tower model and why is it used for recommendation retrieval?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A two-tower model (dual encoder) uses two separate neural networks u2014 one for users and one for items u2014 that each produce a fixed-dimension embedding vector. Similarity between user and item is computed as dot product or cosine similarity. The key advantage: item embeddings can be pre-computed offline and indexed in an Approximate Nearest Neighbor (ANN) index (e.g., HNSW, FAISS). At serving time, only the user tower runs on live features, then ANN search retrieves the top-K closest item vectors in milliseconds u2014 no need to score millions of items at query time.”
}
},
{
“@type”: “Question”,
“name”: “How do recommendation systems solve the cold start problem for new users?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “New user cold start is addressed through multiple strategies: (1) Onboarding: ask users to select interests or rate seed items to bootstrap preferences. (2) Demographic fallback: serve popular items segmented by country, device, or referral source. (3) Session-based models: after the first 2-3 interactions, a session context model (RNN or Transformer over click sequence) generates personalized candidates without requiring historical data. (4) Cross-domain transfer: if the user has a connected account (e.g., Facebook login), import signals from other platforms with consent.”
}
},
{
“@type”: “Question”,
“name”: “What is the difference between collaborative filtering and content-based filtering?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Collaborative filtering recommends items based on the behavior of similar users u2014 “users who liked X also liked Y” u2014 using the interaction matrix (ratings, clicks, purchases). It requires no item content understanding but suffers from cold start and data sparsity. Content-based filtering recommends items similar to ones the user has interacted with, using item features (genre, description, tags) u2014 it works for new items with no interactions but creates filter bubbles (only recommends similar items). Production systems combine both: collaborative filtering for personalization, content-based for cold start and diversity, with a learning-to-rank model that fuses signals.”
}
}
]
}

  • Airbnb Interview Guide
  • Databricks Interview Guide
  • Twitter/X Interview Guide
  • Snap Interview Guide
  • Meta Interview Guide
  • LinkedIn Interview Guide
  • Companies That Ask This

    Scroll to Top