Recommendation systems power product discovery at Netflix, Spotify, Amazon, and LinkedIn. Designing one at scale involves a multi-stage architecture that balances relevance quality with latency constraints serving millions of users.
Requirements
Functional: Return N personalized item recommendations per user, support real-time signals (recent clicks), handle cold start for new users/items, support multiple contexts (homepage, item detail, email).
Non-functional: < 100ms p99 for recommendation serving, handle 10M DAU, update models daily (batch) and recommendations in near-real-time (stream), support A/B testing of recommendation algorithms.
Multi-Stage Recommendation Architecture
All items (millions)
│
▼
[Candidate Generation] ← retrieval: narrow to ~1000 candidates
│ (ANN search, collaborative filtering, content rules)
│
▼
[Pre-Ranking / Filtering] ← remove seen, blocked, low-quality
│
▼
[Ranking Model] ← ML model scores each candidate (100ms budget)
│ (gradient boosted trees, deep learning ranker)
│
▼
[Post-Processing] ← diversity injection, business rules, A/B
│
▼
Final N recommendations
Candidate Generation: Collaborative Filtering
Matrix Factorization
User-Item interaction matrix R (sparse):
Items →
U [5, ?, 3, ?, 1]
s [?, 4, ?, 5, ?]
e [1, ?, ?, 3, 5]
r ...
s
↓
Factorize: R ≈ U × Vᵀ
U: user embeddings (n_users × k)
V: item embeddings (n_items × k)
k = 64-256 latent factors
Prediction: r̂_ui = u_i · v_j (dot product)
Training: minimize Σ(r_ui - u_i · v_j)² + λ(‖u_i‖² + ‖v_j‖²)
via ALS (Alternating Least Squares) or SGD
Two-Tower Neural Model
User Tower Item Tower
──────────── ──────────
user_id embedding item_id embedding
+ user features + item features
+ recent history + content features
│ │
Dense layers Dense layers
│ │
user vector (256d) ──── item vector (256d)
cosine similarity → score
Training: contrastive loss (positive pairs from interactions,
negative sampling from random items)
Two-tower models allow pre-computing item embeddings offline. At serving time, only the user tower runs, and ANN search finds nearest item vectors in milliseconds.
Approximate Nearest Neighbor (ANN) Search
Problem: find top-K closest item vectors to user vector
from 10M item embeddings in < 10ms
Algorithms:
HNSW (Hierarchical Navigable Small World):
- Multi-layer graph; traverse from top layer to bottom
- ~95% recall at 10x speedup vs brute force
- Memory: O(n × d × 4 bytes) = 10M × 256 × 4 = 10GB
FAISS (Facebook AI Similarity Search):
- Supports IVF (inverted file index): cluster items, search top clusters
- IVFFlat: exact within clusters, approximate overall
- GPU acceleration for batch serving
Services: Pinecone, Weaviate, Milvus, Elasticsearch kNN
Ranking Model
Input features per (user, item) pair:
User features: age, country, device, account_age, preferences
Item features: category, popularity, recency, avg_rating
Interaction: historical CTR for this user × category
Context: time_of_day, surface (homepage vs search)
Cross features: user_category_affinity, item_user_overlap
Model: LightGBM (fast, interpretable) or
Deep & Cross Network (DCN) for feature crosses
Output: P(click), P(purchase), P(watch_completion)
→ weighted combination = ranking score
Serving: pre-score at candidate generation time (batch for
returning users), real-time for personalization signals
Cold Start Problem
New User Cold Start
- Onboarding signals: ask users to rate seed items or select interests
- Demographic-based: serve popular items in user’s country/age cohort
- Session-based: update recommendations after first 2-3 interactions using session context model (RNN/Transformer over click sequence)
New Item Cold Start
- Content-based: embed item using text/image features; find nearest existing items in embedding space
- Exploration injection: insert new items in ranked list with boosted score for first N impressions to gather interaction data
- Warm-up period: use content features for ranking; switch to collaborative features after 100+ interactions
Real-Time Personalization
User clicks item → Kafka event
→ Feature pipeline: update user session features (Redis, TTL 30min)
→ Stream processor: rerank current recommendation slate
→ A/B framework: route to correct model variant
→ Cache invalidation: bust user's cached recommendations
Architecture:
Flink consumer → user session store (Redis)
→ online feature store (Feast/Tecton)
→ trigger re-ranking if session changed significantly
Diversity and Serendipity
Pure relevance optimization creates filter bubbles and repetitive recommendations. Post-processing injects diversity:
- MMR (Maximal Marginal Relevance): iteratively select next item maximizing λ × relevance – (1-λ) × similarity to already selected items
- Category caps: max 2 items per category in top-10
- Freshness injection: reserve slots for items newer than 7 days
- Exploration slots: ε-greedy: 5% of slots serve exploratory items to combat popularity bias
Offline Evaluation vs Online Metrics
| Metric | Stage | Measures |
|---|---|---|
| Recall@K, NDCG@K | Offline | Ranking quality on held-out interactions |
| AUC-ROC | Offline | Discriminative power of ranker |
| CTR, CVR | Online A/B | User engagement |
| Session length, D7 retention | Online A/B | Long-term user satisfaction |
| Catalog coverage | Online | Diversity of items surfaced |
Survivorship bias: offline metrics only evaluate on items users saw (logged policy). Use inverse propensity scoring or counterfactual evaluation to correct for this.
System Design Summary
- Scale: Two-tower model for retrieval (ANN over 10M item embeddings), LightGBM/DCN for ranking ~1000 candidates
- Latency: ANN search < 10ms, ranking < 50ms, post-processing < 10ms = 70ms total
- Freshness: Flink pipeline for real-time user session features; model retrained daily
- Infrastructure: Kafka → Flink → Feast (feature store) → Serving layer (TF Serving / Triton)
Companies That Ask This
Asked at: Uber Interview Guide