As machine learning moves from research to production, companies need platforms that manage the full ML lifecycle: data ingestion, feature engineering, training, evaluation, deployment, and monitoring. Designing an ML platform is a key interview topic at companies like Meta, Google, Uber, and any organization running models at scale.
The ML Lifecycle
Data → Features → Training → Evaluation → Deployment → Monitoring
↑ │
└──────────── feedback loop (retraining signals) ──────────┘
Feature Store: The Central Hub
A feature store solves the training-serving skew problem — the same feature computation logic must produce identical results during training (offline) and serving (online).
Feature Store Architecture:
Offline Store (batch, historical)
├── Source: Hive, Spark, dbt pipeline
├── Storage: S3 + Parquet (point-in-time correct joins)
└── Use: model training, offline evaluation
Online Store (low-latency, real-time)
├── Source: Kafka stream → Flink materialization
├── Storage: Redis / DynamoDB (< 5ms p99 reads)
└── Use: model serving (inference time feature lookup)
Feature Registry
└── Metadata: name, type, owner, freshness SLA, lineage
Point-in-time correct join (critical for avoiding leakage):
Training data: join user features as of event timestamp
NOT current feature values → prevents future data leakage
Tools: Feast, Tecton, Hopsworks, AWS SageMaker Feature Store
Model Training Infrastructure
Distributed training patterns:
Data Parallelism (most common):
Split dataset across N GPUs
Each GPU has full model copy
Forward pass → gradients → AllReduce → update weights
Tools: PyTorch DDP, Horovod
Model Parallelism (for models too large for one GPU):
Split model layers across GPUs (pipeline parallelism)
LLaMA 70B: 80 transformer layers across 8 × A100 GPUs
Tools: DeepSpeed, Megatron-LM
Hyperparameter Tuning:
Grid search → Bayesian optimization (TPE algorithm)
→ Population-based training (PBT) for RL
Tools: Ray Tune, Optuna, Weights & Biases Sweeps
Experiment Tracking (mandatory):
Log: hyperparameters, metrics, artifacts, code version
Tools: MLflow, Weights & Biases, Comet
Schema: run_id, experiment_id, params {lr, batch_size},
metrics {train_loss, val_auc per epoch}, model artifact
Model Registry and Versioning
Model lifecycle stages:
Staging → Candidate → Production → Archived
Model registry entry:
{
"model_name": "fraud_detector",
"version": 42,
"stage": "Production",
"training_run_id": "abc123",
"framework": "scikit-learn 1.3",
"metrics": {"val_auc": 0.953, "val_f1": 0.881},
"features": ["amount", "merchant_category", "user_30d_txn_count"],
"training_dataset": "s3://ml-data/fraud/2024-01-01/",
"registered_at": "2024-01-15T10:30:00Z",
"promoted_by": "alice@company.com"
}
Promotion workflow:
1. Train model → log to experiment tracker
2. Pass offline evaluation thresholds → register in staging
3. A/B test in shadow mode (offline comparison vs champion)
4. Shadow passed → promote to Candidate
5. Online A/B test (traffic split) → promote to Production
6. Old model → Archived (retained for rollback)
Model Serving Architecture
Batch inference (offline predictions):
Trigger: nightly Airflow job
Input: feature table in S3 (yesterday's user features)
Output: prediction table (user_id → score) in S3 → Redis
Latency: hours acceptable; throughput is key metric
Tools: Spark MLlib, SageMaker Batch Transform, Ray Batch
Real-time inference (online serving):
Request → Feature Store lookup (Redis, < 5ms)
→ Model Server (TF Serving / Triton / TorchServe)
→ Post-processing (threshold, calibration)
→ Response
Latency: < 20ms p99 target
Scaling: k8s HPA on GPU utilization or request queue depth
Model server optimizations:
- Model quantization: FP32 → INT8 (4× smaller, ~2× faster, ~1% accuracy drop)
- Batching: collect N requests → single GPU forward pass (amortize overhead)
- ONNX: convert from PyTorch/TF → unified runtime
- Model caching: warm model in GPU memory (cold start = seconds)
A/B Testing and Shadow Mode
Shadow mode (safe challenger evaluation):
All requests → Champion model → response to user
All requests → Challenger model → prediction logged (not used)
Offline: compare champion vs challenger on same inputs
A/B test (traffic split for online evaluation):
10% traffic → Challenger (new model)
90% traffic → Champion (current model)
Track: CTR, conversion rate, revenue, long-term engagement
Duration: 2+ weeks (statistical significance + seasonality)
Metrics hierarchy:
Primary: Business metric (revenue, CTR, D7 retention)
Secondary: Model metric (AUC, precision, recall)
Guardrail: Latency p99, error rate, cost-per-prediction
ML Monitoring: Detecting Drift
Types of drift:
Data drift: input feature distribution changes
e.g., user age distribution shifts after new market launch
Concept drift: relationship between features and label changes
e.g., fraud patterns change, model predictions stale
Label drift: outcome distribution changes
e.g., click-through rate drops across the board
Detection methods:
PSI (Population Stability Index): compare feature distributions
PSI 0.2: major shift (retrain)
KS test: Kolmogorov-Smirnov statistic for continuous features
Chi-squared: for categorical features
Model output monitoring:
Track prediction score distribution daily
Alert if mean score shifts > 2 standard deviations
Monitoring stack:
Model → log predictions + features → Kafka
→ Flink: compute drift metrics per feature
→ Time-series DB (Prometheus/InfluxDB)
→ Grafana dashboard + alert if PSI > threshold
→ Trigger retraining pipeline
Retraining Strategy
| Strategy | Trigger | Cost | Best For |
|---|---|---|---|
| Scheduled | Weekly/monthly cron | Low | Stable, slow-changing domains |
| Triggered | Drift detected (PSI > threshold) | Medium | Dynamic environments |
| Continuous | New data available (streaming) | High | Real-time personalization, fraud |
ML Platform Component Summary
Data Layer: Kafka → Flink → Feature Store (offline: S3, online: Redis)
Training: Airflow DAG → Spark / PyTorch DDP → MLflow (tracking)
Registry: Model Registry (staging → production pipeline)
Serving: Triton / TF Serving → k8s HPA → < 20ms p99
Monitoring: Prediction logs → drift detection → alert → retrain trigger
Orchestration: Airflow / Kubeflow Pipelines / Metaflow (pipelines as code)
Interview Discussion Points
- Training/serving skew: The #1 production ML bug. Same feature code must run in training and serving. Feature store enforces this by being the single source of feature logic. Without it, teams independently implement features and diverge.
- Online vs batch features: Some features require real-time computation (user’s last 5 actions), others are batch (user lifetime value). Hybrid feature stores serve both from a unified API — online for real-time, offline for training — hiding the implementation difference from model code.
- Model rollback: Always retain the previous champion model in the registry. Rollback = update serving config to point to previous version. Should complete in < 5 minutes. Canary deployment (5% → 20% → 100% traffic) enables early detection of regressions.
- Cold start in ML serving: Loading a large model (GPT-scale) from disk takes 30-120 seconds. Mitigate with: keep model warm in GPU memory, use smaller distilled models for latency-critical paths, preload on startup, readiness probe gates traffic until model loaded.