System Design Interview: Machine Learning Platform and MLOps

As machine learning moves from research to production, companies need platforms that manage the full ML lifecycle: data ingestion, feature engineering, training, evaluation, deployment, and monitoring. Designing an ML platform is a key interview topic at companies like Meta, Google, Uber, and any organization running models at scale.

The ML Lifecycle

Data → Features → Training → Evaluation → Deployment → Monitoring
  ↑                                                          │
  └──────────── feedback loop (retraining signals) ──────────┘

Feature Store: The Central Hub

A feature store solves the training-serving skew problem — the same feature computation logic must produce identical results during training (offline) and serving (online).

Feature Store Architecture:
  Offline Store (batch, historical)
    ├── Source: Hive, Spark, dbt pipeline
    ├── Storage: S3 + Parquet (point-in-time correct joins)
    └── Use: model training, offline evaluation

  Online Store (low-latency, real-time)
    ├── Source: Kafka stream → Flink materialization
    ├── Storage: Redis / DynamoDB (< 5ms p99 reads)
    └── Use: model serving (inference time feature lookup)

  Feature Registry
    └── Metadata: name, type, owner, freshness SLA, lineage

Point-in-time correct join (critical for avoiding leakage):
  Training data: join user features as of event timestamp
  NOT current feature values → prevents future data leakage
  Tools: Feast, Tecton, Hopsworks, AWS SageMaker Feature Store

Model Training Infrastructure

Distributed training patterns:

Data Parallelism (most common):
  Split dataset across N GPUs
  Each GPU has full model copy
  Forward pass → gradients → AllReduce → update weights
  Tools: PyTorch DDP, Horovod

Model Parallelism (for models too large for one GPU):
  Split model layers across GPUs (pipeline parallelism)
  LLaMA 70B: 80 transformer layers across 8 × A100 GPUs
  Tools: DeepSpeed, Megatron-LM

Hyperparameter Tuning:
  Grid search → Bayesian optimization (TPE algorithm)
  → Population-based training (PBT) for RL
  Tools: Ray Tune, Optuna, Weights & Biases Sweeps

Experiment Tracking (mandatory):
  Log: hyperparameters, metrics, artifacts, code version
  Tools: MLflow, Weights & Biases, Comet
  Schema: run_id, experiment_id, params {lr, batch_size},
          metrics {train_loss, val_auc per epoch}, model artifact

Model Registry and Versioning

Model lifecycle stages:
  Staging → Candidate → Production → Archived

Model registry entry:
{
  "model_name": "fraud_detector",
  "version": 42,
  "stage": "Production",
  "training_run_id": "abc123",
  "framework": "scikit-learn 1.3",
  "metrics": {"val_auc": 0.953, "val_f1": 0.881},
  "features": ["amount", "merchant_category", "user_30d_txn_count"],
  "training_dataset": "s3://ml-data/fraud/2024-01-01/",
  "registered_at": "2024-01-15T10:30:00Z",
  "promoted_by": "alice@company.com"
}

Promotion workflow:
  1. Train model → log to experiment tracker
  2. Pass offline evaluation thresholds → register in staging
  3. A/B test in shadow mode (offline comparison vs champion)
  4. Shadow passed → promote to Candidate
  5. Online A/B test (traffic split) → promote to Production
  6. Old model → Archived (retained for rollback)

Model Serving Architecture

Batch inference (offline predictions):
  Trigger: nightly Airflow job
  Input:  feature table in S3 (yesterday's user features)
  Output: prediction table (user_id → score) in S3 → Redis
  Latency: hours acceptable; throughput is key metric
  Tools: Spark MLlib, SageMaker Batch Transform, Ray Batch

Real-time inference (online serving):
  Request → Feature Store lookup (Redis, < 5ms)
           → Model Server (TF Serving / Triton / TorchServe)
           → Post-processing (threshold, calibration)
           → Response
  Latency: < 20ms p99 target
  Scaling: k8s HPA on GPU utilization or request queue depth

Model server optimizations:
  - Model quantization: FP32 → INT8 (4× smaller, ~2× faster, ~1% accuracy drop)
  - Batching: collect N requests → single GPU forward pass (amortize overhead)
  - ONNX: convert from PyTorch/TF → unified runtime
  - Model caching: warm model in GPU memory (cold start = seconds)

A/B Testing and Shadow Mode

Shadow mode (safe challenger evaluation):
  All requests → Champion model → response to user
  All requests → Challenger model → prediction logged (not used)
  Offline: compare champion vs challenger on same inputs

A/B test (traffic split for online evaluation):
  10% traffic → Challenger (new model)
  90% traffic → Champion (current model)
  Track: CTR, conversion rate, revenue, long-term engagement
  Duration: 2+ weeks (statistical significance + seasonality)

Metrics hierarchy:
  Primary:   Business metric (revenue, CTR, D7 retention)
  Secondary: Model metric (AUC, precision, recall)
  Guardrail: Latency p99, error rate, cost-per-prediction

ML Monitoring: Detecting Drift

Types of drift:
  Data drift:    input feature distribution changes
    e.g., user age distribution shifts after new market launch
  Concept drift: relationship between features and label changes
    e.g., fraud patterns change, model predictions stale
  Label drift:   outcome distribution changes
    e.g., click-through rate drops across the board

Detection methods:
  PSI (Population Stability Index): compare feature distributions
    PSI  0.2:  major shift (retrain)

  KS test: Kolmogorov-Smirnov statistic for continuous features
  Chi-squared: for categorical features

  Model output monitoring:
    Track prediction score distribution daily
    Alert if mean score shifts > 2 standard deviations

Monitoring stack:
  Model → log predictions + features → Kafka
        → Flink: compute drift metrics per feature
        → Time-series DB (Prometheus/InfluxDB)
        → Grafana dashboard + alert if PSI > threshold
        → Trigger retraining pipeline

Retraining Strategy

Strategy Trigger Cost Best For
Scheduled Weekly/monthly cron Low Stable, slow-changing domains
Triggered Drift detected (PSI > threshold) Medium Dynamic environments
Continuous New data available (streaming) High Real-time personalization, fraud

ML Platform Component Summary

Data Layer:      Kafka → Flink → Feature Store (offline: S3, online: Redis)
Training:        Airflow DAG → Spark / PyTorch DDP → MLflow (tracking)
Registry:        Model Registry (staging → production pipeline)
Serving:         Triton / TF Serving → k8s HPA → < 20ms p99
Monitoring:      Prediction logs → drift detection → alert → retrain trigger
Orchestration:   Airflow / Kubeflow Pipelines / Metaflow (pipelines as code)

Interview Discussion Points

  • Training/serving skew: The #1 production ML bug. Same feature code must run in training and serving. Feature store enforces this by being the single source of feature logic. Without it, teams independently implement features and diverge.
  • Online vs batch features: Some features require real-time computation (user’s last 5 actions), others are batch (user lifetime value). Hybrid feature stores serve both from a unified API — online for real-time, offline for training — hiding the implementation difference from model code.
  • Model rollback: Always retain the previous champion model in the registry. Rollback = update serving config to point to previous version. Should complete in < 5 minutes. Canary deployment (5% → 20% → 100% traffic) enables early detection of regressions.
  • Cold start in ML serving: Loading a large model (GPT-scale) from disk takes 30-120 seconds. Mitigate with: keep model warm in GPU memory, use smaller distilled models for latency-critical paths, preload on startup, readiness probe gates traffic until model loaded.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is a feature store and why is it important for ML systems?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A feature store is a centralized repository for ML features that solves two critical problems: (1) Training-serving skew u2014 without a feature store, data scientists compute features in Python/Spark for training, and engineers re-implement the same logic in Java/Go for serving, leading to subtle differences that degrade model performance in production. A feature store ensures the same feature computation logic runs in both environments. (2) Feature reuse u2014 instead of each team recomputing the same features (user purchase history, merchant category statistics), the feature store computes them once and makes them available to all models. Feature stores have an offline component (historical features for training, backed by S3/Hive) and an online component (real-time features for serving, backed by Redis/DynamoDB).”
}
},
{
“@type”: “Question”,
“name”: “How do you detect and handle model drift in production ML systems?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Model drift occurs when the statistical properties of model inputs or outputs change over time, degrading accuracy. Detect it by monitoring: data drift (feature distribution changes, measured via PSI or KS test u2014 PSI > 0.2 indicates major shift), concept drift (relationship between features and labels changes u2014 model predictions stay stable but business outcomes degrade), and output drift (prediction score distribution shifts). Log all model inputs and predictions, compute drift metrics daily via a Flink/Spark job, and alert when drift exceeds thresholds. Response: trigger retraining on labeled recent data. Automated retraining pipelines (Airflow + MLflow) reduce mean-time-to-recovery from days to hours.”
}
},
{
“@type”: “Question”,
“name”: “What is the difference between batch inference and real-time inference, and when should you use each?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Batch inference runs predictions offline on a large dataset and stores results for later lookup u2014 suitable when predictions can be precomputed (e.g., daily personalized email content, weekly risk scores). It uses cheaper CPU compute, handles arbitrary model size/complexity, and scales via Spark/Ray. Real-time inference serves predictions at request time within a latency budget (< 20ms p99) u2014 required when predictions depend on real-time context (e.g., fraud detection on a live transaction, personalized search results). It requires model servers (Triton, TF Serving) on GPU instances, model optimizations (quantization, batching), and a feature store for fast feature lookup. Many systems combine both: batch predictions cached in Redis as defaults, real-time inference for high-value or context-sensitive requests."
}
}
]
}

  • Airbnb Interview Guide
  • Twitter/X Interview Guide
  • Databricks Interview Guide
  • LinkedIn Interview Guide
  • Uber Interview Guide
  • Netflix Interview Guide
  • Companies That Ask This

    Scroll to Top