Q: How do you detect and respond to feature drift in production ML systems?

Monitor the distribution of each feature at serving time and compare to its training distribution using Population Stability Index (PSI). PSI = sum((actual_% - expected_%) * ln(actual_% / expected_%)). PSI < 0.1: stable. PSI 0.1-0.25: moderate drift, increase monitoring frequency. PSI > 0.25: significant drift — trigger retraining or rollback. Additionally monitor: null rates (sudden spikes indicate upstream data pipeline issues), value range violations (values outside training range), and feature freshness (how stale is the last feature update). Alert when any metric crosses threshold; auto-trigger retraining pipelines for well-understood drift patterns.

Question 1

What is the training-serving skew problem and how does a Feature Store solve it?

Accepted Answer

Training-serving skew occurs when features are computed differently at training time (from historical data) vs. serving time (from live data). For example, if training uses a 7-day average computed from a batch job, but serving computes it on a slightly different time window or with different null handling, the model sees a different distribution at inference than it was trained on, causing degraded performance. A Feature Store solves this by defining feature computation logic once and sharing it across both the offline pipeline (for training data generation) and the online store (for real-time serving). The same transformation code runs in both contexts.

Question 2

What is point-in-time correctness in ML feature pipelines and why does it matter?

Accepted Answer

Point-in-time correctness means that when generating training data, each training example only uses feature values that were available at the time its label was generated — not future values. Without it, training data contains data leakage (future information accidentally included), causing the model to learn spurious patterns that do not exist at inference time, leading to optimistic offline metrics but poor online performance. The implementation: a point-in-time join fetches, for each (entity, label_timestamp) pair, the feature value with the latest timestamp <= label_timestamp. This is often the most computationally expensive part of feature pipeline development.

Question 3

How do you detect and respond to feature drift in production ML systems?

Accepted Answer

Monitor the distribution of each feature at serving time and compare to its training distribution using Population Stability Index (PSI). PSI = sum((actual_% - expected_%) * ln(actual_% / expected_%)). PSI < 0.1: stable. PSI 0.1-0.25: moderate drift, increase monitoring frequency. PSI > 0.25: significant drift — trigger retraining or rollback. Additionally monitor: null rates (sudden spikes indicate upstream data pipeline issues), value range violations (values outside training range), and feature freshness (how stale is the last feature update). Alert when any metric crosses threshold; auto-trigger retraining pipelines for well-understood drift patterns.

Question 4

What is shadow mode deployment for ML models and when do you use it?

Accepted Answer

Shadow mode runs a new (challenger) model in parallel with the production (champion) model. The challenger receives the same inputs and makes predictions, but its outputs are NOT served to users — they are logged for offline comparison. Use shadow mode when: you cannot fully evaluate a model on historical data alone (the model's behavior affects future data), when you want to compare latency and resource usage in production conditions, or when the business risk of serving a wrong prediction is too high to A/B test directly. After shadow mode, compare challenger vs. champion prediction distributions, latency, and any available ground truth labels before promoting the challenger.

Question 5

How does model versioning work in an ML platform and what metadata should be tracked?

Accepted Answer

Each trained model is registered in a Model Registry (MLflow, SageMaker Model Registry) with: model version (auto-incremented), training data hash (identifies exactly which data was used), feature schema version (which feature definitions were used), code SHA (reproducibility), evaluation metrics (AUC, RMSE, etc. on held-out data), training duration and compute cost, and the challenger/champion comparison result. The registry supports lifecycle states: Staging (tested, not in prod), Production (serving traffic), Archived (superseded). Rollback: switch the serving endpoint to point to a previous Production version. Lineage: the registry links each model version to its training run, data, and code — enabling full reproducibility.

System Design: ML Training and Serving Pipeline — Feature Store, Training, and Inference at Scale (2025)

ML Platform Architecture Overview

Feature Store: Online and Offline

Training Pipeline

Model Serving: Online Inference

Monitoring: Training-Serving Skew and Data Drift