ML Platform Architecture Overview
An ML platform has three distinct phases: Feature Engineering — transform raw data into model-ready features and store them for reuse. Training Pipeline — fetch features, train models, evaluate, version, and register. Serving Pipeline — retrieve features in real time, run inference, log predictions for feedback loops. The critical insight: the training-serving skew problem. If features are computed differently at training time vs. serving time, the model sees different distributions and performance degrades. A Feature Store with shared computation logic solves this.
Feature Store: Online and Offline
The Feature Store has two planes: Offline Store: a data warehouse (Hive, BigQuery, Snowflake) containing historical feature values with point-in-time correctness (no data leakage — features only use data available at the timestamp of the training label). Used for training data generation. Online Store: a low-latency key-value store (Redis, DynamoDB, Cassandra) serving features at inference time with < 10ms p99. Feature entities (user, item, session) are the primary keys.
class FeatureStore:
def get_training_data(self, entity_ids: list, feature_names: list,
start_time: datetime, end_time: datetime) -> pd.DataFrame:
# Point-in-time join: for each label timestamp,
# fetch the feature value that was current at that time
query = """
SELECT label.entity_id, label.timestamp, label.label,
feat.value
FROM labels label
JOIN feature_log feat
ON label.entity_id = feat.entity_id
AND feat.timestamp = (
SELECT MAX(timestamp) FROM feature_log
WHERE entity_id = label.entity_id
AND feature_name = %s
AND timestamp dict:
keys = [f"{name}:{entity_id}" for name in feature_names]
values = self.redis.mget(keys)
return dict(zip(feature_names, values))
Training Pipeline
Orchestrated by a workflow engine (Airflow, Kubeflow, Metaflow). Stages: Data Validation — check for schema drift, null rates, feature distribution shifts (Great Expectations or TFX). Feature Generation — point-in-time join of labels with offline features. Model Training — distributed training on GPU cluster (Kubernetes + GPU nodes); hyperparameter tuning via Ray Tune or Optuna. Evaluation — compute held-out metrics; compare against production model (challenger/champion). Model Registration — if challenger beats champion by threshold, register in Model Registry (MLflow) with metadata: metrics, training data hash, feature schema version, code SHA. Deployment trigger — CI/CD pipeline picks up the new registered model version.
Model Serving: Online Inference
class InferenceService:
def predict(self, request: PredictRequest) -> PredictResponse:
# 1. Fetch real-time features from online store
features = self.feature_store.get_online_features(
entity_id=request.entity_id,
feature_names=self.model.required_features
)
# 2. Optionally join with request-time context features
features.update(request.context_features)
# 3. Run model inference
score = self.model.predict(features)
# 4. Log prediction for training feedback loop
self.prediction_logger.log(PredictionLog(
entity_id=request.entity_id,
features=features,
score=score,
model_version=self.model.version,
timestamp=datetime.utcnow()
))
return PredictResponse(score=score)
Latency budget for online inference: feature retrieval < 5ms, model forward pass < 10ms, total < 20ms p99. Achieve with: model quantization (INT8), ONNX export + TensorRT, batch inference for non-real-time use cases. Shadow mode: run new model version in parallel without serving its results to users — compare predictions to production model to catch regressions before cutover.
Monitoring: Training-Serving Skew and Data Drift
Log feature distributions at serving time. Compare to training data distributions using PSI (Population Stability Index): PSI = sum((actual% – expected%) * ln(actual% / expected%)). PSI 0.25: significant drift, retrain. Model performance monitoring: track prediction distribution (score distribution shift often precedes metric degradation), and compare against ground truth labels once available (delayed labels). Set up automated retraining triggers when drift metrics exceed thresholds. Feature freshness: monitor Redis TTL and feature update lag — stale features silently degrade model quality without raising errors.
{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”What is the training-serving skew problem and how does a Feature Store solve it?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Training-serving skew occurs when features are computed differently at training time (from historical data) vs. serving time (from live data). For example, if training uses a 7-day average computed from a batch job, but serving computes it on a slightly different time window or with different null handling, the model sees a different distribution at inference than it was trained on, causing degraded performance. A Feature Store solves this by defining feature computation logic once and sharing it across both the offline pipeline (for training data generation) and the online store (for real-time serving). The same transformation code runs in both contexts.”}},{“@type”:”Question”,”name”:”What is point-in-time correctness in ML feature pipelines and why does it matter?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Point-in-time correctness means that when generating training data, each training example only uses feature values that were available at the time its label was generated — not future values. Without it, training data contains data leakage (future information accidentally included), causing the model to learn spurious patterns that do not exist at inference time, leading to optimistic offline metrics but poor online performance. The implementation: a point-in-time join fetches, for each (entity, label_timestamp) pair, the feature value with the latest timestamp <= label_timestamp. This is often the most computationally expensive part of feature pipeline development.”}},{“@type”:”Question”,”name”:”How do you detect and respond to feature drift in production ML systems?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Monitor the distribution of each feature at serving time and compare to its training distribution using Population Stability Index (PSI). PSI = sum((actual_% – expected_%) * ln(actual_% / expected_%)). PSI < 0.1: stable. PSI 0.1-0.25: moderate drift, increase monitoring frequency. PSI > 0.25: significant drift — trigger retraining or rollback. Additionally monitor: null rates (sudden spikes indicate upstream data pipeline issues), value range violations (values outside training range), and feature freshness (how stale is the last feature update). Alert when any metric crosses threshold; auto-trigger retraining pipelines for well-understood drift patterns.”}},{“@type”:”Question”,”name”:”What is shadow mode deployment for ML models and when do you use it?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Shadow mode runs a new (challenger) model in parallel with the production (champion) model. The challenger receives the same inputs and makes predictions, but its outputs are NOT served to users — they are logged for offline comparison. Use shadow mode when: you cannot fully evaluate a model on historical data alone (the model's behavior affects future data), when you want to compare latency and resource usage in production conditions, or when the business risk of serving a wrong prediction is too high to A/B test directly. After shadow mode, compare challenger vs. champion prediction distributions, latency, and any available ground truth labels before promoting the challenger.”}},{“@type”:”Question”,”name”:”How does model versioning work in an ML platform and what metadata should be tracked?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Each trained model is registered in a Model Registry (MLflow, SageMaker Model Registry) with: model version (auto-incremented), training data hash (identifies exactly which data was used), feature schema version (which feature definitions were used), code SHA (reproducibility), evaluation metrics (AUC, RMSE, etc. on held-out data), training duration and compute cost, and the challenger/champion comparison result. The registry supports lifecycle states: Staging (tested, not in prod), Production (serving traffic), Archived (superseded). Rollback: switch the serving endpoint to point to a previous Production version. Lineage: the registry links each model version to its training run, data, and code — enabling full reproducibility.”}}]}
See also: Databricks Interview Prep
See also: Meta Interview Prep
See also: Netflix Interview Prep