System Design: ML Training and Serving Pipeline — Feature Store, Training, and Inference at Scale (2025)

ML Platform Architecture Overview

An ML platform has three distinct phases: Feature Engineering — transform raw data into model-ready features and store them for reuse. Training Pipeline — fetch features, train models, evaluate, version, and register. Serving Pipeline — retrieve features in real time, run inference, log predictions for feedback loops. The critical insight: the training-serving skew problem. If features are computed differently at training time vs. serving time, the model sees different distributions and performance degrades. A Feature Store with shared computation logic solves this.

Feature Store: Online and Offline

The Feature Store has two planes: Offline Store: a data warehouse (Hive, BigQuery, Snowflake) containing historical feature values with point-in-time correctness (no data leakage — features only use data available at the timestamp of the training label). Used for training data generation. Online Store: a low-latency key-value store (Redis, DynamoDB, Cassandra) serving features at inference time with < 10ms p99. Feature entities (user, item, session) are the primary keys.

class FeatureStore:
    def get_training_data(self, entity_ids: list, feature_names: list,
                          start_time: datetime, end_time: datetime) -> pd.DataFrame:
        # Point-in-time join: for each label timestamp,
        # fetch the feature value that was current at that time
        query = """
            SELECT label.entity_id, label.timestamp, label.label,
                   feat.value
            FROM labels label
            JOIN feature_log feat
              ON label.entity_id = feat.entity_id
             AND feat.timestamp = (
                 SELECT MAX(timestamp) FROM feature_log
                 WHERE entity_id = label.entity_id
                   AND feature_name = %s
                   AND timestamp  dict:
        keys = [f"{name}:{entity_id}" for name in feature_names]
        values = self.redis.mget(keys)
        return dict(zip(feature_names, values))

Training Pipeline

Orchestrated by a workflow engine (Airflow, Kubeflow, Metaflow). Stages: Data Validation — check for schema drift, null rates, feature distribution shifts (Great Expectations or TFX). Feature Generation — point-in-time join of labels with offline features. Model Training — distributed training on GPU cluster (Kubernetes + GPU nodes); hyperparameter tuning via Ray Tune or Optuna. Evaluation — compute held-out metrics; compare against production model (challenger/champion). Model Registration — if challenger beats champion by threshold, register in Model Registry (MLflow) with metadata: metrics, training data hash, feature schema version, code SHA. Deployment trigger — CI/CD pipeline picks up the new registered model version.

Model Serving: Online Inference

class InferenceService:
    def predict(self, request: PredictRequest) -> PredictResponse:
        # 1. Fetch real-time features from online store
        features = self.feature_store.get_online_features(
            entity_id=request.entity_id,
            feature_names=self.model.required_features
        )

        # 2. Optionally join with request-time context features
        features.update(request.context_features)

        # 3. Run model inference
        score = self.model.predict(features)

        # 4. Log prediction for training feedback loop
        self.prediction_logger.log(PredictionLog(
            entity_id=request.entity_id,
            features=features,
            score=score,
            model_version=self.model.version,
            timestamp=datetime.utcnow()
        ))
        return PredictResponse(score=score)

Latency budget for online inference: feature retrieval < 5ms, model forward pass < 10ms, total < 20ms p99. Achieve with: model quantization (INT8), ONNX export + TensorRT, batch inference for non-real-time use cases. Shadow mode: run new model version in parallel without serving its results to users — compare predictions to production model to catch regressions before cutover.

Monitoring: Training-Serving Skew and Data Drift

Log feature distributions at serving time. Compare to training data distributions using PSI (Population Stability Index): PSI = sum((actual% – expected%) * ln(actual% / expected%)). PSI 0.25: significant drift, retrain. Model performance monitoring: track prediction distribution (score distribution shift often precedes metric degradation), and compare against ground truth labels once available (delayed labels). Set up automated retraining triggers when drift metrics exceed thresholds. Feature freshness: monitor Redis TTL and feature update lag — stale features silently degrade model quality without raising errors.

See also: Databricks Interview Prep

See also: Meta Interview Prep

See also: Netflix Interview Prep

Scroll to Top