System Design: ML Feature Store — Feature Computation, Storage, Serving, and Point-in-Time Correctness

What Is a Feature Store?

A feature store is a centralized platform for computing, storing, serving, and sharing ML features. Without a feature store: every team re-computes the same features (waste), training data uses features computed differently from serving time (training-serving skew), features are unavailable for other models. With a feature store: shared feature definitions, consistency between training and serving, low-latency online lookup. Key components: feature registry (metadata, definitions), offline store (historical features for training), online store (latest features for real-time serving), feature pipeline (computation).

Offline Store

The offline store provides historical features for model training and batch inference. Data warehouse (BigQuery, Snowflake, Redshift) or data lake (S3 + Parquet). Features are stored as time-series: (entity_id, timestamp, feature_value). Point-in-time correct joins: when creating a training dataset, for each training label at timestamp T, fetch feature values as they existed at time T (not the current values). This prevents data leakage (using future information during training). Implementation: for each (entity, label_timestamp) in the training set: find the most recent feature row with timestamp <= label_timestamp. SQL: SELECT f.feature_value FROM features f WHERE f.entity_id = e.entity_id AND f.timestamp <= e.label_timestamp ORDER BY f.timestamp DESC LIMIT 1.

Online Store

The online store serves the latest feature values at low latency for real-time model inference. Storage: Redis (sub-millisecond reads), DynamoDB, or Cassandra. Schema: {entity_id -> {feature_name: value}}. Latency target: under 5ms p99 for feature lookup. Write path: streaming feature pipeline computes features and writes to the online store. Or: a sync job periodically copies the latest values from the offline store to the online store. Freshness: real-time features (user clicked this item 5 minutes ago) require streaming pipelines. Slowly-changing features (user age, account tier) can tolerate daily batch updates. Pre-fetch: for high-traffic entities, pre-populate the online store before the model server requests them.

Feature Computation Pipelines

Batch features: computed by scheduled jobs (Apache Spark, dbt). Examples: user lifetime value, 30-day purchase count. Run daily or hourly. Write to offline store, sync latest values to online store. Streaming features: computed from real-time event streams (Kafka + Flink/Spark Streaming). Examples: items viewed in the last 5 minutes, current cart value, real-time fraud signals. Write directly to the online store. Low-latency features: computed on-demand at serving time from raw data (not pre-computed). Examples: distance between user and restaurant (requires live GPS). Computed in the feature serving layer using a fast in-memory lookup. The feature store supports all three types; the choice depends on the freshness requirement and computation cost.

Training-Serving Skew Prevention

Training-serving skew: the model is trained on features computed one way and served with features computed differently. This is a top cause of model performance degradation in production. Prevention: (1) Use the same feature definitions for both training and serving. The feature registry stores the canonical definition (SQL, Spark, or Python code). Both the training pipeline and serving layer run this same definition. (2) Log served features: when the model serves a prediction, log the feature values used. Use these logged features as training data for future model versions (ensures training data matches serving distribution exactly). (3) Shadow evaluation: during training data generation, compute features using both the current pipeline and the legacy pipeline; alert on discrepancies.

Interview Tips

  • Feature stores solve the dual problem of offline (training) and online (serving) consistency — explain both stores.
  • Point-in-time correctness is the critical concept that prevents data leakage in training.
  • Training-serving skew is the #1 production ML bug — feature stores prevent it by sharing definitions.
  • Platforms: Feast (open-source), Tecton (managed), Vertex AI Feature Store (GCP), SageMaker Feature Store (AWS).

Asked at: Databricks Interview Guide

Asked at: Netflix Interview Guide

Asked at: Uber Interview Guide

Asked at: LinkedIn Interview Guide

Scroll to Top