Why should feature flag evaluation never make a network call?

Flag evaluation happens on every request, often inside the critical path of serving a user. If evaluation requires a network call to a flag service: (1) it adds 5-50ms latency to every request; (2) if the flag service is down, your entire application fails or degrades; (3) the flag service becomes a single point of failure for all services that use it. The correct architecture: the SDK fetches all flag configurations at startup and caches them in-memory. Flag evaluation reads from the local cache (sub-millisecond, no network). The SDK subscribes to real-time updates (SSE or WebSocket) from the flag service to receive changes within 100ms. This means flag evaluation is always fast and always available, even when the flag service is temporarily unreachable.

How does consistent bucketing work for percentage rollouts?

Consistent bucketing ensures a user always gets the same flag variant across requests. Use: bucket = hash(user_id + flag_key) % 100. If bucket < rollout_percentage, assign to "on" variant. The hash function is deterministic: the same user_id + flag_key always produces the same bucket. Without the flag_key in the hash, a user in the 10% rollout bucket for every flag would skew experiment results (they always get new features). Including the flag_key makes the bucket independent per flag. The modulo 100 maps the hash to [0, 99], representing percentages. For gradual rollouts: start at 1%, check metrics, increase to 5%, 25%, 50%, 100% while monitoring for regressions.

How do you ensure A/B experiment results are statistically valid with feature flags?

Statistical validity requires: (1) Consistent assignment: use consistent bucketing (hash-based) so a user always sees the same variant throughout the experiment. Users switching between variants invalidate the analysis. (2) Log every exposure: when a flag is evaluated for a user, log the (user_id, flag_key, variation, timestamp) event asynchronously to Kafka. This is the experiment exposure. (3) Link outcomes: log conversion events (purchase, click, signup) with the same user_id. Join exposures with outcomes in the analysis pipeline to compute conversion rates per variant. (4) Minimum sample size: do not stop the experiment early when you see positive results. Use a pre-calculated sample size based on expected effect size and significance level (p < 0.05). Stopping early inflates false positive rates.

System Design Interview: Design a Feature Flag System (LaunchDarkly)

Q: How do you ensure A/B experiment results are statistically valid with feature flags?

Statistical validity requires: (1) Consistent assignment: use consistent bucketing (hash-based) so a user always sees the same variant throughout the experiment. Users switching between variants invalidate the analysis. (2) Log every exposure: when a flag is evaluated for a user, log the (user_id, flag_key, variation, timestamp) event asynchronously to Kafka. This is the experiment exposure. (3) Link outcomes: log conversion events (purchase, click, signup) with the same user_id. Join exposures with outcomes in the analysis pipeline to compute conversion rates per variant. (4) Minimum sample size: do not stop the experiment early when you see positive results. Use a pre-calculated sample size based on expected effect size and significance level (p < 0.05). Stopping early inflates false positive rates.

⏱ 5 min read

System Design Interview: Design a Feature Flag System

Feature flags (feature toggles) enable teams to deploy code to production without activating it for users, control gradual rollouts, and kill-switch problematic features instantly. Systems like LaunchDarkly, Optimizely, and Statsig power feature flags at scale. This guide covers the architecture of a production feature flag system.

Requirements

Functional: create and manage flags (on/off), target flags to specific users/groups/percentages, evaluate flags in real-time with <5ms latency, support A/B experiments with assignment logging, instant flag updates without deployment.

Non-functional: 10K flag evaluations/second, flag evaluation must never fail (fallback to defaults if flag service is unavailable), 100ms maximum propagation delay for flag updates.

Flag Data Model

Flag {
  key:           "new_checkout_flow"
  description:  "New checkout UI for A/B test"
  enabled:       true
  rules: [
    {targeting: {user_ids: ["user1", "user2"]}, variation: "on"},
    {targeting: {group: "beta_users"},          variation: "on"},
    {targeting: {percentage: 10},               variation: "on"},  // 10% of users
    {targeting: "everyone",                     variation: "off"}  // default
  ]
  variations: {
    "on":  {value: true,  metadata: {experiment_arm: "treatment"}},
    "off": {value: false, metadata: {experiment_arm: "control"}}
  }
}

Flag Evaluation Engine

Rules are evaluated top-to-bottom; the first matching rule wins:

def evaluate_flag(flag_key, user_context):
    # user_context = {user_id, groups, attributes}
    flag = get_flag(flag_key)  # from local cache
    if not flag or not flag.enabled:
        return flag.default_variation

    for rule in flag.rules:
        if matches_rule(rule.targeting, user_context):
            return rule.variation

    return flag.default_variation

def matches_rule(targeting, user_context):
    if "user_ids" in targeting:
        return user_context["user_id"] in targeting["user_ids"]
    if "group" in targeting:
        return targeting["group"] in user_context["groups"]
    if "percentage" in targeting:
        # Consistent hashing for stable assignment:
        bucket = hash(user_id + flag_key) % 100
        return bucket < targeting["percentage"]
    return True  # "everyone"

Consistent bucketing: the same user always gets the same flag variant. Using hash(user_id + flag_key) % 100 is stable across requests and ensures A/B test groups don't flicker between page loads.

SDK Architecture (Client-Side)

The flag SDK runs in the application process (not as a network call per evaluation):

Application startup:
  1. SDK fetches all flags from Flag Service → local in-memory cache
  2. SDK subscribes to real-time updates (SSE or WebSocket)

Flag evaluation (in-process, ~0.1ms):
  sdk.variation("new_checkout_flow", user_context)
  → reads from local cache
  → no network call

Real-time updates:
  Flag Service → publishes to Redis Pub/Sub → SDK receives update → updates local cache

This architecture means flag evaluations are always fast (local cache) and always available (no dependency on external service for each evaluation). Even if the Flag Service goes down, the SDK continues serving the last known flag state.

Flag Service Architecture

Flag Management UI (create/edit flags)
    ↓
Flag API (CRUD, authentication)
    ↓
Flag Database (PostgreSQL: source of truth for flag configs)
    ↓
Cache Layer (Redis: serves flag state to SDKs)
    ↓
Real-time Update Channel (Redis Pub/Sub → SSE → SDK local caches)

When a flag is updated: write to PostgreSQL → update Redis cache → publish update event to Redis Pub/Sub → all subscribed SDK instances receive the event within 100ms and update their local cache.

A/B Experiment Logging

For experiments to be statistically valid, every flag evaluation must be logged:

ExposureEvent {
  timestamp: 1713200000,
  user_id:   "user_12345",
  flag_key:  "new_checkout_flow",
  variation: "on",
  experiment_arm: "treatment"
}

Events are logged asynchronously to Kafka to avoid adding latency to flag evaluation. Downstream, a Spark job aggregates exposures and outcomes (purchases, click-through rates) per experiment arm to compute statistical significance.

Emergency Kill Switch

A kill switch is a flag that can instantly disable a problematic feature for all users:

Set flag enabled=false → propagates to all SDKs within 100ms
No code deployment required — a single API call or UI click
On-call engineers have access to the management UI 24/7
Alert when a kill switch is activated (PagerDuty notification)

Interview Tips

The SDK-local-cache pattern is the key insight: flag evaluation must never depend on a network call
Consistent hashing for percentage rollouts prevents flickering — explain this explicitly
Real-time updates via Pub/Sub enable <100ms propagation without polling
Mention A/B experiment logging via Kafka — shows you understand the analytics use case
The kill switch use case is compelling to mention — directly relevant to production reliability

Databricks Interview Guide

Stripe Interview Guide

Uber Interview Guide

LinkedIn Interview Guide

Airbnb Interview Guide

Shopify Interview Guide