Why must feature flag evaluation be local and not a network call?

Feature flags are evaluated on every request or user interaction - potentially millions of times per second. A network call to a remote flag service would add 10-100ms of latency to every operation, which is unacceptable for production APIs. The solution is SDK-side caching: the SDK fetches all flag rules at startup, caches them in local memory, and evaluates flags with pure local logic (microseconds). Updates stream in via SSE in the background without blocking request processing.

How does percentage rollout work in feature flags?

Percentage rollout uses deterministic hashing: bucket = hash(user_id + flag_key) % 100. Users with bucket 0-9 are in the 10% rollout, 0-49 in the 50% rollout. The combination of user_id and flag_key ensures: (1) same user always gets same bucket for this flag (sticky), (2) different flags have independent bucket assignments. Increasing from 10% to 20% adds users with buckets 10-19, so the 10% group sees no change.

What is the difference between feature flags and A/B testing?

Feature flags control whether a feature is visible/active for a user segment. A/B testing measures the impact of a change by running multiple variants simultaneously and tracking metrics (conversion, retention, revenue). Feature flags are the mechanism; A/B testing is the methodology. LaunchDarkly and GrowthBook support both: use flag targeting to split users into variants, then analyze metrics per variant in an analytics system.

How do you handle the case when the flag service is down?

The SDK should continue operating with cached flag rules. If the service is unavailable on startup, the SDK serves default values (fallback). Once connected, rules are cached locally, so a service outage after startup does not affect flag evaluations. The SDK should retry connection with exponential backoff. For critical systems, the SDK can persist flag rules to local disk as an additional fallback. Never let flag service downtime take down the main application.

What is a kill switch and when would you use it?

A kill switch is a feature flag that is set to disabled (off for all users) immediately, with no code deployment. Set flag status to INACTIVE, which propagates via SSE to all SDK instances within 1 second. Used for: disabling a broken feature causing production errors, quickly reverting a failed experiment, emergency response to a security vulnerability. This is the primary operational benefit of feature flags over conditional code that requires a deployment to disable.

System Design Interview: Design a Feature Flag System

⏱ 5 min read

System Design Interview: Design a Feature Flag System

Feature flag (feature toggle) systems allow engineers to enable or disable features at runtime without deploying code. They support gradual rollouts, A/B testing, and instant rollback. Asked at LinkedIn, Atlassian, Shopify, and growth-focused companies.

Requirements Clarification

Functional Requirements

Create and manage feature flags with multiple targeting rules
Target flags by user ID, percentage rollout, country, user segment, or custom attributes
Evaluate flags in real-time (<1ms latency)
Gradual rollout: increase percentage from 0% to 100%
Kill switch: instantly disable a feature for all users
Audit log: track who changed what and when

Non-Functional Requirements

Scale: 1B flag evaluations/day, 10K flags, 100M users
Latency: <1ms for flag evaluation (must not slow down critical paths)
Availability: 99.99% (flag service outage should not take down main application)
Consistency: eventual OK (brief inconsistency during flag updates acceptable)

Core Concept: Client-Side SDK

The key architectural insight is that flag evaluation happens in the application process, not via network call. SDKs (LaunchDarkly, Unleash, GrowthBook model):

On startup, SDK fetches all flag configurations from flag service
SDK caches rules in local memory
Flag evaluation: pure local computation using cached rules (microseconds)
SDK subscribes to streaming updates (SSE or WebSocket) for real-time rule changes
Fallback: if streaming disconnects, poll every 30s

This eliminates network latency from the hot path. Flag evaluations are local dictionary lookups.

Flag Data Model

Flag:
  id: string
  key: string (e.g., "new-checkout-flow")
  status: ACTIVE | INACTIVE | ARCHIVED
  variations: [{value: true}, {value: false}]  # or strings, numbers
  targeting_rules: [
    {
      condition: {attribute: "country", operator: "in", values: ["US", "CA"]},
      variation: 0  # index into variations
    },
    {
      condition: {attribute: "user_id", operator: "in_percentage", values: [0, 10]},
      variation: 0  # 10% rollout
    }
  ]
  default_variation: 1  # fallback if no rule matches

Flag Evaluation Algorithm

def evaluate(flag_key, user_context):
    flag = local_cache[flag_key]
    if flag.status != ACTIVE:
        return flag.variations[flag.default_variation]

    for rule in flag.targeting_rules:
        if matches_condition(rule.condition, user_context):
            return flag.variations[rule.variation]

    return flag.variations[flag.default_variation]

def matches_condition(condition, user_context):
    value = user_context.get(condition.attribute)
    if condition.operator == "in_percentage":
        # Deterministic: hash(user_id + flag_key) % 100
        bucket = hash(user_context.user_id + flag_key) % 100
        return condition.values[0] <= bucket < condition.values[1]
    if condition.operator == "in":
        return value in condition.values
    # ... other operators

Consistent Hashing for Percentage Rollout

Use hash(user_id + flag_key) % 100 for bucket assignment. This ensures: same user always gets same bucket for same flag (sticky), different flags have independent bucketing, gradual rollout from 0-100% moves users in same order (predictable).

Architecture

Engineers use UI/API to create/modify flags
    |
Flag Service (CRUD API)
    |
PostgreSQL (source of truth)
    |
Change events -> Kafka
    |
Streaming Update Service (SSE/WebSocket)
    |
SDK instances in application servers (local cache)
    |
Flag evaluation (local, microseconds)

Real-Time Updates

When a flag changes, updates propagate to all SDK instances:

Server-Sent Events (SSE): SDK maintains persistent HTTP connection to streaming service. On flag change, server pushes update. Simple, works through load balancers, one-directional
WebSocket: bi-directional, better for high-frequency updates
Propagation latency: <1 second for 99% of SDK instances
Fallback: SDK polls every 30s if streaming connection drops

Percentage Rollout and A/B Testing

Gradual rollout: increase percentage in flag config (5% → 20% → 50% → 100%). Users in bucket 0-4 see feature at 5%, 0-19 at 20%.
A/B testing: run two variations simultaneously (50/50 split). Track conversion metrics per variation in analytics system.
Multi-variate testing: multiple variations (A/B/C/D). Each variation gets a percentage range.
Sticky sessions: hash-based bucketing ensures same user sees same variation consistently.

Kill Switch and Rollback

Kill switch = set flag status to INACTIVE. Propagates via streaming to all SDKs within 1 second. All users see default variation (feature off). No deployment needed. This is the primary benefit over code-level feature gating.

Interview Tips

Lead with SDK caching – flag evaluation must be <1ms so it cannot be a network call
Explain streaming updates (SSE) for real-time propagation
Describe hash-based bucketing for deterministic percentage rollout
Know the difference between feature flags and A/B testing (flags control visibility; A/B tests measure impact)
Mention GrowthBook, LaunchDarkly, Unleash as real implementations

Companies that ask this: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

Companies that ask this: Stripe Interview Guide 2026: Process, Bug Bash Round, and Payment Systems

Companies that ask this: Twitter/X Interview Guide 2026: Timeline Algorithms, Real-Time Search, and Content at Scale

Companies that ask this: Shopify Interview Guide

Companies that ask this: Atlassian Interview Guide

Companies that ask this: LinkedIn Interview Guide 2026: Social Graph Engineering, Feed Ranking, and Professional Network Scale