System Design Interview: Design a Feature Flag System
Feature flags (feature toggles) enable teams to deploy code to production without activating it for users, control gradual rollouts, and kill-switch problematic features instantly. Systems like LaunchDarkly, Optimizely, and Statsig power feature flags at scale. This guide covers the architecture of a production feature flag system.
Requirements
Functional: create and manage flags (on/off), target flags to specific users/groups/percentages, evaluate flags in real-time with <5ms latency, support A/B experiments with assignment logging, instant flag updates without deployment.
Non-functional: 10K flag evaluations/second, flag evaluation must never fail (fallback to defaults if flag service is unavailable), 100ms maximum propagation delay for flag updates.
Flag Data Model
Flag {
key: "new_checkout_flow"
description: "New checkout UI for A/B test"
enabled: true
rules: [
{targeting: {user_ids: ["user1", "user2"]}, variation: "on"},
{targeting: {group: "beta_users"}, variation: "on"},
{targeting: {percentage: 10}, variation: "on"}, // 10% of users
{targeting: "everyone", variation: "off"} // default
]
variations: {
"on": {value: true, metadata: {experiment_arm: "treatment"}},
"off": {value: false, metadata: {experiment_arm: "control"}}
}
}
Flag Evaluation Engine
Rules are evaluated top-to-bottom; the first matching rule wins:
def evaluate_flag(flag_key, user_context):
# user_context = {user_id, groups, attributes}
flag = get_flag(flag_key) # from local cache
if not flag or not flag.enabled:
return flag.default_variation
for rule in flag.rules:
if matches_rule(rule.targeting, user_context):
return rule.variation
return flag.default_variation
def matches_rule(targeting, user_context):
if "user_ids" in targeting:
return user_context["user_id"] in targeting["user_ids"]
if "group" in targeting:
return targeting["group"] in user_context["groups"]
if "percentage" in targeting:
# Consistent hashing for stable assignment:
bucket = hash(user_id + flag_key) % 100
return bucket < targeting["percentage"]
return True # "everyone"
Consistent bucketing: the same user always gets the same flag variant. Using hash(user_id + flag_key) % 100 is stable across requests and ensures A/B test groups don't flicker between page loads.
SDK Architecture (Client-Side)
The flag SDK runs in the application process (not as a network call per evaluation):
Application startup:
1. SDK fetches all flags from Flag Service → local in-memory cache
2. SDK subscribes to real-time updates (SSE or WebSocket)
Flag evaluation (in-process, ~0.1ms):
sdk.variation("new_checkout_flow", user_context)
→ reads from local cache
→ no network call
Real-time updates:
Flag Service → publishes to Redis Pub/Sub → SDK receives update → updates local cache
This architecture means flag evaluations are always fast (local cache) and always available (no dependency on external service for each evaluation). Even if the Flag Service goes down, the SDK continues serving the last known flag state.
Flag Service Architecture
Flag Management UI (create/edit flags)
↓
Flag API (CRUD, authentication)
↓
Flag Database (PostgreSQL: source of truth for flag configs)
↓
Cache Layer (Redis: serves flag state to SDKs)
↓
Real-time Update Channel (Redis Pub/Sub → SSE → SDK local caches)
When a flag is updated: write to PostgreSQL → update Redis cache → publish update event to Redis Pub/Sub → all subscribed SDK instances receive the event within 100ms and update their local cache.
A/B Experiment Logging
For experiments to be statistically valid, every flag evaluation must be logged:
ExposureEvent {
timestamp: 1713200000,
user_id: "user_12345",
flag_key: "new_checkout_flow",
variation: "on",
experiment_arm: "treatment"
}
Events are logged asynchronously to Kafka to avoid adding latency to flag evaluation. Downstream, a Spark job aggregates exposures and outcomes (purchases, click-through rates) per experiment arm to compute statistical significance.
Emergency Kill Switch
A kill switch is a flag that can instantly disable a problematic feature for all users:
- Set flag enabled=false → propagates to all SDKs within 100ms
- No code deployment required — a single API call or UI click
- On-call engineers have access to the management UI 24/7
- Alert when a kill switch is activated (PagerDuty notification)
Interview Tips
- The SDK-local-cache pattern is the key insight: flag evaluation must never depend on a network call
- Consistent hashing for percentage rollouts prevents flickering — explain this explicitly
- Real-time updates via Pub/Sub enable <100ms propagation without polling
- Mention A/B experiment logging via Kafka — shows you understand the analytics use case
- The kill switch use case is compelling to mention — directly relevant to production reliability