System Design Interview: Design a Feature Flag System
Feature flag (feature toggle) systems allow engineers to enable or disable features at runtime without deploying code. They support gradual rollouts, A/B testing, and instant rollback. Asked at LinkedIn, Atlassian, Shopify, and growth-focused companies.
Requirements Clarification
Functional Requirements
- Create and manage feature flags with multiple targeting rules
- Target flags by user ID, percentage rollout, country, user segment, or custom attributes
- Evaluate flags in real-time (<1ms latency)
- Gradual rollout: increase percentage from 0% to 100%
- Kill switch: instantly disable a feature for all users
- Audit log: track who changed what and when
Non-Functional Requirements
- Scale: 1B flag evaluations/day, 10K flags, 100M users
- Latency: <1ms for flag evaluation (must not slow down critical paths)
- Availability: 99.99% (flag service outage should not take down main application)
- Consistency: eventual OK (brief inconsistency during flag updates acceptable)
Core Concept: Client-Side SDK
The key architectural insight is that flag evaluation happens in the application process, not via network call. SDKs (LaunchDarkly, Unleash, GrowthBook model):
- On startup, SDK fetches all flag configurations from flag service
- SDK caches rules in local memory
- Flag evaluation: pure local computation using cached rules (microseconds)
- SDK subscribes to streaming updates (SSE or WebSocket) for real-time rule changes
- Fallback: if streaming disconnects, poll every 30s
This eliminates network latency from the hot path. Flag evaluations are local dictionary lookups.
Flag Data Model
Flag:
id: string
key: string (e.g., "new-checkout-flow")
status: ACTIVE | INACTIVE | ARCHIVED
variations: [{value: true}, {value: false}] # or strings, numbers
targeting_rules: [
{
condition: {attribute: "country", operator: "in", values: ["US", "CA"]},
variation: 0 # index into variations
},
{
condition: {attribute: "user_id", operator: "in_percentage", values: [0, 10]},
variation: 0 # 10% rollout
}
]
default_variation: 1 # fallback if no rule matches
Flag Evaluation Algorithm
def evaluate(flag_key, user_context):
flag = local_cache[flag_key]
if flag.status != ACTIVE:
return flag.variations[flag.default_variation]
for rule in flag.targeting_rules:
if matches_condition(rule.condition, user_context):
return flag.variations[rule.variation]
return flag.variations[flag.default_variation]
def matches_condition(condition, user_context):
value = user_context.get(condition.attribute)
if condition.operator == "in_percentage":
# Deterministic: hash(user_id + flag_key) % 100
bucket = hash(user_context.user_id + flag_key) % 100
return condition.values[0] <= bucket < condition.values[1]
if condition.operator == "in":
return value in condition.values
# ... other operators
Consistent Hashing for Percentage Rollout
Use hash(user_id + flag_key) % 100 for bucket assignment. This ensures: same user always gets same bucket for same flag (sticky), different flags have independent bucketing, gradual rollout from 0-100% moves users in same order (predictable).
Architecture
Engineers use UI/API to create/modify flags
|
Flag Service (CRUD API)
|
PostgreSQL (source of truth)
|
Change events -> Kafka
|
Streaming Update Service (SSE/WebSocket)
|
SDK instances in application servers (local cache)
|
Flag evaluation (local, microseconds)
Real-Time Updates
When a flag changes, updates propagate to all SDK instances:
- Server-Sent Events (SSE): SDK maintains persistent HTTP connection to streaming service. On flag change, server pushes update. Simple, works through load balancers, one-directional
- WebSocket: bi-directional, better for high-frequency updates
- Propagation latency: <1 second for 99% of SDK instances
- Fallback: SDK polls every 30s if streaming connection drops
Percentage Rollout and A/B Testing
- Gradual rollout: increase percentage in flag config (5% → 20% → 50% → 100%). Users in bucket 0-4 see feature at 5%, 0-19 at 20%.
- A/B testing: run two variations simultaneously (50/50 split). Track conversion metrics per variation in analytics system.
- Multi-variate testing: multiple variations (A/B/C/D). Each variation gets a percentage range.
- Sticky sessions: hash-based bucketing ensures same user sees same variation consistently.
Kill Switch and Rollback
Kill switch = set flag status to INACTIVE. Propagates via streaming to all SDKs within 1 second. All users see default variation (feature off). No deployment needed. This is the primary benefit over code-level feature gating.
Interview Tips
- Lead with SDK caching – flag evaluation must be <1ms so it cannot be a network call
- Explain streaming updates (SSE) for real-time propagation
- Describe hash-based bucketing for deterministic percentage rollout
- Know the difference between feature flags and A/B testing (flags control visibility; A/B tests measure impact)
- Mention GrowthBook, LaunchDarkly, Unleash as real implementations
Companies that ask this: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence
Companies that ask this: Stripe Interview Guide 2026: Process, Bug Bash Round, and Payment Systems
Companies that ask this: Twitter/X Interview Guide 2026: Timeline Algorithms, Real-Time Search, and Content at Scale
Companies that ask this: Shopify Interview Guide
Companies that ask this: Atlassian Interview Guide
Companies that ask this: LinkedIn Interview Guide 2026: Social Graph Engineering, Feed Ranking, and Professional Network Scale
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “Why must feature flag evaluation be local and not a network call?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Feature flags are evaluated on every request or user interaction – potentially millions of times per second. A network call to a remote flag service would add 10-100ms of latency to every operation, which is unacceptable for production APIs. The solution is SDK-side caching: the SDK fetches all flag rules at startup, caches them in local memory, and evaluates flags with pure local logic (microseconds). Updates stream in via SSE in the background without blocking request processing.”
}
},
{
“@type”: “Question”,
“name”: “How does percentage rollout work in feature flags?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Percentage rollout uses deterministic hashing: bucket = hash(user_id + flag_key) % 100. Users with bucket 0-9 are in the 10% rollout, 0-49 in the 50% rollout. The combination of user_id and flag_key ensures: (1) same user always gets same bucket for this flag (sticky), (2) different flags have independent bucket assignments. Increasing from 10% to 20% adds users with buckets 10-19, so the 10% group sees no change.”
}
},
{
“@type”: “Question”,
“name”: “What is the difference between feature flags and A/B testing?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Feature flags control whether a feature is visible/active for a user segment. A/B testing measures the impact of a change by running multiple variants simultaneously and tracking metrics (conversion, retention, revenue). Feature flags are the mechanism; A/B testing is the methodology. LaunchDarkly and GrowthBook support both: use flag targeting to split users into variants, then analyze metrics per variant in an analytics system.”
}
},
{
“@type”: “Question”,
“name”: “How do you handle the case when the flag service is down?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The SDK should continue operating with cached flag rules. If the service is unavailable on startup, the SDK serves default values (fallback). Once connected, rules are cached locally, so a service outage after startup does not affect flag evaluations. The SDK should retry connection with exponential backoff. For critical systems, the SDK can persist flag rules to local disk as an additional fallback. Never let flag service downtime take down the main application.”
}
},
{
“@type”: “Question”,
“name”: “What is a kill switch and when would you use it?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A kill switch is a feature flag that is set to disabled (off for all users) immediately, with no code deployment. Set flag status to INACTIVE, which propagates via SSE to all SDK instances within 1 second. Used for: disabling a broken feature causing production errors, quickly reverting a failed experiment, emergency response to a security vulnerability. This is the primary operational benefit of feature flags over conditional code that requires a deployment to disable.”
}
}
]
}