System Design: Configuration Management Service — Feature Flags, Dynamic Config, and Safe Rollouts

What Is a Config Management Service?

A configuration management service stores application settings that need to change without a code deployment: feature flags (enable/disable features), A/B test parameters, rate limits, timeout values, and system thresholds. Used by Netflix (Archaius), Facebook (GateKeeper), LaunchDarkly, and HashiCorp Consul.

Core Features

Feature flags: boolean toggles to enable/disable code paths. Enable for 1% of users (canary), then roll out to 100% over days. If issues arise, disable instantly without a deploy. Dynamic config: numeric/string values (timeout_ms=500, max_retries=3) changeable at runtime. Services poll or subscribe for updates. Targeting: evaluate flags differently per user (user_id, country, plan tier, employee vs external). Audit log: every config change recorded with who changed it, when, and what the previous value was.

Data Model

Config (flag): flag_id, name, type (BOOLEAN, STRING, NUMBER, JSON), default_value, description, owner_team, created_at. ConfigRule: rule_id, flag_id, priority, condition (JSON: {“country”: “US”, “plan”: “premium”}), value. ConfigVersion: version_id, flag_id, changed_by, changed_at, previous_value, new_value. Rules are evaluated in priority order; first matching rule wins; no match uses default_value.

Rule Evaluation

class FlagEvaluator:
    def evaluate(self, flag_name: str, context: dict) -> Any:
        flag = self.cache.get(f"flag:{flag_name}")
        rules = self.cache.get(f"rules:{flag_name}")  # sorted by priority
        for rule in rules:
            if self.matches(rule.condition, context):
                return rule.value
        return flag.default_value

    def matches(self, condition: dict, context: dict) -> bool:
        for key, expected in condition.items():
            if isinstance(expected, list):
                if context.get(key) not in expected: return False
            elif isinstance(expected, dict) and "gte" in expected:
                if context.get(key, 0) < expected["gte"]: return False
            else:
                if context.get(key) != expected: return False
        return True

Config Distribution Architecture

Storage: PostgreSQL for source of truth (all flags, rules, history). Cache: Redis for fast reads — each flag serialized as a JSON string. Services: client SDKs poll Redis every 10 seconds or subscribe to change events. Change propagation: when a flag is updated in PostgreSQL, the config service publishes an invalidation event to Redis Pub/Sub. All SDK instances subscribed to Pub/Sub receive the event and refresh the affected flag from Redis. End-to-end propagation latency: under 1 second.

Safe Rollout

Percentage rollout: assign each user to a consistent bucket (0-99) using hash(flag_name + user_id) % 100. Rule: if bucket < rollout_percentage, return the new value. This is stable — the same user always gets the same bucket, so they do not flip between old and new behavior during a rollout. Canary deployment: start at 1%, monitor error rates and latency for 30 minutes, then 5%, 10%, 25%, 50%, 100%. Automatic rollback: if error rate exceeds threshold (e.g., 5x baseline), automatically disable the flag and alert the team.

Consistency and Caching

Client SDKs cache flag values in process memory (fastest, no network call per evaluation). Background thread refreshes cache on a configurable interval (default 10s). On startup: load all flags from Redis before serving traffic (fail open vs fail closed is configurable per flag). For flags controlling billing or security: lower TTL (1-5s) or synchronous evaluation from Redis. Consistency trade-off: with 10s polling, a flag change takes up to 10s to propagate. For immediate kills (security incidents), use the Pub/Sub invalidation path which propagates in under 1 second.

Operational Features

Scheduled rollouts: “enable at 9am on launch day” — store scheduled_at on the rule, evaluate based on current time. Kill switch: a special always-off override rule with maximum priority. Dependency tracking: flag A may depend on flag B — evaluate dependencies in topological order. Dashboard: real-time flag state, rollout percentage, and error rate correlation.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is a feature flag and why use it instead of a code deploy?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A feature flag (feature toggle) is a configuration value that enables or disables a code path at runtime without requiring a new deployment. Benefits: (1) Instant rollback — if a new feature causes issues, flip the flag off in seconds vs 10-30 minutes for a rollback deploy. (2) Gradual rollout — enable for 1% of users, monitor for errors, expand to 100% over days. (3) A/B testing — show two variants to different user segments and measure impact. (4) Kill switch — disable a feature under load without touching the codebase. (5) Dark launches — deploy code that is off, test it internally, then enable for users. Flags decouple deployment (when code ships) from release (when users see it).”
}
},
{
“@type”: “Question”,
“name”: “How do you implement percentage-based feature flag rollouts?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Assign each user to a consistent bucket using a deterministic hash: bucket = hash(flag_name + user_id) % 100. This gives a value 0-99 that is stable for the same user and flag — the user does not flip between old and new behavior during a gradual rollout. Enable the flag if bucket config service publishes -> Redis delivers to all subscribers -> SDK refreshes = under 1 second. The polling serves as a safety net if the Pub/Sub message is missed.”
}
},
{
“@type”: “Question”,
“name”: “How do you implement targeting rules for feature flags (user segments, country, plan)?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Store targeting rules as a priority-ordered list of conditions and values. Each rule has: condition (JSON: {country: [US, CA], plan: premium}), value (true/false or a variant), and priority. Evaluation: for a given user context (user_id, country, plan, account_age), evaluate rules in priority order. The first matching rule returns its value. If no rule matches, return the default value. Condition types: exact match (country == US), list membership (plan in [premium, enterprise]), numeric comparison (account_age_days >= 30), percentage rollout (hash(flag+user_id) % 100 < 10). Store rules in Redis as a sorted set by priority for fast evaluation without a database query."
}
},
{
"@type": "Question",
"name": "How do you audit and recover from a bad config change?",
"acceptedAnswer": {
"@type": "Answer",
"text": "Every config change is written to an immutable audit log: who changed it, when, the previous value, and the new value. To recover from a bad change: one-click rollback in the dashboard restores the previous version (looks up ConfigVersion, updates the flag to previous_value, logs the rollback event). For automated rollback: integrate with monitoring — if error rate increases by 3x within 5 minutes of a flag change, automatically disable the flag and alert the team. The audit log is append-only — never delete entries, even for rollbacks. This provides a complete history for post-incident review. Store audit logs in a separate table with restricted write access (only the config service can write, not application code)."
}
}
]
}

Asked at: Netflix Interview Guide

Asked at: Cloudflare Interview Guide

Asked at: Databricks Interview Guide

Asked at: Atlassian Interview Guide

Scroll to Top