System Design: Configuration Management Service — Feature Flags, Dynamic Config, and Safe Rollouts

What Is a Config Management Service?

A configuration management service stores application settings that need to change without a code deployment: feature flags (enable/disable features), A/B test parameters, rate limits, timeout values, and system thresholds. Used by Netflix (Archaius), Facebook (GateKeeper), LaunchDarkly, and HashiCorp Consul.

Core Features

Feature flags: boolean toggles to enable/disable code paths. Enable for 1% of users (canary), then roll out to 100% over days. If issues arise, disable instantly without a deploy. Dynamic config: numeric/string values (timeout_ms=500, max_retries=3) changeable at runtime. Services poll or subscribe for updates. Targeting: evaluate flags differently per user (user_id, country, plan tier, employee vs external). Audit log: every config change recorded with who changed it, when, and what the previous value was.

Data Model

Config (flag): flag_id, name, type (BOOLEAN, STRING, NUMBER, JSON), default_value, description, owner_team, created_at. ConfigRule: rule_id, flag_id, priority, condition (JSON: {“country”: “US”, “plan”: “premium”}), value. ConfigVersion: version_id, flag_id, changed_by, changed_at, previous_value, new_value. Rules are evaluated in priority order; first matching rule wins; no match uses default_value.

Rule Evaluation

class FlagEvaluator:
    def evaluate(self, flag_name: str, context: dict) -> Any:
        flag = self.cache.get(f"flag:{flag_name}")
        rules = self.cache.get(f"rules:{flag_name}")  # sorted by priority
        for rule in rules:
            if self.matches(rule.condition, context):
                return rule.value
        return flag.default_value

    def matches(self, condition: dict, context: dict) -> bool:
        for key, expected in condition.items():
            if isinstance(expected, list):
                if context.get(key) not in expected: return False
            elif isinstance(expected, dict) and "gte" in expected:
                if context.get(key, 0) < expected["gte"]: return False
            else:
                if context.get(key) != expected: return False
        return True

Config Distribution Architecture

Storage: PostgreSQL for source of truth (all flags, rules, history). Cache: Redis for fast reads — each flag serialized as a JSON string. Services: client SDKs poll Redis every 10 seconds or subscribe to change events. Change propagation: when a flag is updated in PostgreSQL, the config service publishes an invalidation event to Redis Pub/Sub. All SDK instances subscribed to Pub/Sub receive the event and refresh the affected flag from Redis. End-to-end propagation latency: under 1 second.

Safe Rollout

Percentage rollout: assign each user to a consistent bucket (0-99) using hash(flag_name + user_id) % 100. Rule: if bucket < rollout_percentage, return the new value. This is stable — the same user always gets the same bucket, so they do not flip between old and new behavior during a rollout. Canary deployment: start at 1%, monitor error rates and latency for 30 minutes, then 5%, 10%, 25%, 50%, 100%. Automatic rollback: if error rate exceeds threshold (e.g., 5x baseline), automatically disable the flag and alert the team.

Consistency and Caching

Client SDKs cache flag values in process memory (fastest, no network call per evaluation). Background thread refreshes cache on a configurable interval (default 10s). On startup: load all flags from Redis before serving traffic (fail open vs fail closed is configurable per flag). For flags controlling billing or security: lower TTL (1-5s) or synchronous evaluation from Redis. Consistency trade-off: with 10s polling, a flag change takes up to 10s to propagate. For immediate kills (security incidents), use the Pub/Sub invalidation path which propagates in under 1 second.

Operational Features

Scheduled rollouts: “enable at 9am on launch day” — store scheduled_at on the rule, evaluate based on current time. Kill switch: a special always-off override rule with maximum priority. Dependency tracking: flag A may depend on flag B — evaluate dependencies in topological order. Dashboard: real-time flag state, rollout percentage, and error rate correlation.

Asked at: Netflix Interview Guide

Asked at: Cloudflare Interview Guide

Asked at: Databricks Interview Guide

Asked at: Atlassian Interview Guide

Scroll to Top