System Design Interview: Design a Configuration Management System (etcd/Consul)

What Is a Configuration Management System?

A configuration management system stores key-value configuration data that services read at startup or at runtime, enabling feature flags, service discovery, and dynamic tuning without redeployment. Examples: etcd (Kubernetes backbone), Consul, AWS AppConfig, LaunchDarkly. Core challenges: strong consistency (every reader sees the same value), watch notifications (push updates to subscribers within milliseconds), and high availability despite distributed consensus overhead.

  • LinkedIn Interview Guide
  • Stripe Interview Guide
  • Databricks Interview Guide
  • Uber Interview Guide
  • Cloudflare Interview Guide
  • Atlassian Interview Guide
  • System Requirements

    Functional

    • Get/Put/Delete key-value pairs
    • Watch: subscribe to changes on a key or prefix
    • Transactions: compare-and-swap (CAS) for leader election and distributed locks
    • Leases: keys auto-expire if the holder does not renew (ephemeral keys)
    • RBAC: per-key and per-prefix access control

    Non-Functional

    • Strong consistency: linearizable reads (no stale reads)
    • Watch latency: config changes propagated to subscribers in <100ms
    • High availability: survive minority node failures (3-node or 5-node cluster)
    • 10K reads/second, 100 writes/second (config is read-heavy)

    Raft Consensus

    etcd uses the Raft consensus algorithm to replicate writes across nodes. A cluster of N nodes tolerates (N-1)/2 failures: a 3-node cluster tolerates 1 failure; a 5-node cluster tolerates 2. Raft basics:

    1. One node is elected leader (via randomized election timeout)
    2. All writes go to the leader
    3. Leader appends the write to its log and replicates to followers
    4. Once a majority (quorum) acknowledge, the write is committed
    5. Committed writes are applied to the state machine (key-value store)
    6. Readers can read from leader (linearizable) or followers (stale)
    # etcd client operations
    etcdctl put /services/auth/host "10.0.1.5"
    etcdctl get /services/auth/host
    etcdctl watch /services/auth/         # watch prefix, get notified on any change
    etcdctl put /locks/job1 "worker-3" --lease=60  # auto-expire in 60s
    

    Watch Mechanism

    Clients register watches on keys or prefixes. The server stores active watchers. When a write is committed, the server scans watchers matching the written key and pushes a WatchEvent to each subscriber over a gRPC stream. The event includes: key, new value, old value, revision number. Clients reconnect automatically on disconnect and resume from the last seen revision (no events missed).

    Feature Flags with Config System

    PUT /features/dark_mode {"enabled": true, "rollout_percent": 20}
    
    # Application code
    config = etcd.get("/features/dark_mode")
    if config["enabled"] and hash(user_id) % 100 < config["rollout_percent"]:
        show_dark_mode()
    

    Percentage rollout without redeployment. Watch for changes: when the config is updated, the app receives a WatchEvent and hot-reloads the feature flag within 100ms. No restart required.

    Service Discovery

    Services register themselves on startup with a lease:

    lease = etcd.grant_lease(ttl=30)
    etcd.put(f"/services/auth/{instance_id}", json.dumps({"host": "10.0.1.5", "port": 8080}), lease=lease)
    # Keepalive: renew lease every 10 seconds
    etcd.keepalive(lease)
    

    On crash: keepalive stops, lease expires in 30 seconds, the key is auto-deleted. Other services watching /services/auth/ receive a delete event and remove the dead instance from their load balancer. Kubernetes uses etcd exactly this way for pod registration.

    Compare-and-Swap for Leader Election

    # Only succeeds if the key does not currently exist
    txn = etcd.transaction(
        compare=[etcd.transactions.version("/election/leader") == 0],
        success=[etcd.transactions.put("/election/leader", node_id, lease=my_lease)],
        failure=[]
    )
    if txn.succeeded:
        # This node is leader
    

    CAS atomicity: only one node succeeds in creating the key. The winning node is the leader. When the leader crashes, its lease expires and the key is deleted. Other nodes retry the CAS and a new leader is elected.

    Scaling Reads

    Config systems are 100:1 read-heavy. Raft requires quorum only for writes. Reads: (1) Linearizable reads: route to leader, which confirms its term with a quorum heartbeat — highest consistency, higher latency. (2) Serializable reads: read from any follower — may be slightly stale, 3-5ms faster. (3) Client-side caching: cache config values in the application, invalidate on WatchEvent. Most apps cache config in memory and update on watch — reduces etcd reads to near zero.

    Interview Tips

    • Raft is the consensus algorithm — know the quorum rule: N nodes tolerate (N-1)/2 failures.
    • Leases enable ephemeral keys without explicit deletion — key for service discovery and leader election.
    • Watch + gRPC stream is the push mechanism — not polling.
    • Client-side caching with watch invalidation is the production read pattern.
    Scroll to Top