System Design Interview: Design a Key-Value Store (Redis / DynamoDB)

What Is a Distributed Key-Value Store?

A key-value store maps arbitrary keys to values with O(1) average operations. Distributed key-value stores (Redis Cluster, DynamoDB, Cassandra, etcd) partition data across nodes for horizontal scalability and replicate for fault tolerance. Core trade-offs involve the CAP theorem: choose between consistency and availability when network partitions occur.

  • Netflix Interview Guide
  • Cloudflare Interview Guide
  • Uber Interview Guide
  • Databricks Interview Guide
  • Stripe Interview Guide
  • Coinbase Interview Guide
  • Data Partitioning

    Consistent Hashing

    Map both nodes and keys onto a ring of hash values (0 to 2^64). A key is assigned to the first node clockwise from its hash position. Virtual nodes (vnodes): each physical node gets 150–200 virtual positions on the ring, distributing load more evenly and reducing the impact of adding/removing nodes. Adding a node: only keys between the new node and its predecessor migrate — O(K/N) keys move, not O(K). Without consistent hashing: adding one node would require rehashing all K keys to N+1 nodes (O(K) migration).

    Replication

    Replicate each key to N neighbors clockwise on the ring (N=3 is typical). Replication factor = 3 means data survives 2 node failures. Write quorum W: how many replicas must acknowledge before write is considered successful. Read quorum R: how many replicas must respond before read is returned. Strong consistency: R + W > N (e.g., R=2, W=2, N=3). Eventual consistency: W=1, R=1 — fast but stale reads possible. Tunable consistency (DynamoDB): caller specifies ConsistencyLevel per request.

    Storage Engine

    LSM-Tree (Log-Structured Merge-Tree): the standard for write-heavy KV stores (LevelDB, RocksDB, Cassandra).

    1. Writes go to WAL (crash recovery) + in-memory MemTable (sorted)
    2. MemTable flushed to SSTable (immutable sorted file) when full
    3. SSTables compacted in background — merge-sort, remove tombstones
    4. Read: check MemTable, then L0 SSTables, then L1, L2… (bloom filters accelerate misses)

    B-Tree (InnoDB, PostgreSQL): better for read-heavy workloads. In-place updates, fewer amplification issues. Higher write cost than LSM.

    Conflict Resolution

    Concurrent writes to the same key on different replicas create conflicts. Strategies:

    • Last-Write-Wins (LWW): timestamp determines winner. Simple but loses concurrent writes (clock skew risks)
    • Vector clocks: each replica increments its own counter on write. Conflict detected when clocks are incomparable. Client resolves (Amazon shopping cart CRDT)
    • CRDTs: data type designed to merge automatically (G-Counter, 2P-Set)

    Failure Detection

    Gossip protocol: each node periodically sends its local view of which nodes are up/down to a random neighbor. Rumors spread exponentially — after O(log N) rounds, all nodes know about a failure. Suspicion mechanism: mark nodes as SUSPECT before FAILED to handle slow nodes. Phi-accrual failure detector: instead of binary up/down, outputs a probability of failure that increases with time since last heartbeat. Used by Cassandra.

    Interview Framework

    1. Partitioning: consistent hashing with virtual nodes
    2. Replication: N=3, configurable R and W quorums
    3. Storage: LSM-Tree for write-heavy, B-Tree for read-heavy
    4. Consistency: CAP trade-off, vector clocks for conflict detection
    5. Failure detection: gossip + phi-accrual
    Scroll to Top