System Design Interview: Design a Distributed Key-Value Store
A distributed key-value store is one of the foundational system design questions — it tests your understanding of distributed systems fundamentals: consistency, partitioning, replication, and fault tolerance. Systems like DynamoDB, Cassandra, and Redis Cluster solve this problem. Here is the architecture and the trade-offs you need to articulate in an interview.
Requirements
Functional: get(key), put(key, value), delete(key). Keys up to 256 bytes, values up to 10 MB.
Non-functional: low latency (<10ms p99), high availability (99.99%), horizontal scalability to petabytes of data, eventual consistency acceptable (tunable).
Data Partitioning: Consistent Hashing
Distribute keys across nodes without rehashing everything when nodes join or leave.
Virtual nodes (vnodes): each physical node gets 150-300 vnodes on the hash ring. Benefits: (1) even distribution even with heterogeneous hardware, (2) when a node fails, its vnodes scatter load across many nodes rather than one neighbor absorbing everything.
hash_ring: sorted array of (hash_value, node_id)
lookup(key): hash(key) → find first node clockwise on ring
add_node(new_node): take ~150 vnodes from existing nodes, redistribute only those keys
remove_node(dead_node): its vnodes transfer to the next clockwise node
Replication
Replicate each key to N nodes (typically N=3) for fault tolerance. With consistent hashing: the primary node plus the next N-1 clockwise nodes in the ring are the replica set for a key.
Quorum reads and writes: with N=3 replicas, W write quorum, R read quorum:
- Strong consistency: W + R > N (e.g., W=2, R=2)
- High availability (DynamoDB default): W=1, R=1 — eventual consistency, lowest latency
- Write-heavy optimization: W=1, R=3 — fast writes, strong reads
The coordinator node (receives the client request) sends the operation to all N replicas in parallel and returns after W (or R) acknowledge.
Storage Engine
On each node, use an LSM-tree (Log-Structured Merge-tree):
- Write path: write to WAL (write-ahead log) for durability, then to an in-memory MemTable (sorted by key, ~64MB)
- Flush: when MemTable fills, flush to an immutable SSTable (Sorted String Table) on disk
- Compaction: background process merges SSTables, removes deleted/overwritten keys, maintains sorted order
- Read path: check MemTable → check Bloom filter for each SSTable → read from disk if needed
Bloom filters eliminate unnecessary disk reads: if the filter says the key is not in an SSTable, skip it (false positive rate ~1%). This makes LSM reads nearly as fast as B-tree reads despite the multi-layer structure.
Conflict Resolution: Vector Clocks
With eventual consistency, a key can be written to multiple replicas concurrently before they sync. Detecting conflicts requires vector clocks:
vector_clock: {node_id: sequence_number}
# Write on node A: increment A's counter
# Example: after 3 writes across nodes A,B,C:
value1: {A:2, B:1} # written mostly on A
value2: {A:1, B:2} # written mostly on B
# These are concurrent (neither dominates) → conflict
# Resolution: last-write-wins (timestamp), or return both to client for application-level merge
DynamoDB uses a simplified version with “last writer wins” as default and optional conditional updates for optimistic locking.
Handling Failures
Hinted Handoff
When a replica node is temporarily down, the coordinator writes to a healthy node as a “hint.” When the failed node recovers, the hint is replayed to bring it up to date. This maintains write availability (sloppy quorum) during node failures.
Anti-Entropy with Merkle Trees
To detect data divergence between replicas without comparing every key:
- Each node builds a Merkle tree over its key space (leaf = hash of key-value, internal nodes = hash of children)
- Nodes exchange tree roots periodically; if roots differ, traverse to find divergent subtrees
- Sync only the divergent key ranges — typically O(differences) not O(total keys)
API Design
PUT /keys/{key} body: {value, ttl_seconds (optional)} → 200 OK
GET /keys/{key} → 200 {value, version} or 404
DELETE /keys/{key} → 200 OK (tombstone written, not immediate deletion)
Client SDK: consistent_put(key, value, quorum=W)
consistent_get(key, quorum=R, read_repair=True)
Read repair: when a GET reads from R replicas and finds divergent values, it asynchronously updates stale replicas with the latest value. This passively heals inconsistency without a separate background job.
Comparison: DynamoDB vs. Redis Cluster vs. Cassandra
| Feature | DynamoDB | Redis Cluster | Cassandra |
|---|---|---|---|
| Consistency | Eventual (tunable) | Eventual | Eventual (tunable) |
| Storage engine | B-tree + LSM | Hash table (in-memory) | LSM-tree |
| Max value size | 400 KB | 512 MB | 2 GB |
| Latency | <10ms | <1ms | <10ms |
| Use case | General purpose | Caching, sessions | Time-series, write-heavy |
Interview Tips
- Open with CAP theorem: you're designing an AP system (availability + partition tolerance), accepting eventual consistency
- Explain consistent hashing and why virtual nodes matter — this is a signal of depth
- Walk through the write path: client → coordinator → W replicas → WAL → MemTable → SSTable
- Mention Bloom filters when discussing reads — shows you know the storage engine internals
- For conflict resolution: “last-write-wins is simple; vector clocks are needed when you can't afford data loss”