What Is a Distributed Key-Value Store?
A key-value store maps arbitrary keys to values with O(1) average operations. Distributed key-value stores (Redis Cluster, DynamoDB, Cassandra, etcd) partition data across nodes for horizontal scalability and replicate for fault tolerance. Core trade-offs involve the CAP theorem: choose between consistency and availability when network partitions occur.
Data Partitioning
Consistent Hashing
Map both nodes and keys onto a ring of hash values (0 to 2^64). A key is assigned to the first node clockwise from its hash position. Virtual nodes (vnodes): each physical node gets 150–200 virtual positions on the ring, distributing load more evenly and reducing the impact of adding/removing nodes. Adding a node: only keys between the new node and its predecessor migrate — O(K/N) keys move, not O(K). Without consistent hashing: adding one node would require rehashing all K keys to N+1 nodes (O(K) migration).
Replication
Replicate each key to N neighbors clockwise on the ring (N=3 is typical). Replication factor = 3 means data survives 2 node failures. Write quorum W: how many replicas must acknowledge before write is considered successful. Read quorum R: how many replicas must respond before read is returned. Strong consistency: R + W > N (e.g., R=2, W=2, N=3). Eventual consistency: W=1, R=1 — fast but stale reads possible. Tunable consistency (DynamoDB): caller specifies ConsistencyLevel per request.
Storage Engine
LSM-Tree (Log-Structured Merge-Tree): the standard for write-heavy KV stores (LevelDB, RocksDB, Cassandra).
- Writes go to WAL (crash recovery) + in-memory MemTable (sorted)
- MemTable flushed to SSTable (immutable sorted file) when full
- SSTables compacted in background — merge-sort, remove tombstones
- Read: check MemTable, then L0 SSTables, then L1, L2… (bloom filters accelerate misses)
B-Tree (InnoDB, PostgreSQL): better for read-heavy workloads. In-place updates, fewer amplification issues. Higher write cost than LSM.
Conflict Resolution
Concurrent writes to the same key on different replicas create conflicts. Strategies:
- Last-Write-Wins (LWW): timestamp determines winner. Simple but loses concurrent writes (clock skew risks)
- Vector clocks: each replica increments its own counter on write. Conflict detected when clocks are incomparable. Client resolves (Amazon shopping cart CRDT)
- CRDTs: data type designed to merge automatically (G-Counter, 2P-Set)
Failure Detection
Gossip protocol: each node periodically sends its local view of which nodes are up/down to a random neighbor. Rumors spread exponentially — after O(log N) rounds, all nodes know about a failure. Suspicion mechanism: mark nodes as SUSPECT before FAILED to handle slow nodes. Phi-accrual failure detector: instead of binary up/down, outputs a probability of failure that increases with time since last heartbeat. Used by Cassandra.
Interview Framework
- Partitioning: consistent hashing with virtual nodes
- Replication: N=3, configurable R and W quorums
- Storage: LSM-Tree for write-heavy, B-Tree for read-heavy
- Consistency: CAP trade-off, vector clocks for conflict detection
- Failure detection: gossip + phi-accrual