What Is a Distributed Key-Value Store?
A key-value store maps arbitrary keys to values with O(1) average operations. Distributed key-value stores (Redis Cluster, DynamoDB, Cassandra, etcd) partition data across nodes for horizontal scalability and replicate for fault tolerance. Core trade-offs involve the CAP theorem: choose between consistency and availability when network partitions occur.
Data Partitioning
Consistent Hashing
Map both nodes and keys onto a ring of hash values (0 to 2^64). A key is assigned to the first node clockwise from its hash position. Virtual nodes (vnodes): each physical node gets 150–200 virtual positions on the ring, distributing load more evenly and reducing the impact of adding/removing nodes. Adding a node: only keys between the new node and its predecessor migrate — O(K/N) keys move, not O(K). Without consistent hashing: adding one node would require rehashing all K keys to N+1 nodes (O(K) migration).
Replication
Replicate each key to N neighbors clockwise on the ring (N=3 is typical). Replication factor = 3 means data survives 2 node failures. Write quorum W: how many replicas must acknowledge before write is considered successful. Read quorum R: how many replicas must respond before read is returned. Strong consistency: R + W > N (e.g., R=2, W=2, N=3). Eventual consistency: W=1, R=1 — fast but stale reads possible. Tunable consistency (DynamoDB): caller specifies ConsistencyLevel per request.
Storage Engine
LSM-Tree (Log-Structured Merge-Tree): the standard for write-heavy KV stores (LevelDB, RocksDB, Cassandra).
- Writes go to WAL (crash recovery) + in-memory MemTable (sorted)
- MemTable flushed to SSTable (immutable sorted file) when full
- SSTables compacted in background — merge-sort, remove tombstones
- Read: check MemTable, then L0 SSTables, then L1, L2… (bloom filters accelerate misses)
B-Tree (InnoDB, PostgreSQL): better for read-heavy workloads. In-place updates, fewer amplification issues. Higher write cost than LSM.
Conflict Resolution
Concurrent writes to the same key on different replicas create conflicts. Strategies:
- Last-Write-Wins (LWW): timestamp determines winner. Simple but loses concurrent writes (clock skew risks)
- Vector clocks: each replica increments its own counter on write. Conflict detected when clocks are incomparable. Client resolves (Amazon shopping cart CRDT)
- CRDTs: data type designed to merge automatically (G-Counter, 2P-Set)
Failure Detection
Gossip protocol: each node periodically sends its local view of which nodes are up/down to a random neighbor. Rumors spread exponentially — after O(log N) rounds, all nodes know about a failure. Suspicion mechanism: mark nodes as SUSPECT before FAILED to handle slow nodes. Phi-accrual failure detector: instead of binary up/down, outputs a probability of failure that increases with time since last heartbeat. Used by Cassandra.
Interview Framework
- Partitioning: consistent hashing with virtual nodes
- Replication: N=3, configurable R and W quorums
- Storage: LSM-Tree for write-heavy, B-Tree for read-heavy
- Consistency: CAP trade-off, vector clocks for conflict detection
- Failure detection: gossip + phi-accrual
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does consistent hashing prevent hotspots in a distributed KV store?”,
“acceptedAnswer”: { “@type”: “Answer”, “text”: “In simple modulo hashing (key % N), adding or removing a node requires rehashing ~K*(N-1)/N keys — expensive. Consistent hashing maps both nodes and keys to a hash ring (0 to 2^64). A key is stored on the first node clockwise from its hash. Adding a node: only keys between the new node and its predecessor move. Removing a node: only that node's keys move to its successor. ~K/N keys move in either case. Without virtual nodes: nodes are unevenly placed on the ring, causing load imbalance. A node with a large arc serves more keys. Virtual nodes: each physical node gets 150–200 random positions on the ring. Load is distributed proportionally to vnode count. Hot key problem (one key gets disproportionate traffic): consistent hashing alone doesn't help — hot keys need application-level sharding (e.g., append random suffix to key for cache sharding) or client-side request coalescing.” }
},
{
“@type”: “Question”,
“name”: “How does an LSM-Tree work and why is it better than B-Tree for write-heavy workloads?”,
“acceptedAnswer”: { “@type”: “Answer”, “text”: “B-Tree updates in place: every write requires finding the right page and modifying it on disk. Random I/O. Write amplification: one logical write may cause multiple disk writes (page splits, parent updates). LSM-Tree (Log-Structured Merge-Tree): all writes are sequential. Write goes to: (1) WAL (append-only on disk, crash recovery), (2) MemTable (in-memory sorted structure, AVL tree or skip list). When MemTable reaches a threshold (e.g., 64MB): flush to disk as an SSTable (Sorted String Table) — sequential write, fast. SSTables are immutable. Background compaction merges SSTables, removes deleted keys (tombstones). Read: check MemTable first, then SSTables from newest to oldest (bloom filter per SSTable to skip misses). Write amplification in LSM: data is written multiple times during compaction — but total writes are sequential (fast). B-Tree reads are faster (one lookup path). LSM reads are slower (may check multiple SSTables). Trade-off: LSM wins on write throughput; B-Tree wins on read latency. That's why LSM is used by RocksDB, Cassandra, LevelDB; B-Tree by PostgreSQL, MySQL (InnoDB).” }
},
{
“@type”: “Question”,
“name”: “What is the difference between eventual consistency and strong consistency in distributed KV stores?”,
“acceptedAnswer”: { “@type”: “Answer”, “text”: “Strong consistency (linearizability): every read returns the most recently written value, as if all operations executed on a single machine. Requires all reads to go through a quorum or a single leader. Latency: higher (cross-datacenter quorum adds RTT). Systems: etcd, ZooKeeper, Google Spanner, DynamoDB with ConsistentRead. Eventual consistency: reads may return stale data, but all replicas converge to the same value given enough time and no new writes. No quorum required for reads. Latency: lower (any replica responds). Systems: Cassandra with ONE consistency level, DynamoDB without ConsistentRead, Redis replication. When to use strong consistency: financial transactions (current balance must be exact), leader election (exactly one leader must be seen). When to use eventual consistency: social media likes/views (approximately correct is fine), user session data (slightly stale login info is acceptable), shopping cart (merge conflicts client-side). The CAP theorem states you cannot have both strong consistency and availability during network partitions — choose one. Most real systems (DynamoDB, Cassandra) offer tunable consistency per request.” }
}
]
}