Question 1

How does consistent hashing prevent hotspots in a distributed KV store?

Accepted Answer

In simple modulo hashing (key % N), adding or removing a node requires rehashing ~K*(N-1)/N keys — expensive. Consistent hashing maps both nodes and keys to a hash ring (0 to 2^64). A key is stored on the first node clockwise from its hash. Adding a node: only keys between the new node and its predecessor move. Removing a node: only that node's keys move to its successor. ~K/N keys move in either case. Without virtual nodes: nodes are unevenly placed on the ring, causing load imbalance. A node with a large arc serves more keys. Virtual nodes: each physical node gets 150–200 random positions on the ring. Load is distributed proportionally to vnode count. Hot key problem (one key gets disproportionate traffic): consistent hashing alone doesn't help — hot keys need application-level sharding (e.g., append random suffix to key for cache sharding) or client-side request coalescing.

Question 2

How does an LSM-Tree work and why is it better than B-Tree for write-heavy workloads?

Accepted Answer

B-Tree updates in place: every write requires finding the right page and modifying it on disk. Random I/O. Write amplification: one logical write may cause multiple disk writes (page splits, parent updates). LSM-Tree (Log-Structured Merge-Tree): all writes are sequential. Write goes to: (1) WAL (append-only on disk, crash recovery), (2) MemTable (in-memory sorted structure, AVL tree or skip list). When MemTable reaches a threshold (e.g., 64MB): flush to disk as an SSTable (Sorted String Table) — sequential write, fast. SSTables are immutable. Background compaction merges SSTables, removes deleted keys (tombstones). Read: check MemTable first, then SSTables from newest to oldest (bloom filter per SSTable to skip misses). Write amplification in LSM: data is written multiple times during compaction — but total writes are sequential (fast). B-Tree reads are faster (one lookup path). LSM reads are slower (may check multiple SSTables). Trade-off: LSM wins on write throughput; B-Tree wins on read latency. That's why LSM is used by RocksDB, Cassandra, LevelDB; B-Tree by PostgreSQL, MySQL (InnoDB).

Question 3

What is the difference between eventual consistency and strong consistency in distributed KV stores?

Accepted Answer

Strong consistency (linearizability): every read returns the most recently written value, as if all operations executed on a single machine. Requires all reads to go through a quorum or a single leader. Latency: higher (cross-datacenter quorum adds RTT). Systems: etcd, ZooKeeper, Google Spanner, DynamoDB with ConsistentRead. Eventual consistency: reads may return stale data, but all replicas converge to the same value given enough time and no new writes. No quorum required for reads. Latency: lower (any replica responds). Systems: Cassandra with ONE consistency level, DynamoDB without ConsistentRead, Redis replication. When to use strong consistency: financial transactions (current balance must be exact), leader election (exactly one leader must be seen). When to use eventual consistency: social media likes/views (approximately correct is fine), user session data (slightly stale login info is acceptable), shopping cart (merge conflicts client-side). The CAP theorem states you cannot have both strong consistency and availability during network partitions — choose one. Most real systems (DynamoDB, Cassandra) offer tunable consistency per request.

System Design Interview: Design a Key-Value Store (Redis / DynamoDB)

What Is a Distributed Key-Value Store?

Data Partitioning

Consistent Hashing

Replication

Storage Engine

Conflict Resolution

Failure Detection

Interview Framework