System Design Interview: Design a Distributed Key-Value Store (DynamoDB/Cassandra)

System Design Interview: Design a Distributed Key-Value Store

A distributed key-value store is one of the foundational system design questions — it tests your understanding of distributed systems fundamentals: consistency, partitioning, replication, and fault tolerance. Systems like DynamoDB, Cassandra, and Redis Cluster solve this problem. Here is the architecture and the trade-offs you need to articulate in an interview.

Requirements

Functional: get(key), put(key, value), delete(key). Keys up to 256 bytes, values up to 10 MB.

Non-functional: low latency (<10ms p99), high availability (99.99%), horizontal scalability to petabytes of data, eventual consistency acceptable (tunable).

Data Partitioning: Consistent Hashing

Distribute keys across nodes without rehashing everything when nodes join or leave.

Virtual nodes (vnodes): each physical node gets 150-300 vnodes on the hash ring. Benefits: (1) even distribution even with heterogeneous hardware, (2) when a node fails, its vnodes scatter load across many nodes rather than one neighbor absorbing everything.

hash_ring: sorted array of (hash_value, node_id)
lookup(key): hash(key) → find first node clockwise on ring
add_node(new_node): take ~150 vnodes from existing nodes, redistribute only those keys
remove_node(dead_node): its vnodes transfer to the next clockwise node

Replication

Replicate each key to N nodes (typically N=3) for fault tolerance. With consistent hashing: the primary node plus the next N-1 clockwise nodes in the ring are the replica set for a key.

Quorum reads and writes: with N=3 replicas, W write quorum, R read quorum:

  • Strong consistency: W + R > N (e.g., W=2, R=2)
  • High availability (DynamoDB default): W=1, R=1 — eventual consistency, lowest latency
  • Write-heavy optimization: W=1, R=3 — fast writes, strong reads

The coordinator node (receives the client request) sends the operation to all N replicas in parallel and returns after W (or R) acknowledge.

Storage Engine

On each node, use an LSM-tree (Log-Structured Merge-tree):

  • Write path: write to WAL (write-ahead log) for durability, then to an in-memory MemTable (sorted by key, ~64MB)
  • Flush: when MemTable fills, flush to an immutable SSTable (Sorted String Table) on disk
  • Compaction: background process merges SSTables, removes deleted/overwritten keys, maintains sorted order
  • Read path: check MemTable → check Bloom filter for each SSTable → read from disk if needed

Bloom filters eliminate unnecessary disk reads: if the filter says the key is not in an SSTable, skip it (false positive rate ~1%). This makes LSM reads nearly as fast as B-tree reads despite the multi-layer structure.

Conflict Resolution: Vector Clocks

With eventual consistency, a key can be written to multiple replicas concurrently before they sync. Detecting conflicts requires vector clocks:

vector_clock: {node_id: sequence_number}
# Write on node A: increment A's counter
# Example: after 3 writes across nodes A,B,C:
value1: {A:2, B:1}  # written mostly on A
value2: {A:1, B:2}  # written mostly on B
# These are concurrent (neither dominates) → conflict
# Resolution: last-write-wins (timestamp), or return both to client for application-level merge

DynamoDB uses a simplified version with “last writer wins” as default and optional conditional updates for optimistic locking.

Handling Failures

Hinted Handoff

When a replica node is temporarily down, the coordinator writes to a healthy node as a “hint.” When the failed node recovers, the hint is replayed to bring it up to date. This maintains write availability (sloppy quorum) during node failures.

Anti-Entropy with Merkle Trees

To detect data divergence between replicas without comparing every key:

  1. Each node builds a Merkle tree over its key space (leaf = hash of key-value, internal nodes = hash of children)
  2. Nodes exchange tree roots periodically; if roots differ, traverse to find divergent subtrees
  3. Sync only the divergent key ranges — typically O(differences) not O(total keys)

API Design

PUT  /keys/{key}         body: {value, ttl_seconds (optional)}  → 200 OK
GET  /keys/{key}         → 200 {value, version} or 404
DELETE /keys/{key}       → 200 OK (tombstone written, not immediate deletion)

Client SDK: consistent_put(key, value, quorum=W)
            consistent_get(key, quorum=R, read_repair=True)

Read repair: when a GET reads from R replicas and finds divergent values, it asynchronously updates stale replicas with the latest value. This passively heals inconsistency without a separate background job.

Comparison: DynamoDB vs. Redis Cluster vs. Cassandra

Feature DynamoDB Redis Cluster Cassandra
Consistency Eventual (tunable) Eventual Eventual (tunable)
Storage engine B-tree + LSM Hash table (in-memory) LSM-tree
Max value size 400 KB 512 MB 2 GB
Latency <10ms <1ms <10ms
Use case General purpose Caching, sessions Time-series, write-heavy

Interview Tips

  • Open with CAP theorem: you're designing an AP system (availability + partition tolerance), accepting eventual consistency
  • Explain consistent hashing and why virtual nodes matter — this is a signal of depth
  • Walk through the write path: client → coordinator → W replicas → WAL → MemTable → SSTable
  • Mention Bloom filters when discussing reads — shows you know the storage engine internals
  • For conflict resolution: “last-write-wins is simple; vector clocks are needed when you can't afford data loss”

  • Shopify Interview Guide
  • Netflix Interview Guide
  • Uber Interview Guide
  • Stripe Interview Guide
  • Coinbase Interview Guide
  • Databricks Interview Guide
  • Scroll to Top