Design a distributed key-value store from scratch — the internals of DynamoDB, etcd, and Redis Cluster. Raft consensus, consistent hashing, read/write quorum.
Each node maintains a membership list. Every second, it gossips its list to K random peers. Node failure detected when heartbeat TTL expires. Eventual consistency — all nodes converge on same membership within seconds.
Virtual nodes (vnodes) on a 2^32 ring. Each physical node owns multiple vnodes (150 default). Client hashes key → finds responsible node. Only K/N keys remapped when node added/removed.
Each shard has a 3-node Raft group. Writes go to the leader, replicated to quorum before ACK. Followers redirect clients. Leader lease: prevents split reads on stale leaders.
Simulate PUT/GET/DELETE with quorum replication. Kill nodes to trigger leader election. Watch consistency in action.
Check MemTable → check Level-0 SSTables (newest first) → check higher levels. Bloom filter says "definitely not here" → skip SSTable. Block cache (LRU) for frequently accessed blocks.
Worst case: O(log N) SSTable checks. Compaction merges overlapping key ranges and discards tombstones.
import sortedcontainers, struct, hashlib
class MemTable:
"""In-memory sorted map. Flush to SSTable when full (>64MB)."""
def __init__(self, size_limit=64*1024*1024): # 64MB
self.data = sortedcontainers.SortedDict()
self.wal = open("wal.log", "ab")
self.size = 0
self.size_limit = size_limit
def put(self, key: str, value: bytes) -> None:
# Write-ahead log first (durability)
record = struct.pack(">H", len(key)) + key.encode() + \
struct.pack(">I", len(value)) + value
self.wal.write(record)
self.wal.flush() # fsync for durability
# Then update in-memory
self.data[key] = (value, time.time_ns()) # (value, timestamp)
self.size += len(key) + len(value)
def get(self, key: str):
return self.data.get(key)
def delete(self, key: str) -> None:
"""Tombstone deletion — actual removal happens at compaction."""
self.put(key, b"") # Empty value = tombstone
def is_full(self) -> bool:
return self.size >= self.size_limit
class BloomFilter:
"""Probabilistic membership test. False positive rate ~1% at 10 bits/element."""
def __init__(self, capacity=1000000, error_rate=0.01):
import math
self.n_bits = int(-capacity * math.log(error_rate) / math.log(2)**2)
self.k_hashes = int(self.n_bits / capacity * math.log(2))
self.bits = bytearray(self.n_bits // 8 + 1)
def _hashes(self, key: str):
h1 = int(hashlib.md5(key.encode()).hexdigest(), 16)
h2 = int(hashlib.sha1(key.encode()).hexdigest(), 16)
return [(h1 + i * h2) % self.n_bits for i in range(self.k_hashes)]
def add(self, key: str):
for bit in self._hashes(key):
self.bits[bit // 8] |= 1 << (bit % 8)
def might_contain(self, key: str) -> bool:
return all(self.bits[b//8] & (1 << (b%8)) for b in self._hashes(key))
| Decision | Choice | Rationale |
|---|---|---|
| Consistency model | Tunable (R+W>N for strong) | R=1,W=1 for speed; R=2,W=2 for safety |
| Storage engine | LSM tree | Sequential writes → high throughput; compaction manages reads |
| Replication | Raft (leader-based) | Strong consistency; linearizable reads from leader |
| Sharding | Consistent hash with vnodes | Minimal key redistribution on scale-out |
| Failure detection | Gossip + Phi accrual detector | Probabilistic — adapts to network conditions |
| Anti-entropy | Merkle trees | Efficiently identify diverged data between replicas |
| Client routing | Smart client (hash locally) | No proxy hop; client finds shard directly |
1. In a 3-node Raft cluster (N=3, quorum=2), a client writes with W=2. Node 3 is partitioned. Can the write succeed?
2. Why does an LSM tree use a Bloom filter per SSTable?
3. A client does GET with R=1 from a follower (not the leader). The follower is lagging 500ms behind. What could happen?
4. When a node joins an existing cluster, what mechanism does a distributed KV store use to transfer data to the new node?
5. DynamoDB uses Merkle trees for anti-entropy. What problem do they solve?
You've mastered distributed systems internals: consensus, transactions, clocks, fault tolerance, observability, and now a full KV store design. Week 4 covers real-world system design interviews.