Day 21 of 30 • Capstone Project ★

Distributed KV Store Capstone

Design a distributed key-value store from scratch — the internals of DynamoDB, etcd, and Redis Cluster. Raft consensus, consistent hashing, read/write quorum.

Raft Consensus Consistent Hashing Quorum Reads/Writes Anti-Entropy Compaction

Requirements

🎯 Functional

  • PUT(key, value) — write a key-value pair
  • GET(key) → value — read by key
  • DELETE(key) — tombstone deletion
  • Automatic data replication (N=3)
  • Leader election on node failure
  • Partition tolerance across network splits

📊 Scale Targets

  • 1M reads/sec, 100K writes/sec
  • P99 read latency < 1ms (in-region)
  • P99 write latency < 10ms (quorum)
  • Petabytes of data across shards
  • Configurable consistency (eventual vs strong)
  • Automatic data compaction (LSM)

Architecture Design

📡 Gossip Protocol

Membership

Each node maintains a membership list. Every second, it gossips its list to K random peers. Node failure detected when heartbeat TTL expires. Eventual consistency — all nodes converge on same membership within seconds.

🔄 Consistent Hashing

Routing

Virtual nodes (vnodes) on a 2^32 ring. Each physical node owns multiple vnodes (150 default). Client hashes key → finds responsible node. Only K/N keys remapped when node added/removed.

🗳️ Raft per Shard

Consistency

Each shard has a 3-node Raft group. Writes go to the leader, replicated to quorum before ACK. Followers redirect clients. Leader lease: prevents split reads on stale leaders.

Write Path — Step by Step

🗄️ Distributed KV Store Simulator (3-node Raft cluster)

Simulate PUT/GET/DELETE with quorum replication. Kill nodes to trigger leader election. Watch consistency in action.

N1
Leader
0
Writes
0
Reads
0
Keys
0
Elections

Storage Engine: LSM Tree

📝 Write Path (LSM)

  • Append to WAL (sequential write, fast)
  • Insert into MemTable (Red-Black tree, sorted)
  • MemTable → SSTable on disk (immutable, sorted)
  • Background compaction merges SSTables
  • Bloom filter per SSTable (avoid disk reads)

🔍 Read Path (LSM)

Check MemTable → check Level-0 SSTables (newest first) → check higher levels. Bloom filter says "definitely not here" → skip SSTable. Block cache (LRU) for frequently accessed blocks.

Worst case: O(log N) SSTable checks. Compaction merges overlapping key ranges and discards tombstones.

import sortedcontainers, struct, hashlib

class MemTable:
    """In-memory sorted map. Flush to SSTable when full (>64MB)."""
    def __init__(self, size_limit=64*1024*1024):  # 64MB
        self.data = sortedcontainers.SortedDict()
        self.wal = open("wal.log", "ab")
        self.size = 0
        self.size_limit = size_limit

    def put(self, key: str, value: bytes) -> None:
        # Write-ahead log first (durability)
        record = struct.pack(">H", len(key)) + key.encode() + \
                 struct.pack(">I", len(value)) + value
        self.wal.write(record)
        self.wal.flush()  # fsync for durability
        # Then update in-memory
        self.data[key] = (value, time.time_ns())  # (value, timestamp)
        self.size += len(key) + len(value)

    def get(self, key: str):
        return self.data.get(key)

    def delete(self, key: str) -> None:
        """Tombstone deletion — actual removal happens at compaction."""
        self.put(key, b"")  # Empty value = tombstone

    def is_full(self) -> bool:
        return self.size >= self.size_limit

class BloomFilter:
    """Probabilistic membership test. False positive rate ~1% at 10 bits/element."""
    def __init__(self, capacity=1000000, error_rate=0.01):
        import math
        self.n_bits = int(-capacity * math.log(error_rate) / math.log(2)**2)
        self.k_hashes = int(self.n_bits / capacity * math.log(2))
        self.bits = bytearray(self.n_bits // 8 + 1)

    def _hashes(self, key: str):
        h1 = int(hashlib.md5(key.encode()).hexdigest(), 16)
        h2 = int(hashlib.sha1(key.encode()).hexdigest(), 16)
        return [(h1 + i * h2) % self.n_bits for i in range(self.k_hashes)]

    def add(self, key: str):
        for bit in self._hashes(key):
            self.bits[bit // 8] |= 1 << (bit % 8)

    def might_contain(self, key: str) -> bool:
        return all(self.bits[b//8] & (1 << (b%8)) for b in self._hashes(key))

Key Design Decisions

DecisionChoiceRationale
Consistency modelTunable (R+W>N for strong)R=1,W=1 for speed; R=2,W=2 for safety
Storage engineLSM treeSequential writes → high throughput; compaction manages reads
ReplicationRaft (leader-based)Strong consistency; linearizable reads from leader
ShardingConsistent hash with vnodesMinimal key redistribution on scale-out
Failure detectionGossip + Phi accrual detectorProbabilistic — adapts to network conditions
Anti-entropyMerkle treesEfficiently identify diverged data between replicas
Client routingSmart client (hash locally)No proxy hop; client finds shard directly

Knowledge Check

1. In a 3-node Raft cluster (N=3, quorum=2), a client writes with W=2. Node 3 is partitioned. Can the write succeed?

2. Why does an LSM tree use a Bloom filter per SSTable?

3. A client does GET with R=1 from a follower (not the leader). The follower is lagging 500ms behind. What could happen?

4. When a node joins an existing cluster, what mechanism does a distributed KV store use to transfer data to the new node?

5. DynamoDB uses Merkle trees for anti-entropy. What problem do they solve?

Week 3 Complete! Days 15–21 Done.

You've mastered distributed systems internals: consensus, transactions, clocks, fault tolerance, observability, and now a full KV store design. Week 4 covers real-world system design interviews.