Day 21 — Distributed KV Store Capstone

Requirements

🎯 Functional

PUT(key, value) — write a key-value pair
GET(key) → value — read by key
DELETE(key) — tombstone deletion
Automatic data replication (N=3)
Leader election on node failure
Partition tolerance across network splits

📊 Scale Targets

1M reads/sec, 100K writes/sec
P99 read latency < 1ms (in-region)
P99 write latency < 10ms (quorum)
Petabytes of data across shards
Configurable consistency (eventual vs strong)
Automatic data compaction (LSM)

Architecture Design

📡 Gossip Protocol

Membership

Each node maintains a membership list. Every second, it gossips its list to K random peers. Node failure detected when heartbeat TTL expires. Eventual consistency — all nodes converge on same membership within seconds.

🔄 Consistent Hashing

Routing

Virtual nodes (vnodes) on a 2^32 ring. Each physical node owns multiple vnodes (150 default). Client hashes key → finds responsible node. Only K/N keys remapped when node added/removed.

🗳️ Raft per Shard

Consistency

Each shard has a 3-node Raft group. Writes go to the leader, replicated to quorum before ACK. Followers redirect clients. Leader lease: prevents split reads on stale leaders.

Write Path — Step by Step

1
Client PUT: Client library hashes key using CRC32/Murmur3 → finds shard via consistent hash ring → connects to shard leader.
2
WAL Append: Leader appends entry to WAL (Write-Ahead Log) on disk before responding. Provides durability on crash.
3
Raft Replication: Leader sends AppendEntries to all 3 followers. Waits for quorum (2/3) ACK before committing.
4
Apply to MemTable: Committed entry applied to in-memory MemTable (sorted, O(log n) lookup). Acknowledged to client.
5
SSTable Flush: When MemTable reaches 64MB, flush to immutable SSTable on disk. Background compaction merges SSTables (LSM tree).

🗄️ Distributed KV Store Simulator (3-node Raft cluster)

Simulate PUT/GET/DELETE with quorum replication. Kill nodes to trigger leader election. Watch consistency in action.

N1

Leader

0

Writes

0

Reads

0

Keys

0

Elections

Storage Engine: LSM Tree

📝 Write Path (LSM)

Append to WAL (sequential write, fast)
Insert into MemTable (Red-Black tree, sorted)
MemTable → SSTable on disk (immutable, sorted)
Background compaction merges SSTables
Bloom filter per SSTable (avoid disk reads)

🔍 Read Path (LSM)

Check MemTable → check Level-0 SSTables (newest first) → check higher levels. Bloom filter says "definitely not here" → skip SSTable. Block cache (LRU) for frequently accessed blocks.

Worst case: O(log N) SSTable checks. Compaction merges overlapping key ranges and discards tombstones.

import sortedcontainers, struct, hashlib

class MemTable:
    """In-memory sorted map. Flush to SSTable when full (>64MB)."""
    def __init__(self, size_limit=64*1024*1024):  # 64MB
        self.data = sortedcontainers.SortedDict()
        self.wal = open("wal.log", "ab")
        self.size = 0
        self.size_limit = size_limit

    def put(self, key: str, value: bytes) -> None:
        # Write-ahead log first (durability)
        record = struct.pack(">H", len(key)) + key.encode() + \
                 struct.pack(">I", len(value)) + value
        self.wal.write(record)
        self.wal.flush()  # fsync for durability
        # Then update in-memory
        self.data[key] = (value, time.time_ns())  # (value, timestamp)
        self.size += len(key) + len(value)

    def get(self, key: str):
        return self.data.get(key)

    def delete(self, key: str) -> None:
        """Tombstone deletion — actual removal happens at compaction."""
        self.put(key, b"")  # Empty value = tombstone

    def is_full(self) -> bool:
        return self.size >= self.size_limit

class BloomFilter:
    """Probabilistic membership test. False positive rate ~1% at 10 bits/element."""
    def __init__(self, capacity=1000000, error_rate=0.01):
        import math
        self.n_bits = int(-capacity * math.log(error_rate) / math.log(2)**2)
        self.k_hashes = int(self.n_bits / capacity * math.log(2))
        self.bits = bytearray(self.n_bits // 8 + 1)

    def _hashes(self, key: str):
        h1 = int(hashlib.md5(key.encode()).hexdigest(), 16)
        h2 = int(hashlib.sha1(key.encode()).hexdigest(), 16)
        return [(h1 + i * h2) % self.n_bits for i in range(self.k_hashes)]

    def add(self, key: str):
        for bit in self._hashes(key):
            self.bits[bit // 8] |= 1 << (bit % 8)

    def might_contain(self, key: str) -> bool:
        return all(self.bits[b//8] & (1 << (b%8)) for b in self._hashes(key))

Key Design Decisions

Decision	Choice	Rationale
Consistency model	Tunable (R+W>N for strong)	R=1,W=1 for speed; R=2,W=2 for safety
Storage engine	LSM tree	Sequential writes → high throughput; compaction manages reads
Replication	Raft (leader-based)	Strong consistency; linearizable reads from leader
Sharding	Consistent hash with vnodes	Minimal key redistribution on scale-out
Failure detection	Gossip + Phi accrual detector	Probabilistic — adapts to network conditions
Anti-entropy	Merkle trees	Efficiently identify diverged data between replicas
Client routing	Smart client (hash locally)	No proxy hop; client finds shard directly

Knowledge Check

1. In a 3-node Raft cluster (N=3, quorum=2), a client writes with W=2. Node 3 is partitioned. Can the write succeed?

2. Why does an LSM tree use a Bloom filter per SSTable?

3. A client does GET with R=1 from a follower (not the leader). The follower is lagging 500ms behind. What could happen?

4. When a node joins an existing cluster, what mechanism does a distributed KV store use to transfer data to the new node?

5. DynamoDB uses Merkle trees for anti-entropy. What problem do they solve?

Week 3 Complete! Days 15–21 Done.

You've mastered distributed systems internals: consensus, transactions, clocks, fault tolerance, observability, and now a full KV store design. Week 4 covers real-world system design interviews.