Day 8  Β·  Storage Engines

Storage Engines:
B-Tree vs LSM

Understand how databases physically store and retrieve data β€” the foundation of every read/write performance decision you will ever make in system design.

3
Simulations
~3.5h
Study Time
5
Quizzes
B-Tree LSM Tree Write Amplification Compaction Bloom Filters Column Stores

Four Ideas to Lock In

Before diving into mechanics, internalise these four mental models. Everything else β€” every performance tradeoff, every database choice β€” flows from them.

🌲

B-Tree

In-place updates on balanced tree pages. O(log n) reads and writes. Dominant in OLTP databases (PostgreSQL, MySQL). Each write requires random disk I/O to find and update the correct page.

πŸ“

LSM Tree

Log-Structured Merge Tree. Buffers writes in memory (MemTable), then flushes to immutable SSTables sequentially. Compaction merges levels in the background. Dominant in write-heavy systems.

πŸ”

Write Amplification

The ratio of bytes actually written to disk vs bytes the app requested to write. B-Tree: 2–4x. LSM: 10–30x because data is rewritten at each compaction level. Directly impacts SSD lifespan and IO cost.

🎯

Bloom Filter

Probabilistic data structure per SSTable. Tells you if a key definitely does NOT exist (zero false negatives). Small false positive rate (~1%). Eliminates 99%+ of unnecessary SSTable disk reads.

B-Tree Write Path β€” Live Animation

Watch how a B-tree traverses from root to leaf on every single write. Each level hop is a random disk seek β€” that is the fundamental cost of B-trees on spinning disks and even SSDs.

B-Tree Visualizer

Click "Insert Key" to add values and watch the traversal path light up in blue. After enough inserts a node split is triggered.

Insert keys to populate the tree. Blue path shows disk seeks on each write.
O(log n)
Read & write complexity
3–4 seeks
Disk seeks for 1TB DB
~40ms
HDD write latency (4 seeks)
πŸ’‘

Why HDD Destroys B-Tree Write Performance

A 1TB database with 4KB pages has a B-tree height of ~4. Each write requires 4 random disk seeks. At 10ms per seek on HDD = 40ms per write. NVMe SSD reduces this to ~0.4ms β€” still random I/O but much faster. This is why write-heavy workloads moved to LSM.

btree_insert.pseudo
btree_insert(root, key, value):
  # 1. Traverse root β†’ leaf (O(log n) random disk reads)
  node = root
  path = []
  while not node.is_leaf:
    path.append(node)
    child_idx = find_child(node, key)        # binary search within 4KB page
    node = read_page_from_disk(node.children[child_idx])  # RANDOM DISK SEEK!

  # 2. Insert into leaf page in-place
  insert_into_page(node, key, value)
  write_page_to_disk(node)               # RANDOM DISK WRITE

  # 3. Split node if full β†’ propagate upward (cascade splits possible)
  if node.is_full():
    split_and_propagate(node, path)

# Performance reality for a 1TB database (4KB pages, fan-out 500):
# Tree height β‰ˆ log_500(1TB / 4KB) β‰ˆ 4 levels
# Each level = 1 random disk read
# HDD:  4 Γ— 10ms = 40ms write latency
# NVMe: 4 Γ— 0.1ms = 0.4ms write latency

LSM Write Path β€” Step-Through

LSM trees convert random writes into sequential ones by buffering in memory first. Step through each stage to understand why write-heavy workloads prefer LSM over B-trees.

LSM Write Pipeline

Advance through each step. Notice how sub-millisecond durability is achieved without touching a random disk location.

1
2
3
4

Ready β€” PUT("user:42", "Alice")

A write request arrives at the LSM engine. It will be processed through 4 stages, achieving sub-millisecond durability without random disk seeks.

lsm_tree.py
class LSMTree:
    def __init__(self):
        self.wal = WriteAheadLog()      # Sequential write β€” crash recovery
        self.memtable = SkipList()       # In-memory sorted structure
        self.sstables = []               # Immutable sorted disk files
        self.bloom_filters = {}          # One Bloom filter per SSTable

    def put(self, key: str, value: str):
        self.wal.append(key, value)      # Step 1: WAL β€” sequential, 0.1ms
        self.memtable.set(key, value)    # Step 2: MemTable β€” pure RAM, 0.01ms
        if self.memtable.size() > MEMTABLE_MAX:
            self._flush_to_sstable()     # Step 3: flush to L0 SSTable (background)

    def get(self, key: str) -> Optional[str]:
        if val := self.memtable.get(key):  # Check MemTable first (newest)
            return val
        for sstable in reversed(self.sstables):
            if not self.bloom_filters[sstable.id].might_contain(key):
                continue               # Bloom says NO β€” skip disk read entirely
            if val := sstable.get(key):
                return val
        return None

    def _flush_to_sstable(self):
        # Write MemTable to disk as sorted, immutable SSTable file
        sstable = SSTable.from_sorted_iter(self.memtable.items())
        self.bloom_filters[sstable.id] = BloomFilter(sstable.keys)
        self.sstables.append(sstable)
        self.memtable = SkipList()
        self._maybe_compact()            # Step 4: compaction in background
⚠️

LSM Read Tradeoff

LSM trees pay for fast writes with slower reads. Without Bloom filters, a read must check the MemTable plus every SSTable file β€” potentially dozens. Bloom filters reduce this to near O(1) for missing keys by eliminating 99%+ of SSTable checks without a disk read.

Complete Engine Comparison

Choosing the wrong storage engine can cause 10x worse performance. This table covers every property you need to evaluate.

PropertyB-Tree (PostgreSQL/MySQL)LSM (RocksDB/Cassandra)
Write latencyHigher β€” random I/O to update page in placeLower β€” sequential WAL + MemTable write
Read latencyLower β€” direct page lookup, O(log n)Higher β€” may check multiple SSTable levels
Write amplification2–4x (page rewrite + WAL)10–30x (repeated compaction rewrites)
Read amplification1–3x disk seeks10–30x without Bloom filter assistance
Space amplificationLow β€” in-place updates reuse pagesModerate β€” 2 copies exist during compaction
Range queriesExcellent β€” data on-disk in sorted orderGood β€” SSTables are individually sorted
Best workloadRead-heavy OLTP, mixed reads/writes, joinsWrite-heavy, time-series, append-only, logs
Real examplesPostgreSQL, MySQL InnoDB, SQLite, OracleRocksDB, Cassandra, LevelDB, FoundationDB
⚠️

Write Amplification at Scale

Cassandra at Facebook was measured at 10–30x write amplification. For a workload writing 1 GB/s at the application layer, actual disk write throughput is 10–30 GB/s. This significantly accelerates SSD wear-out and increases cloud storage costs.

Bloom Filters β€” The Read Accelerator

Without Bloom filters an LSM read must open and binary-search every SSTable for a key. Bloom filters eliminate 99%+ of unnecessary disk reads using only kilobytes of memory per SSTable.

βœ…

No False Negatives

If the Bloom filter says "NO" β€” the key definitely does not exist in that SSTable. We can skip the disk read entirely. This is the guarantee that makes Bloom filters safe to use.

❓

Tunable False Positive Rate

Occasionally the filter says "might exist" but the key is not there. Causes one wasted disk read. Typical production rate: 0.1%–1%. Tunable via bit array size.

πŸ’Ύ

Massive Space Savings

1 billion keys β†’ only 1.2GB for a Bloom filter at 1% FPR. A hash set for 1B keys needs ~40GB. Bloom filters fit in L3 cache β€” no RAM pressure.

πŸ”£

k Hash Functions

Each add/check runs k hash functions. All k bit positions must be 1 for "might contain." Any single 0 means definite miss. Optimal k = (m/n) Γ— ln(2).

bloom_filter.py
from bitarray import bitarray
import mmh3   # MurmurHash3 β€” fast non-cryptographic hash
import math

class BloomFilter:
    def __init__(self, capacity=1_000_000, error_rate=0.01):
        # Optimal bit array size: m = -n*ln(p) / (ln(2)^2)
        self.size = int(-capacity * math.log(error_rate) / (math.log(2)**2))
        self.bits = bitarray(self.size)
        self.bits.setall(0)
        # Optimal hash count: k = (m/n) * ln(2)
        self.num_hashes = int(self.size / capacity * math.log(2))
        # At capacity=1M, error_rate=1%: sizeβ‰ˆ9.6Mb bits, num_hashes=7

    def add(self, item: str):
        for seed in range(self.num_hashes):
            pos = mmh3.hash(item, seed) % self.size
            self.bits[pos] = 1

    def might_contain(self, item: str) -> bool:
        # ALL k bits must be 1. Any single 0 β†’ definite miss
        return all(
            self.bits[mmh3.hash(item, seed) % self.size]
            for seed in range(self.num_hashes)
        )

# In LSM tree: one BloomFilter per SSTable, kept in memory
bf = BloomFilter(capacity=1_000_000, error_rate=0.01)
bf.add("user:42")
bf.might_contain("user:42")   # True  β€” might exist, do disk read
bf.might_contain("user:99")   # False β€” definitely not here, skip!
βœ…

Bloom Filter Impact on RocksDB Reads

RocksDB benchmarks show Bloom filters reduce read I/O by ~90% for point lookups on write-heavy workloads. A database with 10 SSTable levels drops from ~50ms read latency to ~5ms β€” without changing the storage layout at all.

Choose the Right Engine

Every production database decision involves matching your access pattern to the storage engine's strengths. Start by answering: what is my read/write ratio and do I need joins?

🐘

PostgreSQL (B-Tree)

OLTP workloads, read-heavy (reads outweigh writes 10:1+), complex joins, strong ACID guarantees. Banking, e-commerce, user accounts. Ideal baseline for most startups.

πŸͺ¨

Cassandra / RocksDB (LSM)

Write-heavy, time-series, IoT sensors, logs, activity feeds. Writes outnumber reads 3:1+. No complex joins. Used by Netflix, Apple, Instagram at billion-user scale.

πŸ–±οΈ

ClickHouse (Columnar LSM)

Analytics and aggregations over billions of rows. Columnar layout gives 10–100x compression. Ideal for OLAP dashboards. Not for OLTP transactions β€” no good for point reads.

⚑

Redis (In-Memory)

Cache, session store, rate limiting, leaderboards, pub/sub. Sub-millisecond latency because data lives in RAM. Data size limited to available memory.

πŸ’‘

Interview Pattern: Lead with Trade-offs

In system design interviews, always state WHY you're choosing a storage engine. "I'd use Cassandra because this is a write-heavy time-series workload with 1M writes/sec β€” the LSM tree's sequential write path and wide partitions are ideal here. We lose complex joins but this access pattern doesn't need them."

Knowledge Check β€” 5 Questions

These questions test deeper understanding, not memorisation. Think about the mechanism before clicking.

1. HDD random seek time (~10ms) makes B-tree writes slow because…
2. LSM trees convert random writes to sequential by…
3. A Bloom filter false positive means…
4. The primary purpose of compaction in an LSM tree is…
5. You are storing 1 million IoT sensor readings per day for 5 years. Which storage engine fits best?