Day 8: B-Tree vs LSM Storage Engines

Core Concepts

Four Ideas to Lock In

Before diving into mechanics, internalise these four mental models. Everything else — every performance tradeoff, every database choice — flows from them.

🌲

B-Tree

In-place updates on balanced tree pages. O(log n) reads and writes. Dominant in OLTP databases (PostgreSQL, MySQL). Each write requires random disk I/O to find and update the correct page.

📝

LSM Tree

Log-Structured Merge Tree. Buffers writes in memory (MemTable), then flushes to immutable SSTables sequentially. Compaction merges levels in the background. Dominant in write-heavy systems.

🔁

Write Amplification

The ratio of bytes actually written to disk vs bytes the app requested to write. B-Tree: 2–4x. LSM: 10–30x because data is rewritten at each compaction level. Directly impacts SSD lifespan and IO cost.

🎯

Bloom Filter

Probabilistic data structure per SSTable. Tells you if a key definitely does NOT exist (zero false negatives). Small false positive rate (~1%). Eliminates 99%+ of unnecessary SSTable disk reads.

Section 1

B-Tree Write Path — Live Animation

Watch how a B-tree traverses from root to leaf on every single write. Each level hop is a random disk seek — that is the fundamental cost of B-trees on spinning disks and even SSDs.

O(log n)

Read & write complexity

3–4 seeks

Disk seeks for 1TB DB

~40ms

HDD write latency (4 seeks)

💡

Why HDD Destroys B-Tree Write Performance

A 1TB database with 4KB pages has a B-tree height of ~4. Each write requires 4 random disk seeks. At 10ms per seek on HDD = 40ms per write. NVMe SSD reduces this to ~0.4ms — still random I/O but much faster. This is why write-heavy workloads moved to LSM.

btree_insert.pseudo

btree_insert(root, key, value):
  # 1. Traverse root → leaf (O(log n) random disk reads)
  node = root
  path = []
  while not node.is_leaf:
    path.append(node)
    child_idx = find_child(node, key)        # binary search within 4KB page
    node = read_page_from_disk(node.children[child_idx])  # RANDOM DISK SEEK!

  # 2. Insert into leaf page in-place
  insert_into_page(node, key, value)
  write_page_to_disk(node)               # RANDOM DISK WRITE

  # 3. Split node if full → propagate upward (cascade splits possible)
  if node.is_full():
    split_and_propagate(node, path)

# Performance reality for a 1TB database (4KB pages, fan-out 500):
# Tree height ≈ log_500(1TB / 4KB) ≈ 4 levels
# Each level = 1 random disk read
# HDD:  4 × 10ms = 40ms write latency
# NVMe: 4 × 0.1ms = 0.4ms write latency

Section 2

LSM Write Path — Step-Through

LSM trees convert random writes into sequential ones by buffering in memory first. Step through each stage to understand why write-heavy workloads prefer LSM over B-trees.

lsm_tree.py

class LSMTree:
    def __init__(self):
        self.wal = WriteAheadLog()      # Sequential write — crash recovery
        self.memtable = SkipList()       # In-memory sorted structure
        self.sstables = []               # Immutable sorted disk files
        self.bloom_filters = {}          # One Bloom filter per SSTable

    def put(self, key: str, value: str):
        self.wal.append(key, value)      # Step 1: WAL — sequential, 0.1ms
        self.memtable.set(key, value)    # Step 2: MemTable — pure RAM, 0.01ms
        if self.memtable.size() > MEMTABLE_MAX:
            self._flush_to_sstable()     # Step 3: flush to L0 SSTable (background)

    def get(self, key: str) -> Optional[str]:
        if val := self.memtable.get(key):  # Check MemTable first (newest)
            return val
        for sstable in reversed(self.sstables):
            if not self.bloom_filters[sstable.id].might_contain(key):
                continue               # Bloom says NO — skip disk read entirely
            if val := sstable.get(key):
                return val
        return None

    def _flush_to_sstable(self):
        # Write MemTable to disk as sorted, immutable SSTable file
        sstable = SSTable.from_sorted_iter(self.memtable.items())
        self.bloom_filters[sstable.id] = BloomFilter(sstable.keys)
        self.sstables.append(sstable)
        self.memtable = SkipList()
        self._maybe_compact()            # Step 4: compaction in background

⚠️

LSM Read Tradeoff

LSM trees pay for fast writes with slower reads. Without Bloom filters, a read must check the MemTable plus every SSTable file — potentially dozens. Bloom filters reduce this to near O(1) for missing keys by eliminating 99%+ of SSTable checks without a disk read.

Section 3

Complete Engine Comparison

Choosing the wrong storage engine can cause 10x worse performance. This table covers every property you need to evaluate.

Property	B-Tree (PostgreSQL/MySQL)	LSM (RocksDB/Cassandra)
Write latency	Higher — random I/O to update page in place	Lower — sequential WAL + MemTable write
Read latency	Lower — direct page lookup, O(log n)	Higher — may check multiple SSTable levels
Write amplification	2–4x (page rewrite + WAL)	10–30x (repeated compaction rewrites)
Read amplification	1–3x disk seeks	10–30x without Bloom filter assistance
Space amplification	Low — in-place updates reuse pages	Moderate — 2 copies exist during compaction
Range queries	Excellent — data on-disk in sorted order	Good — SSTables are individually sorted
Best workload	Read-heavy OLTP, mixed reads/writes, joins	Write-heavy, time-series, append-only, logs
Real examples	PostgreSQL, MySQL InnoDB, SQLite, Oracle	RocksDB, Cassandra, LevelDB, FoundationDB

⚠️

Write Amplification at Scale

Cassandra at Facebook was measured at 10–30x write amplification. For a workload writing 1 GB/s at the application layer, actual disk write throughput is 10–30 GB/s. This significantly accelerates SSD wear-out and increases cloud storage costs.

Section 4

Bloom Filters — The Read Accelerator

Without Bloom filters an LSM read must open and binary-search every SSTable for a key. Bloom filters eliminate 99%+ of unnecessary disk reads using only kilobytes of memory per SSTable.

✅

No False Negatives

If the Bloom filter says "NO" — the key definitely does not exist in that SSTable. We can skip the disk read entirely. This is the guarantee that makes Bloom filters safe to use.

❓

Tunable False Positive Rate

Occasionally the filter says "might exist" but the key is not there. Causes one wasted disk read. Typical production rate: 0.1%–1%. Tunable via bit array size.

💾

Massive Space Savings

1 billion keys → only 1.2GB for a Bloom filter at 1% FPR. A hash set for 1B keys needs ~40GB. Bloom filters fit in L3 cache — no RAM pressure.

🔣

k Hash Functions

Each add/check runs k hash functions. All k bit positions must be 1 for "might contain." Any single 0 means definite miss. Optimal k = (m/n) × ln(2).

bloom_filter.py

from bitarray import bitarray
import mmh3   # MurmurHash3 — fast non-cryptographic hash
import math

class BloomFilter:
    def __init__(self, capacity=1_000_000, error_rate=0.01):
        # Optimal bit array size: m = -n*ln(p) / (ln(2)^2)
        self.size = int(-capacity * math.log(error_rate) / (math.log(2)**2))
        self.bits = bitarray(self.size)
        self.bits.setall(0)
        # Optimal hash count: k = (m/n) * ln(2)
        self.num_hashes = int(self.size / capacity * math.log(2))
        # At capacity=1M, error_rate=1%: size≈9.6Mb bits, num_hashes=7

    def add(self, item: str):
        for seed in range(self.num_hashes):
            pos = mmh3.hash(item, seed) % self.size
            self.bits[pos] = 1

    def might_contain(self, item: str) -> bool:
        # ALL k bits must be 1. Any single 0 → definite miss
        return all(
            self.bits[mmh3.hash(item, seed) % self.size]
            for seed in range(self.num_hashes)
        )

# In LSM tree: one BloomFilter per SSTable, kept in memory
bf = BloomFilter(capacity=1_000_000, error_rate=0.01)
bf.add("user:42")
bf.might_contain("user:42")   # True  — might exist, do disk read
bf.might_contain("user:99")   # False — definitely not here, skip!

✅

Bloom Filter Impact on RocksDB Reads

RocksDB benchmarks show Bloom filters reduce read I/O by ~90% for point lookups on write-heavy workloads. A database with 10 SSTable levels drops from ~50ms read latency to ~5ms — without changing the storage layout at all.

Section 5

Choose the Right Engine

Every production database decision involves matching your access pattern to the storage engine's strengths. Start by answering: what is my read/write ratio and do I need joins?

🐘

PostgreSQL (B-Tree)

OLTP workloads, read-heavy (reads outweigh writes 10:1+), complex joins, strong ACID guarantees. Banking, e-commerce, user accounts. Ideal baseline for most startups.

🪨

Cassandra / RocksDB (LSM)

Write-heavy, time-series, IoT sensors, logs, activity feeds. Writes outnumber reads 3:1+. No complex joins. Used by Netflix, Apple, Instagram at billion-user scale.

🖱️

ClickHouse (Columnar LSM)

Analytics and aggregations over billions of rows. Columnar layout gives 10–100x compression. Ideal for OLAP dashboards. Not for OLTP transactions — no good for point reads.

⚡

Redis (In-Memory)

Cache, session store, rate limiting, leaderboards, pub/sub. Sub-millisecond latency because data lives in RAM. Data size limited to available memory.

💡

Interview Pattern: Lead with Trade-offs

In system design interviews, always state WHY you're choosing a storage engine. "I'd use Cassandra because this is a write-heavy time-series workload with 1M writes/sec — the LSM tree's sequential write path and wide partitions are ideal here. We lose complex joins but this access pattern doesn't need them."

Section 6

Knowledge Check — 5 Questions

These questions test deeper understanding, not memorisation. Think about the mechanism before clicking.

1. HDD random seek time (~10ms) makes B-tree writes slow because…

2. LSM trees convert random writes to sequential by…

3. A Bloom filter false positive means…

4. The primary purpose of compaction in an LSM tree is…

5. You are storing 1 million IoT sensor readings per day for 5 years. Which storage engine fits best?

Storage Engines:
B-Tree vs LSM

Four Ideas to Lock In

B-Tree

LSM Tree

Write Amplification

Bloom Filter

B-Tree Write Path — Live Animation

B-Tree Visualizer

Why HDD Destroys B-Tree Write Performance

LSM Write Path — Step-Through

LSM Write Pipeline

Ready — PUT("user:42", "Alice")

LSM Read Tradeoff

Complete Engine Comparison

Write Amplification at Scale

Bloom Filters — The Read Accelerator

No False Negatives

Tunable False Positive Rate

Massive Space Savings

k Hash Functions

Bloom Filter Impact on RocksDB Reads

Choose the Right Engine

PostgreSQL (B-Tree)

Cassandra / RocksDB (LSM)

ClickHouse (Columnar LSM)

Redis (In-Memory)

Interview Pattern: Lead with Trade-offs

Knowledge Check — 5 Questions

Storage Engines:B-Tree vs LSM

Four Ideas to Lock In

B-Tree

LSM Tree

Write Amplification

Bloom Filter

B-Tree Write Path — Live Animation

B-Tree Visualizer

Why HDD Destroys B-Tree Write Performance

LSM Write Path — Step-Through

LSM Write Pipeline

Ready — PUT("user:42", "Alice")

LSM Read Tradeoff

Complete Engine Comparison

Write Amplification at Scale

Bloom Filters — The Read Accelerator

No False Negatives

Tunable False Positive Rate

Massive Space Savings

k Hash Functions

Bloom Filter Impact on RocksDB Reads

Choose the Right Engine

PostgreSQL (B-Tree)

Cassandra / RocksDB (LSM)

ClickHouse (Columnar LSM)

Redis (In-Memory)

Interview Pattern: Lead with Trade-offs

Knowledge Check — 5 Questions

Storage Engines:
B-Tree vs LSM