Understand how databases physically store and retrieve data β the foundation of every read/write performance decision you will ever make in system design.
Before diving into mechanics, internalise these four mental models. Everything else β every performance tradeoff, every database choice β flows from them.
In-place updates on balanced tree pages. O(log n) reads and writes. Dominant in OLTP databases (PostgreSQL, MySQL). Each write requires random disk I/O to find and update the correct page.
Log-Structured Merge Tree. Buffers writes in memory (MemTable), then flushes to immutable SSTables sequentially. Compaction merges levels in the background. Dominant in write-heavy systems.
The ratio of bytes actually written to disk vs bytes the app requested to write. B-Tree: 2β4x. LSM: 10β30x because data is rewritten at each compaction level. Directly impacts SSD lifespan and IO cost.
Probabilistic data structure per SSTable. Tells you if a key definitely does NOT exist (zero false negatives). Small false positive rate (~1%). Eliminates 99%+ of unnecessary SSTable disk reads.
Watch how a B-tree traverses from root to leaf on every single write. Each level hop is a random disk seek β that is the fundamental cost of B-trees on spinning disks and even SSDs.
A 1TB database with 4KB pages has a B-tree height of ~4. Each write requires 4 random disk seeks. At 10ms per seek on HDD = 40ms per write. NVMe SSD reduces this to ~0.4ms β still random I/O but much faster. This is why write-heavy workloads moved to LSM.
btree_insert(root, key, value): # 1. Traverse root β leaf (O(log n) random disk reads) node = root path = [] while not node.is_leaf: path.append(node) child_idx = find_child(node, key) # binary search within 4KB page node = read_page_from_disk(node.children[child_idx]) # RANDOM DISK SEEK! # 2. Insert into leaf page in-place insert_into_page(node, key, value) write_page_to_disk(node) # RANDOM DISK WRITE # 3. Split node if full β propagate upward (cascade splits possible) if node.is_full(): split_and_propagate(node, path) # Performance reality for a 1TB database (4KB pages, fan-out 500): # Tree height β log_500(1TB / 4KB) β 4 levels # Each level = 1 random disk read # HDD: 4 Γ 10ms = 40ms write latency # NVMe: 4 Γ 0.1ms = 0.4ms write latency
LSM trees convert random writes into sequential ones by buffering in memory first. Step through each stage to understand why write-heavy workloads prefer LSM over B-trees.
class LSMTree: def __init__(self): self.wal = WriteAheadLog() # Sequential write β crash recovery self.memtable = SkipList() # In-memory sorted structure self.sstables = [] # Immutable sorted disk files self.bloom_filters = {} # One Bloom filter per SSTable def put(self, key: str, value: str): self.wal.append(key, value) # Step 1: WAL β sequential, 0.1ms self.memtable.set(key, value) # Step 2: MemTable β pure RAM, 0.01ms if self.memtable.size() > MEMTABLE_MAX: self._flush_to_sstable() # Step 3: flush to L0 SSTable (background) def get(self, key: str) -> Optional[str]: if val := self.memtable.get(key): # Check MemTable first (newest) return val for sstable in reversed(self.sstables): if not self.bloom_filters[sstable.id].might_contain(key): continue # Bloom says NO β skip disk read entirely if val := sstable.get(key): return val return None def _flush_to_sstable(self): # Write MemTable to disk as sorted, immutable SSTable file sstable = SSTable.from_sorted_iter(self.memtable.items()) self.bloom_filters[sstable.id] = BloomFilter(sstable.keys) self.sstables.append(sstable) self.memtable = SkipList() self._maybe_compact() # Step 4: compaction in background
LSM trees pay for fast writes with slower reads. Without Bloom filters, a read must check the MemTable plus every SSTable file β potentially dozens. Bloom filters reduce this to near O(1) for missing keys by eliminating 99%+ of SSTable checks without a disk read.
Choosing the wrong storage engine can cause 10x worse performance. This table covers every property you need to evaluate.
| Property | B-Tree (PostgreSQL/MySQL) | LSM (RocksDB/Cassandra) |
|---|---|---|
| Write latency | Higher β random I/O to update page in place | Lower β sequential WAL + MemTable write |
| Read latency | Lower β direct page lookup, O(log n) | Higher β may check multiple SSTable levels |
| Write amplification | 2β4x (page rewrite + WAL) | 10β30x (repeated compaction rewrites) |
| Read amplification | 1β3x disk seeks | 10β30x without Bloom filter assistance |
| Space amplification | Low β in-place updates reuse pages | Moderate β 2 copies exist during compaction |
| Range queries | Excellent β data on-disk in sorted order | Good β SSTables are individually sorted |
| Best workload | Read-heavy OLTP, mixed reads/writes, joins | Write-heavy, time-series, append-only, logs |
| Real examples | PostgreSQL, MySQL InnoDB, SQLite, Oracle | RocksDB, Cassandra, LevelDB, FoundationDB |
Cassandra at Facebook was measured at 10β30x write amplification. For a workload writing 1 GB/s at the application layer, actual disk write throughput is 10β30 GB/s. This significantly accelerates SSD wear-out and increases cloud storage costs.
Without Bloom filters an LSM read must open and binary-search every SSTable for a key. Bloom filters eliminate 99%+ of unnecessary disk reads using only kilobytes of memory per SSTable.
If the Bloom filter says "NO" β the key definitely does not exist in that SSTable. We can skip the disk read entirely. This is the guarantee that makes Bloom filters safe to use.
Occasionally the filter says "might exist" but the key is not there. Causes one wasted disk read. Typical production rate: 0.1%β1%. Tunable via bit array size.
1 billion keys β only 1.2GB for a Bloom filter at 1% FPR. A hash set for 1B keys needs ~40GB. Bloom filters fit in L3 cache β no RAM pressure.
Each add/check runs k hash functions. All k bit positions must be 1 for "might contain." Any single 0 means definite miss. Optimal k = (m/n) Γ ln(2).
from bitarray import bitarray import mmh3 # MurmurHash3 β fast non-cryptographic hash import math class BloomFilter: def __init__(self, capacity=1_000_000, error_rate=0.01): # Optimal bit array size: m = -n*ln(p) / (ln(2)^2) self.size = int(-capacity * math.log(error_rate) / (math.log(2)**2)) self.bits = bitarray(self.size) self.bits.setall(0) # Optimal hash count: k = (m/n) * ln(2) self.num_hashes = int(self.size / capacity * math.log(2)) # At capacity=1M, error_rate=1%: sizeβ9.6Mb bits, num_hashes=7 def add(self, item: str): for seed in range(self.num_hashes): pos = mmh3.hash(item, seed) % self.size self.bits[pos] = 1 def might_contain(self, item: str) -> bool: # ALL k bits must be 1. Any single 0 β definite miss return all( self.bits[mmh3.hash(item, seed) % self.size] for seed in range(self.num_hashes) ) # In LSM tree: one BloomFilter per SSTable, kept in memory bf = BloomFilter(capacity=1_000_000, error_rate=0.01) bf.add("user:42") bf.might_contain("user:42") # True β might exist, do disk read bf.might_contain("user:99") # False β definitely not here, skip!
RocksDB benchmarks show Bloom filters reduce read I/O by ~90% for point lookups on write-heavy workloads. A database with 10 SSTable levels drops from ~50ms read latency to ~5ms β without changing the storage layout at all.
Every production database decision involves matching your access pattern to the storage engine's strengths. Start by answering: what is my read/write ratio and do I need joins?
OLTP workloads, read-heavy (reads outweigh writes 10:1+), complex joins, strong ACID guarantees. Banking, e-commerce, user accounts. Ideal baseline for most startups.
Write-heavy, time-series, IoT sensors, logs, activity feeds. Writes outnumber reads 3:1+. No complex joins. Used by Netflix, Apple, Instagram at billion-user scale.
Analytics and aggregations over billions of rows. Columnar layout gives 10β100x compression. Ideal for OLAP dashboards. Not for OLTP transactions β no good for point reads.
Cache, session store, rate limiting, leaderboards, pub/sub. Sub-millisecond latency because data lives in RAM. Data size limited to available memory.
In system design interviews, always state WHY you're choosing a storage engine. "I'd use Cassandra because this is a write-heavy time-series workload with 1M writes/sec β the LSM tree's sequential write path and wide partitions are ideal here. We lose complex joins but this access pattern doesn't need them."
These questions test deeper understanding, not memorisation. Think about the mechanism before clicking.