Storage Engines: B-Tree vs LSM
B-tree read/write characteristics, LSM write path, compaction strategies, and when to choose each storage engine.
4 Exercises
12 Concept Checks
~90 min total
System Design
Session Progress
0 / 4 completed
B-Tree vs LSM Write Characteristics
A time-series metrics database ingests 500,000 writes per second. Using PostgreSQL (B-tree engine) saturates disk I/O at 100,000 writes/sec — each write requires a random disk seek to update the B-tree node in place. You need a storage engine optimized for write-heavy workloads.
Write Path Comparison
B-Tree: Write → find leaf (random seek) → in-place update
LSM: Write → MemTable (RAM) → sequential WAL → flush SSTable
Concept Check — 3 questions
Q1. B-tree updates require random disk writes. LSM writes are sequential. Why does this distinction matter for traditional HDDs?
AThere is no meaningful difference — modern OS disk caches make random and sequential access equivalent
BHDD random seek takes ~10ms; sequential write is ~0.1ms — a 100× throughput difference that LSM exploits by converting all writes to sequential appends
CB-tree is always faster for writes because the tree structure is more compact
DThe distinction only matters for network-attached storage, not local disks
Q2. LSM "write amplification" refers to what phenomenon?
AWriting duplicate rows to multiple replica nodes simultaneously
BTransactions being logged twice — once to WAL and once to the data file
CData being written multiple times: to WAL, MemTable, L0 SSTable, then rewritten during each compaction level merge
DWrite errors being broadcast and amplified across all nodes in the cluster
Q3. For a workload requiring 500,000 writes/sec (time-series metrics), which storage engine is more appropriate?
AB-tree (PostgreSQL) — add more indexes to improve write throughput
BLSM (RocksDB/Cassandra) — sequential writes to MemTable then sequential flush; designed for write-heavy workloads
CB-tree with more indexes and a larger buffer pool
DIn-memory only — no disk persistence needed for metrics
HDD random seek = 5–10ms, sequential = 0.1ms. SSDs narrow the gap but LSM still wins at extreme write rates. LSM write path: (1) append to WAL for durability, (2) insert into in-memory MemTable, (3) when MemTable is full, flush to disk as an immutable sorted SSTable. All disk writes are sequential — no random seeks ever on the write path.
Open Design Challenge
Design the WAL (Write-Ahead Log) for an LSM engine. What gets logged, in what format, and how does recovery work after a crash that lost the MemTable?
If B-tree write performance is limited by random I/O, can you use SSDs to bring B-tree performance close to LSM? Discuss at what point SSDs eliminate the LSM advantage.
Design a hybrid approach: use LSM for the write path but maintain a B-tree index for fast point reads. What consistency guarantees can you provide?
Concept score: 0/3
LSM Read Path and Compaction
RocksDB stores user event data. After 6 months, the LSM tree has accumulated 7 levels with hundreds of SSTables. A point read for user_id=12345 must check the MemTable, then each SSTable level from newest to oldest until the key is found. Read latency grows over time as more SSTables accumulate — this is read amplification.
LSM Read Path
Read: user_id=12345
→
MemTable (check first)
→ miss →
L0 SSTables (multiple, overlapping)
→ miss →
L1 → L2 → L3... (sorted, non-overlapping per level)
Bloom filter skips most SSTables
Concept Check — 3 questions
Q1. A Bloom filter in an LSM tree is used for what purpose during reads?
ACompressing SSTable data to reduce disk usage
BProbabilistically determining that a key does NOT exist in an SSTable, avoiding unnecessary disk reads
CSorting keys within an SSTable for binary search
DWriting WAL entries in the correct order
Q2. LSM compaction merges SSTables to reduce their count. Its primary cost is:
AIncreased memory usage — compacted SSTables must fit in RAM
BSlower reads during compaction — all reads are blocked
CAdditional write I/O and CPU — compaction reads all input SSTables and rewrites merged output, adding 10–30× write amplification
DExclusive locks on all read operations during the compaction window
Q3. For a read-heavy workload (1M reads/sec, 10K writes/sec), which storage engine is more appropriate?
AB-tree — O(log n) point reads, no multi-level SSTable lookup, better suited for reads dominating the workload
BLSM — better read performance due to SSTable compression
CAppend-only log — fastest for all workloads
DHash table — O(1) reads always beat O(log n)
Bloom filters answer "definitely not here" or "probably here" — false positives possible, false negatives impossible. A bloom filter per SSTable means a read typically checks only 1–2 SSTables even across 7 levels. Compaction write amplification: in leveled compaction, each level is 10× the previous, so data is rewritten ~10 times total — the price of keeping SSTable count low for fast reads.
Open Design Challenge
Design the bloom filter sizing for an SSTable containing 1 million keys. At 1% false positive rate, how many bits of memory does the bloom filter require? (Hint: ~10 bits per key at 1% FPR)
Compare size-tiered vs leveled compaction: which has lower write amplification, which has lower read amplification? When would you choose each?
A key is updated 1,000 times over 6 months. Before compaction runs, how many versions of the key exist across SSTables? After full compaction, how many remain?
Concept score: 0/3
SSTables and Columnar Storage
An analytics team runs aggregation queries on 1 billion rows: SELECT avg(revenue) FROM orders WHERE date > '2024-01-01'. A row-store reads all columns for every qualifying row — loading id, name, email, date, revenue, status, address, etc. A column-store reads only the revenue and date columns, skipping all others. The I/O difference is enormous.
Row Store vs Column Store I/O
Row store: read all columns per row → [id, name, email, date, revenue, status...] = high I/O
vs
Column store: read only [date, revenue] columns → 10-100× less I/O
Concept Check — 3 questions
Q1. ClickHouse and BigQuery use columnar storage. The main advantage for analytics queries is:
AFaster row inserts — columnar stores have lower write overhead
BOnly the required columns are read from disk; irrelevant columns are skipped entirely — 10–100× I/O reduction for wide tables
CBetter performance for high-frequency transactional workloads like banking
DBetter join performance across multiple large tables
Q2. Columnar compression ratios are dramatically better than row-store compression because:
ARow stores have inherently more storage overhead from metadata
BColumn stores use newer and more advanced compression algorithms
CValues within a single column are homogeneous in type — an integer revenue column compresses 10–100× better than a row mixing strings, integers, timestamps, and booleans
DColumn stores do not need indexes, saving substantial space
Q3. A columnar database is NOT well-suited for:
AAnalytics aggregations over billions of rows
BOLTP — high-frequency single-row reads and writes like banking transactions where row-level access patterns dominate
CReporting dashboards with aggregations over many columns
DData warehousing with infrequent bulk loads
Column stores shine for OLAP (online analytical processing) — aggregations, GROUP BY, analytics. Row stores shine for OLTP — single-row lookups, updates, inserts. A 100-column table where your query touches 3 columns: row store reads 100% of the data, column store reads 3%. Run-length encoding on a sorted integer column (e.g., status=active repeated 10,000 times) compresses to essentially nothing.
Open Design Challenge
Design a lambda architecture: row store (PostgreSQL) for OLTP writes + column store (ClickHouse) for analytics. How do you keep them in sync, and what is the acceptable lag?
A column store uses zone maps (min/max per block) to skip entire blocks during query execution. Sketch how this works for the query WHERE revenue > 1000000 on a column sorted by revenue.
Parquet (columnar file format) is used in data lakes. How does Parquet's row group structure (e.g., 128MB groups of rows, columns stored together within each group) balance column-store efficiency with I/O locality?
Concept score: 0/3
LSM Tree Tuning in Production
Cassandra stores IoT sensor data at 200,000 events/sec. Reads: 50,000 point lookups/sec. After 2 months, SSTable count grew to 800. Read latency p99 jumped from 2ms to 45ms. Compaction is not keeping up with the write rate. You need to diagnose and fix the SSTable accumulation problem.
SSTable Accumulation Problem
200K writes/sec
→
800 SSTables accumulated
→ read must check all →
p99 latency: 45ms (was 2ms)
fix →
aggressive compaction → fewer SSTables → fast reads
Concept Check — 3 questions
Q1. SSTable accumulation causing slow reads can be fixed by:
AAdding more RAM to increase the MemTable size, delaying SSTable creation
BIncreasing compaction throughput — more aggressive compaction reduces SSTable count, directly reducing read amplification
CDisabling Bloom filters to speed up SSTable scans
DMigrating to a B-tree engine which does not have SSTable accumulation
Q2. Comparing leveled compaction vs size-tiered compaction: leveled compaction specifically achieves:
AHigher write amplification only, with no read benefit
BIdentical performance to size-tiered for all workloads
CLower read amplification — each level has non-overlapping SSTables, so a point read checks at most one SSTable per level
DFaster compaction speed with less CPU usage than size-tiered
Q3. Tombstones in LSM trees are delete markers written as new entries. The danger of tombstone accumulation is:
ATombstones cause permanent data loss by overwriting live data
BTombstones occupy space and slow reads until compaction removes them; excessive tombstones cause severe latency spikes and GC pressure
CToo many tombstones cause the database process to crash immediately
DTombstones block all new write operations until compaction completes
Leveled compaction: L1 = 10 MB, L2 = 100 MB, L3 = 1 GB, etc. Each level has ONE non-overlapping key range per SSTable → read checks at most 1 SSTable per level. Size-tiered: groups similar-sized SSTables → faster writes, but reads may check many SSTables per level. Tombstone TTL in Cassandra defaults to 10 days — compaction must run before that to actually delete data. Time-series data (sensor events) with TTL-based expiry is the top tombstone accumulation scenario.
Open Design Challenge
Design the compaction throttling strategy: compaction consumes I/O that competes with user reads/writes. How do you set compaction throughput limits that keep SSTable count low without degrading user latency?
IoT sensor data has a natural TTL — data older than 90 days is never queried. Design a time-window compaction strategy that efficiently expires old SSTables without scanning all data.
Monitor LSM health in production: define the 5 key metrics you would track (hint: SSTable count per level, compaction lag, tombstone ratio, read/write amplification, bloom filter false positive rate) and set alert thresholds for each.
Concept score: 0/3