Day 8 Exercises — Storage Engines: B-Tree vs LSM

Exercise 1 🟢 Easy ⏱ 15 min

✓ Completed

B-Tree vs LSM Write Characteristics

A time-series metrics database ingests 500,000 writes per second. Using PostgreSQL (B-tree engine) saturates disk I/O at 100,000 writes/sec — each write requires a random disk seek to update the B-tree node in place. You need a storage engine optimized for write-heavy workloads.

Write Path Comparison

B-Tree: Write → find leaf (random seek) → in-place update

LSM: Write → MemTable (RAM) → sequential WAL → flush SSTable

Concept Check — 3 questions

Q1. B-tree updates require random disk writes. LSM writes are sequential. Why does this distinction matter for traditional HDDs?

AThere is no meaningful difference — modern OS disk caches make random and sequential access equivalent

BHDD random seek takes ~10ms; sequential write is ~0.1ms — a 100× throughput difference that LSM exploits by converting all writes to sequential appends

CB-tree is always faster for writes because the tree structure is more compact

DThe distinction only matters for network-attached storage, not local disks

Q2. LSM "write amplification" refers to what phenomenon?

AWriting duplicate rows to multiple replica nodes simultaneously

BTransactions being logged twice — once to WAL and once to the data file

CData being written multiple times: to WAL, MemTable, L0 SSTable, then rewritten during each compaction level merge

DWrite errors being broadcast and amplified across all nodes in the cluster

Q3. For a workload requiring 500,000 writes/sec (time-series metrics), which storage engine is more appropriate?

AB-tree (PostgreSQL) — add more indexes to improve write throughput

BLSM (RocksDB/Cassandra) — sequential writes to MemTable then sequential flush; designed for write-heavy workloads

CB-tree with more indexes and a larger buffer pool

DIn-memory only — no disk persistence needed for metrics

HDD random seek = 5–10ms, sequential = 0.1ms. SSDs narrow the gap but LSM still wins at extreme write rates. LSM write path: (1) append to WAL for durability, (2) insert into in-memory MemTable, (3) when MemTable is full, flush to disk as an immutable sorted SSTable. All disk writes are sequential — no random seeks ever on the write path.

Open Design Challenge

1

Design the WAL (Write-Ahead Log) for an LSM engine. What gets logged, in what format, and how does recovery work after a crash that lost the MemTable?

2

If B-tree write performance is limited by random I/O, can you use SSDs to bring B-tree performance close to LSM? Discuss at what point SSDs eliminate the LSM advantage.

3

Design a hybrid approach: use LSM for the write path but maintain a B-tree index for fast point reads. What consistency guarantees can you provide?

Concept score: 0/3

Exercise 2 🟢 Easy ⏱ 20 min

✓ Completed

LSM Read Path and Compaction

RocksDB stores user event data. After 6 months, the LSM tree has accumulated 7 levels with hundreds of SSTables. A point read for user_id=12345 must check the MemTable, then each SSTable level from newest to oldest until the key is found. Read latency grows over time as more SSTables accumulate — this is read amplification.

LSM Read Path

Read: user_id=12345

→

MemTable (check first)

→ miss →

L0 SSTables (multiple, overlapping)

→ miss →

L1 → L2 → L3... (sorted, non-overlapping per level)

Bloom filter skips most SSTables

Concept Check — 3 questions

Q1. A Bloom filter in an LSM tree is used for what purpose during reads?

ACompressing SSTable data to reduce disk usage

BProbabilistically determining that a key does NOT exist in an SSTable, avoiding unnecessary disk reads

CSorting keys within an SSTable for binary search

DWriting WAL entries in the correct order

Q2. LSM compaction merges SSTables to reduce their count. Its primary cost is:

AIncreased memory usage — compacted SSTables must fit in RAM

BSlower reads during compaction — all reads are blocked

CAdditional write I/O and CPU — compaction reads all input SSTables and rewrites merged output, adding 10–30× write amplification

DExclusive locks on all read operations during the compaction window

Q3. For a read-heavy workload (1M reads/sec, 10K writes/sec), which storage engine is more appropriate?

AB-tree — O(log n) point reads, no multi-level SSTable lookup, better suited for reads dominating the workload

BLSM — better read performance due to SSTable compression

CAppend-only log — fastest for all workloads

DHash table — O(1) reads always beat O(log n)

Bloom filters answer "definitely not here" or "probably here" — false positives possible, false negatives impossible. A bloom filter per SSTable means a read typically checks only 1–2 SSTables even across 7 levels. Compaction write amplification: in leveled compaction, each level is 10× the previous, so data is rewritten ~10 times total — the price of keeping SSTable count low for fast reads.

Open Design Challenge

1

Design the bloom filter sizing for an SSTable containing 1 million keys. At 1% false positive rate, how many bits of memory does the bloom filter require? (Hint: ~10 bits per key at 1% FPR)

2

Compare size-tiered vs leveled compaction: which has lower write amplification, which has lower read amplification? When would you choose each?

3

A key is updated 1,000 times over 6 months. Before compaction runs, how many versions of the key exist across SSTables? After full compaction, how many remain?

Concept score: 0/3

Exercise 3 🟡 Medium ⏱ 25 min

✓ Completed

SSTables and Columnar Storage

An analytics team runs aggregation queries on 1 billion rows: SELECT avg(revenue) FROM orders WHERE date > '2024-01-01'. A row-store reads all columns for every qualifying row — loading id, name, email, date, revenue, status, address, etc. A column-store reads only the revenue and date columns, skipping all others. The I/O difference is enormous.

Row Store vs Column Store I/O

Row store: read all columns per row → [id, name, email, date, revenue, status...] = high I/O

vs

Column store: read only [date, revenue] columns → 10-100× less I/O

Concept Check — 3 questions

Q1. ClickHouse and BigQuery use columnar storage. The main advantage for analytics queries is:

AFaster row inserts — columnar stores have lower write overhead

BOnly the required columns are read from disk; irrelevant columns are skipped entirely — 10–100× I/O reduction for wide tables

CBetter performance for high-frequency transactional workloads like banking

DBetter join performance across multiple large tables

Q2. Columnar compression ratios are dramatically better than row-store compression because:

ARow stores have inherently more storage overhead from metadata

BColumn stores use newer and more advanced compression algorithms

CValues within a single column are homogeneous in type — an integer revenue column compresses 10–100× better than a row mixing strings, integers, timestamps, and booleans

DColumn stores do not need indexes, saving substantial space

Q3. A columnar database is NOT well-suited for:

AAnalytics aggregations over billions of rows

BOLTP — high-frequency single-row reads and writes like banking transactions where row-level access patterns dominate

CReporting dashboards with aggregations over many columns

DData warehousing with infrequent bulk loads

Column stores shine for OLAP (online analytical processing) — aggregations, GROUP BY, analytics. Row stores shine for OLTP — single-row lookups, updates, inserts. A 100-column table where your query touches 3 columns: row store reads 100% of the data, column store reads 3%. Run-length encoding on a sorted integer column (e.g., status=active repeated 10,000 times) compresses to essentially nothing.

Open Design Challenge

1

Design a lambda architecture: row store (PostgreSQL) for OLTP writes + column store (ClickHouse) for analytics. How do you keep them in sync, and what is the acceptable lag?

2

A column store uses zone maps (min/max per block) to skip entire blocks during query execution. Sketch how this works for the query WHERE revenue > 1000000 on a column sorted by revenue.

3

Parquet (columnar file format) is used in data lakes. How does Parquet's row group structure (e.g., 128MB groups of rows, columns stored together within each group) balance column-store efficiency with I/O locality?

Concept score: 0/3

Exercise 4 🔴 Hard ⏱ 30 min

✓ Completed

LSM Tree Tuning in Production

Cassandra stores IoT sensor data at 200,000 events/sec. Reads: 50,000 point lookups/sec. After 2 months, SSTable count grew to 800. Read latency p99 jumped from 2ms to 45ms. Compaction is not keeping up with the write rate. You need to diagnose and fix the SSTable accumulation problem.

SSTable Accumulation Problem

200K writes/sec

→

800 SSTables accumulated

→ read must check all →

p99 latency: 45ms (was 2ms)

fix →

aggressive compaction → fewer SSTables → fast reads

Concept Check — 3 questions

Q1. SSTable accumulation causing slow reads can be fixed by:

AAdding more RAM to increase the MemTable size, delaying SSTable creation

BIncreasing compaction throughput — more aggressive compaction reduces SSTable count, directly reducing read amplification

CDisabling Bloom filters to speed up SSTable scans

DMigrating to a B-tree engine which does not have SSTable accumulation

Q2. Comparing leveled compaction vs size-tiered compaction: leveled compaction specifically achieves:

AHigher write amplification only, with no read benefit

BIdentical performance to size-tiered for all workloads

CLower read amplification — each level has non-overlapping SSTables, so a point read checks at most one SSTable per level

DFaster compaction speed with less CPU usage than size-tiered

Q3. Tombstones in LSM trees are delete markers written as new entries. The danger of tombstone accumulation is:

ATombstones cause permanent data loss by overwriting live data

BTombstones occupy space and slow reads until compaction removes them; excessive tombstones cause severe latency spikes and GC pressure

CToo many tombstones cause the database process to crash immediately

DTombstones block all new write operations until compaction completes

Leveled compaction: L1 = 10 MB, L2 = 100 MB, L3 = 1 GB, etc. Each level has ONE non-overlapping key range per SSTable → read checks at most 1 SSTable per level. Size-tiered: groups similar-sized SSTables → faster writes, but reads may check many SSTables per level. Tombstone TTL in Cassandra defaults to 10 days — compaction must run before that to actually delete data. Time-series data (sensor events) with TTL-based expiry is the top tombstone accumulation scenario.

Open Design Challenge

1

Design the compaction throttling strategy: compaction consumes I/O that competes with user reads/writes. How do you set compaction throughput limits that keep SSTable count low without degrading user latency?

2

IoT sensor data has a natural TTL — data older than 90 days is never queried. Design a time-window compaction strategy that efficiently expires old SSTables without scanning all data.

3

Monitor LSM health in production: define the 5 key metrics you would track (hint: SSTable count per level, compaction lag, tombstone ratio, read/write amplification, bloom filter false positive rate) and set alert thresholds for each.

Concept score: 0/3

Day 8 Complete 🎉