Database Sharding & Partitioning

Horizontal partitioning strategies, consistent hashing, cross-shard queries, resharding, and distributed ID generation.

4 Exercises
12 Concept Checks
~90 min total
System Design
Session Progress
0 / 4 completed
Exercise 1 🟢 Easy ⏱ 15 min
✓ Completed
Shard Key Selection
A social platform shards its users table by user_id % 4 (4 shards). 80% of users are from the US — their user IDs are numerically lower (earlier registrations). Shards 0 and 1 handle 80% of the load; shards 2 and 3 handle 20%. The overloaded shards become a bottleneck causing p99 latency spikes, while two shards sit nearly idle.
Shard Imbalance (user_id % 4)
Shard 0: US users (40% load) 🔥
Shard 1: US users (40% load) 🔥
Shard 2: EU users (15% load)
Shard 3: APAC users (5% load)
Concept Check — 3 questions
Q1. The problem with user_id % N sharding when user IDs are assigned sequentially is:
AIndividual shards grow too large — the modulo operation doesn't limit shard size
BHot spots — sequential IDs paired with non-uniform access patterns create uneven distribution across shards
CCross-shard joins become completely impossible
DReplication breaks down when the shard count is odd
Q2. Consistent hashing solves uneven distribution by:
AGenerating random IDs for all users to ensure uniform modulo distribution
BAlways using exactly 256 shards to guarantee even distribution
CMapping keys to a hash ring where each shard covers a range — adding or removing a shard only remaps approximately 1/N of all keys
DRouting all traffic through a geographic load balancer
Q3. A good shard key should have which property above all others?
AAlphabetical ordering so shards can be accessed in sorted order
BHigh cardinality — enough unique values to distribute data and load evenly across all shards
CSequential integer values to make range queries easy within a shard
DA foreign key relationship to every other table to minimize cross-shard joins
The hot spot problem: modulo is deterministic — same ID always goes to same shard. If early IDs are more active (older users more engaged), the low-numbered shards are always hot. Consistent hashing maps keys to a ring using a hash function — distribution is based on hash output (uniform), not the raw key value. Adding shard: only its neighbors' data moves. Removing shard: only that shard's data redistributes. Average remapping = 1/N of total data.
Open Design Challenge
1
Design the shard key for a messaging system where users send messages to each other. If you shard by sender_id, what query patterns become cross-shard? If you shard by conversation_id, what changes?
2
A celebrity user (10M followers) on a social platform creates a hot shard. Design a "celebrity routing" exception: how do you handle accounts whose activity level exceeds single-shard capacity?
3
Design the shard routing layer: given a user_id, how does the application determine which shard DB to connect to? Sketch the lookup table vs hash ring approaches.
Concept score: 0/3
Exercise 2 🟡 Medium ⏱ 20 min
✓ Completed
Cross-Shard Queries
Orders are sharded by user_id. A business analytics query needs: SELECT sum(revenue) FROM orders WHERE date = '2024-01-01'. Since date is not the shard key, this query must fan out to all shards to get partial results, then aggregate them. At 32 shards, this is 32 parallel database calls for every analytics query.
Scatter-Gather Query Pattern
Analytics query (no shard key in WHERE)
→ fan out →
Shard 0: partial sum
Shard 1: partial sum
... Shard 31
→ merge →
Aggregator: total sum
Concept Check — 3 questions
Q1. A scatter-gather query fans out to all 32 shards in parallel. If each shard takes 100ms, what is the total query latency?
A~100ms — queries run in parallel; latency equals the slowest shard, not the sum
B~3,200ms — all shard queries execute sequentially
C~32ms — parallelism reduces latency by a factor of N
D~1,600ms — half the shards run in parallel, then the rest
Q2. To avoid scatter-gather queries for business analytics on sharded OLTP data, the recommended pattern is:
ACreate a secondary index on the date column across all shards
BIncrease the shard count to 256 to reduce per-shard data size
CMaintain a separate OLAP store (data warehouse like Redshift/BigQuery) fed by ETL from all shards; run analytics there
DNormalize the schema to eliminate the need for cross-shard aggregation
Q3. Cross-shard transactions (distributed transactions across multiple shard databases) should generally be avoided because:
AThey consume excessive disk space on each participating shard
BThey require 2-phase commit (2PC) which is slow, operationally complex, and can block indefinitely if the coordinator fails mid-transaction
CThey increase memory usage beyond what typical databases support
DDistributed transactions violate SQL standards and are unsupported
Scatter-gather latency = max(shard latencies), not sum — parallel execution is the key. However, each shard query does consume a DB connection and some CPU, so 32 parallel queries 32× more resource-intensive than 1. OLAP separation: OLTP shards are optimized for point reads/writes; analytics queries need full scans which are destructive to OLTP latency. Run ETL every hour into a data warehouse where full-table scans are expected and optimized. 2PC: Phase 1 = prepare (all shards lock resources). Phase 2 = commit. If coordinator dies after Phase 1, shards hold locks forever — blocking all conflicting transactions.
Open Design Challenge
1
Design the ETL pipeline from 32 OLTP shards to Redshift. What is the acceptable lag (near-real-time vs daily batch)? How do you handle schema changes in the OLTP layer?
2
An order transfer moves money from user A (shard 0) to user B (shard 5). Design this without 2PC: use the SAGA pattern with compensating transactions. Sketch the steps and failure scenarios.
3
Design a secondary index for querying orders by date across all shards without scatter-gather. Hint: maintain a separate "date index" table on a non-sharded global DB that maps (date, order_id, shard_id).
Concept score: 0/3
Exercise 3 🟡 Medium ⏱ 25 min
✓ Completed
Resharding Without Downtime
A startup began with 4 shards. After rapid growth, each shard holds 500GB of data at 80% CPU utilization. They need to double to 8 shards to restore headroom. The challenge: during migration, a write could land on the old shard location (user_id % 4) while the data is being moved to the new location (user_id % 8), causing data divergence.
Resharding: 4 → 8 Shards
4-shard ring (80% CPU)
→ split each →
8-shard ring (migration in progress)
→ traffic switch →
8-shard ring (40% CPU)
Concept Check — 3 questions
Q1. To reshard from 4 to 8 shards without taking downtime, the safest approach is:
AStop all writes, migrate all data to 8 shards, update routing config, restart writes
BDouble-write to both old and new shard locations during migration, migrate historical data in background, flip reads to new shards, then stop writes to old shards
CExport all 4 shards to CSV files and reimport into 8 shards at scheduled maintenance
DIncrease individual shard storage limits instead of adding more shards
Q2. Virtual nodes (vnodes) in consistent hashing make resharding easier by:
AAutomating schema migration scripts across all physical shards
BCompressing shard data to make it faster to transfer between nodes
CAssigning many small virtual shards per physical node — when adding a node, only the relevant vnodes move; remapping is granular and gradual
DEnabling atomic cross-shard transactions during the migration window
Q3. During resharding, a read arrives for a key that has already been migrated to the new shard but hasn't been deleted from the old shard yet. The read-migration pattern handles this by:
AReturning HTTP 404 during the migration window — reads are disabled
BFirst checking the new shard; if key not found, fall back to the old shard — then lazily migrate the data and remove from old shard
CLocking all reads globally during migration to prevent inconsistency
DServing only from the cache layer during migration to hide inconsistency
Double-write migration: Phase 1 — start writing to both old and new shard for new writes. Phase 2 — background job migrates historical data (batched, rate-limited to not overload). Phase 3 — flip reads to new shard. Phase 4 — stop writing to old shard. Phase 5 — decommission old shard. Each phase can be rolled back independently. Vnodes (Cassandra uses 256 vnodes per node by default): adding 1 physical node moves 1/N of vnodes from each existing node — gradual, controlled redistribution.
Open Design Challenge
1
Design the rate limiting for the background data migration job. The migration is copying 500GB × 4 shards = 2TB of data. At what MB/sec should it run to complete in 24 hours without saturating production disk I/O?
2
Design a migration progress dashboard: what metrics would you show to operators monitoring the resharding? Include per-shard migration %, write lag between old/new, and estimated completion time.
3
During double-write, a key is written to the old shard but the write to the new shard fails. Design the reconciliation mechanism to detect and fix this divergence before the final cutover.
Concept score: 0/3
Exercise 4 🔴 Hard ⏱ 30 min
✓ Completed
Global ID Generation at Scale
A distributed system sharded across 32 nodes needs globally unique IDs for every record. Options: UUID v4 (128-bit random), DB auto-increment per shard (not globally unique), Snowflake ID (64-bit timestamp+machine+sequence), ULID (128-bit sortable random). Each trades off size, sortability, uniqueness guarantees, and database performance.
ID Generation Comparison
UUID v4: 128-bit random, not sortable
vs
Snowflake: 41-bit ms-timestamp + 10-bit machine + 12-bit seq = 64-bit sortable
vs
ULID: 48-bit time + 80-bit random, sortable base32
Concept Check — 3 questions
Q1. Why are random UUID v4 IDs problematic for B-tree database indexes?
A128-bit is too large — databases have a maximum index key size
BRandom insertion order causes B-tree page splits and cache fragmentation — sequential IDs produce much better index locality and fewer page splits
CUUID v4 values are not globally unique — collisions occur at scale
DUUID generation is too CPU-intensive for high-throughput inserts
Q2. A Snowflake ID embeds a millisecond timestamp in the top 41 bits. What does this mean for the generated IDs?
AIDs expire and become invalid after a certain number of years
BIDs are only valid for a maximum of 10 years before rollover
CIDs are time-sortable, B-tree friendly, and allow approximate creation time to be decoded — but correctness depends on synchronized clocks and a unique machine ID registry
DIDs must be assigned by a central server to ensure the timestamp is consistent
Q3. A Snowflake ID generator process restarts and the sequence counter resets to 0. What is the risk?
ANo risk — Snowflake IDs are globally unique by design regardless of restarts
BIf the process restarts within the same millisecond with the same machine ID, the sequence counter starts at 0 again — potentially generating a duplicate ID that was already assigned before the restart
CAll previously generated IDs become invalid and the system must regenerate them
DID generation pauses for exactly 1 second after restart to skip the risky millisecond
UUID v4 B-tree problem: every insert must go to a random position in the tree → frequent page splits, poor cache utilization, 50–100% write amplification vs sequential IDs. Snowflake IDs are monotonically increasing within a machine → sequential B-tree inserts, near-perfect cache hit rate for recent data. Restart fix: at startup, sleep until the next millisecond before generating any IDs (ensures sequence=0 at a new timestamp). Alternatively, persist last_used_ms and refuse to generate IDs with a timestamp ≤ last_used_ms.
Open Design Challenge
1
Design the machine ID registry for a 32-node Snowflake cluster. How do nodes get assigned their 10-bit machine IDs? What happens when a node is decommissioned and a new node comes up — can it reuse the old machine ID?
2
Snowflake IDs embed creation time. This means you can do time-range queries (WHERE id BETWEEN snowflake_at('2024-01-01') AND snowflake_at('2024-12-31')) without a separate created_at column. Sketch the implementation of a snowflake_at() function.
3
Compare Snowflake vs ULID for a multi-tenant SaaS: which would you choose if (a) you need to sort records by creation time and (b) you don't want to expose a centralized machine ID registry? Justify your answer.
Concept score: 0/3