Day 11 Exercises — NoSQL Data Modeling

Exercise 1 🟡 Easy ⏱ 15 min

✓ Completed

Partition Key Design in DynamoDB

An IoT platform stores sensor readings. Each sensor sends 1 event/second. The DynamoDB table uses sensor_id as the partition key. Sensor #42 monitors a critical server and generates 10× more events than all others — 10 writes/sec on a single partition while others get 1 write/sec each.

Hot Partition Problem

Sensor #1 → 1 write/sec

Sensor #2 → 1 write/sec

Sensor #42 → 10 writes/sec (HOT)

→

DynamoDB
PK = sensor_id

→

Partition A (sensor #1) — normal

Partition B (sensor #2) — normal

Partition C (sensor #42) ⚡ THROTTLED

Concept Check — 3 questions

Q1. A hot partition in DynamoDB causes?

AAll partitions to slow down uniformly across the table

BThrottling on the hot partition — writes and reads to that partition are rejected when throughput limit is exceeded

CPermanent data loss on the affected partition

DReplication failure across all availability zones

Q2. To prevent hot partitions for high-write sensors, you can?

AUse sensor_id as the sort key instead of the partition key

BIncrease the DynamoDB table's provisioned throughput capacity

CShard the partition key with a random suffix: sensor_id#1, sensor_id#2, etc. — spread writes, then aggregate at read time

DSwitch to synchronous writes to slow down the producer

Q3. DynamoDB's partition key + sort key composite key allows?

ACross-partition transactions for multi-item atomic consistency

BMultiple items per partition key, sorted by sort key — enables range queries within a single partition

CFull table scans without consuming read capacity units

DMulti-table joins exactly like relational SQL

Hot partitions occur when one partition key receives disproportionate traffic. DynamoDB allocates throughput per partition — exceed the limit and requests get throttled (ProvisionedThroughputExceededException). Fix: append a random suffix 0-9 to sensor_id → 10 partitions share the write load → at read time, query all 10 suffixes and aggregate. Composite keys enable range queries: sensor_id#5 + created_at lets you fetch all events for a sensor in a time range efficiently.

Open Design Challenge

1

Design the DynamoDB table schema for IoT sensor readings. Include partition key, sort key, and at least 3 attribute fields. Justify your key choices.

2

With 10 partition shards per sensor (sensor_id#0 through sensor_id#9), how do you query "get last 100 readings for sensor #42"?

3

Design a GSI (Global Secondary Index) to enable the query: "find all sensors that reported temperature > 90°C in the last hour."

Concept score: 0/3

Exercise 2 🟡 Easy ⏱ 20 min

✓ Completed

Denormalization for Read Performance

A social network uses MongoDB. Normalized schema: posts collection (author_id reference) and a users collection. Fetching a feed requires: query posts → for each post, query users for author name and photo. This N+1 pattern hits the DB once per post. Denormalized: embed author name and photo directly in each post document — no join needed.

Normalized vs Denormalized Schema

❌ NORMALIZED (N+1 problem)

Post{author_id: "u42"}

↓ lookup per post

User{name, photo} — extra DB query

✅ DENORMALIZED (single read)

Post{author_id, author_name, author_photo}

↓ one document fetch

All data embedded — no lookup needed

Concept Check — 3 questions

Q1. Denormalization in MongoDB embeds author data in the post. The main risk is?

APost documents become too large to store in MongoDB (exceeds 16MB limit)

BStale embedded data — if the user changes their name or photo, all embedded posts have outdated information

CWrites become significantly slower than the normalized schema

DYou can no longer query posts by author_id field

Q2. The "write duplication" cost of denormalization is acceptable when?

AAlways — denormalization is always the right choice in NoSQL databases

BNever — duplicate data always causes more problems than it solves

CWhen read frequency greatly exceeds write frequency — denormalization trades write overhead for dramatic read speed gains

DOnly for small datasets that fit entirely in memory

Q3. For a user's post feed, which MongoDB schema is better?

AAlways keep posts and users in completely separate collections for cleanliness

BEmbed frequently-read fields (author name, photo) in the post document — avoids lookups for the common read path

CSwitch to a relational SQL database to enable proper joins

DStore posts as XML strings for maximum query flexibility

The golden rule of NoSQL modeling: model for how you query, not how you store. Feeds are read 100× more than users change their profiles. Embedding author_name and author_photo is worth the staleness risk. To handle updates: when a user changes their display name, run an async background job to update embedded copies — eventual consistency is fine for display names. MongoDB's 16MB document limit is almost never a concern for this pattern.

Open Design Challenge

1

Design the full MongoDB document schema for a social media post. Include embedded author info, media attachments, and a like count. What fields would you embed vs reference?

2

A user changes their profile photo. You have 50 million posts with their embedded photo URL. How do you update all of them without downtime?

3

Design the feed query: "get 20 most recent posts from users I follow, sorted by created_at." What indexes are needed?

Concept score: 0/3

Exercise 3 🔴 Medium ⏱ 25 min

✓ Completed

Access Pattern-Driven Modeling

An e-commerce platform uses Cassandra. Three distinct access patterns: 1) Get all orders for a user, 2) Get a specific order by ID, 3) Get orders within a date range. Cassandra requires knowing access patterns upfront — it cannot efficiently execute ad-hoc queries on non-key columns. Each pattern needs its own dedicated table.

One Table Per Access Pattern (Cassandra)

orders_by_user (PK: user_id, SK: created_at)

orders_by_id (PK: order_id)

orders_by_date (PK: date, SK: created_at)

→

Cassandra
data duplicated per table

→

Fast single-partition
reads for each pattern

Concept Check — 3 questions

Q1. Creating 3 separate tables for 3 access patterns (the Cassandra way) is called?

ANormalization — systematically reducing data redundancy across tables

BDenormalized query tables — one table per access pattern, data duplicated to serve each query shape efficiently

CVertical partitioning — splitting columns into separate tables

DTable inheritance — sharing data through a parent-child table hierarchy

Q2. In Cassandra, running SELECT * FROM orders WHERE date > '2024-01-01' requires?

ACreating a secondary index on the date column

BUsing the ALLOW FILTERING directive — expensive full cluster scan

CA partition key design that includes date — Cassandra cannot efficiently filter on non-partition-key columns without scanning all partitions

DEnabling a SQL full-text search extension within Cassandra

Q3. A wide row (many columns per partition) in Cassandra is used for?

ATime-series data — each column is a measurement at a specific timestamp, making sequential time-range reads extremely efficient

BUser profile data where each attribute is a separate column

CTransactional order data requiring full ACID guarantees

DAnalytics aggregation tables meant to replace a data warehouse

Cassandra's mantra: model for queries, not for relations. Before writing any schema, list every query the application will run. Then create one table per query shape. Storage is cheap; slow reads are not. Wide rows shine for time-series: partition key = sensor_id, clustering key = timestamp → one sequential disk read returns hundreds of time-ordered measurements. Never use ALLOW FILTERING in production — it scans every node in the cluster.

Open Design Challenge

1

Write the CQL CREATE TABLE statements for all three order tables. Include complete PRIMARY KEY definitions with partition and clustering keys.

2

When an order status changes from PENDING to SHIPPED, you must update all 3 tables consistently. Design the write path — how do you keep them in sync?

3

A new requirement: "get orders totaling over $500 for a user." Cassandra can't filter on non-key columns efficiently. What are your options?

Concept score: 0/3

Exercise 4 🔥 Hard ⏱ 30 min

✓ Completed

TTL and Time-Series Patterns

An analytics platform stores clickstream events. Storage grows unbounded — 10 billion new events/month. Events older than 90 days are never queried. Running DELETE jobs against 900 billion rows is prohibitively expensive. The solution: Cassandra's TTL (Time To Live) built into inserts so rows expire automatically with no manual cleanup.

Cassandra TTL Lifecycle

Event written
INSERT ... USING TTL 7776000

→

90 days pass
TTL counter reaches 0

→

Tombstone written
row hidden from reads

→

Compaction runs
tombstone permanently removed

Concept Check — 3 questions

Q1. Cassandra TTL (Time To Live) automatically expires rows by?

AImmediately deleting the row from disk the moment the TTL value reaches zero

BWriting a tombstone marker at expiry time — the row is hidden from reads but physically removed only during compaction

CAutomatically archiving the row to a cold storage tier

DOverwriting the row's values with nulls to indicate expiry

Q2. Time-series data in Cassandra uses (date, user_id) as the partition key. The date component prevents?

AHot partitions from high-traffic users generating too many writes

BCross-datacenter replication lag from heavy write volume

CUnbounded partition growth — without date, one user's partition accumulates events forever; with date, each day creates a new bounded partition

DRead amplification caused by compaction overhead

Q3. For 10B events/month with 90-day retention, total storage at 200 bytes/event is approximately?

A6 TB total storage

B~60 TB total — 10B events × 3 months × 200 bytes/event, multiplied by replication factor and overhead

C600 TB total storage

D6 PB total storage

TTL in Cassandra: INSERT INTO events (...) VALUES (...) USING TTL 7776000 (90 days in seconds = 90 × 86400). Tombstones are lightweight — data appears deleted to readers immediately but disk space is freed at compaction. Watch out: tombstone accumulation slows reads if compaction is too infrequent. For time-series storage math: 10B events/month × 3 months = 30B events × 200 bytes = 6TB raw. With Cassandra replication factor 3 plus compaction overhead, cluster storage ≈ 60TB total.

Open Design Challenge

1

Write the Cassandra CREATE TABLE for clickstream events with a default TTL and a (date, user_id) partition key. What clustering columns would you include?

2

With TTL, tombstones accumulate until compaction runs. Which Cassandra compaction strategy (STCS, LCS, TWCS) is best for time-series data and why?

3

Design the data lifecycle: raw events in Cassandra (90-day TTL) → aggregated daily metrics in a data warehouse. What triggers the aggregation job?

Concept score: 0/3

Day 11 Complete 🎉