Day 9 Exercises — Replication Strategies & Consistency

Exercise 1 🟢 Easy ⏱ 15 min

✓ Completed

Replication Modes

An e-commerce platform has one primary + 2 read replicas. Replication is asynchronous. A customer updates their shipping address. Seconds later they refresh their profile page — it shows the old address. The read was served from a replica with 200ms replication lag. The customer panics thinking their update was lost.

Async Replication Lag

✏️ Write: new address

→ immediate ACK →

Primary DB

→ async →

Replica 1 (lag: 200ms)

Replica 2 (lag: 350ms)

Concept Check — 3 questions

Q1. Async replication's core trade-off: writes are acknowledged quickly, but what is the downside?

AWrite latency is higher because the primary must format data for the replica before acknowledging

BPotential data loss on primary failure and stale reads from replicas that haven't caught up

CComplex distributed transactions are required for every write

DEach replica requires 2× the storage of the primary

Q2. Synchronous replication: the primary waits for at least 1 replica to confirm before acknowledging the write. What is the main downside?

AData loss on primary failure — the replica might not have the latest data

BStale reads are still possible — the replica can lag behind the primary

CHigher write latency — every write must wait for a cross-network round-trip to the replica before returning to the client

DNo fault tolerance — if the synchronous replica fails, writes must stop

Q3. Semi-synchronous replication (MySQL feature): primary waits for 1 of N replicas to acknowledge. What does this provide?

AZero write latency — the primary never waits for any replica

BDurability guarantee with moderate latency — 1 confirmed copy ensures no data loss even if the primary crashes immediately after

CZero replication lag — all replicas are always current

DAutomatic conflict resolution for concurrent writes

Async replication: primary ACKs immediately. If primary crashes before the write propagates to replicas, that write is permanently lost. Semi-sync (MySQL): wait for 1 replica, then fall back to async if the replica doesn't respond within a timeout (default 10s). This is the practical compromise: 1 confirmed copy for durability, minimal latency impact for normal operation.

Open Design Challenge

1

Design the failover process when the primary crashes in an async replication setup. Which replica becomes the new primary? How do you detect split-brain and prevent data divergence?

2

For a payment system, should replication be sync or async? Justify your choice considering the consequences of data loss vs latency increase.

3

Design monitoring for replication lag: what metric do you track, what is the alert threshold, and what action do you take when lag exceeds 1 second?

Concept score: 0/3

Exercise 2 🟢 Easy ⏱ 20 min

✓ Completed

Read-Your-Writes Consistency

Social network: a user posts a tweet (write hits the primary). Seconds later they view their own profile (read served from a replica 500ms behind). The tweet is missing. The user refreshes several times — it eventually appears. Users file bug reports: "my tweet disappeared." This is a read-your-writes violation.

Read-Your-Writes Violation

👤 User posts tweet

→ write →

Primary (has tweet)

→ user reads profile →

Replica (500ms behind, no tweet)

→

User sees missing tweet 😡

Concept Check — 3 questions

Q1. To implement read-your-writes consistency, the simplest approach is:

ADisable all read replicas — every read must hit the primary

BSticky reads to primary: route reads to the primary for a short window (e.g., 5 seconds) after the same user makes a write

CSwitch all replication to synchronous mode globally

DStore the write result in the client's local browser cache

Q2. The session-token approach for read-your-writes: store the last-write timestamp in the user's session. Reads then check:

AOnly the user's IP address to route to the nearest replica

BThe global database version number for the current schema

CWhether the replica's replication position is ahead of the timestamp in the session; if not, route to the primary instead

DThe user's last login timestamp to determine session freshness

Q3. Monotonic reads consistency guarantees what property for a user's reads?

AA user never sees data older than what they previously read — no going backwards in time between successive requests

BAll reads are served from the primary to guarantee freshness

CAll writes are totally ordered so reads always reflect causal order

DAll replicas always return the same value at any point in time

Sticky-primary approach: track last_write_at in the user session. For the next N seconds (or until replica catches up), route that user's reads to primary. Session-token approach: store the replication position (binlog offset / LSN) at write time. When reading, pick a replica that is at or ahead of that position. Monotonic reads: use sticky routing to the same replica within a session — if the replica doesn't go backwards, reads are monotonic.

Open Design Challenge

1

A user posts a tweet from their phone and then opens a laptop browser. Both devices should see the tweet immediately. How do you implement cross-device read-your-writes consistency?

2

Design a read routing layer that checks replica replication lag before routing. If lag exceeds 200ms, route to primary. Sketch the architecture and the lag check mechanism.

3

Read-your-writes increases load on the primary for N seconds after every write. At 1M writes/hour with a 5-second primary-sticky window, estimate the additional primary read load vs baseline.

Concept score: 0/3

Exercise 3 🟡 Medium ⏱ 25 min

✓ Completed

Multi-Primary (Active-Active) Conflicts

A globally distributed app has primary databases in US and EU. Both accept writes. A user concurrently changes their email from US (to alice@us.com) and EU (to alice@eu.com) at the same millisecond. Both primaries accept the write. During synchronization, US and EU have conflicting email values for the same user — and there's no clear winner.

Multi-Primary Write Conflict

US Primary: email=alice@us.com

←→ conflict ←→

EU Primary: email=alice@eu.com

→ need resolution →

LWW / CRDT / Application logic

Concept Check — 3 questions

Q1. Last-Write-Wins (LWW) conflict resolution uses what to determine which write wins?

AThe longest value — alice@us-company-domain.com beats alice@eu.com

BWall clock timestamp — the write with the higher timestamp wins

CAlphabetical ordering — alice@eu.com comes before alice@us.com

DThe first write received by any node in the cluster wins

Q2. LWW's fundamental problem in a distributed system is:

ATimestamps take too much storage overhead in each record

BComparing timestamps is too computationally expensive at scale

CClock skew — clocks on different machines may differ by milliseconds, so the "later" timestamp may actually belong to the write that happened earlier in real causal time

DLWW requires a central coordinator that becomes a bottleneck

Q3. Application-level conflict resolution (CouchDB, DynamoDB): the application receives both conflicting versions and decides which to keep. This approach is most appropriate when:

AData is simple key-value pairs where the latest value always wins

BDomain-specific merge logic is needed — e.g., merging shopping carts, taking max of counters, or using CRDT data types

CIt is always the best approach — applications always know best

DOnly for append-only data where merging is trivial

LWW silently discards writes — the "losing" email is simply gone. This causes data loss without any error. Clock skew of even 1ms between US and EU nodes can flip the winner incorrectly. Hybrid Logical Clocks (HLC) partially solve clock skew by combining physical and logical time. For shopping carts: merge by union (take all items from both versions). For email: present conflict to user. CRDTs (Conflict-free Replicated Data Types) are data structures that always converge automatically (counters, sets, etc.).

Open Design Challenge

1

Design a conflict resolution strategy for a shared document editor (Google Docs-style) where two users in different regions edit the same paragraph simultaneously. What data structure would you use?

2

Implement a version vector (vector clock) for the US/EU email conflict: what does the vector look like before and after the conflict? How does it help determine causality?

3

Some conflicts cannot be auto-resolved (e.g., two users book the last flight seat simultaneously). Design a "conflict surfacing" system that presents these conflicts to users and tracks resolution status.

Concept score: 0/3

Exercise 4 🔴 Hard ⏱ 30 min

✓ Completed

Replication Topologies

A global fintech platform runs in 5 regions. Each region must survive independently. Currently: hub-and-spoke topology with a single primary in US and 4 replica regions. When the US region goes down, a 15-minute failover window causes a global service outage while a new primary is elected and DNS propagates. You need to eliminate this downtime.

Hub-and-Spoke vs Active-Active

Hub-Spoke: US Primary → EU Replica, APAC Replica, SA Replica (read-only)

Active-Active: US ↔ EU ↔ APAC ↔ SA (all accept writes)

Concept Check — 3 questions

Q1. Active-active multi-region replication reduces downtime compared to hub-and-spoke by:

AUsing faster hardware in each region to reduce failover duration

BEach region accepts writes independently — if the US goes down, traffic simply routes to other primaries with no failover procedure needed

CUsing synchronous replication so all regions are always in sync

DHaving more replica nodes so election is faster

Q2. A daisy-chain replication topology (A → B → C → D) carries what specific risk?

AFaster change propagation — each hop accelerates delivery to the next node

BLess total replication lag because each node is closer to the next

CCumulative lag — each hop adds delay; D's lag equals the sum of A→B, B→C, and C→D lags combined

DAutomatic conflict resolution at each intermediary node

Q3. Galera Cluster implements synchronous multi-master replication using write-set certification. Certification-based conflict detection means:

AWrites are digitally signed with SSL certificates before being applied

BBefore applying, each node validates the write-set has no key conflicts with concurrent in-flight transactions, rolling back conflicting transactions

CNodes vote by majority on which writes to accept, rejecting minority writes

DWrites are applied in random order and reconciled afterward

Active-active eliminates failover but introduces write conflicts (Exercise 3). Hub-and-spoke eliminates conflicts but requires failover on primary failure. For fintech: strong consistency often requires a single primary per data partition — use geography-based sharding (US users → US primary, EU users → EU primary) for active-active without cross-region conflicts on the same record. Galera certification: the write-set contains modified row keys; if another node committed a write to the same key concurrently, the later one is rolled back.

Open Design Challenge

1

Design geographic sharding to enable active-active without write conflicts: US users write to the US primary, EU users write to the EU primary. How do you handle a user who travels from US to EU?

2

Design the automated failover process for hub-and-spoke: when should the system auto-promote a replica to primary vs require manual intervention? What signals trigger auto-promotion?

3

A split-brain scenario: US primary and EU replica both think they are the primary (network partition). Design a fencing mechanism (STONITH — Shoot The Other Node In The Head) that ensures only one is active.

Concept score: 0/3

Day 9 Complete 🎉