Day 16

Consensus: Raft & Paxos

Understand how distributed systems agree on a single value in the presence of failures — quorums, leader election, log replication, and the read-latency optimization that separates theory from production.

Exercise 1🟢 Easy15 min
Calculating Quorum for 5, 7, and 9 Node Clusters
Your infrastructure team is planning Raft-based clusters of different sizes for three services: a configuration store (5 nodes), a metadata registry (7 nodes), and a distributed lock service (9 nodes). Each cluster is deployed across 3 availability zones. You need to calculate the quorum majority, the maximum tolerable failures, and evaluate whether odd vs even node counts matter before the hardware is ordered.

Tasks

  • Calculate the quorum (majority) for 5, 7, and 9 node clusters using the formula floor(N/2) + 1. Show your work for each.
  • State the maximum number of simultaneous node failures each cluster can tolerate while still making progress (electing a leader and committing entries).
  • Explain why Raft clusters must always have an odd number of nodes — what happens to availability in a 6-node cluster vs a 5-node cluster when exactly 3 nodes fail?
  • For the 9-node cluster spread across 3 AZs (3 nodes/AZ), what is the worst-case AZ failure scenario that the cluster can survive, and what is the one that causes it to halt?
Your Notes
Exercise 2🔴 Medium30 min
5-Node Raft Cluster Loses 3 Nodes Simultaneously
A production 5-node Raft cluster (nodes A, B, C, D, E — where A is the current leader) experiences a cascading failure during a rolling restart gone wrong: nodes C, D, and E all crash within 200ms of each other. Nodes A and B remain up. The cluster's election timeout is 150–300ms (randomized). The system serves 2,000 write operations/sec and the application clients have a 5-second timeout. You must explain exactly what happens and what the operators must do.

Tasks

  • Walk through the exact sequence of events: What does Node A (the current leader) do when it can no longer reach C, D, or E? What does Node B do? Does a new leader get elected?
  • Explain what happens to the 2,000 write operations/sec in flight: are they accepted, queued, or rejected? What error do application clients receive, and after how many seconds?
  • Describe the "split brain" risk in this scenario — can Node A falsely believe it is still the leader and serve stale reads after losing quorum? How does Raft's heartbeat mechanism prevent or allow this?
  • Design the recovery procedure: once C, D, E are restarted, walk through how the cluster re-forms quorum. Does Node A automatically resume leadership, or does a new election occur? What log entries might need to be rolled back?
Your Notes
Exercise 3🔴 Medium35 min
Distributed Lock Service Using Raft for a Job Scheduler
A batch job scheduler needs a distributed lock service to ensure that only one worker node runs each job at a time across a 200-node worker fleet. Jobs run for 10 seconds to 30 minutes. The lock service must be highly available (no single point of failure) and must automatically release locks if the lock-holder worker crashes (no lock leaking forever). You decide to build the lock service on top of a 5-node Raft cluster.

Tasks

  • Design the lock data model stored in the Raft-replicated state machine: what fields does each lock entry need (lock_id, holder_worker_id, granted_at, lease_expires_at, term) and why is "term" necessary?
  • Describe the lock acquisition protocol: how does a worker acquire a lock via the Raft leader, and what happens if two workers simultaneously request the same lock (race condition handling at the leader)?
  • Design the lease-based lock expiry: if a worker crashes while holding a lock with a 60-second lease, when is the lock released? How does a new worker know the lock is available without polling the Raft cluster every second?
  • Explain the "fencing token" problem: if a worker holds a lock, the lock expires, a new worker acquires it, and then the original worker wakes up and tries to use the lock — how does the term/token field prevent the zombie worker from corrupting shared state?
Your Notes
Exercise 4🔥 Hard55 min
Leader Lease Optimization to Reduce Read Latency in Raft
A metadata service backed by a 5-node Raft cluster in us-east-1 serves 80,000 read requests/sec and 2,000 write requests/sec. All reads are routed to the leader to guarantee linearizability. The leader's p99 read latency is 18ms because every read requires a round-trip to confirm leadership (a "ReadIndex" heartbeat to a majority of followers before serving the read). The team wants to reduce p99 reads to under 5ms without sacrificing linearizability guarantees.

Tasks

  • Explain why a naive "just serve reads from the leader's local state" approach breaks linearizability — describe the specific scenario where a deposed leader serves stale reads during the brief window before it learns it has been replaced.
  • Describe the "leader lease" optimization: how does a leader use the fact that no new leader can be elected until at least one election timeout has passed to serve reads from local state safely, without a round-trip to followers?
  • Identify the clock synchronization assumption that makes leader leases safe and explain what happens if the leader's clock drifts faster than followers' clocks by more than the lease duration — which databases use bounded clock drift to make this safe (hint: TiKV, CockroachDB)?
  • Design the lease renewal protocol: how does the leader extend its lease before it expires (before the election timeout), and what happens to reads if the leader fails to extend in time — does it stop serving reads, or does it risk serving stale data?
Your Notes