Day 16 — Consensus: Raft & Paxos

Why Consensus Matters

❌ The Problem: Split Brain

In a 3-node cluster, if the leader loses network connectivity, two nodes may each believe they are leader and accept conflicting writes — split brain. Without consensus, different parts of your system end up with incompatible state.

Concurrent writes to both "leaders"
Diverged logs, lost updates
Data corruption after partition heals

✅ The Solution: Quorum

A consensus algorithm requires a majority quorum (N/2 + 1) to make decisions. In a 5-node cluster: 3 nodes must agree. Even if 2 nodes fail or are partitioned, the remaining 3 can still make progress — but neither partition of 2 can.

3-node cluster: tolerates 1 failure
5-node cluster: tolerates 2 failures
7-node cluster: tolerates 3 failures

📐 The FLP Impossibility Theorem

Fischer, Lynch, and Paterson (1985) proved that in an asynchronous system where even one process can fail, there is no deterministic algorithm that always achieves consensus in finite time. Real systems work around this by using timeouts (partial synchrony), randomization (Raft election timeouts), or weaker guarantees. Raft and Paxos assume partially synchronous networks where messages eventually arrive.

Raft Algorithm

🏷️ Node Roles

Every node is one of three roles:

Leader: handles all writes, sends heartbeats
Follower: replicates log, redirects clients
Candidate: running for election

📅 Terms

Time is divided into terms (monotonically increasing integers). Each term begins with an election. Terms serve as logical clocks — nodes reject messages from older terms.

❤️ Heartbeats

The leader sends heartbeats (AppendEntries with no entries) to all followers every 150ms. If a follower doesn't receive a heartbeat in its election timeout (150–300ms random), it starts an election.

Raft Leader Election — Step by Step

1
Follower times out — no heartbeat received within election timeout. Increments term, transitions to Candidate state.
2
RequestVote RPC — Candidate votes for itself, sends RequestVote to all other nodes with its current term and log index.
3
Voting rules — A node grants a vote if: (a) it hasn't voted this term, (b) the candidate's log is at least as up-to-date. Each node votes for at most one candidate per term.
4
Majority wins — First candidate to receive (N/2+1) votes becomes the new Leader. Sends heartbeat to assert authority and stop other elections.
5
Split vote — If no candidate wins majority (two candidates split votes), timeout again with new random delay, start a new election in term+1.

🗳️ Interactive Raft Election Simulator

Click node circles to kill/revive them. Use buttons to trigger elections and observe consensus in action.

1

Current Term

N1

Leader

0

Log Entries

5

Alive Nodes

0

Elections Held

Log Replication

📜 AppendEntries RPC

When a client writes to the leader:

Leader appends entry to its local log
Sends AppendEntries to all followers
Waits for quorum ACK (N/2+1)
Commits entry, applies to state machine
Notifies followers to commit on next heartbeat

🔧 Log Consistency Check

AppendEntries includes the previous entry's (term, index). Followers reject if they don't have that entry, forcing the leader to step back and find the common prefix — guaranteeing log consistency without extra RPCs.

Leader tracks nextIndex[] per follower
Decrements and retries on rejection
Log never has holes — sequential only

# Simplified Raft log entry structure
{
  "term": 3,
  "index": 42,
  "command": "SET x = 100",
  "committed": true
}

# AppendEntries RPC payload
{
  "leader_term": 3,
  "leader_id": "node-1",
  "prev_log_index": 41,
  "prev_log_term": 3,
  "entries": [
    {"term": 3, "index": 42, "command": "SET x = 100"}
  ],
  "leader_commit": 41  # followers commit up to this index
}

Paxos vs Raft

📚 Paxos (Lamport, 1989)

Academic Standard

The original consensus algorithm. Operates in two phases: Prepare/Promise and Accept/Accepted. Each value is decided independently — no notion of a persistent leader, making it harder to understand and implement.

Flexible leader (any node can propose)
Multi-Paxos for log replication
Used in: Google Chubby, Zookeeper
Very hard to implement correctly

🦋 Raft (Ongaro & Ousterhout, 2014)

Practical Standard

Designed explicitly for understandability. Separates concerns into: leader election, log replication, safety. Strong leader — all writes go through one node. Easier to reason about and implement correctly.

Single strong leader per term
Log always flows leader → followers
Used in: etcd, Consul, TiKV, CockroachDB
Equivalent guarantees to Paxos

Property	Paxos	Raft
Understandability	Very complex	Designed to be understandable
Leader	Optional (any proposer)	Required (single strong leader)
Log ordering	Gaps allowed	Sequential, no gaps
Client redirect	Complex (no single leader)	Simple (leader is known)
Membership change	Manually specified	Joint consensus (built-in)
Real-world use	Google, ZooKeeper	etcd, Consul, TiKV

Real-World Use Cases

🗄️ etcd

Raft

Kubernetes control plane uses etcd for all cluster state — pod assignments, config, secrets. A 3-node etcd cluster tolerates 1 failure. Raft ensures the API server always reads consistent state.

🔒 HashiCorp Consul

Raft

Service discovery and KV store. Raft used for server cluster. Agents on every node use gossip protocol for health checks — gossip for availability, Raft for consistency.

🐘 CockroachDB

Raft per range

Each 512MB data range has its own Raft group. Thousands of Raft groups per cluster. Allows per-range leader election and fault isolation — different from single-cluster Raft.

Knowledge Check

1. In a 5-node Raft cluster, how many nodes must be alive to continue accepting writes?

2. A Raft follower has election timeout of 200ms. The leader's heartbeat interval is 150ms. What happens?

3. Why does Raft use a random election timeout (e.g., 150–300ms) rather than a fixed timeout?

4. An old leader (term 3) sends AppendEntries after a new leader has been elected in term 4. What happens?

5. What is the key architectural difference between Paxos and Raft?

Day 16 Complete!

You understand how distributed systems achieve consensus. Next: Distributed Transactions (2PC & Saga).