Day 16 of 30 • Consensus Algorithms

Consensus: Raft & Paxos

How distributed systems agree on a single value despite failures. The algorithms powering etcd, Consul, and CockroachDB.

Leader Election Log Replication Split Brain Quorum Fault Tolerance

Why Consensus Matters

❌ The Problem: Split Brain

In a 3-node cluster, if the leader loses network connectivity, two nodes may each believe they are leader and accept conflicting writes — split brain. Without consensus, different parts of your system end up with incompatible state.

  • Concurrent writes to both "leaders"
  • Diverged logs, lost updates
  • Data corruption after partition heals

✅ The Solution: Quorum

A consensus algorithm requires a majority quorum (N/2 + 1) to make decisions. In a 5-node cluster: 3 nodes must agree. Even if 2 nodes fail or are partitioned, the remaining 3 can still make progress — but neither partition of 2 can.

  • 3-node cluster: tolerates 1 failure
  • 5-node cluster: tolerates 2 failures
  • 7-node cluster: tolerates 3 failures

📐 The FLP Impossibility Theorem

Fischer, Lynch, and Paterson (1985) proved that in an asynchronous system where even one process can fail, there is no deterministic algorithm that always achieves consensus in finite time. Real systems work around this by using timeouts (partial synchrony), randomization (Raft election timeouts), or weaker guarantees. Raft and Paxos assume partially synchronous networks where messages eventually arrive.

Raft Algorithm

🏷️ Node Roles

Every node is one of three roles:

  • Leader: handles all writes, sends heartbeats
  • Follower: replicates log, redirects clients
  • Candidate: running for election

📅 Terms

Time is divided into terms (monotonically increasing integers). Each term begins with an election. Terms serve as logical clocks — nodes reject messages from older terms.

❤️ Heartbeats

The leader sends heartbeats (AppendEntries with no entries) to all followers every 150ms. If a follower doesn't receive a heartbeat in its election timeout (150–300ms random), it starts an election.

Raft Leader Election — Step by Step

🗳️ Interactive Raft Election Simulator

Click node circles to kill/revive them. Use buttons to trigger elections and observe consensus in action.

1
Current Term
N1
Leader
0
Log Entries
5
Alive Nodes
0
Elections Held

Log Replication

📜 AppendEntries RPC

When a client writes to the leader:

  • Leader appends entry to its local log
  • Sends AppendEntries to all followers
  • Waits for quorum ACK (N/2+1)
  • Commits entry, applies to state machine
  • Notifies followers to commit on next heartbeat

🔧 Log Consistency Check

AppendEntries includes the previous entry's (term, index). Followers reject if they don't have that entry, forcing the leader to step back and find the common prefix — guaranteeing log consistency without extra RPCs.

  • Leader tracks nextIndex[] per follower
  • Decrements and retries on rejection
  • Log never has holes — sequential only
# Simplified Raft log entry structure
{
  "term": 3,
  "index": 42,
  "command": "SET x = 100",
  "committed": true
}

# AppendEntries RPC payload
{
  "leader_term": 3,
  "leader_id": "node-1",
  "prev_log_index": 41,
  "prev_log_term": 3,
  "entries": [
    {"term": 3, "index": 42, "command": "SET x = 100"}
  ],
  "leader_commit": 41  # followers commit up to this index
}

Paxos vs Raft

📚 Paxos (Lamport, 1989)

Academic Standard

The original consensus algorithm. Operates in two phases: Prepare/Promise and Accept/Accepted. Each value is decided independently — no notion of a persistent leader, making it harder to understand and implement.

  • Flexible leader (any node can propose)
  • Multi-Paxos for log replication
  • Used in: Google Chubby, Zookeeper
  • Very hard to implement correctly

🦋 Raft (Ongaro & Ousterhout, 2014)

Practical Standard

Designed explicitly for understandability. Separates concerns into: leader election, log replication, safety. Strong leader — all writes go through one node. Easier to reason about and implement correctly.

  • Single strong leader per term
  • Log always flows leader → followers
  • Used in: etcd, Consul, TiKV, CockroachDB
  • Equivalent guarantees to Paxos
PropertyPaxosRaft
UnderstandabilityVery complexDesigned to be understandable
LeaderOptional (any proposer)Required (single strong leader)
Log orderingGaps allowedSequential, no gaps
Client redirectComplex (no single leader)Simple (leader is known)
Membership changeManually specifiedJoint consensus (built-in)
Real-world useGoogle, ZooKeeperetcd, Consul, TiKV

Real-World Use Cases

🗄️ etcd

Raft

Kubernetes control plane uses etcd for all cluster state — pod assignments, config, secrets. A 3-node etcd cluster tolerates 1 failure. Raft ensures the API server always reads consistent state.

🔒 HashiCorp Consul

Raft

Service discovery and KV store. Raft used for server cluster. Agents on every node use gossip protocol for health checks — gossip for availability, Raft for consistency.

🐘 CockroachDB

Raft per range

Each 512MB data range has its own Raft group. Thousands of Raft groups per cluster. Allows per-range leader election and fault isolation — different from single-cluster Raft.

Knowledge Check

1. In a 5-node Raft cluster, how many nodes must be alive to continue accepting writes?

2. A Raft follower has election timeout of 200ms. The leader's heartbeat interval is 150ms. What happens?

3. Why does Raft use a random election timeout (e.g., 150–300ms) rather than a fixed timeout?

4. An old leader (term 3) sends AppendEntries after a new leader has been elected in term 4. What happens?

5. What is the key architectural difference between Paxos and Raft?

Day 16 Complete!

You understand how distributed systems achieve consensus. Next: Distributed Transactions (2PC & Saga).