How distributed systems agree on a single value despite failures. The algorithms powering etcd, Consul, and CockroachDB.
In a 3-node cluster, if the leader loses network connectivity, two nodes may each believe they are leader and accept conflicting writes — split brain. Without consensus, different parts of your system end up with incompatible state.
A consensus algorithm requires a majority quorum (N/2 + 1) to make decisions. In a 5-node cluster: 3 nodes must agree. Even if 2 nodes fail or are partitioned, the remaining 3 can still make progress — but neither partition of 2 can.
Fischer, Lynch, and Paterson (1985) proved that in an asynchronous system where even one process can fail, there is no deterministic algorithm that always achieves consensus in finite time. Real systems work around this by using timeouts (partial synchrony), randomization (Raft election timeouts), or weaker guarantees. Raft and Paxos assume partially synchronous networks where messages eventually arrive.
Every node is one of three roles:
Time is divided into terms (monotonically increasing integers). Each term begins with an election. Terms serve as logical clocks — nodes reject messages from older terms.
The leader sends heartbeats (AppendEntries with no entries) to all followers every 150ms. If a follower doesn't receive a heartbeat in its election timeout (150–300ms random), it starts an election.
Click node circles to kill/revive them. Use buttons to trigger elections and observe consensus in action.
When a client writes to the leader:
AppendEntries includes the previous entry's (term, index). Followers reject if they don't have that entry, forcing the leader to step back and find the common prefix — guaranteeing log consistency without extra RPCs.
# Simplified Raft log entry structure
{
"term": 3,
"index": 42,
"command": "SET x = 100",
"committed": true
}
# AppendEntries RPC payload
{
"leader_term": 3,
"leader_id": "node-1",
"prev_log_index": 41,
"prev_log_term": 3,
"entries": [
{"term": 3, "index": 42, "command": "SET x = 100"}
],
"leader_commit": 41 # followers commit up to this index
}
The original consensus algorithm. Operates in two phases: Prepare/Promise and Accept/Accepted. Each value is decided independently — no notion of a persistent leader, making it harder to understand and implement.
Designed explicitly for understandability. Separates concerns into: leader election, log replication, safety. Strong leader — all writes go through one node. Easier to reason about and implement correctly.
| Property | Paxos | Raft |
|---|---|---|
| Understandability | Very complex | Designed to be understandable |
| Leader | Optional (any proposer) | Required (single strong leader) |
| Log ordering | Gaps allowed | Sequential, no gaps |
| Client redirect | Complex (no single leader) | Simple (leader is known) |
| Membership change | Manually specified | Joint consensus (built-in) |
| Real-world use | Google, ZooKeeper | etcd, Consul, TiKV |
Kubernetes control plane uses etcd for all cluster state — pod assignments, config, secrets. A 3-node etcd cluster tolerates 1 failure. Raft ensures the API server always reads consistent state.
Service discovery and KV store. Raft used for server cluster. Agents on every node use gossip protocol for health checks — gossip for availability, Raft for consistency.
Each 512MB data range has its own Raft group. Thousands of Raft groups per cluster. Allows per-range leader election and fault isolation — different from single-cluster Raft.
1. In a 5-node Raft cluster, how many nodes must be alive to continue accepting writes?
2. A Raft follower has election timeout of 200ms. The leader's heartbeat interval is 150ms. What happens?
3. Why does Raft use a random election timeout (e.g., 150–300ms) rather than a fixed timeout?
4. An old leader (term 3) sends AppendEntries after a new leader has been elected in term 4. What happens?
5. What is the key architectural difference between Paxos and Raft?
You understand how distributed systems achieve consensus. Next: Distributed Transactions (2PC & Saga).