System Design: Consensus Algorithms — Raft, Paxos, Leader Election, and Distributed Agreement

The Consensus Problem

In a distributed system, multiple nodes must agree on a single value (which node is the leader, what the next log entry is, whether a transaction committed). Consensus is hard because nodes can fail, networks can drop or delay messages, and there is no shared clock. The fundamental result: FLP impossibility — in an asynchronous system with even one faulty process, consensus cannot be guaranteed to terminate. In practice: use timeouts to detect failures (making the system partially synchronous) and achieve consensus with high probability.

Raft Algorithm

Raft is designed for understandability. Three roles: Leader (exactly one, handles all client requests), Follower (passive, replicate leader), Candidate (temporary, during election). Leader election: all nodes start as Followers. Each has an election timeout (150-300ms, randomized). If no heartbeat from the leader within the timeout: become a Candidate, increment term, vote for yourself, send RequestVote RPCs. If you receive a majority of votes: become Leader. Randomized timeouts prevent all nodes from timing out simultaneously (split votes). Log replication: Leader receives entries, appends to its log, sends AppendEntries RPCs to all Followers. Once a majority acknowledge: commit the entry, apply to state machine, return to client. Followers apply committed entries in order.

Safety guarantee: only nodes with up-to-date logs can win an election (RequestVote includes the candidate’s log index and term; followers reject candidates with older logs). This ensures committed entries are never lost across leader changes.

Paxos

Paxos is the theoretical foundation. Multi-Paxos is the variant used in practice. Two phases: Phase 1 (Prepare/Promise): a Proposer sends Prepare(n) to a majority of Acceptors. Acceptors promise not to accept proposals with number < n and return the highest-numbered proposal they have already accepted. Phase 2 (Accept/Accepted): Proposer sends Accept(n, value) — if acceptors returned previous values, use the value from the highest-numbered prior proposal (not the proposer's own value). If a majority of Acceptors Accept: the value is chosen. Paxos is difficult to understand and implement correctly — Raft was created explicitly as a more understandable alternative. Used in: Google Chubby, Apache Zookeeper (ZAB protocol, a Paxos variant).

Practical Applications

Leader election: etcd (Kubernetes control plane uses Raft). ZooKeeper (uses ZAB, a Paxos variant — provides leader election as a service for Kafka, HDFS). Distributed databases: CockroachDB (Raft per range), TiDB (Raft), YugabyteDB. Log replication: every replicated state machine (Redis Sentinel, PostgreSQL streaming replication) uses some form of consensus or simpler primary-backup replication. The difference: consensus handles leader failures automatically; primary-backup requires manual intervention.

Interview Tips

  • You do not need to know the full Raft specification. Know: leader election via randomized timeouts, log replication requiring majority acknowledgment, and the safety property (committed entries survive leader changes).
  • CAP theorem connection: consensus algorithms choose consistency + partition tolerance (CP). During a partition, a minority partition cannot commit entries (it cannot reach majority). Availability is sacrificed.
  • Quorum: majority quorum = (n/2) + 1. A 3-node cluster tolerates 1 failure. A 5-node cluster tolerates 2 failures. Always use odd cluster sizes.

Asked at: Databricks Interview Guide

Asked at: Cloudflare Interview Guide

Asked at: Netflix Interview Guide

Asked at: Uber Interview Guide

Asked at: Atlassian Interview Guide

Scroll to Top