The Consensus Problem
In a distributed system, multiple nodes must agree on a single value (which node is the leader, what the next log entry is, whether a transaction committed). Consensus is hard because nodes can fail, networks can drop or delay messages, and there is no shared clock. The fundamental result: FLP impossibility — in an asynchronous system with even one faulty process, consensus cannot be guaranteed to terminate. In practice: use timeouts to detect failures (making the system partially synchronous) and achieve consensus with high probability.
Raft Algorithm
Raft is designed for understandability. Three roles: Leader (exactly one, handles all client requests), Follower (passive, replicate leader), Candidate (temporary, during election). Leader election: all nodes start as Followers. Each has an election timeout (150-300ms, randomized). If no heartbeat from the leader within the timeout: become a Candidate, increment term, vote for yourself, send RequestVote RPCs. If you receive a majority of votes: become Leader. Randomized timeouts prevent all nodes from timing out simultaneously (split votes). Log replication: Leader receives entries, appends to its log, sends AppendEntries RPCs to all Followers. Once a majority acknowledge: commit the entry, apply to state machine, return to client. Followers apply committed entries in order.
Safety guarantee: only nodes with up-to-date logs can win an election (RequestVote includes the candidate’s log index and term; followers reject candidates with older logs). This ensures committed entries are never lost across leader changes.
Paxos
Paxos is the theoretical foundation. Multi-Paxos is the variant used in practice. Two phases: Phase 1 (Prepare/Promise): a Proposer sends Prepare(n) to a majority of Acceptors. Acceptors promise not to accept proposals with number < n and return the highest-numbered proposal they have already accepted. Phase 2 (Accept/Accepted): Proposer sends Accept(n, value) — if acceptors returned previous values, use the value from the highest-numbered prior proposal (not the proposer's own value). If a majority of Acceptors Accept: the value is chosen. Paxos is difficult to understand and implement correctly — Raft was created explicitly as a more understandable alternative. Used in: Google Chubby, Apache Zookeeper (ZAB protocol, a Paxos variant).
Practical Applications
Leader election: etcd (Kubernetes control plane uses Raft). ZooKeeper (uses ZAB, a Paxos variant — provides leader election as a service for Kafka, HDFS). Distributed databases: CockroachDB (Raft per range), TiDB (Raft), YugabyteDB. Log replication: every replicated state machine (Redis Sentinel, PostgreSQL streaming replication) uses some form of consensus or simpler primary-backup replication. The difference: consensus handles leader failures automatically; primary-backup requires manual intervention.
Interview Tips
- You do not need to know the full Raft specification. Know: leader election via randomized timeouts, log replication requiring majority acknowledgment, and the safety property (committed entries survive leader changes).
- CAP theorem connection: consensus algorithms choose consistency + partition tolerance (CP). During a partition, a minority partition cannot commit entries (it cannot reach majority). Availability is sacrificed.
- Quorum: majority quorum = (n/2) + 1. A 3-node cluster tolerates 1 failure. A 5-node cluster tolerates 2 failures. Always use odd cluster sizes.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does Raft leader election work and why are randomized timeouts used?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Raft leader election: all nodes start as Followers. Each node has an election timeout (150-300ms, chosen randomly at startup). If a Follower receives no heartbeat from the Leader within the timeout: it becomes a Candidate, increments its term, votes for itself, and sends RequestVote RPCs to all other nodes. Nodes grant their vote if: they have not already voted in this term AND the candidate’s log is at least as up-to-date as their own. If the Candidate receives a majority of votes: it becomes the Leader and immediately sends heartbeats. Randomized timeouts are key: if all nodes had the same timeout, they would all become Candidates simultaneously and split the vote. Randomization ensures one node usually times out first and wins the election before others start.”
}
},
{
“@type”: “Question”,
“name”: “How does Raft ensure committed log entries are never lost?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Raft’s safety property: once a log entry is committed, it will never be lost even across leader changes. How it is guaranteed: (1) An entry is only committed once a majority of nodes have it. (2) A Candidate can only win an election if its log is at least as up-to-date as the majority’s logs (RequestVote grants the vote only if the candidate’s last log index and term are >= the voter’s). Combining (1) and (2): any new leader must have all committed entries — because it won a majority vote, and the majority all have those entries. The new leader inherits all committed entries before it starts leading. This prevents the split-brain scenario where two leaders commit conflicting entries.”
}
},
{
“@type”: “Question”,
“name”: “What is the difference between Raft and Paxos?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Paxos (1989): the theoretical foundation of consensus. Multi-Paxos is the practical version (Paxos for a sequence of values, not just one). Paxos is notoriously difficult to understand and has many implementation subtleties. It separates the roles of Proposer, Acceptor, and Learner. Real-world use: Google Chubby, Apache ZooKeeper (uses ZAB, a Paxos variant). Raft (2014): designed explicitly for understandability. Subsumes all roles into a single node state machine (Follower, Candidate, Leader). Decomposes consensus into leader election and log replication, with clear rules for each. Real-world use: etcd (Kubernetes), CockroachDB, TiDB, Consul. Key difference: Raft elects a strong leader who handles all decisions; Multi-Paxos also typically uses a distinguished proposer but the protocol is less prescriptive about it.”
}
},
{
“@type”: “Question”,
“name”: “How does a split brain situation occur in distributed consensus and how is it prevented?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Split brain: two nodes both believe they are the leader simultaneously and accept writes, causing divergent state. Prevention in Raft: term number. Each election increments the term. A node immediately steps down as Leader if it receives a message with a higher term (it knows another election has occurred). A Leader cannot commit entries without a majority — if it cannot reach a majority (network partition), it cannot commit anything. The minority partition’s leader is effectively frozen. If a second leader is elected on the majority side: it will have a higher term. When the network heals, the old leader sees the higher term and steps down. Quorum requirement is the fundamental protection: only one leader can hold a majority at any time, preventing two leaders from both committing.”
}
},
{
“@type”: “Question”,
“name”: “What are the practical trade-offs of using a consensus-based system like etcd?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Advantages: strong consistency (linearizability) — reads always see the latest committed write. Automatic leader failover — no manual intervention needed when a node fails. Distributed — no single point of failure if cluster has 3+ nodes. Disadvantages: write latency — every write requires a round trip to a majority of nodes. With 3 nodes in 3 different data centers: write latency = 2x cross-DC RTT (50-100ms). Throughput limited by leader — all writes go through one node. etcd is typically limited to thousands of writes per second. Scale: consensus clusters are usually 3 or 5 nodes — adding more nodes increases write latency (larger majorities to contact). Not suitable for high-throughput data storage; use it for metadata, configuration, and leader election only (as Kubernetes does).”
}
}
]
}
Asked at: Databricks Interview Guide
Asked at: Cloudflare Interview Guide
Asked at: Netflix Interview Guide
Asked at: Uber Interview Guide
Asked at: Atlassian Interview Guide