Question 1

How does Raft leader election work and why are randomized timeouts used?

Accepted Answer

Raft leader election: all nodes start as Followers. Each node has an election timeout (150-300ms, chosen randomly at startup). If a Follower receives no heartbeat from the Leader within the timeout: it becomes a Candidate, increments its term, votes for itself, and sends RequestVote RPCs to all other nodes. Nodes grant their vote if: they have not already voted in this term AND the candidate's log is at least as up-to-date as their own. If the Candidate receives a majority of votes: it becomes the Leader and immediately sends heartbeats. Randomized timeouts are key: if all nodes had the same timeout, they would all become Candidates simultaneously and split the vote. Randomization ensures one node usually times out first and wins the election before others start.

Question 2

How does Raft ensure committed log entries are never lost?

Accepted Answer

Raft's safety property: once a log entry is committed, it will never be lost even across leader changes. How it is guaranteed: (1) An entry is only committed once a majority of nodes have it. (2) A Candidate can only win an election if its log is at least as up-to-date as the majority's logs (RequestVote grants the vote only if the candidate's last log index and term are >= the voter's). Combining (1) and (2): any new leader must have all committed entries -- because it won a majority vote, and the majority all have those entries. The new leader inherits all committed entries before it starts leading. This prevents the split-brain scenario where two leaders commit conflicting entries.

Question 3

What is the difference between Raft and Paxos?

Accepted Answer

Paxos (1989): the theoretical foundation of consensus. Multi-Paxos is the practical version (Paxos for a sequence of values, not just one). Paxos is notoriously difficult to understand and has many implementation subtleties. It separates the roles of Proposer, Acceptor, and Learner. Real-world use: Google Chubby, Apache ZooKeeper (uses ZAB, a Paxos variant). Raft (2014): designed explicitly for understandability. Subsumes all roles into a single node state machine (Follower, Candidate, Leader). Decomposes consensus into leader election and log replication, with clear rules for each. Real-world use: etcd (Kubernetes), CockroachDB, TiDB, Consul. Key difference: Raft elects a strong leader who handles all decisions; Multi-Paxos also typically uses a distinguished proposer but the protocol is less prescriptive about it.

Question 4

How does a split brain situation occur in distributed consensus and how is it prevented?

Accepted Answer

Split brain: two nodes both believe they are the leader simultaneously and accept writes, causing divergent state. Prevention in Raft: term number. Each election increments the term. A node immediately steps down as Leader if it receives a message with a higher term (it knows another election has occurred). A Leader cannot commit entries without a majority -- if it cannot reach a majority (network partition), it cannot commit anything. The minority partition's leader is effectively frozen. If a second leader is elected on the majority side: it will have a higher term. When the network heals, the old leader sees the higher term and steps down. Quorum requirement is the fundamental protection: only one leader can hold a majority at any time, preventing two leaders from both committing.

Question 5

What are the practical trade-offs of using a consensus-based system like etcd?

Accepted Answer

Advantages: strong consistency (linearizability) -- reads always see the latest committed write. Automatic leader failover -- no manual intervention needed when a node fails. Distributed -- no single point of failure if cluster has 3+ nodes. Disadvantages: write latency -- every write requires a round trip to a majority of nodes. With 3 nodes in 3 different data centers: write latency = 2x cross-DC RTT (50-100ms). Throughput limited by leader -- all writes go through one node. etcd is typically limited to thousands of writes per second. Scale: consensus clusters are usually 3 or 5 nodes -- adding more nodes increases write latency (larger majorities to contact). Not suitable for high-throughput data storage; use it for metadata, configuration, and leader election only (as Kubernetes does).

System Design: Consensus Algorithms — Raft, Paxos, Leader Election, and Distributed Agreement

The Consensus Problem

Raft Algorithm

Paxos

Practical Applications

Interview Tips