How does HDFS handle NameNode failure in a production system?

HDFS High Availability uses an Active + Standby NameNode pair sharing a distributed edit log via the Quorum Journal Manager (QJM) u2014 a set of 2N+1 journal nodes where edits must be acknowledged by a majority quorum. ZooKeeper manages leader election and ZKFC (ZooKeeper Failover Controller) daemons on each NameNode monitor health and trigger automatic failover in ~30 seconds. DataNodes send block reports to both NameNodes so the Standby maintains an up-to-date block map and can serve immediately upon promotion.

What is the GFS write path and why is data flow decoupled from control flow?

In GFS, the client first requests chunk locations from the Master, which grants a lease to one chunkserver as the primary. The client then pipelines data directly to all replicas over TCP (each chunkserver forwards to the next, maximizing bandwidth). Only after all replicas have buffered the data does the client send a write command to the primary, which assigns a serial order and coordinates with secondaries. Decoupling data flow from control flow means the Master is never in the data path u2014 it handles only metadata u2014 enabling it to serve thousands of clients without becoming a bottleneck.

When would you choose erasure coding over replication for distributed storage?

Erasure coding (e.g., Reed-Solomon 8+4) reduces storage overhead from 200% (3x replication) to 50% at equal or better durability, at the cost of higher read/write latency (encoding/decoding overhead) and CPU usage. Choose erasure coding for cold or archival data where access frequency is low and storage cost dominates. Choose replication for hot data requiring low-latency reads, sequential write throughput (streaming), and simplicity of recovery. HDFS uses both: 3x replication for active workloads, erasure coding for the cold data tier.

System Design Interview: Distributed File System (GFS/HDFS/S3)

⏱ 5 min read

Designing a distributed file system (DFS) like HDFS, Google File System, or Amazon S3 is a classic system design interview question that tests your understanding of fault tolerance, data replication, consistency, and large-scale storage architecture.

Core Requirements

Functional: Store and retrieve large files (GBs to TBs), support sequential reads/writes, provide a hierarchical namespace, handle concurrent access, support append operations.

Non-functional: High durability (11 nines = 99.999999999%), fault tolerance (survive rack failures), throughput-optimized (not latency), horizontal scalability to exabytes, cost efficiency via commodity hardware.

Architecture: Master-Chunkserver Model (GFS/HDFS)

Client
  │
  ├─ metadata request ──→ Master/NameNode (single, replicated)
  │     ← chunk locations
  │
  └─ data transfer ─────→ ChunkServer/DataNode (thousands)
                                  ↕ replication pipeline
                           ChunkServer  ChunkServer

Master Node Responsibilities

Namespace management: file → chunk list mapping, stored in memory + write-ahead log
Chunk placement: decides which chunkservers hold each chunk replica
Lease management: grants leases to primary replicas for mutations
Garbage collection: lazy deletion via periodic scans
Heartbeat processing: chunk reports from all chunkservers

Chunk Design

Fixed 64MB chunks (GFS) or 128MB blocks (HDFS) — large to amortize metadata overhead
Each chunk has a globally unique 64-bit chunk handle
Default 3x replication across different racks
Each replica stored as a plain Linux file on the chunkserver

Write Path

1. Client → Master: "Write to file F, offset O"
2. Master → Client: Primary chunkserver + secondary replicas
3. Client → All replicas: Push data to TCP pipeline (chain replication)
   (data flows to closest replica first, then forwards)
4. Client → Primary: "Write command"
5. Primary assigns serial order, applies mutation, forwards to secondaries
6. Secondaries apply in same serial order, reply to Primary
7. Primary → Client: success (or partial failure)

Key insight: Data flow is decoupled from control flow. Data pipelining over TCP maximizes throughput by using each machine’s full outbound bandwidth.

Read Path

1. Client → Master: file name + byte range
2. Master → Client: chunk handle + replica locations (cached by client)
3. Client → Nearest ChunkServer: read chunk at offset
   (client picks nearest by network topology)

Clients cache chunk locations for 60 seconds, reducing master load. The master is never in the data path for reads.

Fault Tolerance

ChunkServer Failures

Master detects via missed heartbeats (30s timeout)
Re-replication triggered when replica count drops below 3
Priority queue: chunks with only 1 replica replicated first
Bandwidth throttled to not impact client I/O

Master Failure (HDFS)

HDFS HA: Active + Standby NameNode with shared edit log (NFS or QJM — Quorum Journal Manager)
QJM: edit log written to 2N+1 journal nodes; majority quorum required
ZooKeeper-based leader election for failover (~30s failover time)
DataNodes send block reports to both NameNodes

Data Integrity

Every chunk has a 32-bit checksum per 64KB block stored separately
Verified on every read — corrupt chunks trigger re-replication from healthy replica
Background scrubbing: DataNodes scan all chunks weekly

Consistency Model

GFS provides a relaxed consistency model:

Defined region: all clients see the same data after a successful mutation — guaranteed for sequential writes
Consistent but undefined: all clients see same data but it may be a mix of concurrent writes (concurrent mutations)
Undefined: failed mutations leave the region inconsistent

Applications handle this via checksums, unique identifiers, and idempotent operations (Google MapReduce was designed around GFS’s consistency model).

Object Storage vs Block Storage (S3 Architecture)

S3 Architecture:
  PUT /bucket/key → S3 Frontend (stateless, authenticates, routes)
                  → Metadata service (consistent key-value store, like DynamoDB)
                  → Storage backend (erasure-coded chunks across AZs)

Key differences from GFS/HDFS:
  - No filesystem namespace (flat key-value, no directories)
  - Eventual consistency for overwrites (strong consistency since 2020)
  - Erasure coding (8+4 Reed-Solomon) instead of 3x replication
  - Storage separated from compute (unlike HDFS co-location)

Erasure Coding vs Replication

Approach	Storage Overhead	Durability	Read Latency	Write Latency
3x Replication	200%	High	Low (any replica)	Low
Reed-Solomon 8+4	50%	Higher	Higher (decode)	Higher (encode)

HDFS uses erasure coding for cold data (HDFS-RAID), keeping 3x replication for hot data where read latency matters.

Capacity Estimation

1 billion files × 64MB avg = 64 PB raw; with 3x replication = 192 PB
Master metadata: ~200 bytes per chunk × (64PB / 64MB) = ~200MB per PB — fits in RAM
At 10 PB: ~31M chunks × 200 bytes = 6.2 GB master memory — manageable
Chunkserver count: 192 PB / 10 TB per server = ~19,200 servers

Scaling Challenges

Master Bottleneck

HDFS Federation: multiple independent NameNodes, each owning a namespace volume
ViewFS client-side mount table maps paths to correct NameNode
Allows linear scaling of namespace capacity and throughput

Small File Problem

Many small files (< chunk size) waste master memory (full chunk metadata per file) and create excess chunkserver connections. Solutions:

HAR files (HDFS Archives): bundle small files into a single HDFS file with index
Sequence files: merge small files into k-v store format
Application-level compaction (periodic merge jobs)

Interview Discussion Points

Why large chunk sizes? Reduces master metadata, amortizes TCP connection overhead, enables pipelining
Why not strong consistency? Network partitions and replica failures make it expensive; GFS workloads (MapReduce, append-only logs) tolerate relaxed consistency
Trade-off: HDFS vs S3 HDFS: high throughput, co-located compute, strong consistency, no egress cost. S3: infinite scale, pay-per-use, cross-region, simpler ops
How to detect stale replicas? Chunk version numbers incremented on each lease grant — chunkservers report versions at heartbeat; stale replicas (missed lease) are garbage collected

Apple Interview Guide

Cloudflare Interview Guide

Twitter/X Interview Guide

Meta Interview Guide

Netflix Interview Guide

Companies That Ask This

Databricks Interview Guide

Asked at: Uber Interview Guide

Asked at: Atlassian Interview Guide