Designing a distributed file system (DFS) like HDFS, Google File System, or Amazon S3 is a classic system design interview question that tests your understanding of fault tolerance, data replication, consistency, and large-scale storage architecture.
Core Requirements
Functional: Store and retrieve large files (GBs to TBs), support sequential reads/writes, provide a hierarchical namespace, handle concurrent access, support append operations.
Non-functional: High durability (11 nines = 99.999999999%), fault tolerance (survive rack failures), throughput-optimized (not latency), horizontal scalability to exabytes, cost efficiency via commodity hardware.
Architecture: Master-Chunkserver Model (GFS/HDFS)
Client
│
├─ metadata request ──→ Master/NameNode (single, replicated)
│ ← chunk locations
│
└─ data transfer ─────→ ChunkServer/DataNode (thousands)
↕ replication pipeline
ChunkServer ChunkServer
Master Node Responsibilities
- Namespace management: file → chunk list mapping, stored in memory + write-ahead log
- Chunk placement: decides which chunkservers hold each chunk replica
- Lease management: grants leases to primary replicas for mutations
- Garbage collection: lazy deletion via periodic scans
- Heartbeat processing: chunk reports from all chunkservers
Chunk Design
- Fixed 64MB chunks (GFS) or 128MB blocks (HDFS) — large to amortize metadata overhead
- Each chunk has a globally unique 64-bit chunk handle
- Default 3x replication across different racks
- Each replica stored as a plain Linux file on the chunkserver
Write Path
1. Client → Master: "Write to file F, offset O"
2. Master → Client: Primary chunkserver + secondary replicas
3. Client → All replicas: Push data to TCP pipeline (chain replication)
(data flows to closest replica first, then forwards)
4. Client → Primary: "Write command"
5. Primary assigns serial order, applies mutation, forwards to secondaries
6. Secondaries apply in same serial order, reply to Primary
7. Primary → Client: success (or partial failure)
Key insight: Data flow is decoupled from control flow. Data pipelining over TCP maximizes throughput by using each machine’s full outbound bandwidth.
Read Path
1. Client → Master: file name + byte range
2. Master → Client: chunk handle + replica locations (cached by client)
3. Client → Nearest ChunkServer: read chunk at offset
(client picks nearest by network topology)
Clients cache chunk locations for 60 seconds, reducing master load. The master is never in the data path for reads.
Fault Tolerance
ChunkServer Failures
- Master detects via missed heartbeats (30s timeout)
- Re-replication triggered when replica count drops below 3
- Priority queue: chunks with only 1 replica replicated first
- Bandwidth throttled to not impact client I/O
Master Failure (HDFS)
- HDFS HA: Active + Standby NameNode with shared edit log (NFS or QJM — Quorum Journal Manager)
- QJM: edit log written to 2N+1 journal nodes; majority quorum required
- ZooKeeper-based leader election for failover (~30s failover time)
- DataNodes send block reports to both NameNodes
Data Integrity
- Every chunk has a 32-bit checksum per 64KB block stored separately
- Verified on every read — corrupt chunks trigger re-replication from healthy replica
- Background scrubbing: DataNodes scan all chunks weekly
Consistency Model
GFS provides a relaxed consistency model:
- Defined region: all clients see the same data after a successful mutation — guaranteed for sequential writes
- Consistent but undefined: all clients see same data but it may be a mix of concurrent writes (concurrent mutations)
- Undefined: failed mutations leave the region inconsistent
Applications handle this via checksums, unique identifiers, and idempotent operations (Google MapReduce was designed around GFS’s consistency model).
Object Storage vs Block Storage (S3 Architecture)
S3 Architecture:
PUT /bucket/key → S3 Frontend (stateless, authenticates, routes)
→ Metadata service (consistent key-value store, like DynamoDB)
→ Storage backend (erasure-coded chunks across AZs)
Key differences from GFS/HDFS:
- No filesystem namespace (flat key-value, no directories)
- Eventual consistency for overwrites (strong consistency since 2020)
- Erasure coding (8+4 Reed-Solomon) instead of 3x replication
- Storage separated from compute (unlike HDFS co-location)
Erasure Coding vs Replication
| Approach | Storage Overhead | Durability | Read Latency | Write Latency |
|---|---|---|---|---|
| 3x Replication | 200% | High | Low (any replica) | Low |
| Reed-Solomon 8+4 | 50% | Higher | Higher (decode) | Higher (encode) |
HDFS uses erasure coding for cold data (HDFS-RAID), keeping 3x replication for hot data where read latency matters.
Capacity Estimation
- 1 billion files × 64MB avg = 64 PB raw; with 3x replication = 192 PB
- Master metadata: ~200 bytes per chunk × (64PB / 64MB) = ~200MB per PB — fits in RAM
- At 10 PB: ~31M chunks × 200 bytes = 6.2 GB master memory — manageable
- Chunkserver count: 192 PB / 10 TB per server = ~19,200 servers
Scaling Challenges
Master Bottleneck
- HDFS Federation: multiple independent NameNodes, each owning a namespace volume
- ViewFS client-side mount table maps paths to correct NameNode
- Allows linear scaling of namespace capacity and throughput
Small File Problem
Many small files (< chunk size) waste master memory (full chunk metadata per file) and create excess chunkserver connections. Solutions:
- HAR files (HDFS Archives): bundle small files into a single HDFS file with index
- Sequence files: merge small files into k-v store format
- Application-level compaction (periodic merge jobs)
Interview Discussion Points
- Why large chunk sizes? Reduces master metadata, amortizes TCP connection overhead, enables pipelining
- Why not strong consistency? Network partitions and replica failures make it expensive; GFS workloads (MapReduce, append-only logs) tolerate relaxed consistency
- Trade-off: HDFS vs S3 HDFS: high throughput, co-located compute, strong consistency, no egress cost. S3: infinite scale, pay-per-use, cross-region, simpler ops
- How to detect stale replicas? Chunk version numbers incremented on each lease grant — chunkservers report versions at heartbeat; stale replicas (missed lease) are garbage collected
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does HDFS handle NameNode failure in a production system?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “HDFS High Availability uses an Active + Standby NameNode pair sharing a distributed edit log via the Quorum Journal Manager (QJM) u2014 a set of 2N+1 journal nodes where edits must be acknowledged by a majority quorum. ZooKeeper manages leader election and ZKFC (ZooKeeper Failover Controller) daemons on each NameNode monitor health and trigger automatic failover in ~30 seconds. DataNodes send block reports to both NameNodes so the Standby maintains an up-to-date block map and can serve immediately upon promotion.”
}
},
{
“@type”: “Question”,
“name”: “What is the GFS write path and why is data flow decoupled from control flow?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “In GFS, the client first requests chunk locations from the Master, which grants a lease to one chunkserver as the primary. The client then pipelines data directly to all replicas over TCP (each chunkserver forwards to the next, maximizing bandwidth). Only after all replicas have buffered the data does the client send a write command to the primary, which assigns a serial order and coordinates with secondaries. Decoupling data flow from control flow means the Master is never in the data path u2014 it handles only metadata u2014 enabling it to serve thousands of clients without becoming a bottleneck.”
}
},
{
“@type”: “Question”,
“name”: “When would you choose erasure coding over replication for distributed storage?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Erasure coding (e.g., Reed-Solomon 8+4) reduces storage overhead from 200% (3x replication) to 50% at equal or better durability, at the cost of higher read/write latency (encoding/decoding overhead) and CPU usage. Choose erasure coding for cold or archival data where access frequency is low and storage cost dominates. Choose replication for hot data requiring low-latency reads, sequential write throughput (streaming), and simplicity of recovery. HDFS uses both: 3x replication for active workloads, erasure coding for the cold data tier.”
}
}
]
}