What Is a Distributed File System?
A distributed file system (DFS) stores files across many machines, exposing them as a single unified namespace. Google File System (GFS, 2003) and Hadoop HDFS (2006, open-source GFS implementation) are the foundational designs. Key insight: commodity hardware fails constantly at scale, so the file system must treat failure as the norm, not the exception. GFS stores petabytes on thousands of cheap servers.
Architecture: Master-Worker (NameNode-DataNode)
NameNode (Master)
Single master node (with standby for HA) that stores all metadata: file namespace, directory tree, file → block mapping, block → DataNode locations. Entirely in-memory for fast access. Metadata persisted to disk via edit log and periodic checkpoints (FsImage). Operations: create file, delete file, open file, rename. Does NOT store file data — only metadata. Single master simplifies design and consistency but requires careful capacity planning (metadata can grow to tens of GB for billions of files).
DataNode (Worker)
Hundreds to thousands of DataNodes. Each stores file blocks as local files on disk. Periodically sends heartbeat to NameNode (alive status) and block report (which blocks it holds). On failure: NameNode detects missing heartbeat → marks DataNode dead → re-replicates its blocks from other DataNodes.
Block Storage
Files are split into fixed-size blocks (HDFS default: 128MB, GFS: 64MB). Large blocks reduce metadata overhead: a 1PB file with 128MB blocks = 8 million blocks vs. 1 trillion blocks at 1MB. Each block replicated 3x (default replication factor) across different DataNodes, with rack awareness: replicas placed on 2 different racks to tolerate rack failure.
Read Path
- Client asks NameNode for block locations for the file
- NameNode returns: [block_id → [DataNode1, DataNode2, DataNode3] for each block]
- Client reads directly from the nearest DataNode (by rack proximity)
- Client streams blocks sequentially — no further NameNode involvement for data transfer
Write Path
- Client asks NameNode to create the file → NameNode returns the pipeline of 3 DataNodes
- Client streams data to DataNode1; DataNode1 forwards to DataNode2; DataNode2 to DataNode3 (pipeline replication)
- Each DataNode acknowledges write upstream; client waits for all 3 ACKs
- On block completion: client informs NameNode of successful write
Pipeline replication: client sends to one DataNode, not three — reduces client bandwidth usage. DataNodes form a chain for replication.
Fault Tolerance
- DataNode failure: NameNode detects missing heartbeat after 30s, decrements replication count for all blocks on that node, triggers re-replication from surviving replicas
- Data corruption: blocks stored with CRC checksum. Client verifies checksum on read; corrupted block is reported to NameNode, re-fetched from another replica
- NameNode failure: Standby NameNode maintains a synchronized edit log via a shared journal (QJM — Quorum Journal Manager). On primary failure, standby takes over within seconds
HDFS vs. Object Storage (S3)
| Feature | HDFS | S3 |
|---|---|---|
| Locality | Compute on same nodes as data (Spark/Hadoop locality) | Compute and storage separated — network hop required |
| Throughput | Very high for sequential reads (local disk) | High but network-bound |
| Mutation | Append-only (no in-place edits to existing files) | Object immutable, PUT new version |
| Management | Self-managed cluster | Fully managed by AWS |
Interview Tips
- NameNode = metadata only, DataNode = data only. Clear separation of concerns.
- Large block size reduces NameNode memory pressure — critical design choice.
- Pipeline replication: client → DN1 → DN2 → DN3 reduces client write bandwidth.
- Modern trend: use S3/GCS as storage with Spark/Flink doing compute separately (S3+Spark beats HDFS operationally for most new deployments).