System Design Interview: Design a Distributed File System (HDFS / GFS)

What Is a Distributed File System?

A distributed file system (DFS) stores files across many machines, exposing them as a single unified namespace. Google File System (GFS, 2003) and Hadoop HDFS (2006, open-source GFS implementation) are the foundational designs. Key insight: commodity hardware fails constantly at scale, so the file system must treat failure as the norm, not the exception. GFS stores petabytes on thousands of cheap servers.

  • Twitter Interview Guide
  • LinkedIn Interview Guide
  • Uber Interview Guide
  • Netflix Interview Guide
  • Cloudflare Interview Guide
  • Databricks Interview Guide
  • Architecture: Master-Worker (NameNode-DataNode)

    NameNode (Master)

    Single master node (with standby for HA) that stores all metadata: file namespace, directory tree, file → block mapping, block → DataNode locations. Entirely in-memory for fast access. Metadata persisted to disk via edit log and periodic checkpoints (FsImage). Operations: create file, delete file, open file, rename. Does NOT store file data — only metadata. Single master simplifies design and consistency but requires careful capacity planning (metadata can grow to tens of GB for billions of files).

    DataNode (Worker)

    Hundreds to thousands of DataNodes. Each stores file blocks as local files on disk. Periodically sends heartbeat to NameNode (alive status) and block report (which blocks it holds). On failure: NameNode detects missing heartbeat → marks DataNode dead → re-replicates its blocks from other DataNodes.

    Block Storage

    Files are split into fixed-size blocks (HDFS default: 128MB, GFS: 64MB). Large blocks reduce metadata overhead: a 1PB file with 128MB blocks = 8 million blocks vs. 1 trillion blocks at 1MB. Each block replicated 3x (default replication factor) across different DataNodes, with rack awareness: replicas placed on 2 different racks to tolerate rack failure.

    Read Path

    1. Client asks NameNode for block locations for the file
    2. NameNode returns: [block_id → [DataNode1, DataNode2, DataNode3] for each block]
    3. Client reads directly from the nearest DataNode (by rack proximity)
    4. Client streams blocks sequentially — no further NameNode involvement for data transfer

    Write Path

    1. Client asks NameNode to create the file → NameNode returns the pipeline of 3 DataNodes
    2. Client streams data to DataNode1; DataNode1 forwards to DataNode2; DataNode2 to DataNode3 (pipeline replication)
    3. Each DataNode acknowledges write upstream; client waits for all 3 ACKs
    4. On block completion: client informs NameNode of successful write

    Pipeline replication: client sends to one DataNode, not three — reduces client bandwidth usage. DataNodes form a chain for replication.

    Fault Tolerance

    • DataNode failure: NameNode detects missing heartbeat after 30s, decrements replication count for all blocks on that node, triggers re-replication from surviving replicas
    • Data corruption: blocks stored with CRC checksum. Client verifies checksum on read; corrupted block is reported to NameNode, re-fetched from another replica
    • NameNode failure: Standby NameNode maintains a synchronized edit log via a shared journal (QJM — Quorum Journal Manager). On primary failure, standby takes over within seconds

    HDFS vs. Object Storage (S3)

    Feature HDFS S3
    Locality Compute on same nodes as data (Spark/Hadoop locality) Compute and storage separated — network hop required
    Throughput Very high for sequential reads (local disk) High but network-bound
    Mutation Append-only (no in-place edits to existing files) Object immutable, PUT new version
    Management Self-managed cluster Fully managed by AWS

    Interview Tips

    • NameNode = metadata only, DataNode = data only. Clear separation of concerns.
    • Large block size reduces NameNode memory pressure — critical design choice.
    • Pipeline replication: client → DN1 → DN2 → DN3 reduces client write bandwidth.
    • Modern trend: use S3/GCS as storage with Spark/Flink doing compute separately (S3+Spark beats HDFS operationally for most new deployments).

    {
    “@context”: “https://schema.org”,
    “@type”: “FAQPage”,
    “mainEntity”: [
    {
    “@type”: “Question”,
    “name”: “What is the role of the NameNode in HDFS and what are its failure modes?”,
    “acceptedAnswer”: { “@type”: “Answer”, “text”: “The HDFS NameNode is the single master that maintains all filesystem metadata: the directory tree, file-to-block mapping, and block-to-DataNode locations. All metadata is kept in memory for fast access (~200 bytes per file/block). Persistence: edits are written to an edit log on disk, and periodic checkpoints (FsImage) snapshot the full metadata state. Failure modes: (1) NameNode crash: without HA, HDFS goes down completely. Data on DataNodes is safe but inaccessible. (2) Memory exhaustion: very large namespaces (billions of files) can exhaust NameNode heap. Mitigate by using large block sizes (128MB reduces block count) and federation (multiple NameNodes each owning a namespace partition). (3) Edit log corruption: if the edit log is corrupted, metadata may be lost for recent operations — edit log is replicated to multiple directories (local + NFS) for redundancy. HDFS HA solution: two NameNodes (Active and Standby) sharing a Quorum Journal Manager (QJM — 3+ JournalNodes). Standby continuously replays the shared edit log; on Active failure, Standby takes over in seconds.” }
    },
    {
    “@type”: “Question”,
    “name”: “How does rack-aware replication work in HDFS?”,
    “acceptedAnswer”: { “@type”: “Answer”, “text”: “HDFS replicates each block 3 times (default replication factor = 3) with rack awareness to tolerate both node and rack failures. Default placement policy: (1) First replica on the same node as the writer (or a random node if the client is not in the cluster). (2) Second replica on a different rack from the first. (3) Third replica on the same rack as the second, but on a different node. Why this placement: if an entire rack loses power or network connectivity, at least one replica (the first) is on a different rack and remains accessible. The second and third replicas being on the same rack reduces cross-rack bandwidth for replication (only one cross-rack transfer per block instead of two). Trade-off: this policy tolerates failure of one rack. For three-rack clusters: place one replica per rack for maximum fault tolerance, but this doubles cross-rack replication bandwidth. HDFS administrator configures rack topology via a script that maps IP addresses to rack identifiers.” }
    },
    {
    “@type”: “Question”,
    “name”: “What is Gorilla compression for time series data and how does HDFS block storage compare to it?”,
    “acceptedAnswer”: { “@type”: “Answer”, “text”: “These are distinct concepts addressing different problems. Gorilla (Facebook's TSDB, 2015) is a compression algorithm for time series data: delta-delta encoding for timestamps (most successive timestamps differ by the same amount, so the second-order delta is often zero or very small), and XOR encoding for float values (successive measurements often have similar bits — XOR captures the difference compactly). Achieves ~1.37 bytes per sample vs. 16 bytes raw. Used by Prometheus TSDB and Facebook's internal metrics system. HDFS block storage is a general-purpose distributed storage with fixed 128MB blocks, pipeline replication, and CRC checksumming. It doesn't apply domain-specific compression to data — it stores raw bytes (though files themselves can be stored in compressed formats like Snappy or LZ4 which Spark/Parquet use). The distinction: Gorilla is application-level compression optimized for time series access patterns; HDFS is infrastructure-level storage optimized for large sequential reads/writes. In practice, time series systems (like Thanos storing Prometheus blocks on S3/HDFS) apply Gorilla compression first, then store the compressed blocks in object/distributed storage.” }
    }
    ]
    }

    Scroll to Top