System Design Interview: Design an Object Storage System (Amazon S3)

What Is an Object Storage System?

Object storage (Amazon S3, Google Cloud Storage) stores arbitrary-size files (objects) in named buckets. Unlike a filesystem (hierarchical directories), objects are flat key-value pairs: bucket/key → bytes. Objects are immutable — you write a new version, not modify in place. Amazon S3 stores trillions of objects, exabytes of data.

  • Stripe Interview Guide
  • Airbnb Interview Guide
  • Uber Interview Guide
  • Databricks Interview Guide
  • Cloudflare Interview Guide
  • Netflix Interview Guide
  • System Requirements

    Functional

    • PUT object: upload bytes, return URL
    • GET object: download bytes by bucket/key
    • DELETE object
    • Multipart upload for large objects (>5GB)
    • Versioning: multiple versions of the same key
    • Lifecycle policies: auto-delete or archive after N days

    Non-Functional

    • Durability: 99.999999999% (11 nines) — S3’s SLA
    • Availability: 99.99%
    • Throughput: terabytes/second aggregate

    Architecture

    Metadata Service

    Stores object metadata: bucket, key, size, content-type, owner, checksum (MD5/SHA256), version_id, storage_location (which data nodes hold the chunks). Backed by a strongly-consistent distributed database (DynamoDB or a custom sharded MySQL). Partitioned by bucket+key hash. On PUT: write metadata, then write data. On GET: read metadata to find data location, then stream data.

    Data Nodes

    Objects are split into chunks (typically 64MB each for large objects). Each chunk is replicated 3x across data nodes in different availability zones (cross-AZ replication). Chunks stored as flat files on disk — no filesystem abstractions needed beyond a local key-value store. Data nodes expose a simple HTTP API: PUT /chunk/{id}, GET /chunk/{id}.

    Durability via Erasure Coding

    For cost-efficient durability: instead of 3x replication, use erasure coding (Reed-Solomon). Split an object into k data chunks and m parity chunks. Any k chunks can reconstruct the full object (tolerating up to m failures). S3 uses 6+3 erasure coding: 6 data chunks, 3 parity chunks across 9 data nodes. Failure of any 3 nodes: data fully recoverable. Storage overhead: 9/6 = 1.5x vs. 3x for replication. 50% storage savings at the cost of more CPU for encoding/decoding.

    Multipart Upload

    For large files (GB–TB): split into parts, upload concurrently, complete when all parts arrive.

    1. CreateMultipartUpload → returns upload_id
    2. UploadPart(upload_id, part_number, bytes) for each part (min 5MB each)
    3. CompleteMultipartUpload(upload_id, [part_number, etag] list) → atomically commits

    Benefits: resume on failure (only re-upload failed parts), parallel upload from multiple threads, no single-connection bandwidth bottleneck.

    Consistency Model

    S3 now offers strong read-after-write consistency: after a successful PUT, subsequent GET operations will return the new object. Before 2020, S3 provided eventual consistency for new objects. Strong consistency is implemented by routing all requests for the same key through a consistent hash ring to the same primary node (or using conditional writes with a distributed lock).

    Caching and CDN

    S3 itself is not a CDN — it serves from a single region. For global low-latency access: configure CloudFront (CDN) in front of S3. Edge PoPs cache objects near users. TTL controlled by Cache-Control headers on the S3 object. For frequently-accessed objects (images, JS bundles), 95%+ of traffic served from CDN edge without hitting S3.

    Interview Tips

    • Metadata service + data nodes = separation of concerns. Metadata is small but consistency-critical; data is large but throughput-critical.
    • Erasure coding vs. replication: 1.5x overhead vs. 3x — erasure coding dominates at petabyte scale.
    • Multipart upload: critical for any object > 5GB (HTTP connections don’t stay open that long reliably).
    • Durability (11 nines) requires cross-AZ replication — a single datacenter fire cannot cause data loss.
    Scroll to Top