System Design Interview: Design an Object Storage System (Amazon S3)

What Is an Object Storage System?

Object storage (Amazon S3, Google Cloud Storage) stores arbitrary-size files (objects) in named buckets. Unlike a filesystem (hierarchical directories), objects are flat key-value pairs: bucket/key → bytes. Objects are immutable — you write a new version, not modify in place. Amazon S3 stores trillions of objects, exabytes of data.

  • Stripe Interview Guide
  • Airbnb Interview Guide
  • Uber Interview Guide
  • Databricks Interview Guide
  • Cloudflare Interview Guide
  • Netflix Interview Guide
  • System Requirements

    Functional

    • PUT object: upload bytes, return URL
    • GET object: download bytes by bucket/key
    • DELETE object
    • Multipart upload for large objects (>5GB)
    • Versioning: multiple versions of the same key
    • Lifecycle policies: auto-delete or archive after N days

    Non-Functional

    • Durability: 99.999999999% (11 nines) — S3’s SLA
    • Availability: 99.99%
    • Throughput: terabytes/second aggregate

    Architecture

    Metadata Service

    Stores object metadata: bucket, key, size, content-type, owner, checksum (MD5/SHA256), version_id, storage_location (which data nodes hold the chunks). Backed by a strongly-consistent distributed database (DynamoDB or a custom sharded MySQL). Partitioned by bucket+key hash. On PUT: write metadata, then write data. On GET: read metadata to find data location, then stream data.

    Data Nodes

    Objects are split into chunks (typically 64MB each for large objects). Each chunk is replicated 3x across data nodes in different availability zones (cross-AZ replication). Chunks stored as flat files on disk — no filesystem abstractions needed beyond a local key-value store. Data nodes expose a simple HTTP API: PUT /chunk/{id}, GET /chunk/{id}.

    Durability via Erasure Coding

    For cost-efficient durability: instead of 3x replication, use erasure coding (Reed-Solomon). Split an object into k data chunks and m parity chunks. Any k chunks can reconstruct the full object (tolerating up to m failures). S3 uses 6+3 erasure coding: 6 data chunks, 3 parity chunks across 9 data nodes. Failure of any 3 nodes: data fully recoverable. Storage overhead: 9/6 = 1.5x vs. 3x for replication. 50% storage savings at the cost of more CPU for encoding/decoding.

    Multipart Upload

    For large files (GB–TB): split into parts, upload concurrently, complete when all parts arrive.

    1. CreateMultipartUpload → returns upload_id
    2. UploadPart(upload_id, part_number, bytes) for each part (min 5MB each)
    3. CompleteMultipartUpload(upload_id, [part_number, etag] list) → atomically commits

    Benefits: resume on failure (only re-upload failed parts), parallel upload from multiple threads, no single-connection bandwidth bottleneck.

    Consistency Model

    S3 now offers strong read-after-write consistency: after a successful PUT, subsequent GET operations will return the new object. Before 2020, S3 provided eventual consistency for new objects. Strong consistency is implemented by routing all requests for the same key through a consistent hash ring to the same primary node (or using conditional writes with a distributed lock).

    Caching and CDN

    S3 itself is not a CDN — it serves from a single region. For global low-latency access: configure CloudFront (CDN) in front of S3. Edge PoPs cache objects near users. TTL controlled by Cache-Control headers on the S3 object. For frequently-accessed objects (images, JS bundles), 95%+ of traffic served from CDN edge without hitting S3.

    Interview Tips

    • Metadata service + data nodes = separation of concerns. Metadata is small but consistency-critical; data is large but throughput-critical.
    • Erasure coding vs. replication: 1.5x overhead vs. 3x — erasure coding dominates at petabyte scale.
    • Multipart upload: critical for any object > 5GB (HTTP connections don’t stay open that long reliably).
    • Durability (11 nines) requires cross-AZ replication — a single datacenter fire cannot cause data loss.

    {
    “@context”: “https://schema.org”,
    “@type”: “FAQPage”,
    “mainEntity”: [
    {
    “@type”: “Question”,
    “name”: “How does Amazon S3 achieve 11 nines of durability?”,
    “acceptedAnswer”: { “@type”: “Answer”, “text”: “S3's 99.999999999% durability means losing one object per 100 billion object-years. Achieved through: (1) Geographic redundancy: objects replicated across at least 3 Availability Zones within a region. Each AZ is physically separate (different power grid, flooding zone, network). Loss of an entire AZ doesn't affect the object. (2) Erasure coding (for Standard storage class): Reed-Solomon erasure coding splits data into k data shards + m parity shards. Any k shards can reconstruct the object. At S3's scale: 6+3 or 8+4 erasure coding is common. Tolerate loss of 3-4 shards (entire storage nodes). (3) Data integrity verification: every write is checksummed (MD5 + CRC32). Reads are verified against checksum. Silent data corruption (bit rot on disk) is detected and repaired automatically. (4) Continuous scrubbing: background processes continuously read and verify all stored data. Corrupted blocks are repaired from intact shards before more shards fail. (5) Cross-region replication (optional): for additional durability, objects can be replicated to a second region asynchronously.” }
    },
    {
    “@type”: “Question”,
    “name”: “How does multipart upload work for large objects in S3?”,
    “acceptedAnswer”: { “@type”: “Answer”, “text”: “Standard single PUT requests have a 5GB limit and are susceptible to connection failures that require restarting from zero. Multipart upload addresses this for large objects (S3 minimum part size: 5MB, except the last part). Flow: (1) InitiateMultipartUpload → server returns upload_id. (2) UploadPart(upload_id, part_number 1-10000, bytes) → returns ETag (MD5 of the part). Each part can be uploaded in parallel from multiple threads/machines. Failed parts can be retried without restarting others. (3) CompleteMultipartUpload(upload_id, [part_number: etag] list) → server assembles the final object atomically. Server verifies that all parts are present and checksums match. If any part is missing, the complete operation fails. (4) AbortMultipartUpload: clean up incomplete uploads (add a lifecycle rule to auto-abort incomplete uploads after 7 days to avoid paying for stored parts). Concurrent uploads: each part can be uploaded in parallel. For a 100GB file with 64MB parts: 1563 parts, parallelizing 50 at a time → ~100GB / (50 * network_bandwidth) upload time.” }
    },
    {
    “@type”: “Question”,
    “name”: “What is the difference between S3 Standard, S3-IA, and S3 Glacier?”,
    “acceptedAnswer”: { “@type”: “Answer”, “text”: “S3 offers multiple storage classes with different availability, retrieval latency, and cost trade-offs. S3 Standard: 99.99% availability, millisecond retrieval. Most expensive per GB stored (~$0.023/GB/month). For frequently accessed data. S3 Standard-IA (Infrequent Access): 99.9% availability, millisecond retrieval. Cheaper storage ($0.0125/GB/month) but has a per-GB retrieval fee. Minimum 30-day storage charge. For data accessed monthly or less (backups, DR). S3 Glacier Instant Retrieval: millisecond retrieval, 90-day minimum, cheapest for cold data with occasional access. S3 Glacier Flexible Retrieval: 3-5 hour retrieval, very cheap storage. For archives accessed annually. S3 Glacier Deep Archive: 12-hour retrieval, cheapest. True cold storage. Lifecycle policies automate transitions: Standard → IA after 30 days, → Glacier after 90 days, → delete after 365 days. Engineering decision framework: ask "how often is this data accessed?" → choose storage class accordingly. Use S3 Intelligent-Tiering when access patterns are unpredictable — automatically moves objects between tiers based on actual access.” }
    }
    ]
    }

    Scroll to Top