System Design Interview: Object Storage (Amazon S3)

What Is Object Storage?

Object storage stores unstructured data (images, videos, backups, logs) as discrete objects with a flat namespace (bucket/key), unlike hierarchical file systems. Amazon S3 stores trillions of objects and handles millions of requests per second. Object storage is optimized for large, immutable objects (write-once-read-many) and provides high durability (S3 guarantees 99.999999999% — 11 nines) via replication.

Core API

  • PUT object: upload an object to bucket/key. S3 returns 200 only after the object is durably persisted to multiple AZs. Object size limit: 5GB for single PUT; use multipart upload for larger objects.
  • GET object: retrieve object by bucket/key. Range GET (HTTP Range header) retrieves a byte range — used by video streaming to seek within large video files.
  • DELETE object: removes the object (with versioning enabled, creates a delete marker — the object is recoverable).
  • Multipart upload: split large objects into 5MB-5GB parts, upload in parallel, complete with a final CompleteMultipartUpload call. Each part is a separate PUT request. Enables: parallel upload across multiple connections, resumable uploads (retry failed parts), and upload of objects > 5GB.
  • Pre-signed URL: a time-limited URL that grants temporary access to a specific object. Generated server-side and sent to clients for direct browser-to-S3 upload/download, bypassing your application server. Avoids proxying large files through application servers.

Data Durability via Replication

S3’s 11-nine durability comes from storing each object across at least 3 Availability Zones (AZs). When you PUT an object: (1) S3 receives the object at the primary AZ. (2) Simultaneously replicates to 2+ additional AZs. (3) Returns success only after all replicas confirm. If an AZ loses power/hardware, the other AZs serve all GET requests for objects whose replicas they hold. The probability of all 3+ independent AZs failing simultaneously and causing data loss is ~10^-11 per year per object.

Erasure coding: for very large objects, S3 may use erasure coding (similar to RAID-6) instead of full replication. Split the object into k data shards and m parity shards. Any k shards can reconstruct the original data. k=10, m=4 (14 total) means you can lose any 4 shards and still recover the data, while storing only 40% more data than replication at factor 14/10=1.4× (versus 3× for full replication).

Consistency Model

S3 provides strong read-after-write consistency (since December 2020): after a successful PUT, any subsequent GET for the same key is guaranteed to return the new data. Before 2020, S3 was eventually consistent — a PUT might not be visible to subsequent GETs for seconds. Strong consistency simplifies application development but required significant engineering changes to S3’s internal metadata layer.

Bucket Internals: Data Placement

A bucket is a logical namespace. Internally: object data is chunked into 64MB or 128MB blocks stored on physical disks across many storage servers. The object metadata (bucket, key, size, ETag, created_at, version_id, block locations) is stored in a distributed metadata service. When a GET arrives: look up metadata to find block locations, fetch blocks (possibly from multiple storage nodes), reassemble and return. Object keys are hashed to distribute objects evenly across storage nodes — sequential keys (e.g., log files timestamped 2024-01-01, 2024-01-02) would all hash to nearby partitions, creating hotspots. S3 automatically added random prefixes in older days; modern S3 handles hot key detection and automatic partition splitting.

Storage Classes and Lifecycle Policies

S3 offers tiered storage with different cost/access trade-offs:

  • S3 Standard: frequent access, 3-AZ replication, ~$0.023/GB-month, millisecond retrieval
  • S3 Standard-IA (Infrequent Access): lower storage cost (~$0.0125/GB), retrieval fee per GB, same durability and availability. For objects accessed < 1x/month.
  • S3 Glacier Instant: archive with millisecond retrieval, ~$0.004/GB. For objects accessed quarterly.
  • S3 Glacier Flexible: 3-5 hour retrieval, ~$0.0036/GB. For compliance archives.
  • S3 Glacier Deep Archive: 12-hour retrieval, ~$0.00099/GB. For 7-year compliance retention at minimum cost.

Lifecycle policies automate transitions: “move to Standard-IA after 30 days, Glacier after 90 days, delete after 365 days.” This automatically reduces storage costs for log files and backups.

Access Control

Three layers: (1) Bucket policies (resource-based IAM policies) — define who can access which operations. (2) IAM policies (identity-based) — define what AWS identities can do. (3) Access Control Lists (ACLs) — legacy per-object permissions. Modern recommendation: use bucket policies + IAM, disable ACLs. Block Public Access settings (account-wide or per-bucket) prevent accidental public exposure — a common cause of data breaches.

CDN Integration

S3 serves as the origin for CloudFront (CDN). Workflow: user requests an image → CloudFront edge node checks its cache → if miss, CloudFront fetches from S3 (origin) and caches the response → subsequent requests for the same object are served from the CloudFront edge node (50-200ms response time, near the user). Reduces S3 GET costs by 80-90% for popular objects; reduces latency from 100-500ms (transatlantic S3 request) to 5-20ms (edge cache). S3 + CloudFront is the standard architecture for static asset serving, video streaming (HLS segments), and software distribution.

  • Airbnb Interview Guide
  • Atlassian Interview Guide
  • Shopify Interview Guide
  • Cloudflare Interview Guide
  • Databricks Interview Guide
  • Companies That Ask This

    Scroll to Top