System Design: Media Storage and Delivery — CDN, Transcoding Pipeline, and Adaptive Streaming

Requirements

A media storage and delivery system stores uploaded videos or images and delivers them efficiently to users worldwide. Think YouTube, Netflix, Instagram, or TikTok. Core requirements: users upload raw video; the system transcodes it into multiple resolutions and formats; content is delivered via CDN with adaptive bitrate streaming; high availability and low latency worldwide. Scale: YouTube receives 500 hours of video per minute. Netflix serves 15% of global internet traffic during peak hours. This is among the most infrastructure-heavy system design problems.

Upload and Ingestion

Upload flow: (1) Client requests an upload URL from the API server. (2) API creates a pending media record and returns a pre-signed upload URL pointing directly to object storage (S3, GCS). (3) Client uploads the file directly to object storage — bypasses your application servers entirely, reducing cost and latency. (4) Object storage fires an event (S3 event notification, pub/sub) when the upload completes. (5) The transcoding pipeline picks up the event and begins processing. Why pre-signed URLs: eliminates the app server as a proxy for large binary uploads. A 4GB video upload routed through your app server consumes bandwidth, memory, and CPU that should go to API requests. Chunked upload: for large files, the client splits the file into 10MB chunks, uploads each via multipart upload. On failure, only the failed chunk needs to be re-sent. Chunk checksums (MD5) verify integrity. The server assembles chunks after all are confirmed.

Transcoding Pipeline

Transcoding converts the raw video into multiple formats and resolutions for different devices and network conditions. Outputs: 360p, 480p, 720p, 1080p, 4K (if source is high enough quality). Formats: H.264/MP4 (universal), H.265/HEVC (better compression, newer devices), VP9/WebM (Chrome, Android). Architecture: the upload event is placed on a job queue (SQS, Kafka). A fleet of transcoding workers pulls jobs and runs FFmpeg. Workers run on GPU-equipped VMs (for hardware-accelerated encoding). Each resolution is a separate job — parallelized. A coordinator tracks job progress per media_id. When all jobs complete: media status is updated to READY. Fan-out: a 4K source produces ~6 output files. Processing time: 1 minute of 4K video takes ~5 minutes of transcoding. Use spot/preemptible VMs to reduce cost — transcoding is stateless and restartable. Store transcoded files in a separate S3 prefix organized by media_id and quality: media/{media_id}/1080p.mp4, media/{media_id}/720p.mp4, etc.

CDN and Adaptive Bitrate Streaming

CDN (CloudFront, Fastly, Akamai): transcoded files are served from CDN edge nodes, not directly from S3. The CDN caches files at ~200+ PoPs worldwide. Cache-Control: max-age=31536000 (1 year) — video files are immutable (content-addressed). CDN hit ratio: for popular content, 99%+ of requests are served from cache. Adaptive bitrate streaming (ABR): the player dynamically switches quality based on network conditions. Two protocols: HLS (HTTP Live Streaming, Apple standard) and DASH (Dynamic Adaptive Streaming over HTTP, open standard). HLS: the video is split into 6-second segments (.ts files). A manifest file (.m3u8) lists all available bitrates and their segment URLs. The player downloads the manifest, measures bandwidth, and starts downloading segments at the appropriate bitrate. If bandwidth drops: the player switches to a lower quality segment at the next segment boundary — seamless to the user. Generating HLS: FFmpeg outputs .ts segments and the manifest file during transcoding. All segments are uploaded to S3 and served via CDN.

Thumbnails, Metadata, and Search

Thumbnail extraction: during transcoding, extract frames at regular intervals (every 10 seconds) as JPEG thumbnails. Store in S3: media/{media_id}/thumbs/0010.jpg, /0020.jpg, etc. The player uses these for scrubbing previews (hover over the timeline). Cover image: extract a representative frame at ~10% into the video. Metadata: store in a relational database (PostgreSQL): media_id, uploader_id, title, description, duration_seconds, width, height, status (UPLOADING, PROCESSING, READY, FAILED), storage_bytes, view_count, created_at. Video search: index title and description in Elasticsearch. Tag-based filtering in SQL. Recommendation: collaborative filtering based on watch history (offline job, updated daily). Abuse detection: run frames through an ML classifier for policy violations (nudity, violence). Flag for human review or auto-remove. This runs asynchronously after upload, not in the serving path.

Interview Tips

  • Always mention pre-signed URLs for direct upload — it shows you understand cost and scale implications.
  • Transcoding is the most complex part: fan-out into multiple qualities, stateless workers, spot VMs for cost.
  • HLS/DASH adaptive streaming is the standard answer for video delivery — know the manifest + segment model.
  • CDN with immutable cache headers is the key to global low latency at scale.


{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”Why use pre-signed URLs for video upload instead of routing through your servers?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Pre-signed URLs allow the client to upload directly to object storage (S3/GCS), bypassing your application servers entirely. This eliminates server bandwidth costs, reduces latency, and removes a bottleneck — a 4GB video upload through your app server consumes memory and CPU that should serve API requests.”}},{“@type”:”Question”,”name”:”How does adaptive bitrate streaming (HLS) work?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”The video is split into 6-second segments (.ts files). A manifest file (.m3u8) lists all available bitrates and their segment URLs. The player measures available bandwidth, selects the appropriate quality level, and downloads segments accordingly. If bandwidth drops, the player seamlessly switches to lower quality at the next segment boundary.”}},{“@type”:”Question”,”name”:”How do you parallelize the transcoding pipeline?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Each output resolution (360p, 480p, 720p, 1080p, 4K) is a separate transcoding job. Jobs are placed on a queue (SQS, Kafka) and processed concurrently by a fleet of workers running FFmpeg on GPU-equipped VMs. A coordinator tracks per-quality completion and updates the media status to READY when all jobs finish.”}},{“@type”:”Question”,”name”:”How do CDNs achieve near-100% cache hit rates for video content?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Transcoded video segments are immutable (content never changes once created), so they can be cached with long TTLs (Cache-Control: max-age=31536000, 1 year). CDN edge nodes at 200+ PoPs worldwide cache the files. Popular content is cached at every edge globally, resulting in 99%+ cache hit rates.”}},{“@type”:”Question”,”name”:”How do you handle abuse detection for uploaded videos?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Abuse detection runs asynchronously after upload — not in the serving path. Video frames are sampled and run through ML classifiers for policy violations (nudity, violence, copyrighted content). Detected violations are flagged for human review or auto-removed. This keeps the upload and transcoding pipeline fast while still enforcing content policies.”}}]}

See also: Netflix Interview Prep

See also: Snap Interview Prep

See also: Apple Interview Prep

See also: Meta Interview Prep

Scroll to Top