System Design Interview: Video Streaming Platform (Netflix/YouTube)

Designing a video streaming platform like Netflix or YouTube is one of the most comprehensive system design challenges, combining video processing pipelines, CDN architecture, adaptive bitrate streaming, and personalization at massive scale.

Core Requirements

Functional: Upload videos, transcode to multiple resolutions/formats, stream to users on any device with adaptive quality, support search and recommendations, track view history and progress.

Non-functional: 200M daily active users. 1B hours of video watched per day. Upload processing within 30 minutes. Stream start time < 2 seconds. 99.99% availability. Support 4K, 1080p, 720p, 480p, 360p.

Video Upload and Processing Pipeline

Upload flow:
  Creator → Upload API → S3 (raw video, presigned PUT URL)
                      → SQS: "video uploaded" event
                      → Video Processing Worker

Processing pipeline stages:
  1. Validation: check video format, duration, file integrity (SHA-256)
  2. Transcoding: convert to multiple resolutions and formats
     Tool: FFmpeg (open source) or cloud services (AWS MediaConvert)
     Output per video:
       360p H.264 MP4  (mobile, low bandwidth)
       480p H.264 MP4
       720p H.264 MP4  (standard HD)
       1080p H.264 MP4
       1080p H.265 HEVC (50% smaller than H.264)
       4K H.265         (if source is 4K)
       Audio-only AAC   (for background play)
  3. Thumbnail generation: extract frames at 10s intervals
  4. Content moderation: run ML classifier (NSFW, copyright)
  5. Packaging: segment into chunks for adaptive streaming (HLS/DASH)
  6. CDN distribution: push to edge PoPs
  7. Update DB: mark video as available, publish event

Distributed transcoding:
  Split 2-hour movie into 5-minute segments
  Transcode each segment in parallel across N workers
  Merge segments → final output
  Reduces 2-hour 4K transcode from 8 hours → 30 minutes

Adaptive Bitrate Streaming (ABR)

Problem: users have different bandwidths; static quality = bad UX
         (buffering on slow connections, blurry on fast connections)

ABR: client switches quality mid-stream based on current bandwidth

HLS (HTTP Live Streaming) — Apple, widely supported:
  Master playlist (m3u8):
    #EXTM3U
    #EXT-X-STREAM-INF:BANDWIDTH=800000,RESOLUTION=640x360
    360p/index.m3u8
    #EXT-X-STREAM-INF:BANDWIDTH=2800000,RESOLUTION=1280x720
    720p/index.m3u8
    #EXT-X-STREAM-INF:BANDWIDTH=5000000,RESOLUTION=1920x1080
    1080p/index.m3u8

  Each quality-level playlist (360p/index.m3u8):
    #EXTM3U
    #EXT-X-TARGETDURATION:6
    #EXTINF:6.0,
    segment_001.ts
    #EXTINF:6.0,
    segment_002.ts
    ...

  Client logic:
    - Download master playlist → pick initial quality based on bandwidth estimate
    - Download segments → measure download speed
    - Buffer  20s, high speed: switch to higher quality
    - Target: maintain 20-30s buffer for smooth playback

CDN Architecture

Video storage distribution:
  Origin: S3 (all video segments, all qualities)
  CDN (CloudFront / Akamai / Fastly):
    300+ PoPs worldwide
    Each PoP caches popular video segments
    Cache hit ratio: ~80% (hot content cached at edge)
    Cache miss → origin pull → cache at edge (5-10s penalty)

Cache key: {video_id}/{quality}/{segment_number}.ts
  e.g., cdn.netflix.com/v/xyz123/1080p/segment_042.ts

Popular content (top 10%): pre-warmed at all PoPs
  When video is trending → push-based distribution to edge

Long-tail content: served from nearest PoP or origin on-demand
  First viewer in a region pulls from origin → cached for next viewers

Cache TTL:
  Video segments (immutable): Cache-Control: max-age=31536000 (1 year)
  Playlists (can change): Cache-Control: max-age=5 (live) or 60 (VOD)

Video Storage Architecture

Storage tiers by access frequency:

Hot tier ( 180 days,  2 years, rarely accessed):
  S3 Glacier Deep Archive
  Cost: $0.00099/GB/month, 12-48 hour retrieval

Lifecycle policy (automated):
  Day 0: S3 Standard
  Day 30: → S3-IA
  Day 180: → Glacier
  Day 730: → Glacier Deep Archive

Storage at YouTube scale:
  500 hours of video uploaded per minute
  Avg 1 hour = 2GB raw = 1GB after transcoding (all qualities)
  500 GB/min = 720TB/day = 263PB/year

Database Design

videos table (PostgreSQL/Spanner):
  id            UUID PRIMARY KEY
  title         TEXT
  description   TEXT
  creator_id    BIGINT
  status        ENUM(processing, published, removed)
  duration_sec  INT
  view_count    BIGINT DEFAULT 0
  published_at  TIMESTAMPTZ
  INDEX (creator_id, published_at DESC)
  INDEX (status, published_at DESC)  -- for feed queries

video_qualities table:
  video_id      UUID
  quality       ENUM(360p, 480p, 720p, 1080p, 4K)
  storage_path  TEXT  -- S3 key
  size_bytes    BIGINT
  PRIMARY KEY (video_id, quality)

view_events (ClickHouse / BigQuery — analytics):
  event_id    UUID
  video_id    UUID
  user_id     BIGINT
  started_at  TIMESTAMPTZ
  watch_secs  INT
  quality     TEXT
  device      TEXT
  country     TEXT

user_watch_history (Cassandra — high write rate):
  user_id     BIGINT PARTITION KEY
  video_id    UUID CLUSTERING KEY (descending by watched_at)
  watched_at  TIMESTAMPTZ
  progress_sec INT  -- resume position

View Counter at Scale

Problem: 1B views/day = 11,600 views/sec
         Incrementing DB view_count per view → DB bottleneck

Solution: counter aggregation
  Redis INCR video:views:{video_id}  (atomic, fast)
  Background job every 30s:
    - Flush Redis counters to DB in batch
    - GETSET video:views:{video_id} 0  (atomic read + reset)
    - UPDATE videos SET view_count = view_count + {delta} WHERE id = ?

Approximate counts (YouTube approach):
  Only update DB when count crosses thresholds: 100, 1000, 10000, ...
  Display: "1.2M views" is fine — exact count unimportant

Milestone events:
  On write: check if new count crosses 1M, 10M, 100M milestones
  → Trigger: notify creator, update trending algorithm, badge

Live Streaming Architecture

Live streaming vs VOD (Video on Demand):
  VOD:  entire video pre-processed; segments pre-generated
  Live: real-time encoding; low latency required

Live ingest:
  Creator → RTMP client (OBS, mobile app)
          → RTMP ingest server
          → Transcoder: encode to HLS segments in real-time
          → 2-second segment length (lower = lower latency, more requests)
          → S3 + CDN: segments available ~6 seconds after capture

Ultra-low latency (WebRTC for < 1s delay):
  Used for: interactive live streams, auctions, sports betting
  WebRTC peer-to-peer or SFU (Selective Forwarding Unit) architecture
  See: System Design: WebRTC and Real-Time Video Architecture

Interview Discussion Points

  • Why segment videos into small chunks? Seekability: client jumps to any position by calculating the segment number. Adaptive quality: client switches quality at segment boundaries. Resilience: download one segment at a time; network interruption → just re-download that segment.
  • H.264 vs H.265 trade-off? H.265 (HEVC) produces 50% smaller files at equivalent quality, but requires 4× more CPU to encode. For Netflix: H.265 saves significant CDN costs at scale. Browser support: H.265 not universally supported (Edge yes, Chrome partially) → must serve both formats based on client capabilities.
  • How does Netflix achieve < 2s start time? Pre-buffer: on hover, client fetches the first 5 segments of the most likely quality. Predictive pre-loading: user’s network measured, quality pre-selected before play button is clicked. CDN PoP selection: DNS routes to nearest PoP with the content cached.
  • How to handle concurrent viewers for a viral video? The CDN absorbs the load — each PoP serves its regional audience from local cache. The origin server (S3) only handles CDN miss requests. For truly global events (World Cup), pre-distribute all segments to all PoPs hours before kickoff.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does adaptive bitrate streaming (ABR) work in video platforms like Netflix?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Adaptive bitrate streaming (ABR) pre-encodes each video at multiple quality levels (360p, 720p, 1080p, 4K) and splits each quality into small 2-6 second segments. A manifest file (HLS .m3u8 or DASH .mpd) lists all available quality levels. The player downloads a few segments at the current quality, measures download speed, and continuously estimates available bandwidth. If the buffer falls below a threshold (e.g., < 5 seconds), it switches to a lower quality. If the buffer is healthy and bandwidth is high, it steps up quality. The player makes switching decisions at each segment boundary, so quality changes happen seamlessly between segments. The goal is to keep a 20-30 second buffer at the highest sustainable quality."
}
},
{
"@type": "Question",
"name": "How does a video platform transcode videos efficiently at scale?",
"acceptedAnswer": {
"@type": "Answer",
"text": "Transcoding is compute-intensive and parallelized using a split-merge approach. A 2-hour video is split into 5-minute segments immediately after upload. Each segment is transcoded independently in parallel across multiple workers (FFmpeg on GPU instances), producing all quality variants simultaneously. When all segment transcodes complete, the results are merged into the final output. This reduces a 2-hour 4K transcode from 8+ hours (serial) to 30 minutes (parallel). AWS Elastic Transcoder, AWS MediaConvert, or a custom worker fleet on spot instances handles this. Each video results in 5-7 quality variants u00d7 hundreds of 5-minute segments = thousands of objects stored in S3."
}
},
{
"@type": "Question",
"name": "How does Netflix achieve a video start time of under 2 seconds?",
"acceptedAnswer": {
"@type": "Answer",
"text": "Netflix achieves fast start through several techniques: predictive pre-buffering starts downloading the first few segments of the most likely next video before the user clicks play (based on viewing patterns and hover events). Client-side bandwidth estimation runs continuously, so quality selection is instant when play starts. Geographic CDN placement ensures video segments are cached at an edge PoP within milliseconds of the user's location u2014 300+ PoPs worldwide mean < 20ms network RTT to a cached segment. For the most popular content, segments are pre-pushed to all CDN edge nodes proactively. The HLS manifest is small (< 5KB) and fetched first, then segment downloads begin immediately u2014 parallelized HTTP/2 multiplexing downloads multiple segments simultaneously."
}
}
]
}

  • LinkedIn Interview Guide
  • Cloudflare Interview Guide
  • Databricks Interview Guide
  • Twitter/X Interview Guide
  • Snap Interview Guide
  • Meta Interview Guide
  • Companies That Ask This

    Scroll to Top