System Design: Video Processing Pipeline (YouTube/Netflix) — Transcoding, HLS, and Scaling

The Video Upload and Processing Problem

YouTube processes 500 hours of video every minute. Each uploaded video must be transcoded into multiple resolutions (360p, 480p, 720p, 1080p, 4K), multiple formats (MP4/H.264, WebM/VP9, HLS segments for adaptive streaming), and have thumbnails generated — all before the video can be served. The challenge: decoupled async processing at massive scale with fault tolerance.

High-Level Architecture

User Browser
    │
    ├─[1] Upload raw video → Object Store (S3 "raw" bucket)
    │      via presigned URL — bypasses API servers
    │
API Server
    ├─[2] Create video metadata (title, status=PROCESSING)
    ├─[3] Publish "video.uploaded" event → Kafka
    │
Transcoding Workers (consume Kafka)
    ├─[4a] Download raw video from S3
    ├─[4b] Transcode to all target resolutions (FFmpeg)
    ├─[4c] Upload transcoded files → S3 "transcoded" bucket
    ├─[4d] Generate thumbnails
    ├─[5] Publish "video.transcoded" event → Kafka
    │
Post-Processing Workers
    ├─[6] Update video status → PUBLISHED
    ├─[7] Invalidate CDN cache for video page
    └─[8] Trigger search index update

Upload via Presigned URL

Never route large file uploads through your API servers — it wastes bandwidth and CPU. Instead: (1) Client requests a presigned S3 URL from the API server. (2) API server generates a time-limited URL (15 minutes) directly from S3. (3) Client uploads the raw video directly to S3. (4) S3 event triggers a notification → API server marks upload as received. This keeps API servers thin and uses S3’s upload bandwidth directly.

import boto3
from datetime import timedelta

def generate_upload_url(video_id: str) -> dict:
    s3 = boto3.client('s3')
    key = f"raw/{video_id}/original.mp4"
    url = s3.generate_presigned_url(
        'put_object',
        Params={'Bucket': 'my-raw-videos', 'Key': key, 'ContentType': 'video/mp4'},
        ExpiresIn=900,  # 15 minutes
    )
    return {'upload_url': url, 'video_id': video_id, 'key': key}

Transcoding with FFmpeg

FFmpeg is the standard open-source tool for video transcoding. Each target resolution is a separate FFmpeg invocation (or a single pass with multiple outputs).

import subprocess
import os

RESOLUTIONS = [
    {'name': '360p',  'width': 640,  'height': 360,  'bitrate': '800k'},
    {'name': '720p',  'width': 1280, 'height': 720,  'bitrate': '2500k'},
    {'name': '1080p', 'width': 1920, 'height': 1080, 'bitrate': '5000k'},
]

def transcode(input_path: str, output_dir: str, video_id: str) -> list:
    outputs = []
    for res in RESOLUTIONS:
        output_path = os.path.join(output_dir, f"{res['name']}.mp4")
        cmd = [
            'ffmpeg', '-i', input_path,
            '-vf', f"scale={res['width']}:{res['height']}",
            '-b:v', res['bitrate'],
            '-c:v', 'libx264', '-c:a', 'aac',
            '-movflags', 'faststart',   # moov atom at start for fast seek
            '-y', output_path
        ]
        subprocess.run(cmd, check=True)
        outputs.append({'resolution': res['name'], 'path': output_path})
    return outputs

def generate_thumbnail(input_path: str, output_path: str, timestamp: str = '00:00:05'):
    subprocess.run([
        'ffmpeg', '-i', input_path,
        '-ss', timestamp, '-vframes', '1',
        '-q:v', '2', '-y', output_path
    ], check=True)

Adaptive Bitrate Streaming (HLS)

Modern video players use adaptive streaming: they download short segments (2-10 seconds) and dynamically switch quality based on available bandwidth. HLS (HTTP Live Streaming) segments video into .ts chunks and serves an M3U8 playlist.

# Generate HLS segments for all resolutions
def create_hls(input_path: str, output_dir: str):
    os.makedirs(output_dir, exist_ok=True)
    # Create per-resolution playlists
    for res in RESOLUTIONS:
        res_dir = os.path.join(output_dir, res['name'])
        os.makedirs(res_dir, exist_ok=True)
        subprocess.run([
            'ffmpeg', '-i', input_path,
            '-vf', f"scale={res['width']}:{res['height']}",
            '-b:v', res['bitrate'], '-c:v', 'libx264', '-c:a', 'aac',
            '-hls_time', '6',          # 6-second segments
            '-hls_playlist_type', 'vod',
            '-hls_segment_filename', os.path.join(res_dir, 'segment_%03d.ts'),
            os.path.join(res_dir, 'index.m3u8')
        ], check=True)
    # Create master playlist referencing all resolutions
    master_playlist = "#EXTM3Un"
    for res in RESOLUTIONS:
        master_playlist += f'#EXT-X-STREAM-INF:BANDWIDTH={int(res["bitrate"][:-1])*1000},RESOLUTION={res["width"]}x{res["height"]}n'
        master_playlist += f'{res["name"]}/index.m3u8n'
    with open(os.path.join(output_dir, 'master.m3u8'), 'w') as f:
        f.write(master_playlist)

Worker Scaling and Fault Tolerance

  • Kafka consumer groups: each transcoding worker is a consumer in the same consumer group. Kafka assigns partitions across workers for parallel processing. If a worker crashes, its partitions are rebalanced to healthy workers.
  • Idempotent workers: workers check if transcoded files already exist in S3 before transcoding (S3 HeadObject). If the worker crashes mid-transcode and restarts, it safely re-transcodes (overwriting the partial output).
  • Dead letter queue: after N failed transcode attempts, move the event to a DLQ for manual inspection and alerting.
  • Progress tracking: for long transcodes (4K video = hours), periodically update the job’s progress in Redis (percent complete) so the UI can show progress bars.
  • Spot/preemptible instances: transcoding is CPU-intensive but stateless (can restart from scratch). Use AWS Spot instances or GCP preemptible VMs — 60-90% cost reduction. If the instance is preempted, Kafka offset remains uncommitted and another worker picks up the job.

CDN Integration

After transcoding, video segments are served from a CDN (CloudFront, Fastly). The CDN caches segments globally — users stream from the nearest edge node. Videos are “push” cached (pre-warmed at edge) for popular content and “pull” cached (cached on first request) for long-tail content.

Interview Questions

Q: How do you handle a 4-hour video upload and transcoding?

For large files: use multipart upload (S3 multipart allows parallel 5MB chunks, resumable on failure). For transcoding: split the video into temporal segments (chunk by time), transcode each chunk independently in parallel, then concatenate. A 4-hour video split into 10-minute chunks = 24 parallel transcoding jobs. This reduces end-to-end latency from hours to minutes. Use a distributed job queue (Kafka + worker pool) to parallelize. Track chunk completion; merge when all chunks are done.

Q: How do you estimate the compute cost of transcoding?

Transcoding is roughly 1:1 to 1:3 real-time (encoding 1 minute of video takes 1-3 minutes of CPU). For 500 hours of video per minute: 500 * 60 = 30,000 minutes of raw video per minute. With 4 resolutions: 120,000 minutes of transcoding per minute. At 3x real-time speed: 40,000 minutes of CPU per minute = ~667 CPU cores constantly running. In practice, use GPU-accelerated transcoding (NVENC) — 10-50x faster than CPU, dramatically reducing cost.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does YouTube handle video upload and processing at scale?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The client requests a presigned URL from the Upload Service. The video is uploaded directly to object storage (S3/GCS) — bypassing app servers. On upload completion, S3 fires an event to a message queue (SQS/Kafka). Transcoding workers consume these events and process each video: extract metadata (duration, resolution, codec), transcode to multiple resolutions (360p, 480p, 720p, 1080p, 4K) using FFmpeg. Each resolution is a separate task, parallelizable across workers. Thumbnails are extracted at configurable timestamps. Output segments and manifests are written to a CDN origin bucket. The video record is updated to AVAILABLE only after all required resolutions are processed.”
}
},
{
“@type”: “Question”,
“name”: “What is HLS adaptive streaming and how does it work?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “HLS (HTTP Live Streaming) splits a video into small segments (2-6 seconds each) and generates an M3U8 playlist file per resolution. A master playlist lists all variant streams with their bandwidth and resolution. The video player downloads the master playlist, then selects the variant stream matching available bandwidth. During playback, the player continuously monitors download speed and buffer level. If bandwidth drops, it switches to a lower-resolution playlist. If bandwidth improves, it switches up. Each resolution is independently segmented so the player can switch at any segment boundary. This allows smooth playback even on variable-quality connections.”
}
},
{
“@type”: “Question”,
“name”: “How do you prevent duplicate video processing when a worker crashes mid-job?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Use message queue visibility timeout. When a worker picks up a transcoding job, the message becomes invisible to other consumers for N minutes (e.g., 30 minutes for a long transcode). If the worker crashes, the message reappears after the timeout and is reprocessed by another worker. Make transcoding idempotent: output files are written to deterministic paths (video_id/720p/segment_001.ts). A re-run overwrites the same files. Track progress in a jobs table: status=PROCESSING with worker_id and started_at. On crash recovery, the new worker picks up the message, checks the jobs table, and resumes from the last completed segment checkpoint. On success, status=DONE and the message is deleted.”
}
},
{
“@type”: “Question”,
“name”: “How do you scale the transcoding pipeline to handle 500 hours of video uploaded per minute?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Use a worker pool auto-scaled by queue depth. Each transcoding job (one video, one resolution) is an independent task. A 10-minute video at 720p takes about 2 minutes of CPU time. With 500 hours/minute of uploads and 4 resolution variants each = 2000 transcoding jobs/minute, each taking about 2 minutes = about 4000 concurrent workers needed at peak. Use spot/preemptible instances for transcoding (70% cost savings); handle preemption with job checkpointing. Separate queues for priority tiers: premium users go to a fast queue served by on-demand instances; free uploads go to the spot queue. CDN caches all segments, so playback load does not hit origin.”
}
},
{
“@type”: “Question”,
“name”: “How do you implement resumable video uploads for large files?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Use multipart upload (S3 multipart or TUS protocol). The client splits the file into chunks (5-50MB each). Each chunk is uploaded independently with a part number. S3 stores parts until CompleteMultipartUpload is called with the list of ETags. If a chunk fails, only that chunk is retried — not the entire file. The client tracks which parts succeeded using localStorage. On browser restart, the client queries which parts the server has and resumes from the first missing chunk. S3 multipart uploads can be paused indefinitely (up to 7 days by default). Set a lifecycle rule to abort incomplete multipart uploads after 24 hours to avoid storage leakage.”
}
}
]
}

Asked at: Netflix Interview Guide

Asked at: Uber Interview Guide

Asked at: Databricks Interview Guide

Asked at: Twitter/X Interview Guide

Asked at: Cloudflare Interview Guide

Scroll to Top