WebRTC Fundamentals
WebRTC (Web Real-Time Communication) enables peer-to-peer audio, video, and data transfer directly between browsers without a central media server. Three core APIs: MediaStream (camera/microphone), RTCPeerConnection (P2P connection), RTCDataChannel (arbitrary data). The challenge: browsers cannot connect directly due to NAT and firewalls. Solution: ICE (Interactive Connectivity Establishment) with STUN/TURN servers.
Signaling
WebRTC does not define signaling — you implement it. Signaling exchanges: SDP offers/answers (codec negotiation, media direction), ICE candidates (network addresses). Typical flow: Caller creates an offer (SDP), sends via WebSocket to Signaling Server, which relays to Callee. Callee creates an answer, sends back. Both sides exchange ICE candidates as discovered. Signaling server is only needed during call setup; once peers connect, media flows P2P. Implement signaling with WebSocket or long polling. The signaling server is stateless relative to media — it just relays messages.
NAT Traversal: STUN and TURN
STUN (Session Traversal Utilities for NAT): the client queries a STUN server to discover its public IP:port. The STUN server reflects the request back with the observed IP:port. This works for most NATs (full cone, address-restricted). TURN (Traversal Using Relays around NAT): when P2P fails (symmetric NAT, corporate firewalls), TURN relays all media through a server. TURN is expensive (server bandwidth = all media). Only ~15-20% of calls need TURN; the rest succeed with STUN or direct connection. ICE gathers all candidate types (host, server reflexive from STUN, relayed from TURN) and tries them in priority order: direct > STUN-discovered > TURN-relayed.
Topology: Mesh vs SFU vs MCU
Mesh (P2P): each participant connects to every other. N participants = N*(N-1) connections. Works for 2-3 people; client upload bandwidth grows linearly. At 6 people: each client uploads 5 streams. Impractical beyond 4-5 participants. SFU (Selective Forwarding Unit): clients connect to a central server. Each client uploads once; the SFU forwards the right streams to each subscriber. The SFU does not decode/re-encode — it forwards RTP packets. Low server CPU. Supports simulcast: clients upload multiple quality levels (360p, 720p, 1080p); SFU selects the appropriate quality per subscriber. This is the industry standard: Zoom, Google Meet, Discord all use SFU. MCU (Multipoint Control Unit): server decodes all streams, composites into one video grid, re-encodes, sends one stream to each client. Lowest client download bandwidth. Highest server CPU. Used for recording and for very low-bandwidth scenarios.
SFU Architecture for Scale
Each SFU server handles ~1000 concurrent participants. For a large call (1000+ participants): use a cascade of SFU servers. Edge SFUs receive streams from participants and forward to a root SFU. Root SFU forwards to other edge SFUs. Sharding: route calls to SFU servers by call_id (consistent hashing). Geographic distribution: place SFU servers in regions close to users to minimize latency (target under 100ms RTT). Auto-scale SFU clusters by concurrent participant count. The signaling server assigns participants to an SFU server and handles SFU failover.
Quality Adaptation
WebRTC uses RTCP feedback for congestion control (REMB, Transport-CC). When bandwidth decreases: reduce resolution (720p → 360p), reduce frame rate (30fps → 15fps). Simulcast: the sender uploads 3 quality layers simultaneously; the SFU switches which layer to forward without the sender changing anything. This gives instant quality switching without keyframe requests. On the receiving side, WebRTC’s built-in jitter buffer handles packet reordering; NACK requests retransmission of lost packets; FEC (Forward Error Correction) recovers from loss without retransmission (better for real-time audio).
Interview Tips
- Always mention TURN fallback — most candidates forget 15-20% of calls need relay.
- SFU is the right answer for group calls; mention simulcast for quality adaptation.
- Signaling is a separate concern from media — keep them architecturally separate.
- Recording: tap the SFU, decode and mux to file, store in object storage.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the role of STUN and TURN in WebRTC?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “STUN (Session Traversal Utilities for NAT) lets a client discover its public IP:port by querying a STUN server, which reflects the observed address back. This works for most NAT types (full cone, address-restricted). TURN (Traversal Using Relays around NAT) is a fallback: when P2P fails (symmetric NAT, corporate firewalls), all media is relayed through the TURN server. TURN is expensive in bandwidth but essential for reliability — approximately 15-20% of WebRTC calls require TURN. ICE (Interactive Connectivity Establishment) gathers all candidate types (host, STUN-discovered, TURN-relayed) and tests them in priority order, selecting the best working path.”
}
},
{
“@type”: “Question”,
“name”: “What is an SFU and why is it preferred for group video calls?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “An SFU (Selective Forwarding Unit) is a media server that receives one video stream from each participant and selectively forwards the appropriate streams to each other participant, without decoding or re-encoding. Each client uploads once; the SFU distributes. This scales to dozens of participants while keeping server CPU low (no transcoding). The SFU supports simulcast: clients upload multiple quality layers (360p, 720p, 1080p) simultaneously; the SFU forwards the appropriate quality to each subscriber based on their available bandwidth. Zoom, Google Meet, and Discord all use SFU architecture. The alternative, mesh (P2P), requires each client to upload N-1 streams — impractical beyond 4-5 participants.”
}
},
{
“@type”: “Question”,
“name”: “How does WebRTC signaling work?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “WebRTC signaling exchanges the information two peers need to establish a connection: SDP (Session Description Protocol) offers and answers (codec negotiation, media direction, ICE credentials) and ICE candidates (network addresses). The signaling protocol is not defined by WebRTC — you implement it, typically using WebSockets. Flow: Caller creates an SDP offer and sends it via the signaling server to the Callee. Callee creates an SDP answer and sends it back. Both sides exchange ICE candidates as they are discovered (trickle ICE). Once the P2P connection is established, all media flows directly between peers — the signaling server is no longer involved. The signaling server only handles call setup messages.”
}
},
{
“@type”: “Question”,
“name”: “How do you handle WebRTC quality adaptation for variable bandwidth?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “WebRTC uses RTCP feedback for congestion control. Two main mechanisms: REMB (Receiver Estimated Maximum Bitrate) and Transport-CC (transport-wide congestion control). When bandwidth drops, the sender reduces bitrate by lowering resolution or frame rate. Simulcast is the best solution for SFU architectures: the sender uploads 3 quality layers simultaneously (e.g., 180p/360p/720p). The SFU monitors each subscriber’s available bandwidth and switches which layer to forward, without requiring any encoder changes on the sender side. This gives instant, seamless quality switching. FEC (Forward Error Correction) adds redundancy to audio packets so loss can be recovered without retransmission — critical for real-time audio where retransmission would arrive too late.”
}
},
{
“@type”: “Question”,
“name”: “How do you record a WebRTC call?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Two approaches: (1) Client-side recording: use the MediaRecorder API to record the local streams on the sender’s device. Simple but misses remote streams and depends on client device reliability. (2) Server-side recording via SFU: tap the SFU to receive all streams. Run a recording bot that subscribes to all streams as a phantom participant. Decode and composite the video grid using FFmpeg. Mux audio and video into an MP4 or WebM container. Upload to object storage (S3) when the call ends. Server-side recording is reliable and captures all participants. Store the raw stream segments during the call (in case of failure) and composite after. Send a signed S3 URL to participants when the recording is ready.”
}
}
]
}
Asked at: Netflix Interview Guide
Asked at: Snap Interview Guide
Asked at: Twitter/X Interview Guide
Asked at: Cloudflare Interview Guide