WebRTC Fundamentals
WebRTC (Web Real-Time Communication) enables peer-to-peer audio, video, and data transfer directly between browsers without a central media server. Three core APIs: MediaStream (camera/microphone), RTCPeerConnection (P2P connection), RTCDataChannel (arbitrary data). The challenge: browsers cannot connect directly due to NAT and firewalls. Solution: ICE (Interactive Connectivity Establishment) with STUN/TURN servers.
Signaling
WebRTC does not define signaling — you implement it. Signaling exchanges: SDP offers/answers (codec negotiation, media direction), ICE candidates (network addresses). Typical flow: Caller creates an offer (SDP), sends via WebSocket to Signaling Server, which relays to Callee. Callee creates an answer, sends back. Both sides exchange ICE candidates as discovered. Signaling server is only needed during call setup; once peers connect, media flows P2P. Implement signaling with WebSocket or long polling. The signaling server is stateless relative to media — it just relays messages.
NAT Traversal: STUN and TURN
STUN (Session Traversal Utilities for NAT): the client queries a STUN server to discover its public IP:port. The STUN server reflects the request back with the observed IP:port. This works for most NATs (full cone, address-restricted). TURN (Traversal Using Relays around NAT): when P2P fails (symmetric NAT, corporate firewalls), TURN relays all media through a server. TURN is expensive (server bandwidth = all media). Only ~15-20% of calls need TURN; the rest succeed with STUN or direct connection. ICE gathers all candidate types (host, server reflexive from STUN, relayed from TURN) and tries them in priority order: direct > STUN-discovered > TURN-relayed.
Topology: Mesh vs SFU vs MCU
Mesh (P2P): each participant connects to every other. N participants = N*(N-1) connections. Works for 2-3 people; client upload bandwidth grows linearly. At 6 people: each client uploads 5 streams. Impractical beyond 4-5 participants. SFU (Selective Forwarding Unit): clients connect to a central server. Each client uploads once; the SFU forwards the right streams to each subscriber. The SFU does not decode/re-encode — it forwards RTP packets. Low server CPU. Supports simulcast: clients upload multiple quality levels (360p, 720p, 1080p); SFU selects the appropriate quality per subscriber. This is the industry standard: Zoom, Google Meet, Discord all use SFU. MCU (Multipoint Control Unit): server decodes all streams, composites into one video grid, re-encodes, sends one stream to each client. Lowest client download bandwidth. Highest server CPU. Used for recording and for very low-bandwidth scenarios.
SFU Architecture for Scale
Each SFU server handles ~1000 concurrent participants. For a large call (1000+ participants): use a cascade of SFU servers. Edge SFUs receive streams from participants and forward to a root SFU. Root SFU forwards to other edge SFUs. Sharding: route calls to SFU servers by call_id (consistent hashing). Geographic distribution: place SFU servers in regions close to users to minimize latency (target under 100ms RTT). Auto-scale SFU clusters by concurrent participant count. The signaling server assigns participants to an SFU server and handles SFU failover.
Quality Adaptation
WebRTC uses RTCP feedback for congestion control (REMB, Transport-CC). When bandwidth decreases: reduce resolution (720p → 360p), reduce frame rate (30fps → 15fps). Simulcast: the sender uploads 3 quality layers simultaneously; the SFU switches which layer to forward without the sender changing anything. This gives instant quality switching without keyframe requests. On the receiving side, WebRTC’s built-in jitter buffer handles packet reordering; NACK requests retransmission of lost packets; FEC (Forward Error Correction) recovers from loss without retransmission (better for real-time audio).
Interview Tips
- Always mention TURN fallback — most candidates forget 15-20% of calls need relay.
- SFU is the right answer for group calls; mention simulcast for quality adaptation.
- Signaling is a separate concern from media — keep them architecturally separate.
- Recording: tap the SFU, decode and mux to file, store in object storage.
Asked at: Netflix Interview Guide
Asked at: Snap Interview Guide
Asked at: Twitter/X Interview Guide
Asked at: Cloudflare Interview Guide