Requirements
Functional: 1-on-1 and group messaging, real-time message delivery, message persistence and history, read receipts (sent/delivered/read), user presence (online/offline/last seen), media attachments, push notifications for offline users.
Non-functional: messages must be delivered in order, at-least-once delivery, low latency (< 100ms for online users), scale to 100M users with billions of messages/day.
Core Architecture
WebSocket vs. Long Polling vs. SSE
- WebSocket: bidirectional, persistent TCP connection. Best for chat — low overhead per message, server can push at any time. Used by WhatsApp, Slack, Discord.
- Long Polling: client makes HTTP request, server holds it open until a message arrives (or timeout). Simpler to implement, works through all firewalls. Higher latency, more overhead per message. Fallback for environments that block WebSockets.
- Server-Sent Events (SSE): server push over HTTP/1.1, unidirectional. Good for notifications but not chat (can’t send from client).
Chat Server Architecture
User A Chat Server 1 Chat Server 2 User B
| |
[Message Store] [Message Store]
| |
[Redis Pub/Sub or Kafka]
User A connects to Chat Server 1. User B connects to Chat Server 2 (different node). When A sends a message to B: (1) Chat Server 1 persists the message. (2) Chat Server 1 publishes to Redis Pub/Sub channel user:{B_id}. (3) Chat Server 2 subscribes to that channel and receives the message. (4) Chat Server 2 pushes it to B’s WebSocket connection.
Message Data Model
@dataclass
class Message:
message_id: str # UUID, globally unique
conversation_id: str # groups messages into a conversation
sender_id: str
content: str
message_type: str # 'text' | 'image' | 'video' | 'file'
media_url: Optional[str]
sent_at: datetime # client-side timestamp
server_at: datetime # server-received timestamp (for ordering)
sequence_num: int # per-conversation monotonic sequence for ordering
@dataclass
class Conversation:
conversation_id: str
type: str # 'direct' | 'group'
participant_ids: List[str]
created_at: datetime
last_message_id: Optional[str]
last_activity: datetime
@dataclass
class MessageStatus:
message_id: str
user_id: str
status: str # 'delivered' | 'read'
timestamp: datetime
Message Ordering and Sequencing
Challenge: two users sending simultaneously — which message comes first? Options:
- Server-assigned sequence number: a sequence service (or database auto-increment) assigns a monotonically increasing sequence_num per conversation. Messages are displayed sorted by sequence_num. Race condition: two near-simultaneous messages from different servers may get sequence numbers out of order.
- Logical clock (Lamport timestamp): each message has a logical clock value. On send, increment clock; on receive, set clock = max(local, received) + 1. Total ordering across all clients.
- Client timestamp + sequence number: hybrid — use sequence number for ordering within a session; use server_at for cross-session ordering. Good enough for most chat apps.
Message Storage at Scale
Facebook Messenger uses HBase; WhatsApp uses Mnesia (Erlang); Discord uses Cassandra. Key requirements: write-heavy (every message), sequential reads (load conversation history), range queries (messages since timestamp X).
# Cassandra schema: partition by conversation, cluster by sequence_num
CREATE TABLE messages (
conversation_id UUID,
sequence_num BIGINT,
message_id UUID,
sender_id UUID,
content TEXT,
sent_at TIMESTAMP,
PRIMARY KEY (conversation_id, sequence_num)
) WITH CLUSTERING ORDER BY (sequence_num DESC);
-- Query last 50 messages: SELECT * FROM messages WHERE conversation_id = ? LIMIT 50
User Presence
class PresenceService:
ONLINE_TTL = 30 # seconds; heartbeat every 15s
def user_online(self, user_id: str, server_id: str):
r.setex(f"presence:{user_id}", self.ONLINE_TTL, server_id)
r.publish(f"presence_updates", f"{user_id}:online")
def user_offline(self, user_id: str):
r.delete(f"presence:{user_id}")
r.set(f"last_seen:{user_id}", datetime.utcnow().isoformat())
r.publish(f"presence_updates", f"{user_id}:offline")
def is_online(self, user_id: str) -> bool:
return r.exists(f"presence:{user_id}") > 0
def get_server(self, user_id: str) -> Optional[str]:
"""Which chat server is this user connected to?"""
return r.get(f"presence:{user_id}")
Push Notifications for Offline Users
When a user is offline (no WebSocket connection), fall back to push notifications:
- Message arrives at chat server. Check presence:
r.exists(f"presence:{recipient_id}"). - If online: push via WebSocket through Pub/Sub routing.
- If offline: publish a “push_notification” event to Kafka. A notification worker consumes it and sends to FCM (Android) or APNs (iOS).
- When the user comes back online, they pull unread messages from the message store (catch-up).
Read Receipts
def mark_delivered(message_id: str, user_id: str):
db.upsert(MessageStatus(message_id, user_id, 'delivered', datetime.utcnow()))
# Notify sender
sender_id = get_message(message_id).sender_id
push_status_update(sender_id, message_id, 'delivered')
def mark_read(conversation_id: str, user_id: str, up_to_sequence: int):
"""Batch mark all messages up to sequence_num as read."""
db.execute(
"UPDATE message_status SET status='read', timestamp=NOW() "
"WHERE conversation_id=%s AND user_id=%s AND sequence_num <= %s AND status != 'read'",
[conversation_id, user_id, up_to_sequence]
)
push_read_receipt(conversation_id, user_id, up_to_sequence)
Scaling
- Chat servers: stateless except for WebSocket connections. Use consistent hashing to route a user_id to a specific chat server (sticky sessions for Pub/Sub efficiency). Auto-scale based on connection count.
- Message fan-out in groups: for a group with 1000 members, sending a message requires 1000 Pub/Sub publishes. Cap group size or use a separate “group delivery” service. For very large groups (> 10K), use server-side fan-out via a precomputed member list stored in Redis.
- Hot conversations: a very active group chat (10K messages/minute) can overwhelm one Cassandra partition. Shard by (conversation_id, time_bucket) to spread load across multiple partitions.
Asked at: Meta Interview Guide
Asked at: Snap Interview Guide
Asked at: LinkedIn Interview Guide
Asked at: Twitter/X Interview Guide
Asked at: Atlassian Interview Guide