Designing a transactional email service (like SendGrid or AWS SES) or an email client (like Gmail) involves deep distributed systems knowledge: message queuing, deliverability, inbox storage, and search. Both variants appear in senior engineering interviews.
Variant A: Transactional Email Sending Service (SendGrid-like)
Architecture
Application → Email API → Queue → SMTP Sender Pool → Internet MTAs
Components:
Email API: REST endpoint, validates, enqueues
Queue: Kafka topics partitioned by priority
Sender Pool: Workers that connect to recipient MTAs via SMTP
Bounce Handler: Processes delivery failures
Analytics: Open tracking, click tracking
Queue Design by Priority
Kafka topics:
email.transactional → password resets, purchase receipts (< 5s)
email.marketing → newsletters, promotions (< 1hr acceptable)
email.bulk → cold outreach, low priority (hours)
Partitioning: hash(sender_domain) → consistent sending from same IP range
Consumer groups: dedicated pool per topic tier
Transactional: 100 workers, auto-scale
Marketing: 50 workers, batch
Bulk: 20 workers, throttled to avoid blacklisting
Deliverability: The Hard Part
Email deliverability requirements:
SPF: TXT record listing authorized sending IPs
"v=spf1 ip4:203.0.113.0/24 include:sendgrid.net ~all"
DKIM: Cryptographic signature on email headers
Private key signs headers → recipient verifies via DNS public key
Prevents spoofing: "From: security@yourbank.com"
DMARC: Policy for SPF/DKIM failures
"v=DMARC1; p=quarantine; rua=mailto:dmarc@example.com"
p=none: monitor only; p=quarantine: spam folder; p=reject: block
IP reputation:
New sending IPs → warm up gradually (100 → 1K → 10K → 100K/day)
Monitor bounce rate (< 2%), spam complaint rate (< 0.1%)
Dedicated IPs for transactional (protect from marketing spam complaints)
Suppression list:
Maintain list of unsubscribed / hard-bounced / spam-complained emails
Never send to suppression list — automatic filtering before queue entry
Bounce Handling
Soft bounce: temporary failure (mailbox full, server down)
→ Retry with exponential backoff: 5min, 30min, 2hr, 6hr, 24hr
→ Discard after 72 hours without success
Hard bounce: permanent failure (user doesn't exist, domain invalid)
→ Add to suppression list immediately
→ Never retry — continuing to send hard-bounced addresses → blacklisting
Feedback loops (FBL):
Gmail, Yahoo, Outlook send complaints back via FBL
→ Remove from list, add to suppression
→ Track complaint rate by campaign/sender
Open/Click Tracking
Open tracking: embed 1x1 pixel image
<img src="https://track.sendgrid.com/open/{encoded_email_id}" width="1" height="1">
When client loads image → tracking server logs open event
Limitation: iOS Mail Privacy Protection preloads all images
Click tracking: rewrite all links
Original: https://example.com/product/123
Rewritten: https://click.sendgrid.com/{encoded_link}?eid={email_id}
On click: redirect to original, log click event
Events: {email_id, event_type, timestamp, user_agent, ip}
→ Kafka → Flink → ClickHouse (analytics) + Redis (real-time dashboard)
Variant B: Email Client / Inbox (Gmail-like)
Storage Model
Core entities:
Users: {user_id, email, quota_bytes}
Threads: {thread_id, subject, participants[], created_at}
Messages: {message_id, thread_id, from, to[], cc[], body, size_bytes}
Labels: {label_id, user_id, name} -- INBOX, SENT, SPAM, custom
MessageLabels: {message_id, user_id, label_id}
Storage design:
Message bodies: object storage (S3 / GCS)
Key: messages/{user_id}/{message_id}
Content-Type: message/rfc822
Metadata: relational DB (PostgreSQL / Spanner for global)
Hot path: threads + messages for current user (< 10K rows typical)
Attachments: separate object storage with CDN
Key: attachments/{attachment_id}/{filename}
Quota: 15GB per user (Gmail) → track per-user bytes in DB
Inbox Loading: Performance
GET /inbox (most common operation — must be fast):
Query: threads with INBOX label, ordered by last_message_time DESC, LIMIT 50
Without optimization:
SELECT t.*, m.* FROM threads t
JOIN messages m ON m.thread_id = t.id AND m.id = (
SELECT id FROM messages WHERE thread_id = t.id ORDER BY ts DESC LIMIT 1
)
JOIN thread_labels tl ON tl.thread_id = t.id AND tl.label_id = INBOX
WHERE tl.user_id = ?
ORDER BY t.last_message_ts DESC LIMIT 50
→ Slow: nested selects, many joins
Optimized with denormalization:
threads table includes: snippet, last_message_ts, unread_count, participants_json
→ Single table scan, no joins for inbox listing
→ Update denormalized fields on each new message (async worker)
Search: Full-Text Search on Email
Gmail search: "from:alice subject:invoice after:2024-01-01"
Options:
Option A: Elasticsearch
Index: {message_id, user_id, from, subject, body, ts}
Query: bool filter on user_id + full-text on body/subject
Latency: 50-200ms
Cost: significant (index = 2-3× raw storage size)
Option B: Custom inverted index per user
Build per-user inverted index: word → list of message IDs
Store in user's namespace (Bigtable / Cassandra)
Google's approach for scale + isolation
Option C: CloudSearch / Typesense (simpler)
Managed search; less control but faster to implement
Email-specific search optimizations:
Sender search: "from:alice" → index FROM field separately for exact match
Date range: partition index by year/month → prune partitions
Attachment type: index MIME types for "has:attachment"
Push Notifications: New Email Delivery
SMTP inbound → Parse → Store message → Notify user
Notification channels:
Web: WebSocket (Gmail uses long-polling → updated with Push API)
Mobile: APNs (iOS) / FCM (Android) → push notification
Desktop: OS notification API
Inbound SMTP flow:
MTA (Postfix) receives email → LMTP delivery to inbox service
Inbox service:
1. Spam/virus filtering (SpamAssassin, VirusTotal API)
2. Apply user filter rules (if subject contains "receipt" → label Bills)
3. Store message body to S3
4. Store metadata to DB
5. Publish "new_message" event to Redis Pub/Sub
6. Push service: Redis subscriber → APNs/FCM notification
Interview Discussion Points
- Why use object storage for email bodies? Email bodies are immutable blobs of variable size (1KB to 25MB with attachments). Object storage is cheap ($0.023/GB vs $0.10+/GB for DB), scales infinitely, and has built-in durability. Metadata (from, subject, labels, read status) is mutable and queried — that belongs in a relational DB.
- How does Gmail achieve sub-second search on 15 years of email? Per-user inverted index stored in Bigtable, partitioned by time range. Queries are scoped to one user (no cross-user queries), making it a single-tenant search problem. The index is updated asynchronously on message receipt.
- How to handle 99.9% email deliverability? Warm up sending IPs gradually, separate transactional from marketing IPs, monitor bounce/complaint rates obsessively (automate pausing campaigns above 0.1% complaint rate), maintain SPF/DKIM/DMARC, and use dedicated IPs for high-reputation senders.