Designing a transactional email service (like SendGrid or AWS SES) or an email client (like Gmail) involves deep distributed systems knowledge: message queuing, deliverability, inbox storage, and search. Both variants appear in senior engineering interviews.
Variant A: Transactional Email Sending Service (SendGrid-like)
Architecture
Application → Email API → Queue → SMTP Sender Pool → Internet MTAs
Components:
Email API: REST endpoint, validates, enqueues
Queue: Kafka topics partitioned by priority
Sender Pool: Workers that connect to recipient MTAs via SMTP
Bounce Handler: Processes delivery failures
Analytics: Open tracking, click tracking
Queue Design by Priority
Kafka topics:
email.transactional → password resets, purchase receipts (< 5s)
email.marketing → newsletters, promotions (< 1hr acceptable)
email.bulk → cold outreach, low priority (hours)
Partitioning: hash(sender_domain) → consistent sending from same IP range
Consumer groups: dedicated pool per topic tier
Transactional: 100 workers, auto-scale
Marketing: 50 workers, batch
Bulk: 20 workers, throttled to avoid blacklisting
Deliverability: The Hard Part
Email deliverability requirements:
SPF: TXT record listing authorized sending IPs
"v=spf1 ip4:203.0.113.0/24 include:sendgrid.net ~all"
DKIM: Cryptographic signature on email headers
Private key signs headers → recipient verifies via DNS public key
Prevents spoofing: "From: security@yourbank.com"
DMARC: Policy for SPF/DKIM failures
"v=DMARC1; p=quarantine; rua=mailto:dmarc@example.com"
p=none: monitor only; p=quarantine: spam folder; p=reject: block
IP reputation:
New sending IPs → warm up gradually (100 → 1K → 10K → 100K/day)
Monitor bounce rate (< 2%), spam complaint rate (< 0.1%)
Dedicated IPs for transactional (protect from marketing spam complaints)
Suppression list:
Maintain list of unsubscribed / hard-bounced / spam-complained emails
Never send to suppression list — automatic filtering before queue entry
Bounce Handling
Soft bounce: temporary failure (mailbox full, server down)
→ Retry with exponential backoff: 5min, 30min, 2hr, 6hr, 24hr
→ Discard after 72 hours without success
Hard bounce: permanent failure (user doesn't exist, domain invalid)
→ Add to suppression list immediately
→ Never retry — continuing to send hard-bounced addresses → blacklisting
Feedback loops (FBL):
Gmail, Yahoo, Outlook send complaints back via FBL
→ Remove from list, add to suppression
→ Track complaint rate by campaign/sender
Open/Click Tracking
Open tracking: embed 1x1 pixel image
<img src="https://track.sendgrid.com/open/{encoded_email_id}" width="1" height="1">
When client loads image → tracking server logs open event
Limitation: iOS Mail Privacy Protection preloads all images
Click tracking: rewrite all links
Original: https://example.com/product/123
Rewritten: https://click.sendgrid.com/{encoded_link}?eid={email_id}
On click: redirect to original, log click event
Events: {email_id, event_type, timestamp, user_agent, ip}
→ Kafka → Flink → ClickHouse (analytics) + Redis (real-time dashboard)
Variant B: Email Client / Inbox (Gmail-like)
Storage Model
Core entities:
Users: {user_id, email, quota_bytes}
Threads: {thread_id, subject, participants[], created_at}
Messages: {message_id, thread_id, from, to[], cc[], body, size_bytes}
Labels: {label_id, user_id, name} -- INBOX, SENT, SPAM, custom
MessageLabels: {message_id, user_id, label_id}
Storage design:
Message bodies: object storage (S3 / GCS)
Key: messages/{user_id}/{message_id}
Content-Type: message/rfc822
Metadata: relational DB (PostgreSQL / Spanner for global)
Hot path: threads + messages for current user (< 10K rows typical)
Attachments: separate object storage with CDN
Key: attachments/{attachment_id}/{filename}
Quota: 15GB per user (Gmail) → track per-user bytes in DB
Inbox Loading: Performance
GET /inbox (most common operation — must be fast):
Query: threads with INBOX label, ordered by last_message_time DESC, LIMIT 50
Without optimization:
SELECT t.*, m.* FROM threads t
JOIN messages m ON m.thread_id = t.id AND m.id = (
SELECT id FROM messages WHERE thread_id = t.id ORDER BY ts DESC LIMIT 1
)
JOIN thread_labels tl ON tl.thread_id = t.id AND tl.label_id = INBOX
WHERE tl.user_id = ?
ORDER BY t.last_message_ts DESC LIMIT 50
→ Slow: nested selects, many joins
Optimized with denormalization:
threads table includes: snippet, last_message_ts, unread_count, participants_json
→ Single table scan, no joins for inbox listing
→ Update denormalized fields on each new message (async worker)
Search: Full-Text Search on Email
Gmail search: "from:alice subject:invoice after:2024-01-01"
Options:
Option A: Elasticsearch
Index: {message_id, user_id, from, subject, body, ts}
Query: bool filter on user_id + full-text on body/subject
Latency: 50-200ms
Cost: significant (index = 2-3× raw storage size)
Option B: Custom inverted index per user
Build per-user inverted index: word → list of message IDs
Store in user's namespace (Bigtable / Cassandra)
Google's approach for scale + isolation
Option C: CloudSearch / Typesense (simpler)
Managed search; less control but faster to implement
Email-specific search optimizations:
Sender search: "from:alice" → index FROM field separately for exact match
Date range: partition index by year/month → prune partitions
Attachment type: index MIME types for "has:attachment"
Push Notifications: New Email Delivery
SMTP inbound → Parse → Store message → Notify user
Notification channels:
Web: WebSocket (Gmail uses long-polling → updated with Push API)
Mobile: APNs (iOS) / FCM (Android) → push notification
Desktop: OS notification API
Inbound SMTP flow:
MTA (Postfix) receives email → LMTP delivery to inbox service
Inbox service:
1. Spam/virus filtering (SpamAssassin, VirusTotal API)
2. Apply user filter rules (if subject contains "receipt" → label Bills)
3. Store message body to S3
4. Store metadata to DB
5. Publish "new_message" event to Redis Pub/Sub
6. Push service: Redis subscriber → APNs/FCM notification
Interview Discussion Points
- Why use object storage for email bodies? Email bodies are immutable blobs of variable size (1KB to 25MB with attachments). Object storage is cheap ($0.023/GB vs $0.10+/GB for DB), scales infinitely, and has built-in durability. Metadata (from, subject, labels, read status) is mutable and queried — that belongs in a relational DB.
- How does Gmail achieve sub-second search on 15 years of email? Per-user inverted index stored in Bigtable, partitioned by time range. Queries are scoped to one user (no cross-user queries), making it a single-tenant search problem. The index is updated asynchronously on message receipt.
- How to handle 99.9% email deliverability? Warm up sending IPs gradually, separate transactional from marketing IPs, monitor bounce/complaint rates obsessively (automate pausing campaigns above 0.1% complaint rate), maintain SPF/DKIM/DMARC, and use dedicated IPs for high-reputation senders.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the difference between SPF, DKIM, and DMARC in email deliverability?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “SPF (Sender Policy Framework) is a DNS TXT record that lists IP addresses authorized to send email on behalf of your domain u2014 receiving servers check if the sending IP is in the SPF record. DKIM (DomainKeys Identified Mail) adds a cryptographic signature to email headers, signed with a private key, verifiable via a public key in DNS u2014 prevents message tampering and spoofing. DMARC (Domain-based Message Authentication, Reporting, and Conformance) builds on SPF and DKIM by specifying what to do when they fail: none (monitor), quarantine (spam folder), or reject (block). DMARC also enables aggregate reporting so you can see who’s sending email on your behalf. All three are required for reliable deliverability to Gmail, Yahoo, and Outlook.”
}
},
{
“@type”: “Question”,
“name”: “How does a transactional email service handle bounces and protect sender reputation?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Hard bounces (permanent failures: address doesn’t exist, domain invalid) must be added to a suppression list immediately and never retried u2014 continuing to send to hard-bounced addresses is a primary cause of IP blacklisting. Soft bounces (temporary: mailbox full, server temporarily unavailable) are retried with exponential backoff (5min u2192 30min u2192 2hr u2192 6hr u2192 24hr), then abandoned after 72 hours. Feedback loop (FBL) registrations with major ISPs deliver spam complaint notifications u2014 any address that marks email as spam should be added to the suppression list. Keeping bounce rate below 2% and complaint rate below 0.1% is essential for maintaining IP reputation and inbox placement.”
}
},
{
“@type”: “Question”,
“name”: “How does Gmail store and search through billions of emails efficiently?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Gmail stores email bodies in distributed object storage (similar to GFS/Colossus), with metadata (from, subject, labels, timestamps) in a distributed database (Spanner for global consistency). The inbox listing is optimized with denormalization: the threads table stores a precomputed snippet, last message timestamp, unread count, and participant list so inbox loading is a single table scan without joins. Full-text search uses a per-user inverted index stored in Bigtable, partitioned by time range u2014 since each user’s email is a single-tenant search problem, queries are scoped to one user’s index partitions, enabling sub-second search across years of email. Updates to the inverted index happen asynchronously after message receipt.”
}
}
]
}