System Design Interview: Email Service at Scale (SendGrid/Gmail)

Designing a transactional email service (like SendGrid or AWS SES) or an email client (like Gmail) involves deep distributed systems knowledge: message queuing, deliverability, inbox storage, and search. Both variants appear in senior engineering interviews.

Variant A: Transactional Email Sending Service (SendGrid-like)

Architecture

Application → Email API → Queue → SMTP Sender Pool → Internet MTAs

Components:
  Email API:    REST endpoint, validates, enqueues
  Queue:        Kafka topics partitioned by priority
  Sender Pool:  Workers that connect to recipient MTAs via SMTP
  Bounce Handler: Processes delivery failures
  Analytics:   Open tracking, click tracking

Queue Design by Priority

Kafka topics:
  email.transactional   → password resets, purchase receipts (< 5s)
  email.marketing       → newsletters, promotions (< 1hr acceptable)
  email.bulk            → cold outreach, low priority (hours)

Partitioning: hash(sender_domain) → consistent sending from same IP range
Consumer groups: dedicated pool per topic tier
  Transactional: 100 workers, auto-scale
  Marketing:     50 workers, batch
  Bulk:          20 workers, throttled to avoid blacklisting

Deliverability: The Hard Part

Email deliverability requirements:
  SPF:   TXT record listing authorized sending IPs
    "v=spf1 ip4:203.0.113.0/24 include:sendgrid.net ~all"

  DKIM:  Cryptographic signature on email headers
    Private key signs headers → recipient verifies via DNS public key
    Prevents spoofing: "From: security@yourbank.com"

  DMARC: Policy for SPF/DKIM failures
    "v=DMARC1; p=quarantine; rua=mailto:dmarc@example.com"
    p=none: monitor only; p=quarantine: spam folder; p=reject: block

  IP reputation:
    New sending IPs → warm up gradually (100 → 1K → 10K → 100K/day)
    Monitor bounce rate (< 2%), spam complaint rate (< 0.1%)
    Dedicated IPs for transactional (protect from marketing spam complaints)

  Suppression list:
    Maintain list of unsubscribed / hard-bounced / spam-complained emails
    Never send to suppression list — automatic filtering before queue entry

Bounce Handling

Soft bounce: temporary failure (mailbox full, server down)
  → Retry with exponential backoff: 5min, 30min, 2hr, 6hr, 24hr
  → Discard after 72 hours without success

Hard bounce: permanent failure (user doesn't exist, domain invalid)
  → Add to suppression list immediately
  → Never retry — continuing to send hard-bounced addresses → blacklisting

Feedback loops (FBL):
  Gmail, Yahoo, Outlook send complaints back via FBL
  → Remove from list, add to suppression
  → Track complaint rate by campaign/sender

Open/Click Tracking

Open tracking: embed 1x1 pixel image
  <img src="https://track.sendgrid.com/open/{encoded_email_id}" width="1" height="1">
  When client loads image → tracking server logs open event
  Limitation: iOS Mail Privacy Protection preloads all images

Click tracking: rewrite all links
  Original: https://example.com/product/123
  Rewritten: https://click.sendgrid.com/{encoded_link}?eid={email_id}
  On click: redirect to original, log click event

Events: {email_id, event_type, timestamp, user_agent, ip}
  → Kafka → Flink → ClickHouse (analytics) + Redis (real-time dashboard)

Variant B: Email Client / Inbox (Gmail-like)

Storage Model

Core entities:
  Users:    {user_id, email, quota_bytes}
  Threads:  {thread_id, subject, participants[], created_at}
  Messages: {message_id, thread_id, from, to[], cc[], body, size_bytes}
  Labels:   {label_id, user_id, name} -- INBOX, SENT, SPAM, custom
  MessageLabels: {message_id, user_id, label_id}

Storage design:
  Message bodies: object storage (S3 / GCS)
    Key: messages/{user_id}/{message_id}
    Content-Type: message/rfc822

  Metadata: relational DB (PostgreSQL / Spanner for global)
    Hot path: threads + messages for current user (< 10K rows typical)

  Attachments: separate object storage with CDN
    Key: attachments/{attachment_id}/{filename}
    Quota: 15GB per user (Gmail) → track per-user bytes in DB

Inbox Loading: Performance

GET /inbox (most common operation — must be fast):
  Query: threads with INBOX label, ordered by last_message_time DESC, LIMIT 50

  Without optimization:
    SELECT t.*, m.* FROM threads t
    JOIN messages m ON m.thread_id = t.id AND m.id = (
      SELECT id FROM messages WHERE thread_id = t.id ORDER BY ts DESC LIMIT 1
    )
    JOIN thread_labels tl ON tl.thread_id = t.id AND tl.label_id = INBOX
    WHERE tl.user_id = ?
    ORDER BY t.last_message_ts DESC LIMIT 50
    → Slow: nested selects, many joins

  Optimized with denormalization:
    threads table includes: snippet, last_message_ts, unread_count, participants_json
    → Single table scan, no joins for inbox listing
    → Update denormalized fields on each new message (async worker)

Search: Full-Text Search on Email

Gmail search: "from:alice subject:invoice after:2024-01-01"

Options:
  Option A: Elasticsearch
    Index: {message_id, user_id, from, subject, body, ts}
    Query: bool filter on user_id + full-text on body/subject
    Latency: 50-200ms
    Cost: significant (index = 2-3× raw storage size)

  Option B: Custom inverted index per user
    Build per-user inverted index: word → list of message IDs
    Store in user's namespace (Bigtable / Cassandra)
    Google's approach for scale + isolation

  Option C: CloudSearch / Typesense (simpler)
    Managed search; less control but faster to implement

Email-specific search optimizations:
  Sender search: "from:alice" → index FROM field separately for exact match
  Date range: partition index by year/month → prune partitions
  Attachment type: index MIME types for "has:attachment"

Push Notifications: New Email Delivery

SMTP inbound → Parse → Store message → Notify user

Notification channels:
  Web:    WebSocket (Gmail uses long-polling → updated with Push API)
  Mobile: APNs (iOS) / FCM (Android) → push notification
  Desktop: OS notification API

Inbound SMTP flow:
  MTA (Postfix) receives email → LMTP delivery to inbox service
  Inbox service:
    1. Spam/virus filtering (SpamAssassin, VirusTotal API)
    2. Apply user filter rules (if subject contains "receipt" → label Bills)
    3. Store message body to S3
    4. Store metadata to DB
    5. Publish "new_message" event to Redis Pub/Sub
    6. Push service: Redis subscriber → APNs/FCM notification

Interview Discussion Points

  • Why use object storage for email bodies? Email bodies are immutable blobs of variable size (1KB to 25MB with attachments). Object storage is cheap ($0.023/GB vs $0.10+/GB for DB), scales infinitely, and has built-in durability. Metadata (from, subject, labels, read status) is mutable and queried — that belongs in a relational DB.
  • How does Gmail achieve sub-second search on 15 years of email? Per-user inverted index stored in Bigtable, partitioned by time range. Queries are scoped to one user (no cross-user queries), making it a single-tenant search problem. The index is updated asynchronously on message receipt.
  • How to handle 99.9% email deliverability? Warm up sending IPs gradually, separate transactional from marketing IPs, monitor bounce/complaint rates obsessively (automate pausing campaigns above 0.1% complaint rate), maintain SPF/DKIM/DMARC, and use dedicated IPs for high-reputation senders.

  • Snap Interview Guide
  • Atlassian Interview Guide
  • Cloudflare Interview Guide
  • Shopify Interview Guide
  • Stripe Interview Guide
  • LinkedIn Interview Guide
  • Companies That Ask This

    Scroll to Top