System Design Interview: Multi-Region Architecture and Global Replication

Why Multi-Region?

Single-region deployments have two problems: (1) Latency — a user in Tokyo making a request to a US-East server experiences 150ms round-trip time before any processing begins. (2) Availability — if the US-East region goes down, all users worldwide lose service. Multi-region architecture addresses both: serve users from the closest region (< 20ms RTT) and survive complete region failure. The challenge: keeping data consistent across geographically distributed databases separated by hundreds of milliseconds of network latency.

The CAP Theorem Applied

CAP Theorem: in the presence of a network partition, a distributed system can provide either Consistency (all nodes see the same data) or Availability (all requests receive a response), but not both simultaneously. For multi-region systems: the network between regions is effectively always “partitioned” — a message from us-east to eu-west takes 80ms one way. Choosing consistency: writes block until all regions acknowledge → every write has 80ms+ additional latency. Choosing availability: serve reads locally with potentially stale data → users in different regions may see different states. Most global systems choose availability + eventual consistency for reads, with strong consistency only where business-critical (payment balances, inventory counts).

Active-Passive (Primary-Secondary) Replication

One region is the primary (handles all writes); other regions are replicas (handle reads, replicate from primary). Write flow: write goes to primary → primary commits → asynchronously replicates to replicas → replicas apply the change (eventually consistent). Read flow from a replica: may return stale data if replication lag is non-zero. Use cases: read-heavy systems where slight staleness is acceptable (product catalog, blog posts, user profiles). Failover: if the primary fails, promote a replica to primary. Replica may be slightly behind — possible data loss (RPO > 0). Synchronous replication (write ack after all replicas confirm) eliminates data loss but adds cross-region write latency.

Active-Active (Multi-Primary) Replication

Multiple regions accept writes simultaneously. Writes from different regions are replicated to all other regions and merged. Challenge: conflict resolution when the same record is updated in two regions simultaneously. Conflict resolution strategies:

  • Last-write-wins (LWW): accept the write with the latest timestamp. Simple but can lose data — a write with a slightly earlier timestamp is silently discarded. Suitable for user profile updates where overwriting is acceptable.
  • Vector clocks: each write carries a version vector (one counter per region). Concurrent writes (neither dominates the other’s vector) are flagged as conflicts for application-level resolution. Used by Amazon DynamoDB.
  • CRDTs: design the data structure to support automatic conflict-free merging — counters, sets, registers with specific semantics.
  • Application-level: detect conflicts and present to the user (Dropbox conflict copies, Google Docs version history).

Global Load Balancing

Route users to the nearest healthy region via: (1) GeoDNS — DNS server returns different IP addresses based on the client’s geographic origin (resolved via IP geolocation). Client queries DNS → receives IP of the nearest region → connects to that region. Propagation delay: DNS TTL must be short (60 seconds) to quickly redirect traffic during failover, but short TTL increases DNS query volume. (2) Anycast — same IP address is advertised from multiple regions. BGP routing automatically directs packets to the nearest advertisement. Used by Cloudflare for its network (single IP, routed to nearest PoP). (3) Application-layer redirect — the application itself detects the user’s location and redirects to the nearest region’s endpoint. Adds one redirect hop but is more flexible.

Database Global Patterns

Google Spanner: globally distributed SQL database with external consistency (stronger than serializable). Uses TrueTime (GPS + atomic clocks) to assign globally unique timestamps with bounded uncertainty. Achieves strong consistency across regions at the cost of write latency (must wait for TrueTime uncertainty interval, typically 7ms). Used by Google Ads, F1 (Google Shopping). Available as Cloud Spanner on GCP.

CockroachDB: open-source distributed SQL database with serializable isolation across regions. Uses Raft consensus per range (16MB data chunks). Write latency in multi-region mode: Raft quorum requires acknowledgment from majority of replicas (2 of 3 regions) → latency = RTT to nearest non-local region. Configurable home regions per table or row — data locality reduces latency for region-specific data.

DynamoDB Global Tables: fully managed multi-active replication across up to 3 AWS regions. Last-write-wins conflict resolution. Sub-second replication between regions. Per-region reads are eventually consistent; strong-consistency reads are single-region only.

Data Residency and Compliance

GDPR requires European users’ personal data to be stored in the EU. This means: user data for EU users must be in EU regions, cannot be replicated to US regions without explicit user consent. Multi-region architecture must respect data residency: route EU users to EU regions, store PII in EU-only databases, apply geo-fencing to replication (EU data replicated only within EU). Operational challenge: support queries that join EU and US data (analytics, fraud detection) without violating residency — anonymize/aggregate before cross-region transfer.

Interview Checklist

  • Latency requirement → multi-region; single region for simpler deployments
  • Read-heavy + tolerate staleness → active-passive (primary-secondary)
  • High availability for writes globally → active-active (multi-primary)
  • Conflict resolution: LWW (simple), vector clocks (exact), CRDTs (merge-friendly)
  • Global routing: GeoDNS (flexibility) or Anycast (low overhead)
  • Strong consistency globally: Spanner / CockroachDB (at cost of write latency)
  • Compliance: data residency constraints drive region assignments

  • Databricks Interview Guide
  • Netflix Interview Guide
  • Airbnb Interview Guide
  • Cloudflare Interview Guide
  • Stripe Interview Guide
  • Companies That Ask This

    Scroll to Top