Question 1

What are the trade-offs between schema-per-tenant and shared-schema multi-tenancy?

Accepted Answer

Schema-per-tenant: each tenant has its own set of tables in a separate schema (PostgreSQL) or separate database. Pros: strong isolation (a bug cannot leak cross-tenant data), easy to add tenant-specific columns, straightforward GDPR deletion (drop schema), simple to move a tenant to dedicated infrastructure. Cons: migrations must run against every tenant schema -- 1000 tenants means 1000 migration runs (slow, error-prone); hard to query across tenants for analytics; connection pooling complexity (must connect to the right schema per request). Shared-schema: all tenants share tables, every row has tenant_id. Pros: single migration run, easy cross-tenant analytics, simpler connection management. Cons: application-level isolation (a missing WHERE clause leaks data); indexes must include tenant_id; one large tenant can degrade performance for others (noisy neighbor). Most startups start shared-schema and migrate selected enterprise customers to dedicated databases as needed.

Question 2

How does PostgreSQL Row-Level Security enforce tenant isolation?

Accepted Answer

RLS adds a security policy at the database level that automatically filters rows based on a session variable. Example: SET app.tenant_id = 'abc-123'; then any SELECT/UPDATE/DELETE on the orders table automatically adds WHERE tenant_id = 'abc-123'. The policy is defined once: CREATE POLICY tenant_isolation ON orders USING (tenant_id = current_setting('app.tenant_id')::uuid). Enable with ALTER TABLE orders ENABLE ROW LEVEL SECURITY. Every query is then automatically scoped to the current tenant -- even if the application forgets the WHERE clause. Benefits: defense in depth (isolation holds even with application bugs), auditable (policy lives in the database schema), works across all ORM queries transparently. Considerations: the session variable must be set on every connection before queries; connection pooling requires resetting the variable between tenants (PgBouncer transaction-mode does this automatically).

Question 3

How do you implement per-tenant rate limiting without a global lock?

Accepted Answer

Use Redis with atomic increment operations. For each API request: INCR tenant:{tenant_id}:api:{minute} -- atomically increment the counter for the current minute. EXPIRE tenant:{tenant_id}:api:{minute} 120 -- keep the key for 2 minutes (allows reading the previous minute for sliding window). Compare the result to the tenant's quota from TenantConfig. If exceeded, return HTTP 429 with Retry-After: {seconds_until_next_minute}. This is O(1) per request with no locks. For sliding window instead of fixed window: use a Redis sorted set with ZADD (score=timestamp, member=request_id) and ZCOUNT to count requests in the last 60 seconds. More accurate but uses more memory per tenant. For burst allowance: token bucket implemented with INCR + TTL on a tokens-remaining key, refilled by a background job.

Question 4

How do you handle database migrations in a multi-tenant SaaS?

Accepted Answer

For shared-schema: run migrations once against the shared tables -- all tenants see the change simultaneously. Use backwards-compatible migrations (add nullable columns, never drop columns immediately). Deployment sequence: deploy new code with feature flags disabled u2192 run migration u2192 enable feature flags u2192 old code still works with new schema. For schema-per-tenant: migrations must run against each tenant's schema. Use a migration runner that iterates all active tenants and applies pending migrations. Run in parallel with a worker pool (10-20 concurrent migrations) to avoid taking hours for 1000+ tenants. Track migration state per tenant in a schema_migrations table in each tenant schema. Handle failures: if migration fails for tenant X, log and continue with other tenants. Failed tenants remain on the old schema; retry with a separate job. Never block new deployments on a partial migration failure.

Question 5

How do you isolate a noisy neighbor tenant in a shared-infrastructure SaaS?

Accepted Answer

Detection: monitor per-tenant resource consumption -- queries per second, query latency, rows scanned, storage I/O. Alert when a tenant consumes >10% of cluster resources. Mitigation options (in escalating order): (1) Rate limit the tenant's API calls (HTTP 429 with clear messaging). (2) Query throttling: add a pg_sleep or similar delay to the noisy tenant's queries in the application layer. (3) Move the tenant to a dedicated database connection pool with limited connections. (4) Migrate the tenant to their own dedicated database instance (tenant tier upgrade). For the last option: take a logical backup, restore to a new instance, update the tenant routing table to point to the new instance, run in parallel briefly to verify, switch traffic. All this should be doable without tenant-visible downtime using logical replication.

Low-Level Design: Multi-Tenant SaaS Platform — Tenant Isolation, Schema Design, and Rate Limiting

What is Multi-Tenancy

Database Schema Approaches

Tenant Routing and Context

Tenant-Level Configuration

Per-Tenant Rate Limiting and Quotas

Tenant Onboarding and Offboarding