What Is a Multi-Tenant Architecture?
Multi-tenancy means a single instance of a software application serves multiple customers (tenants), with each tenant’s data isolated from others. All major SaaS products — Salesforce, Slack, GitHub, Datadog — are multi-tenant. The central design challenge is balancing isolation (tenant A cannot see tenant B’s data), operational efficiency (shared infrastructure reduces cost), and customization (enterprise tenants want their own configurations, SLAs, and sometimes their own databases).
Isolation Models
Silo (Database per Tenant)
Each tenant gets a dedicated database instance. Total data isolation — no risk of cross-tenant data leakage from a query bug. Easier compliance (PCI, HIPAA: data never co-located with other customers’ data). Enables tenant-specific database configurations (indexes, extensions). Downsides: high infrastructure cost (1,000 tenants = 1,000 database instances), complex schema migrations (must run against every tenant database), and long onboarding time (spinning up a database takes minutes).
When to use: large enterprise customers paying $100K+/year who require data isolation contractually; regulated industries with strict data residency requirements.
Pool (Shared Database, Shared Schema)
All tenants share one database with a tenant_id column in every table. Every query must include WHERE tenant_id = :current_tenant — enforced at the application layer or via Row-Level Security (RLS) in PostgreSQL. Maximum resource efficiency — all tenants share indexes, connections, and compute. Schema migrations run once. Noisy neighbor risk: a large tenant running heavy queries degrades performance for others.
When to use: SMB/self-serve customers; early-stage SaaS where cost matters; when tenant datasets are small and similar in size.
Bridge (Shared Database, Schema per Tenant)
All tenants share a database instance, but each gets their own schema (tenant_abc.users, tenant_xyz.users). Better isolation than the pool model (no tenant_id column needed; a misconfigured query cannot read another tenant’s data). Schema migrations must be applied to every tenant schema, but tools like Flyway can manage this. PostgreSQL supports thousands of schemas per database.
When to use: mid-market customers requiring schema isolation without the cost of dedicated databases.
Row-Level Security (RLS) in PostgreSQL
RLS enforces tenant isolation at the database level, not the application layer. Even if a bug in application code omits the WHERE tenant_id filter, the database enforces it.
-- Enable RLS on a table
ALTER TABLE orders ENABLE ROW LEVEL SECURITY;
-- Policy: users can only see rows where tenant_id matches their session variable
CREATE POLICY tenant_isolation ON orders
USING (tenant_id = current_setting('app.current_tenant_id')::uuid);
-- Application sets the tenant context before each query:
SET app.current_tenant_id = 'tenant-uuid-here';
SELECT * FROM orders; -- automatically filtered to current tenant
RLS has a small performance cost (policy evaluation on each row) and requires SET statements for each connection or transaction. Connection poolers (PgBouncer) must be configured in transaction-mode pooling to ensure SET commands apply to the correct connection.
Tenant Onboarding
A new tenant should be active within seconds (for self-serve) or hours (for enterprise). Self-serve flow:
- User submits email + plan — a Tenant Creation Service writes a tenant record to the tenants table (id, slug, plan, created_at, status=provisioning)
- For pool model: no additional database work needed — schema already exists
- For bridge model: CREATE SCHEMA tenant_{id}; run migration scripts against new schema
- For silo model: provision a new RDS/CloudSQL instance asynchronously; store connection string in tenant config; mark status=active when ready
- Send welcome email; redirect to onboarding flow
The Noisy Neighbor Problem
In the pool model, one tenant running a large report query or bulk import can consume CPU and I/O that degrades performance for thousands of other tenants. Solutions:
- Query timeouts: per-tenant statement timeouts (SET statement_timeout = ‘5s’) kill runaway queries
- Connection limits: allocate a maximum number of database connections per tenant; excess requests queue or are rejected with a 429
- Rate limiting at the API layer: limit requests per tenant per second; large tenants get higher limits based on their plan
- Read replicas for reporting: route heavy analytical queries to a read replica so OLAP load does not compete with OLTP
- Tenant tiering: automatically detect high-usage tenants and migrate them to dedicated resources (silo tier) — often triggered by SLA thresholds
Data Model: Tenant-Aware Schema Design
-- Tenant directory (global)
CREATE TABLE tenants (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
slug TEXT UNIQUE NOT NULL,
plan TEXT NOT NULL, -- free/pro/enterprise
db_shard TEXT, -- shard id for pool; connection string for silo
created_at TIMESTAMPTZ DEFAULT now()
);
-- Pool model: all tenant data tables include tenant_id
CREATE TABLE users (
id UUID DEFAULT gen_random_uuid(),
tenant_id UUID NOT NULL REFERENCES tenants(id),
email TEXT NOT NULL,
created_at TIMESTAMPTZ DEFAULT now(),
PRIMARY KEY (tenant_id, id) -- tenant_id first for index locality
);
-- Composite primary key with tenant_id as prefix ensures all rows for a
-- tenant are co-located on disk (PostgreSQL heap) and in B-tree pages.
Multi-Region Tenants
Enterprise customers in the EU may require data to stay in the EU (GDPR). Architecture: a global control plane (tenant directory, authentication) runs in a primary region. Each geographic region has its own data plane (application servers + databases) that hosts tenant data. The routing layer (a GeoDNS-aware API gateway or the control plane) directs each tenant’s requests to their designated region. Tenant metadata in the control plane stores: tenant_id → region. Cross-region reads are never needed for tenant data — each region is self-contained. The control plane is replicated globally but contains no PII.
Schema Migrations Across All Tenants
For silo and bridge models, each schema migration must run against every tenant. At 10,000 tenants, running ALTER TABLE against each schema serially would take hours. Strategies:
- Expand-contract pattern: add a nullable column (non-blocking DDL), backfill data asynchronously, then add the NOT NULL constraint (instantaneous on populated column). Never block production with long-running ALTER.
- Parallel migration runner: a migration job fetches the list of tenant schemas, distributes them across a worker pool, and applies migrations in parallel. Track completion per tenant — retry failed schemas.
- Blue-green schema: for large changes, deploy the new schema alongside the old; route new tenant onboardings to the new schema; migrate existing tenants in waves.
Key Interview Points
- Match isolation model to tenant size: pool for SMB, bridge for mid-market, silo for enterprise
- Enforce isolation at the database layer (RLS) as a safety net — not only at application layer
- Noisy neighbor: apply per-tenant rate limiting, connection limits, and query timeouts
- Onboarding must be automated and fast; enterprise silo provisioning is async
- Use expand-contract for schema migrations to avoid downtime at any tenant count
- Multi-region: route tenant data to a fixed region for data residency compliance
Frequently Asked Questions
What is the noisy neighbor problem in multi-tenant systems and how do you solve it?
The noisy neighbor problem occurs when one tenant's workload consumes a disproportionate share of shared resources (CPU, database connections, I/O bandwidth), degrading performance for other tenants on the same infrastructure. Common triggers: a large tenant running a bulk data export, an unoptimized report query that scans millions of rows, or a spike in their user traffic. Solutions at multiple layers: (1) API rate limiting: enforce per-tenant request quotas (e.g., 1000 req/min for free plan, 10000 for pro) using a Redis token bucket or sliding window counter. Return 429 Too Many Requests when exceeded. (2) Database query timeouts: set statement_timeout = '5s' per tenant connection to kill runaway queries automatically. (3) Connection pool limits: allocate a maximum number of database connections per tenant via PgBouncer pool_size per database. (4) Compute quotas: in Kubernetes, set resource requests/limits per tenant namespace or pod. (5) Read replica offloading: route analytical and report queries to a read replica so OLAP load does not compete with OLTP on the primary. (6) Tenant tiering: proactively detect high-usage tenants and migrate them to dedicated infrastructure.
How do you implement tenant isolation at the database level?
Application-layer isolation (always include WHERE tenant_id = :id in queries) is necessary but insufficient — a single missing WHERE clause can expose all tenants' data. Database-level enforcement provides a safety net. PostgreSQL Row-Level Security (RLS): ALTER TABLE orders ENABLE ROW LEVEL SECURITY; followed by CREATE POLICY tenant_isolation ON orders USING (tenant_id = current_setting('app.current_tenant_id')::uuid). Before each query, the application executes SET app.current_tenant_id = 'tenant-uuid'. Even if the application forgets the WHERE clause, the database policy silently filters to the current tenant's rows. RLS policies are invisible to the application — they behave like implicit WHERE clauses. Performance: RLS adds a small overhead per row evaluated; ensure tenant_id is indexed. For schema-per-tenant (bridge model), each tenant has their own PostgreSQL schema, so objects from different tenants are never in the same table — no tenant_id column needed. The risk of cross-tenant access requires the connection to SET search_path = tenant_schema_name, which is enforced by the connection manager.
How do you handle database migrations across thousands of tenant schemas?
Running ALTER TABLE statements against thousands of tenant schemas serially would take hours and risk partial failures leaving schemas in inconsistent states. The expand-contract (or parallel change) pattern minimizes disruption: (1) Expand: add the new column as nullable with no constraints (ALTER TABLE ADD COLUMN new_col TYPE NULL) — PostgreSQL 11+ makes this instant for nullable columns with no default. (2) Backfill: run a background job that batches UPDATE statements across tenants, updating rows with the new value. Run in small batches (1000 rows) with delays to avoid I/O spikes. (3) Switch application code to write both old and new columns. (4) Contract: once all rows are backfilled and code is deployed, add the NOT NULL constraint (PostgreSQL validates existing rows but since all are populated, this is fast) and drop the old column. For the parallel schema runner: fetch all tenant schema names, distribute across a worker pool (e.g., 20 concurrent workers), track completion per schema in a migrations_log table, and retry failed schemas. Use schema versioning — track the current migration version per schema and only apply pending migrations.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the noisy neighbor problem in multi-tenant systems and how do you solve it?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The noisy neighbor problem occurs when one tenant’s workload consumes a disproportionate share of shared resources (CPU, database connections, I/O bandwidth), degrading performance for other tenants on the same infrastructure. Common triggers: a large tenant running a bulk data export, an unoptimized report query that scans millions of rows, or a spike in their user traffic. Solutions at multiple layers: (1) API rate limiting: enforce per-tenant request quotas (e.g., 1000 req/min for free plan, 10000 for pro) using a Redis token bucket or sliding window counter. Return 429 Too Many Requests when exceeded. (2) Database query timeouts: set statement_timeout = ‘5s’ per tenant connection to kill runaway queries automatically. (3) Connection pool limits: allocate a maximum number of database connections per tenant via PgBouncer pool_size per database. (4) Compute quotas: in Kubernetes, set resource requests/limits per tenant namespace or pod. (5) Read replica offloading: route analytical and report queries to a read replica so OLAP load does not compete with OLTP on the primary. (6) Tenant tiering: proactively detect high-usage tenants and migrate them to dedicated infrastructure.”
}
},
{
“@type”: “Question”,
“name”: “How do you implement tenant isolation at the database level?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Application-layer isolation (always include WHERE tenant_id = :id in queries) is necessary but insufficient — a single missing WHERE clause can expose all tenants’ data. Database-level enforcement provides a safety net. PostgreSQL Row-Level Security (RLS): ALTER TABLE orders ENABLE ROW LEVEL SECURITY; followed by CREATE POLICY tenant_isolation ON orders USING (tenant_id = current_setting(‘app.current_tenant_id’)::uuid). Before each query, the application executes SET app.current_tenant_id = ‘tenant-uuid’. Even if the application forgets the WHERE clause, the database policy silently filters to the current tenant’s rows. RLS policies are invisible to the application — they behave like implicit WHERE clauses. Performance: RLS adds a small overhead per row evaluated; ensure tenant_id is indexed. For schema-per-tenant (bridge model), each tenant has their own PostgreSQL schema, so objects from different tenants are never in the same table — no tenant_id column needed. The risk of cross-tenant access requires the connection to SET search_path = tenant_schema_name, which is enforced by the connection manager.”
}
},
{
“@type”: “Question”,
“name”: “How do you handle database migrations across thousands of tenant schemas?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Running ALTER TABLE statements against thousands of tenant schemas serially would take hours and risk partial failures leaving schemas in inconsistent states. The expand-contract (or parallel change) pattern minimizes disruption: (1) Expand: add the new column as nullable with no constraints (ALTER TABLE ADD COLUMN new_col TYPE NULL) — PostgreSQL 11+ makes this instant for nullable columns with no default. (2) Backfill: run a background job that batches UPDATE statements across tenants, updating rows with the new value. Run in small batches (1000 rows) with delays to avoid I/O spikes. (3) Switch application code to write both old and new columns. (4) Contract: once all rows are backfilled and code is deployed, add the NOT NULL constraint (PostgreSQL validates existing rows but since all are populated, this is fast) and drop the old column. For the parallel schema runner: fetch all tenant schema names, distribute across a worker pool (e.g., 20 concurrent workers), track completion per schema in a migrations_log table, and retry failed schemas. Use schema versioning — track the current migration version per schema and only apply pending migrations.”
}
}
]
}