Question 1

How do you size a database connection pool correctly?

Accepted Answer

The widely cited formula: pool_size = (number_of_cores * 2) + effective_spindle_count. For a 4-core application server connecting to an SSD-backed DB: pool_size = 4*2+1 = 9. Rationale: database operations involve I/O waits. During a wait, the CPU can service another connection. With 4 cores and I/O waits, ~8 connections can keep all cores busy. The extra 1 covers disk I/O. Too few connections: requests queue waiting for a connection, increasing latency. Too many: DB process memory (each PostgreSQL connection ~5-10MB), context switching overhead, and lock contention dominate. Practical approach: start with pool_size = 10 per application server. Monitor: pool_checkout_timeout_rate (pool exhausted signal) and DB active_connections vs. max_connections. Tune empirically with load testing. For read-heavy apps with replicas: primary pool small (writes), replica pool larger (reads), with connection-level routing.

Question 2

What is the thundering herd problem in connection pools and how do you solve it?

Accepted Answer

Thundering herd in connection pools: all application instances start simultaneously (deploy, restart) and each immediately tries to fill its min_pool_size connections. With 100 app servers and min_pool_size=10, the database receives 1000 concurrent new connection requests within seconds. Each connection handshake is expensive for the DB. Solution: (1) Staggered startup: add jitter to pool initialization delay (random 0-5 seconds per instance) — startup connections spread over time. (2) Lazy initialization: don't pre-fill the pool at startup; only create connections on first use. min_size=0 at startup, connections created on demand up to max_size. (3) PgBouncer / ProxySQL in front of DB: the pool manager handles connection multiplexing at the infrastructure level, absorbing burst connection demand. Same thundering herd pattern occurs after a DB failover when all app instances simultaneously detect the primary is gone and reconnect.

Question 3

How does a connection pool handle stale or broken connections?

Accepted Answer

Network equipment (firewalls, NAT gateways, load balancers) silently drops idle TCP connections after a timeout (typically 5-30 minutes). The application pool still holds a reference to the connection object, but the underlying TCP session is gone. On next use, the query fails with a "connection reset" or "broken pipe" error. Detection strategies: (1) Test-on-borrow: before returning a connection to the application, issue SELECT 1. If it fails, discard and create a new connection. Adds 1-3ms per acquire — acceptable for infrequent use. (2) Keepalive: configure TCP keepalive probes (SO_KEEPALIVE) to detect dead connections and prevent NAT timeout. (3) max_lifetime: close and recreate connections older than max_lifetime (e.g., 30 minutes) regardless of health — prevents silent stale state, cached execution plan drift, and session-level variable accumulation. (4) Background health check: a pool thread periodically sends SELECT 1 to idle connections, removing those that fail. Best practices: always set max_lifetime < firewall idle timeout, and always handle OperationalError in application code by retrying once.

System Design Interview: Design a Database Connection Pool

What Is a Database Connection Pool?

Connection Lifecycle

Pool Sizing

Key Configuration Parameters

Overflow Handling

Health Checking and Reconnection

Read Replica Routing

Interview Tips