System Design Interview: Design a Database Connection Pool

What Is a Database Connection Pool?

Opening a database connection is expensive: TCP handshake, authentication, SSL negotiation, session setup — typically 20–200ms. A connection pool maintains a set of pre-established connections that are reused across requests. Instead of opening/closing on every query, the application borrows a connection, uses it, and returns it to the pool. At 10K requests/second without a pool, you’d need 10K new connections/second — databases cap at 500–2000 total connections.

  • Databricks Interview Guide
  • Airbnb Interview Guide
  • Shopify Interview Guide
  • Coinbase Interview Guide
  • Uber Interview Guide
  • Stripe Interview Guide
  • Connection Lifecycle

    1. Initialization: pool creates min_size connections at startup
    2. Borrow: application calls pool.acquire() — returns an idle connection or blocks until one is available
    3. Use: application executes queries on the borrowed connection
    4. Return: application calls pool.release() — connection returns to idle pool (not closed)
    5. Validation: before returning to application, optionally ping the connection (SELECT 1) to verify it’s still alive
    6. Eviction: idle connections beyond max_idle_time are closed; broken connections are removed

    Pool Sizing

    Classic formula: pool_size = (core_count * 2) + effective_spindle_count. For a 4-core app server with SSDs: pool_size ≈ 9. Too small: requests queue, latency spikes. Too large: database is overwhelmed with concurrent connections (each consumes ~5MB RAM), context switching overhead, lock contention.

    Empirical approach: start with pool_size = 10 per app server. Monitor pool_wait_time (requests blocked waiting for a connection) and db_active_connections. Scale pool up if wait_time > 1ms; scale down if connection utilization < 50%.

    Key Configuration Parameters

    • min_size: connections kept alive even when idle (warmth, immediate availability)
    • max_size: hard limit on total connections (prevents DB overload)
    • connect_timeout: how long to wait for a new connection (fail fast)
    • acquire_timeout: how long a request waits for an available connection before failing
    • max_lifetime: maximum age of any connection — periodic refresh to prevent “drift” (stale sessions, memory leaks)
    • idle_timeout: close connections idle longer than this to free DB-side resources

    Overflow Handling

    When all connections are in use and max_size is reached, a new acquire() request must either: (1) block and wait up to acquire_timeout, then fail with timeout error, or (2) immediately fail with “pool exhausted” error. The right choice depends on the use case: interactive user request should fail fast (return 503); background job can wait. Monitor pool_exhausted_count as a key SLO metric — spikes indicate capacity issues.

    Health Checking and Reconnection

    Database connections can become stale: firewall drops idle TCP connections, database restarts, network blips. Pool must detect and replace broken connections. Strategies:

    • Test-on-borrow: ping connection (SELECT 1) before returning to application. Adds latency (1-2ms) per acquire but guarantees healthy connection.
    • Background validator: periodic thread pings idle connections and removes broken ones. Zero latency impact on requests.
    • Exception-based: on query failure, close the connection and retry with a fresh one. Simpler but one request gets the error.

    Read Replica Routing

    Modern apps use read replicas for scalability. Pool implementation: maintain two pools — primary pool (writes + strongly-consistent reads) and replica pool (eventually consistent reads). Application layer routes: queries inside a transaction → primary; SELECT without transaction → replica pool (round-robin across replicas). Replica pool size >> primary pool size (reads >> writes).

    Interview Tips

    • Pool sizing: database is usually the bottleneck, not the pool — keep pool_size well below DB max_connections.
    • Acquire timeout prevents cascade failures — a slow DB shouldn’t cause unbounded request queueing.
    • max_lifetime prevents subtle bugs from long-lived stale sessions (cached execution plans, timezone drift).
    • PgBouncer (PostgreSQL) and ProxySQL (MySQL) are external connection poolers that sit between app servers and DB — useful when the application can’t use driver-level pooling.

    {
    “@context”: “https://schema.org”,
    “@type”: “FAQPage”,
    “mainEntity”: [
    {
    “@type”: “Question”,
    “name”: “How do you size a database connection pool correctly?”,
    “acceptedAnswer”: { “@type”: “Answer”, “text”: “The widely cited formula: pool_size = (number_of_cores * 2) + effective_spindle_count. For a 4-core application server connecting to an SSD-backed DB: pool_size = 4*2+1 = 9. Rationale: database operations involve I/O waits. During a wait, the CPU can service another connection. With 4 cores and I/O waits, ~8 connections can keep all cores busy. The extra 1 covers disk I/O. Too few connections: requests queue waiting for a connection, increasing latency. Too many: DB process memory (each PostgreSQL connection ~5-10MB), context switching overhead, and lock contention dominate. Practical approach: start with pool_size = 10 per application server. Monitor: pool_checkout_timeout_rate (pool exhausted signal) and DB active_connections vs. max_connections. Tune empirically with load testing. For read-heavy apps with replicas: primary pool small (writes), replica pool larger (reads), with connection-level routing.” }
    },
    {
    “@type”: “Question”,
    “name”: “What is the thundering herd problem in connection pools and how do you solve it?”,
    “acceptedAnswer”: { “@type”: “Answer”, “text”: “Thundering herd in connection pools: all application instances start simultaneously (deploy, restart) and each immediately tries to fill its min_pool_size connections. With 100 app servers and min_pool_size=10, the database receives 1000 concurrent new connection requests within seconds. Each connection handshake is expensive for the DB. Solution: (1) Staggered startup: add jitter to pool initialization delay (random 0-5 seconds per instance) — startup connections spread over time. (2) Lazy initialization: don't pre-fill the pool at startup; only create connections on first use. min_size=0 at startup, connections created on demand up to max_size. (3) PgBouncer / ProxySQL in front of DB: the pool manager handles connection multiplexing at the infrastructure level, absorbing burst connection demand. Same thundering herd pattern occurs after a DB failover when all app instances simultaneously detect the primary is gone and reconnect.” }
    },
    {
    “@type”: “Question”,
    “name”: “How does a connection pool handle stale or broken connections?”,
    “acceptedAnswer”: { “@type”: “Answer”, “text”: “Network equipment (firewalls, NAT gateways, load balancers) silently drops idle TCP connections after a timeout (typically 5-30 minutes). The application pool still holds a reference to the connection object, but the underlying TCP session is gone. On next use, the query fails with a "connection reset" or "broken pipe" error. Detection strategies: (1) Test-on-borrow: before returning a connection to the application, issue SELECT 1. If it fails, discard and create a new connection. Adds 1-3ms per acquire — acceptable for infrequent use. (2) Keepalive: configure TCP keepalive probes (SO_KEEPALIVE) to detect dead connections and prevent NAT timeout. (3) max_lifetime: close and recreate connections older than max_lifetime (e.g., 30 minutes) regardless of health — prevents silent stale state, cached execution plan drift, and session-level variable accumulation. (4) Background health check: a pool thread periodically sends SELECT 1 to idle connections, removing those that fail. Best practices: always set max_lifetime < firewall idle timeout, and always handle OperationalError in application code by retrying once.” }
    }
    ]
    }

    Scroll to Top