System Design Interview: Design a Database Connection Pool

What Is a Database Connection Pool?

Opening a database connection is expensive: TCP handshake, authentication, SSL negotiation, session setup — typically 20–200ms. A connection pool maintains a set of pre-established connections that are reused across requests. Instead of opening/closing on every query, the application borrows a connection, uses it, and returns it to the pool. At 10K requests/second without a pool, you’d need 10K new connections/second — databases cap at 500–2000 total connections.

  • Databricks Interview Guide
  • Airbnb Interview Guide
  • Shopify Interview Guide
  • Coinbase Interview Guide
  • Uber Interview Guide
  • Stripe Interview Guide
  • Connection Lifecycle

    1. Initialization: pool creates min_size connections at startup
    2. Borrow: application calls pool.acquire() — returns an idle connection or blocks until one is available
    3. Use: application executes queries on the borrowed connection
    4. Return: application calls pool.release() — connection returns to idle pool (not closed)
    5. Validation: before returning to application, optionally ping the connection (SELECT 1) to verify it’s still alive
    6. Eviction: idle connections beyond max_idle_time are closed; broken connections are removed

    Pool Sizing

    Classic formula: pool_size = (core_count * 2) + effective_spindle_count. For a 4-core app server with SSDs: pool_size ≈ 9. Too small: requests queue, latency spikes. Too large: database is overwhelmed with concurrent connections (each consumes ~5MB RAM), context switching overhead, lock contention.

    Empirical approach: start with pool_size = 10 per app server. Monitor pool_wait_time (requests blocked waiting for a connection) and db_active_connections. Scale pool up if wait_time > 1ms; scale down if connection utilization < 50%.

    Key Configuration Parameters

    • min_size: connections kept alive even when idle (warmth, immediate availability)
    • max_size: hard limit on total connections (prevents DB overload)
    • connect_timeout: how long to wait for a new connection (fail fast)
    • acquire_timeout: how long a request waits for an available connection before failing
    • max_lifetime: maximum age of any connection — periodic refresh to prevent “drift” (stale sessions, memory leaks)
    • idle_timeout: close connections idle longer than this to free DB-side resources

    Overflow Handling

    When all connections are in use and max_size is reached, a new acquire() request must either: (1) block and wait up to acquire_timeout, then fail with timeout error, or (2) immediately fail with “pool exhausted” error. The right choice depends on the use case: interactive user request should fail fast (return 503); background job can wait. Monitor pool_exhausted_count as a key SLO metric — spikes indicate capacity issues.

    Health Checking and Reconnection

    Database connections can become stale: firewall drops idle TCP connections, database restarts, network blips. Pool must detect and replace broken connections. Strategies:

    • Test-on-borrow: ping connection (SELECT 1) before returning to application. Adds latency (1-2ms) per acquire but guarantees healthy connection.
    • Background validator: periodic thread pings idle connections and removes broken ones. Zero latency impact on requests.
    • Exception-based: on query failure, close the connection and retry with a fresh one. Simpler but one request gets the error.

    Read Replica Routing

    Modern apps use read replicas for scalability. Pool implementation: maintain two pools — primary pool (writes + strongly-consistent reads) and replica pool (eventually consistent reads). Application layer routes: queries inside a transaction → primary; SELECT without transaction → replica pool (round-robin across replicas). Replica pool size >> primary pool size (reads >> writes).

    Interview Tips

    • Pool sizing: database is usually the bottleneck, not the pool — keep pool_size well below DB max_connections.
    • Acquire timeout prevents cascade failures — a slow DB shouldn’t cause unbounded request queueing.
    • max_lifetime prevents subtle bugs from long-lived stale sessions (cached execution plans, timezone drift).
    • PgBouncer (PostgreSQL) and ProxySQL (MySQL) are external connection poolers that sit between app servers and DB — useful when the application can’t use driver-level pooling.
    Scroll to Top