Cloud SQL connections are exhausted and failover takes too long. Fix the database setup.

Q: Cloud SQL connections are exhausted and failover takes too long. Fix the database setup.

Learn the answer to "Cloud SQL connections are exhausted and failover takes too long. Fix the database setup." with detailed explanations, code examples, and best practices on DeployU.

The Scenario

Your Cloud SQL PostgreSQL instance is experiencing issues:

Error: FATAL: remaining connection slots are reserved for
non-replication superuser connections

Monitoring shows:
- Active connections: 100/100 (max)
- Connection attempts: 500/sec (mostly rejected)
- Failover time during maintenance: 3 minutes
- Application errors during failover: 2,847

Multiple GKE pods connect directly to Cloud SQL, and during failovers or high load, the application becomes unavailable.

The Challenge

Design a resilient database architecture with proper connection management, high availability, and minimal failover impact.

Wrong Approach

A junior engineer might increase max_connections to 1000, use larger instance types, skip read replicas, or ignore connection pooling. These approaches waste resources, don't solve the connection exhaustion problem, and still have long failover times.

Addresses symptoms, not root cause

Right Approach

A senior engineer implements connection pooling with Cloud SQL Proxy or PgBouncer, configures regional high availability with automatic failover, uses read replicas for read scaling, and implements proper application retry logic with exponential backoff.

Step 1: Understand the Connection Problem

# Check current connections
gcloud sql connect my-instance --user=postgres

postgres=> SELECT count(*) FROM pg_stat_activity;
 count
-------
   100

postgres=> SELECT usename, application_name, count(*)
FROM pg_stat_activity
GROUP BY usename, application_name
ORDER BY count DESC;

# Typical issue: Each pod opens 10 connections
# 20 pods × 10 connections = 200 (but max is 100)

Step 2: Implement Cloud SQL Auth Proxy with Connection Pooling

# Deploy Cloud SQL Proxy as sidecar
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  template:
    spec:
      serviceAccountName: app-service-account
      containers:
      - name: api
        image: gcr.io/project/api:latest
        env:
        - name: DB_HOST
          value: "127.0.0.1"  # Connect via proxy
        - name: DB_PORT
          value: "5432"
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: db-credentials
              key: url

      - name: cloud-sql-proxy
        image: gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.7.0
        args:
          - "--structured-logs"
          - "--private-ip"
          - "--auto-iam-authn"  # Use Workload Identity
          - "project:region:instance"
        securityContext:
          runAsNonRoot: true
        resources:
          requests:
            memory: "256Mi"
            cpu: "100m"

Step 3: Add PgBouncer for Connection Pooling

# PgBouncer deployment for connection pooling
apiVersion: apps/v1
kind: Deployment
metadata:
  name: pgbouncer
spec:
  replicas: 2
  template:
    spec:
      containers:
      - name: pgbouncer
        image: bitnami/pgbouncer:latest
        ports:
        - containerPort: 6432
        env:
        - name: POSTGRESQL_HOST
          value: "10.0.0.5"  # Cloud SQL private IP
        - name: POSTGRESQL_DATABASE
          value: "mydb"
        - name: PGBOUNCER_POOL_MODE
          value: "transaction"  # Best for web apps
        - name: PGBOUNCER_MAX_CLIENT_CONN
          value: "1000"  # Accept up to 1000 app connections
        - name: PGBOUNCER_DEFAULT_POOL_SIZE
          value: "20"  # Use only 20 real DB connections
        - name: PGBOUNCER_MIN_POOL_SIZE
          value: "10"
        - name: PGBOUNCER_RESERVE_POOL_SIZE
          value: "5"
        resources:
          requests:
            memory: "128Mi"
            cpu: "100m"
          limits:
            memory: "256Mi"
            cpu: "500m"
---
apiVersion: v1
kind: Service
metadata:
  name: pgbouncer
spec:
  selector:
    app: pgbouncer
  ports:
  - port: 5432
    targetPort: 6432

Step 4: Configure Cloud SQL High Availability

resource "google_sql_database_instance" "main" {
  name             = "production-db"
  database_version = "POSTGRES_15"
  region           = "us-central1"

  settings {
    tier              = "db-custom-4-16384"  # 4 vCPU, 16GB RAM
    availability_type = "REGIONAL"  # Enables HA with standby

    backup_configuration {
      enabled                        = true
      point_in_time_recovery_enabled = true
      start_time                     = "03:00"
      transaction_log_retention_days = 7

      backup_retention_settings {
        retained_backups = 30
        retention_unit   = "COUNT"
      }
    }

    ip_configuration {
      ipv4_enabled    = false  # Disable public IP
      private_network = google_compute_network.main.id

      # Allow connections from GKE
      authorized_networks {
        name  = "gke-cluster"
        value = "10.0.0.0/8"
      }
    }

    database_flags {
      name  = "max_connections"
      value = "200"  # Reasonable limit with pooling
    }

    database_flags {
      name  = "log_min_duration_statement"
      value = "1000"  # Log slow queries > 1s
    }

    maintenance_window {
      day          = 7  # Sunday
      hour         = 3  # 3 AM
      update_track = "stable"
    }

    insights_config {
      query_insights_enabled  = true
      query_string_length     = 4096
      record_application_tags = true
      record_client_address   = true
    }
  }

  deletion_protection = true
}

# Read replica for read scaling
resource "google_sql_database_instance" "replica" {
  name                 = "production-db-replica"
  master_instance_name = google_sql_database_instance.main.name
  region               = "us-central1"
  database_version     = "POSTGRES_15"

  replica_configuration {
    failover_target = false
  }

  settings {
    tier              = "db-custom-4-16384"
    availability_type = "ZONAL"

    ip_configuration {
      ipv4_enabled    = false
      private_network = google_compute_network.main.id
    }
  }
}

Step 5: Implement Application Retry Logic

import time
import random
from sqlalchemy import create_engine
from sqlalchemy.exc import OperationalError

def create_db_engine():
    return create_engine(
        DATABASE_URL,
        pool_size=5,           # Connections per worker
        max_overflow=10,       # Extra connections when needed
        pool_pre_ping=True,    # Verify connections before use
        pool_recycle=1800,     # Recycle connections after 30min
        connect_args={
            "connect_timeout": 10,
            "application_name": "api-server",
        }
    )

def execute_with_retry(engine, query, max_retries=3):
    """Execute query with exponential backoff retry."""
    for attempt in range(max_retries):
        try:
            with engine.connect() as conn:
                return conn.execute(query)
        except OperationalError as e:
            if attempt == max_retries - 1:
                raise

            # Exponential backoff with jitter
            wait_time = (2 ** attempt) + random.uniform(0, 1)
            print(f"Database error, retrying in {wait_time:.2f}s: {e}")
            time.sleep(wait_time)

// Node.js with pg-pool
const { Pool } = require('pg');

const pool = new Pool({
  connectionString: process.env.DATABASE_URL,
  max: 10,                    // Max connections in pool
  idleTimeoutMillis: 30000,   // Close idle connections after 30s
  connectionTimeoutMillis: 10000,
  application_name: 'api-server',
});

async function queryWithRetry(sql, params, maxRetries = 3) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await pool.query(sql, params);
    } catch (error) {
      if (attempt === maxRetries - 1) throw error;

      // Check if retryable error
      if (error.code === 'ECONNREFUSED' ||
          error.code === '57P01' ||  // admin_shutdown
          error.code === '57P02') {  // crash_shutdown
        const waitTime = Math.pow(2, attempt) * 1000 + Math.random() * 1000;
        console.log(`Database error, retrying in ${waitTime}ms`);
        await new Promise(r => setTimeout(r, waitTime));
      } else {
        throw error;
      }
    }
  }
}

Step 6: Monitor Database Health

# Alert on connection exhaustion
resource "google_monitoring_alert_policy" "db_connections" {
  display_name = "Cloud SQL Connection Exhaustion"

  conditions {
    display_name = "Connections > 80%"

    condition_threshold {
      filter = <<-EOT
        resource.type="cloudsql_database" AND
        metric.type="cloudsql.googleapis.com/database/postgresql/num_backends"
      EOT

      comparison      = "COMPARISON_GT"
      threshold_value = 160  # 80% of 200 max
      duration        = "300s"

      aggregations {
        alignment_period   = "60s"
        per_series_aligner = "ALIGN_MEAN"
      }
    }
  }
}

# Alert on replication lag (for read replicas)
resource "google_monitoring_alert_policy" "replication_lag" {
  display_name = "Cloud SQL Replication Lag"

  conditions {
    display_name = "Replication lag > 60s"

    condition_threshold {
      filter = <<-EOT
        resource.type="cloudsql_database" AND
        metric.type="cloudsql.googleapis.com/database/replication/replica_lag"
      EOT

      comparison      = "COMPARISON_GT"
      threshold_value = 60
      duration        = "300s"
    }
  }
}

Systematic, production-ready debugging

Connection Architecture

                    ┌─────────────────────┐
                    │   Application Pods   │
                    │  (100+ instances)    │
                    └──────────┬──────────┘
                               │ 1000 connections
                               ▼
                    ┌─────────────────────┐
                    │     PgBouncer       │
                    │  (connection pool)  │
                    └──────────┬──────────┘
                               │ 20-50 connections
                               ▼
              ┌────────────────┴────────────────┐
              │                                 │
              ▼                                 ▼
    ┌─────────────────┐              ┌─────────────────┐
    │  Primary (HA)   │──replication─▶│  Read Replica   │
    │   (writes)      │              │    (reads)      │
    └─────────────────┘              └─────────────────┘

Failover Behavior

Configuration	Failover Time	Data Loss	Cost
Single zone	Manual (hours)	Possible	$
Regional HA	~60 seconds	None	$$$
With read replicas	Reads continue	None	$$$$

Practice Question

Why should you use transaction pooling mode in PgBouncer instead of session pooling for web applications?

Questions