Questions
Cloud SQL connections are exhausted and failover takes too long. Fix the database setup.
The Scenario
Your Cloud SQL PostgreSQL instance is experiencing issues:
Error: FATAL: remaining connection slots are reserved for
non-replication superuser connections
Monitoring shows:
- Active connections: 100/100 (max)
- Connection attempts: 500/sec (mostly rejected)
- Failover time during maintenance: 3 minutes
- Application errors during failover: 2,847
Multiple GKE pods connect directly to Cloud SQL, and during failovers or high load, the application becomes unavailable.
The Challenge
Design a resilient database architecture with proper connection management, high availability, and minimal failover impact.
A junior engineer might increase max_connections to 1000, use larger instance types, skip read replicas, or ignore connection pooling. These approaches waste resources, don't solve the connection exhaustion problem, and still have long failover times.
A senior engineer implements connection pooling with Cloud SQL Proxy or PgBouncer, configures regional high availability with automatic failover, uses read replicas for read scaling, and implements proper application retry logic with exponential backoff.
Step 1: Understand the Connection Problem
# Check current connections
gcloud sql connect my-instance --user=postgres
postgres=> SELECT count(*) FROM pg_stat_activity;
count
-------
100
postgres=> SELECT usename, application_name, count(*)
FROM pg_stat_activity
GROUP BY usename, application_name
ORDER BY count DESC;
# Typical issue: Each pod opens 10 connections
# 20 pods × 10 connections = 200 (but max is 100)Step 2: Implement Cloud SQL Auth Proxy with Connection Pooling
# Deploy Cloud SQL Proxy as sidecar
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
spec:
template:
spec:
serviceAccountName: app-service-account
containers:
- name: api
image: gcr.io/project/api:latest
env:
- name: DB_HOST
value: "127.0.0.1" # Connect via proxy
- name: DB_PORT
value: "5432"
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: db-credentials
key: url
- name: cloud-sql-proxy
image: gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.7.0
args:
- "--structured-logs"
- "--private-ip"
- "--auto-iam-authn" # Use Workload Identity
- "project:region:instance"
securityContext:
runAsNonRoot: true
resources:
requests:
memory: "256Mi"
cpu: "100m"Step 3: Add PgBouncer for Connection Pooling
# PgBouncer deployment for connection pooling
apiVersion: apps/v1
kind: Deployment
metadata:
name: pgbouncer
spec:
replicas: 2
template:
spec:
containers:
- name: pgbouncer
image: bitnami/pgbouncer:latest
ports:
- containerPort: 6432
env:
- name: POSTGRESQL_HOST
value: "10.0.0.5" # Cloud SQL private IP
- name: POSTGRESQL_DATABASE
value: "mydb"
- name: PGBOUNCER_POOL_MODE
value: "transaction" # Best for web apps
- name: PGBOUNCER_MAX_CLIENT_CONN
value: "1000" # Accept up to 1000 app connections
- name: PGBOUNCER_DEFAULT_POOL_SIZE
value: "20" # Use only 20 real DB connections
- name: PGBOUNCER_MIN_POOL_SIZE
value: "10"
- name: PGBOUNCER_RESERVE_POOL_SIZE
value: "5"
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "500m"
---
apiVersion: v1
kind: Service
metadata:
name: pgbouncer
spec:
selector:
app: pgbouncer
ports:
- port: 5432
targetPort: 6432Step 4: Configure Cloud SQL High Availability
resource "google_sql_database_instance" "main" {
name = "production-db"
database_version = "POSTGRES_15"
region = "us-central1"
settings {
tier = "db-custom-4-16384" # 4 vCPU, 16GB RAM
availability_type = "REGIONAL" # Enables HA with standby
backup_configuration {
enabled = true
point_in_time_recovery_enabled = true
start_time = "03:00"
transaction_log_retention_days = 7
backup_retention_settings {
retained_backups = 30
retention_unit = "COUNT"
}
}
ip_configuration {
ipv4_enabled = false # Disable public IP
private_network = google_compute_network.main.id
# Allow connections from GKE
authorized_networks {
name = "gke-cluster"
value = "10.0.0.0/8"
}
}
database_flags {
name = "max_connections"
value = "200" # Reasonable limit with pooling
}
database_flags {
name = "log_min_duration_statement"
value = "1000" # Log slow queries > 1s
}
maintenance_window {
day = 7 # Sunday
hour = 3 # 3 AM
update_track = "stable"
}
insights_config {
query_insights_enabled = true
query_string_length = 4096
record_application_tags = true
record_client_address = true
}
}
deletion_protection = true
}
# Read replica for read scaling
resource "google_sql_database_instance" "replica" {
name = "production-db-replica"
master_instance_name = google_sql_database_instance.main.name
region = "us-central1"
database_version = "POSTGRES_15"
replica_configuration {
failover_target = false
}
settings {
tier = "db-custom-4-16384"
availability_type = "ZONAL"
ip_configuration {
ipv4_enabled = false
private_network = google_compute_network.main.id
}
}
}Step 5: Implement Application Retry Logic
import time
import random
from sqlalchemy import create_engine
from sqlalchemy.exc import OperationalError
def create_db_engine():
return create_engine(
DATABASE_URL,
pool_size=5, # Connections per worker
max_overflow=10, # Extra connections when needed
pool_pre_ping=True, # Verify connections before use
pool_recycle=1800, # Recycle connections after 30min
connect_args={
"connect_timeout": 10,
"application_name": "api-server",
}
)
def execute_with_retry(engine, query, max_retries=3):
"""Execute query with exponential backoff retry."""
for attempt in range(max_retries):
try:
with engine.connect() as conn:
return conn.execute(query)
except OperationalError as e:
if attempt == max_retries - 1:
raise
# Exponential backoff with jitter
wait_time = (2 ** attempt) + random.uniform(0, 1)
print(f"Database error, retrying in {wait_time:.2f}s: {e}")
time.sleep(wait_time)// Node.js with pg-pool
const { Pool } = require('pg');
const pool = new Pool({
connectionString: process.env.DATABASE_URL,
max: 10, // Max connections in pool
idleTimeoutMillis: 30000, // Close idle connections after 30s
connectionTimeoutMillis: 10000,
application_name: 'api-server',
});
async function queryWithRetry(sql, params, maxRetries = 3) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
return await pool.query(sql, params);
} catch (error) {
if (attempt === maxRetries - 1) throw error;
// Check if retryable error
if (error.code === 'ECONNREFUSED' ||
error.code === '57P01' || // admin_shutdown
error.code === '57P02') { // crash_shutdown
const waitTime = Math.pow(2, attempt) * 1000 + Math.random() * 1000;
console.log(`Database error, retrying in ${waitTime}ms`);
await new Promise(r => setTimeout(r, waitTime));
} else {
throw error;
}
}
}
}Step 6: Monitor Database Health
# Alert on connection exhaustion
resource "google_monitoring_alert_policy" "db_connections" {
display_name = "Cloud SQL Connection Exhaustion"
conditions {
display_name = "Connections > 80%"
condition_threshold {
filter = <<-EOT
resource.type="cloudsql_database" AND
metric.type="cloudsql.googleapis.com/database/postgresql/num_backends"
EOT
comparison = "COMPARISON_GT"
threshold_value = 160 # 80% of 200 max
duration = "300s"
aggregations {
alignment_period = "60s"
per_series_aligner = "ALIGN_MEAN"
}
}
}
}
# Alert on replication lag (for read replicas)
resource "google_monitoring_alert_policy" "replication_lag" {
display_name = "Cloud SQL Replication Lag"
conditions {
display_name = "Replication lag > 60s"
condition_threshold {
filter = <<-EOT
resource.type="cloudsql_database" AND
metric.type="cloudsql.googleapis.com/database/replication/replica_lag"
EOT
comparison = "COMPARISON_GT"
threshold_value = 60
duration = "300s"
}
}
} Connection Architecture
┌─────────────────────┐
│ Application Pods │
│ (100+ instances) │
└──────────┬──────────┘
│ 1000 connections
▼
┌─────────────────────┐
│ PgBouncer │
│ (connection pool) │
└──────────┬──────────┘
│ 20-50 connections
▼
┌────────────────┴────────────────┐
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ Primary (HA) │──replication─▶│ Read Replica │
│ (writes) │ │ (reads) │
└─────────────────┘ └─────────────────┘
Failover Behavior
| Configuration | Failover Time | Data Loss | Cost |
|---|---|---|---|
| Single zone | Manual (hours) | Possible | $ |
| Regional HA | ~60 seconds | None | $$$ |
| With read replicas | Reads continue | None | $$$$ |
Practice Question
Why should you use transaction pooling mode in PgBouncer instead of session pooling for web applications?