RDS connections are exhausted and failover takes too long. Fix the database setup.

Q: RDS connections are exhausted and failover takes too long. Fix the database setup.

Learn the answer to "RDS connections are exhausted and failover takes too long. Fix the database setup." with detailed explanations, code examples, and best practices on DeployU.

The Scenario

Your RDS PostgreSQL database is having issues:

Current Problems:
├── "FATAL: too many connections" errors
├── Application timeout during failover: 2-3 minutes
├── Max connections: 100 (db.t3.medium)
├── Active connections: 95-100 constantly
├── Lambda functions: 50 concurrent
├── ECS tasks: 20 replicas
└── Connection pooling: None

The Challenge

Fix the connection exhaustion problem, reduce failover time, and implement proper connection management for a highly available database setup.

Wrong Approach

A junior engineer might just upgrade to a larger instance, increase max_connections parameter, or create connections on every request. These approaches increase costs without fixing the root cause, can cause OOM issues, and create connection storms.

Addresses symptoms, not root cause

Right Approach

A senior engineer implements RDS Proxy for connection pooling, configures Multi-AZ with proper DNS caching settings, uses connection retry logic, and right-sizes the instance based on actual workload needs.

Step 1: Understand the Connection Problem

Connection Math:
├── Lambda: 50 concurrent × 1 connection = 50 connections
├── ECS: 20 tasks × 10 pool size = 200 connections
├── Total needed: 250 connections
├── Available: 100 connections
└── Result: Connection exhaustion

With RDS Proxy:
├── Lambda: 50 concurrent → Proxy → 10 DB connections
├── ECS: 20 tasks → Proxy → 40 DB connections
├── Total DB connections: 50 (pin on transaction)
└── Result: Efficient connection reuse

Step 2: Deploy RDS Proxy

# Store database credentials in Secrets Manager
resource "aws_secretsmanager_secret" "db_credentials" {
  name = "rds/postgres/credentials"
}

resource "aws_secretsmanager_secret_version" "db_credentials" {
  secret_id = aws_secretsmanager_secret.db_credentials.id
  secret_string = jsonencode({
    username = "admin"
    password = random_password.db.result
  })
}

# RDS Proxy
resource "aws_db_proxy" "main" {
  name                   = "orders-db-proxy"
  debug_logging          = false
  engine_family          = "POSTGRESQL"
  idle_client_timeout    = 1800
  require_tls            = true
  vpc_security_group_ids = [aws_security_group.rds_proxy.id]
  vpc_subnet_ids         = aws_subnet.database[*].id

  auth {
    auth_scheme               = "SECRETS"
    client_password_auth_type = "POSTGRES_SCRAM_SHA_256"
    iam_auth                  = "REQUIRED"
    secret_arn                = aws_secretsmanager_secret.db_credentials.arn
  }

  tags = {
    Name = "orders-db-proxy"
  }
}

# Proxy target group configuration
resource "aws_db_proxy_default_target_group" "main" {
  db_proxy_name = aws_db_proxy.main.name

  connection_pool_config {
    max_connections_percent      = 100
    max_idle_connections_percent = 50
    connection_borrow_timeout    = 120
    init_query                   = "SET timezone='UTC'"
    session_pinning_filters      = ["EXCLUDE_VARIABLE_SETS"]
  }
}

# Associate proxy with RDS instance
resource "aws_db_proxy_target" "main" {
  db_proxy_name          = aws_db_proxy.main.name
  target_group_name      = aws_db_proxy_default_target_group.main.name
  db_instance_identifier = aws_db_instance.main.id
}

# Security group for RDS Proxy
resource "aws_security_group" "rds_proxy" {
  name        = "rds-proxy-sg"
  description = "Security group for RDS Proxy"
  vpc_id      = aws_vpc.main.id

  ingress {
    description     = "PostgreSQL from app tier"
    from_port       = 5432
    to_port         = 5432
    protocol        = "tcp"
    security_groups = [aws_security_group.app.id]
  }

  egress {
    from_port       = 5432
    to_port         = 5432
    protocol        = "tcp"
    security_groups = [aws_security_group.rds.id]
  }
}

Step 3: Configure Multi-AZ RDS

resource "aws_db_instance" "main" {
  identifier     = "orders-db"
  engine         = "postgres"
  engine_version = "15.4"
  instance_class = "db.r6g.large"

  allocated_storage     = 100
  max_allocated_storage = 500
  storage_type          = "gp3"
  storage_encrypted     = true
  kms_key_id            = aws_kms_key.rds.arn

  db_name  = "orders"
  username = "admin"
  password = random_password.db.result

  # High Availability
  multi_az = true

  # Networking
  db_subnet_group_name   = aws_db_subnet_group.main.name
  vpc_security_group_ids = [aws_security_group.rds.id]
  publicly_accessible    = false

  # Performance
  parameter_group_name = aws_db_parameter_group.main.name

  # Backup
  backup_retention_period = 7
  backup_window          = "03:00-04:00"
  maintenance_window     = "Mon:04:00-Mon:05:00"

  # Monitoring
  performance_insights_enabled    = true
  monitoring_interval             = 60
  monitoring_role_arn             = aws_iam_role.rds_monitoring.arn
  enabled_cloudwatch_logs_exports = ["postgresql", "upgrade"]

  # Deletion protection
  deletion_protection = true
  skip_final_snapshot = false
  final_snapshot_identifier = "orders-db-final"

  tags = {
    Name = "orders-db"
  }
}

# Parameter group for connection optimization
resource "aws_db_parameter_group" "main" {
  family = "postgres15"
  name   = "orders-db-params"

  # Connection settings
  parameter {
    name  = "max_connections"
    value = "200"
  }

  # Reduce failover detection time
  parameter {
    name  = "tcp_keepalives_idle"
    value = "60"
  }

  parameter {
    name  = "tcp_keepalives_interval"
    value = "10"
  }

  parameter {
    name  = "tcp_keepalives_count"
    value = "3"
  }

  # Performance
  parameter {
    name  = "shared_buffers"
    value = "{DBInstanceClassMemory/4}"
  }

  parameter {
    name  = "work_mem"
    value = "65536"
  }
}

Step 4: Connection Retry Logic

import psycopg2
from psycopg2 import OperationalError
import time
import logging
from functools import wraps

logger = logging.getLogger(__name__)

def with_retry(max_attempts=3, backoff_factor=2):
    """Decorator for database operations with retry logic."""
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            last_exception = None

            for attempt in range(max_attempts):
                try:
                    return func(*args, **kwargs)
                except OperationalError as e:
                    last_exception = e
                    error_msg = str(e).lower()

                    # Retryable errors
                    retryable = any([
                        'connection refused' in error_msg,
                        'connection reset' in error_msg,
                        'connection timed out' in error_msg,
                        'server closed the connection' in error_msg,
                        'ssl connection has been closed' in error_msg,
                    ])

                    if not retryable:
                        raise

                    wait_time = backoff_factor ** attempt
                    logger.warning(
                        f"Database connection error (attempt {attempt + 1}/{max_attempts}). "
                        f"Retrying in {wait_time}s: {e}"
                    )
                    time.sleep(wait_time)

            raise last_exception

        return wrapper
    return decorator

class DatabaseConnection:
    """Connection wrapper with automatic reconnection."""

    def __init__(self, dsn: str):
        self.dsn = dsn
        self._connection = None

    @property
    def connection(self):
        if self._connection is None or self._connection.closed:
            self._connection = self._create_connection()
        return self._connection

    @with_retry(max_attempts=5, backoff_factor=2)
    def _create_connection(self):
        return psycopg2.connect(
            self.dsn,
            connect_timeout=5,
            options='-c statement_timeout=30000'
        )

    @with_retry(max_attempts=3)
    def execute(self, query: str, params=None):
        with self.connection.cursor() as cursor:
            cursor.execute(query, params)
            if cursor.description:
                return cursor.fetchall()
            self.connection.commit()
            return cursor.rowcount

    def close(self):
        if self._connection:
            self._connection.close()

Step 5: Lambda with IAM Authentication

import boto3
import psycopg2
import os

def get_db_token():
    """Generate IAM authentication token for RDS Proxy."""
    client = boto3.client('rds')

    token = client.generate_db_auth_token(
        DBHostname=os.environ['DB_PROXY_ENDPOINT'],
        Port=5432,
        DBUsername=os.environ['DB_USER'],
        Region=os.environ['AWS_REGION']
    )
    return token

# Connection pool that works with RDS Proxy
connection = None

def get_connection():
    """Get or create database connection."""
    global connection

    if connection is None or connection.closed:
        connection = psycopg2.connect(
            host=os.environ['DB_PROXY_ENDPOINT'],
            port=5432,
            database=os.environ['DB_NAME'],
            user=os.environ['DB_USER'],
            password=get_db_token(),
            sslmode='require',
            connect_timeout=5
        )

    return connection

def handler(event, context):
    """Lambda handler with connection reuse."""
    conn = get_connection()

    try:
        with conn.cursor() as cursor:
            cursor.execute(
                "SELECT * FROM orders WHERE id = %s",
                (event['order_id'],)
            )
            result = cursor.fetchone()

        return {
            'statusCode': 200,
            'body': {'order': result}
        }

    except Exception as e:
        # Reset connection on error
        global connection
        if connection:
            connection.close()
            connection = None
        raise

Step 6: Read Replicas for Read Scaling

# Read replica
resource "aws_db_instance" "read_replica" {
  identifier          = "orders-db-replica"
  replicate_source_db = aws_db_instance.main.identifier
  instance_class      = "db.r6g.large"

  # Replica-specific settings
  multi_az            = false
  publicly_accessible = false

  # Performance
  parameter_group_name = aws_db_parameter_group.replica.name

  # Monitoring
  performance_insights_enabled = true
  monitoring_interval          = 60
  monitoring_role_arn          = aws_iam_role.rds_monitoring.arn

  tags = {
    Name = "orders-db-replica"
  }
}

# RDS Proxy endpoint for read replicas
resource "aws_db_proxy_endpoint" "read_only" {
  db_proxy_name          = aws_db_proxy.main.name
  db_proxy_endpoint_name = "read-only"
  vpc_subnet_ids         = aws_subnet.database[*].id
  target_role            = "READ_ONLY"
  vpc_security_group_ids = [aws_security_group.rds_proxy.id]
}

Step 7: Monitoring and Alerting

# Connection usage alarm
resource "aws_cloudwatch_metric_alarm" "db_connections" {
  alarm_name          = "rds-high-connections"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "DatabaseConnections"
  namespace           = "AWS/RDS"
  period              = 300
  statistic           = "Average"
  threshold           = 150  # 75% of max_connections=200
  alarm_description   = "Database connections are high"
  alarm_actions       = [aws_sns_topic.alerts.arn]

  dimensions = {
    DBInstanceIdentifier = aws_db_instance.main.identifier
  }
}

# Proxy connection alarm
resource "aws_cloudwatch_metric_alarm" "proxy_connections" {
  alarm_name          = "rds-proxy-connections"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "ClientConnections"
  namespace           = "AWS/RDS/Proxy"
  period              = 60
  statistic           = "Sum"
  threshold           = 900  # RDS Proxy limit is 1000
  alarm_description   = "RDS Proxy connections are high"
  alarm_actions       = [aws_sns_topic.alerts.arn]

  dimensions = {
    ProxyName = aws_db_proxy.main.name
  }
}

# Replication lag alarm
resource "aws_cloudwatch_metric_alarm" "replication_lag" {
  alarm_name          = "rds-replication-lag"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 3
  metric_name         = "ReplicaLag"
  namespace           = "AWS/RDS"
  period              = 60
  statistic           = "Maximum"
  threshold           = 30  # 30 seconds
  alarm_description   = "Read replica lag is high"
  alarm_actions       = [aws_sns_topic.alerts.arn]

  dimensions = {
    DBInstanceIdentifier = aws_db_instance.read_replica.identifier
  }
}

Systematic, production-ready debugging

RDS High Availability Summary

Feature	Benefit	Failover Time
Multi-AZ	Automatic failover	60-120 seconds
RDS Proxy	Connection pooling	Near-instant
Read Replicas	Read scaling	Manual promotion
Aurora	Faster failover	Under 30 seconds

Practice Question

Why does RDS Proxy significantly reduce failover time compared to direct RDS connections?

Questions