Interviews / Cloud & DevOps / RDS connections are exhausted and failover takes too long. Fix the database setup.
Lambda functions are timing out when accessing RDS in a VPC. Debug the connectivity issue.
Design a multi-tier VPC architecture with public, private, and database subnets.
DynamoDB is throttling requests and costs are high. Optimize the table design.
RDS connections are exhausted and failover takes too long. Fix the database setup.
Implement S3 with CloudFront for secure, cached content delivery with signed URLs.
ECS tasks are failing with exit code 137 and health check failures. Debug the container issues.
Messages are being lost and processed multiple times. Implement reliable SQS/SNS messaging.
Design a scalable API Gateway with throttling, caching, and Lambda integration.
Production incidents take hours to detect. Implement CloudWatch alarms and dashboards.
IAM policies are too permissive. Implement least privilege access with proper role design.
Build a CI/CD pipeline with CodePipeline that deploys to ECS with blue-green deployments.
Your AWS bill increased 40% last month. Identify waste and implement cost controls.
Questions
RDS connections are exhausted and failover takes too long. Fix the database setup.
The Scenario
Your RDS PostgreSQL database is having issues:
Current Problems:
├── "FATAL: too many connections" errors
├── Application timeout during failover: 2-3 minutes
├── Max connections: 100 (db.t3.medium)
├── Active connections: 95-100 constantly
├── Lambda functions: 50 concurrent
├── ECS tasks: 20 replicas
└── Connection pooling: None
The Challenge
Fix the connection exhaustion problem, reduce failover time, and implement proper connection management for a highly available database setup.
Wrong Approach
A junior engineer might just upgrade to a larger instance, increase max_connections parameter, or create connections on every request. These approaches increase costs without fixing the root cause, can cause OOM issues, and create connection storms.
Right Approach
A senior engineer implements RDS Proxy for connection pooling, configures Multi-AZ with proper DNS caching settings, uses connection retry logic, and right-sizes the instance based on actual workload needs.
Step 1: Understand the Connection Problem
Connection Math:
├── Lambda: 50 concurrent × 1 connection = 50 connections
├── ECS: 20 tasks × 10 pool size = 200 connections
├── Total needed: 250 connections
├── Available: 100 connections
└── Result: Connection exhaustion
With RDS Proxy:
├── Lambda: 50 concurrent → Proxy → 10 DB connections
├── ECS: 20 tasks → Proxy → 40 DB connections
├── Total DB connections: 50 (pin on transaction)
└── Result: Efficient connection reuseStep 2: Deploy RDS Proxy
# Store database credentials in Secrets Manager
resource "aws_secretsmanager_secret" "db_credentials" {
name = "rds/postgres/credentials"
}
resource "aws_secretsmanager_secret_version" "db_credentials" {
secret_id = aws_secretsmanager_secret.db_credentials.id
secret_string = jsonencode({
username = "admin"
password = random_password.db.result
})
}
# RDS Proxy
resource "aws_db_proxy" "main" {
name = "orders-db-proxy"
debug_logging = false
engine_family = "POSTGRESQL"
idle_client_timeout = 1800
require_tls = true
vpc_security_group_ids = [aws_security_group.rds_proxy.id]
vpc_subnet_ids = aws_subnet.database[*].id
auth {
auth_scheme = "SECRETS"
client_password_auth_type = "POSTGRES_SCRAM_SHA_256"
iam_auth = "REQUIRED"
secret_arn = aws_secretsmanager_secret.db_credentials.arn
}
tags = {
Name = "orders-db-proxy"
}
}
# Proxy target group configuration
resource "aws_db_proxy_default_target_group" "main" {
db_proxy_name = aws_db_proxy.main.name
connection_pool_config {
max_connections_percent = 100
max_idle_connections_percent = 50
connection_borrow_timeout = 120
init_query = "SET timezone='UTC'"
session_pinning_filters = ["EXCLUDE_VARIABLE_SETS"]
}
}
# Associate proxy with RDS instance
resource "aws_db_proxy_target" "main" {
db_proxy_name = aws_db_proxy.main.name
target_group_name = aws_db_proxy_default_target_group.main.name
db_instance_identifier = aws_db_instance.main.id
}
# Security group for RDS Proxy
resource "aws_security_group" "rds_proxy" {
name = "rds-proxy-sg"
description = "Security group for RDS Proxy"
vpc_id = aws_vpc.main.id
ingress {
description = "PostgreSQL from app tier"
from_port = 5432
to_port = 5432
protocol = "tcp"
security_groups = [aws_security_group.app.id]
}
egress {
from_port = 5432
to_port = 5432
protocol = "tcp"
security_groups = [aws_security_group.rds.id]
}
}Step 3: Configure Multi-AZ RDS
resource "aws_db_instance" "main" {
identifier = "orders-db"
engine = "postgres"
engine_version = "15.4"
instance_class = "db.r6g.large"
allocated_storage = 100
max_allocated_storage = 500
storage_type = "gp3"
storage_encrypted = true
kms_key_id = aws_kms_key.rds.arn
db_name = "orders"
username = "admin"
password = random_password.db.result
# High Availability
multi_az = true
# Networking
db_subnet_group_name = aws_db_subnet_group.main.name
vpc_security_group_ids = [aws_security_group.rds.id]
publicly_accessible = false
# Performance
parameter_group_name = aws_db_parameter_group.main.name
# Backup
backup_retention_period = 7
backup_window = "03:00-04:00"
maintenance_window = "Mon:04:00-Mon:05:00"
# Monitoring
performance_insights_enabled = true
monitoring_interval = 60
monitoring_role_arn = aws_iam_role.rds_monitoring.arn
enabled_cloudwatch_logs_exports = ["postgresql", "upgrade"]
# Deletion protection
deletion_protection = true
skip_final_snapshot = false
final_snapshot_identifier = "orders-db-final"
tags = {
Name = "orders-db"
}
}
# Parameter group for connection optimization
resource "aws_db_parameter_group" "main" {
family = "postgres15"
name = "orders-db-params"
# Connection settings
parameter {
name = "max_connections"
value = "200"
}
# Reduce failover detection time
parameter {
name = "tcp_keepalives_idle"
value = "60"
}
parameter {
name = "tcp_keepalives_interval"
value = "10"
}
parameter {
name = "tcp_keepalives_count"
value = "3"
}
# Performance
parameter {
name = "shared_buffers"
value = "{DBInstanceClassMemory/4}"
}
parameter {
name = "work_mem"
value = "65536"
}
}Step 4: Connection Retry Logic
import psycopg2
from psycopg2 import OperationalError
import time
import logging
from functools import wraps
logger = logging.getLogger(__name__)
def with_retry(max_attempts=3, backoff_factor=2):
"""Decorator for database operations with retry logic."""
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
last_exception = None
for attempt in range(max_attempts):
try:
return func(*args, **kwargs)
except OperationalError as e:
last_exception = e
error_msg = str(e).lower()
# Retryable errors
retryable = any([
'connection refused' in error_msg,
'connection reset' in error_msg,
'connection timed out' in error_msg,
'server closed the connection' in error_msg,
'ssl connection has been closed' in error_msg,
])
if not retryable:
raise
wait_time = backoff_factor ** attempt
logger.warning(
f"Database connection error (attempt {attempt + 1}/{max_attempts}). "
f"Retrying in {wait_time}s: {e}"
)
time.sleep(wait_time)
raise last_exception
return wrapper
return decorator
class DatabaseConnection:
"""Connection wrapper with automatic reconnection."""
def __init__(self, dsn: str):
self.dsn = dsn
self._connection = None
@property
def connection(self):
if self._connection is None or self._connection.closed:
self._connection = self._create_connection()
return self._connection
@with_retry(max_attempts=5, backoff_factor=2)
def _create_connection(self):
return psycopg2.connect(
self.dsn,
connect_timeout=5,
options='-c statement_timeout=30000'
)
@with_retry(max_attempts=3)
def execute(self, query: str, params=None):
with self.connection.cursor() as cursor:
cursor.execute(query, params)
if cursor.description:
return cursor.fetchall()
self.connection.commit()
return cursor.rowcount
def close(self):
if self._connection:
self._connection.close()Step 5: Lambda with IAM Authentication
import boto3
import psycopg2
import os
def get_db_token():
"""Generate IAM authentication token for RDS Proxy."""
client = boto3.client('rds')
token = client.generate_db_auth_token(
DBHostname=os.environ['DB_PROXY_ENDPOINT'],
Port=5432,
DBUsername=os.environ['DB_USER'],
Region=os.environ['AWS_REGION']
)
return token
# Connection pool that works with RDS Proxy
connection = None
def get_connection():
"""Get or create database connection."""
global connection
if connection is None or connection.closed:
connection = psycopg2.connect(
host=os.environ['DB_PROXY_ENDPOINT'],
port=5432,
database=os.environ['DB_NAME'],
user=os.environ['DB_USER'],
password=get_db_token(),
sslmode='require',
connect_timeout=5
)
return connection
def handler(event, context):
"""Lambda handler with connection reuse."""
conn = get_connection()
try:
with conn.cursor() as cursor:
cursor.execute(
"SELECT * FROM orders WHERE id = %s",
(event['order_id'],)
)
result = cursor.fetchone()
return {
'statusCode': 200,
'body': {'order': result}
}
except Exception as e:
# Reset connection on error
global connection
if connection:
connection.close()
connection = None
raiseStep 6: Read Replicas for Read Scaling
# Read replica
resource "aws_db_instance" "read_replica" {
identifier = "orders-db-replica"
replicate_source_db = aws_db_instance.main.identifier
instance_class = "db.r6g.large"
# Replica-specific settings
multi_az = false
publicly_accessible = false
# Performance
parameter_group_name = aws_db_parameter_group.replica.name
# Monitoring
performance_insights_enabled = true
monitoring_interval = 60
monitoring_role_arn = aws_iam_role.rds_monitoring.arn
tags = {
Name = "orders-db-replica"
}
}
# RDS Proxy endpoint for read replicas
resource "aws_db_proxy_endpoint" "read_only" {
db_proxy_name = aws_db_proxy.main.name
db_proxy_endpoint_name = "read-only"
vpc_subnet_ids = aws_subnet.database[*].id
target_role = "READ_ONLY"
vpc_security_group_ids = [aws_security_group.rds_proxy.id]
}Step 7: Monitoring and Alerting
# Connection usage alarm
resource "aws_cloudwatch_metric_alarm" "db_connections" {
alarm_name = "rds-high-connections"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "DatabaseConnections"
namespace = "AWS/RDS"
period = 300
statistic = "Average"
threshold = 150 # 75% of max_connections=200
alarm_description = "Database connections are high"
alarm_actions = [aws_sns_topic.alerts.arn]
dimensions = {
DBInstanceIdentifier = aws_db_instance.main.identifier
}
}
# Proxy connection alarm
resource "aws_cloudwatch_metric_alarm" "proxy_connections" {
alarm_name = "rds-proxy-connections"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "ClientConnections"
namespace = "AWS/RDS/Proxy"
period = 60
statistic = "Sum"
threshold = 900 # RDS Proxy limit is 1000
alarm_description = "RDS Proxy connections are high"
alarm_actions = [aws_sns_topic.alerts.arn]
dimensions = {
ProxyName = aws_db_proxy.main.name
}
}
# Replication lag alarm
resource "aws_cloudwatch_metric_alarm" "replication_lag" {
alarm_name = "rds-replication-lag"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
metric_name = "ReplicaLag"
namespace = "AWS/RDS"
period = 60
statistic = "Maximum"
threshold = 30 # 30 seconds
alarm_description = "Read replica lag is high"
alarm_actions = [aws_sns_topic.alerts.arn]
dimensions = {
DBInstanceIdentifier = aws_db_instance.read_replica.identifier
}
} RDS High Availability Summary
| Feature | Benefit | Failover Time |
|---|---|---|
| Multi-AZ | Automatic failover | 60-120 seconds |
| RDS Proxy | Connection pooling | Near-instant |
| Read Replicas | Read scaling | Manual promotion |
| Aurora | Faster failover | Under 30 seconds |
Practice Question
Why does RDS Proxy significantly reduce failover time compared to direct RDS connections?