Interviews / Cloud & DevOps / Your AWS bill increased 40% last month. Identify waste and implement cost controls.
Lambda functions are timing out when accessing RDS in a VPC. Debug the connectivity issue.
Design a multi-tier VPC architecture with public, private, and database subnets.
DynamoDB is throttling requests and costs are high. Optimize the table design.
RDS connections are exhausted and failover takes too long. Fix the database setup.
Implement S3 with CloudFront for secure, cached content delivery with signed URLs.
ECS tasks are failing with exit code 137 and health check failures. Debug the container issues.
Messages are being lost and processed multiple times. Implement reliable SQS/SNS messaging.
Design a scalable API Gateway with throttling, caching, and Lambda integration.
Production incidents take hours to detect. Implement CloudWatch alarms and dashboards.
IAM policies are too permissive. Implement least privilege access with proper role design.
Build a CI/CD pipeline with CodePipeline that deploys to ECS with blue-green deployments.
Your AWS bill increased 40% last month. Identify waste and implement cost controls.
Questions
Your AWS bill increased 40% last month. Identify waste and implement cost controls.
The Scenario
Your AWS costs are out of control:
Cost Breakdown (Last Month):
├── EC2: $45,000 (40% increase)
├── RDS: $12,000 (stable)
├── S3: $8,000 (20% increase)
├── NAT Gateway: $6,000 (100% increase!)
├── Data Transfer: $5,000
├── Lambda: $3,000
└── Total: $79,000 (+40% from previous month)
Issues Found:
├── Dev instances running 24/7
├── Unattached EBS volumes: 50TB
├── Old snapshots: 100TB
├── No Reserved Instances
└── No Savings Plans
The Challenge
Identify cost optimization opportunities, implement proper tagging for cost allocation, set up budgets and alerts, and reduce the bill by at least 30%.
Wrong Approach
A junior engineer might just turn off instances randomly, delete resources without checking dependencies, or ignore the data transfer costs. These approaches cause outages, lose data, and miss the biggest savings opportunities.
Right Approach
A senior engineer analyzes Cost Explorer data, implements tagging for attribution, uses Compute Optimizer for right-sizing, leverages Savings Plans, and addresses each cost category systematically with proper controls.
Step 1: Cost Analysis with Cost Explorer
# Get cost breakdown by service
aws ce get-cost-and-usage \
--time-period Start=2024-01-01,End=2024-01-31 \
--granularity MONTHLY \
--metrics "UnblendedCost" \
--group-by Type=DIMENSION,Key=SERVICE
# Find unused resources
# Unattached EBS volumes
aws ec2 describe-volumes \
--filters "Name=status,Values=available" \
--query 'Volumes[*].[VolumeId,Size,CreateTime]'
# Old snapshots (over 90 days)
aws ec2 describe-snapshots --owner-ids self \
--query 'Snapshots[?StartTime<=`2023-10-01`].[SnapshotId,VolumeSize,StartTime]'
# Idle EC2 instances (low CPU)
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-xxx \
--start-time 2024-01-01T00:00:00Z \
--end-time 2024-01-31T00:00:00Z \
--period 86400 \
--statistics Average
# Unattached Elastic IPs
aws ec2 describe-addresses \
--query 'Addresses[?AssociationId==`null`].[PublicIp,AllocationId]'Step 2: Cost Allocation Tags
# Enforce tagging with AWS Config
resource "aws_config_config_rule" "required_tags" {
name = "required-tags"
source {
owner = "AWS"
source_identifier = "REQUIRED_TAGS"
}
input_parameters = jsonencode({
tag1Key = "Environment"
tag1Value = "production,staging,development"
tag2Key = "Team"
tag3Key = "CostCenter"
})
scope {
compliance_resource_types = [
"AWS::EC2::Instance",
"AWS::RDS::DBInstance",
"AWS::S3::Bucket",
"AWS::Lambda::Function"
]
}
}
# Tag policy in AWS Organizations
resource "aws_organizations_policy" "tag_policy" {
name = "required-tags"
type = "TAG_POLICY"
content = jsonencode({
tags = {
Environment = {
tag_key = {
@@assign = "Environment"
}
tag_value = {
@@assign = ["production", "staging", "development"]
}
enforced_for = {
@@assign = ["ec2:instance", "rds:db", "s3:bucket"]
}
}
Team = {
tag_key = {
@@assign = "Team"
}
}
CostCenter = {
tag_key = {
@@assign = "CostCenter"
}
}
}
})
}Step 3: Savings Plans and Reserved Instances
# Check recommendations
# aws ce get-savings-plans-purchase-recommendation
# Compute Savings Plans (most flexible)
# Covers EC2, Lambda, and Fargate
# 1-year no upfront: ~20% savings
# 1-year all upfront: ~30% savings
# 3-year all upfront: ~50% savings
# Example: Purchase via CLI
# aws savingsplans create-savings-plan \
# --savings-plan-offering-id xxx \
# --commitment 10.0 \
# --savings-plan-type ComputeSavingsPlans
# Reserved Instances for RDS
resource "aws_db_instance" "main" {
# ... configuration
# Use Reserved Instance pricing
# Purchase separately via console or CLI
# aws rds purchase-reserved-db-instances-offering
}
# Cost comparison table
#
# Resource Type | On-Demand | 1yr No Upfront | 1yr All Upfront | 3yr All Upfront
# EC2 m5.xlarge | $140/mo | $90/mo (-36%) | $84/mo (-40%) | $56/mo (-60%)
# RDS db.r5.large | $175/mo | $110/mo (-37%) | $100/mo (-43%) | $67/mo (-62%)Step 4: Right-Sizing with Compute Optimizer
# Enable Compute Optimizer
aws compute-optimizer update-enrollment-status --status Active
# Get EC2 recommendations
aws compute-optimizer get-ec2-instance-recommendations \
--query 'instanceRecommendations[*].[instanceArn,currentInstanceType,recommendationOptions[0].instanceType,recommendationOptions[0].estimatedMonthlySavings]'
# Get Lambda recommendations
aws compute-optimizer get-lambda-function-recommendations \
--query 'lambdaFunctionRecommendations[*].[functionArn,currentMemorySize,memorySizeRecommendationOptions[0].memorySize]'# Auto-shutdown for dev/test environments
resource "aws_autoscaling_schedule" "scale_down" {
scheduled_action_name = "scale-down-night"
min_size = 0
max_size = 0
desired_capacity = 0
recurrence = "0 20 * * MON-FRI" # 8 PM weekdays
autoscaling_group_name = aws_autoscaling_group.dev.name
}
resource "aws_autoscaling_schedule" "scale_up" {
scheduled_action_name = "scale-up-morning"
min_size = 2
max_size = 4
desired_capacity = 2
recurrence = "0 8 * * MON-FRI" # 8 AM weekdays
autoscaling_group_name = aws_autoscaling_group.dev.name
}
# Instance Scheduler for EC2/RDS
# Use AWS Instance Scheduler solution
# https://aws.amazon.com/solutions/implementations/instance-scheduler-on-aws/Step 5: S3 Cost Optimization
resource "aws_s3_bucket_lifecycle_configuration" "main" {
bucket = aws_s3_bucket.data.id
# Move to cheaper storage classes
rule {
id = "transition-to-ia"
status = "Enabled"
transition {
days = 30
storage_class = "STANDARD_IA"
}
transition {
days = 90
storage_class = "GLACIER_IR" # Instant Retrieval
}
transition {
days = 180
storage_class = "GLACIER"
}
transition {
days = 365
storage_class = "DEEP_ARCHIVE"
}
}
# Delete old versions
rule {
id = "delete-old-versions"
status = "Enabled"
noncurrent_version_transition {
noncurrent_days = 30
storage_class = "STANDARD_IA"
}
noncurrent_version_expiration {
noncurrent_days = 90
}
}
# Abort incomplete multipart uploads
rule {
id = "abort-multipart"
status = "Enabled"
abort_incomplete_multipart_upload {
days_after_initiation = 7
}
}
# Delete expired objects
rule {
id = "expire-logs"
status = "Enabled"
filter {
prefix = "logs/"
}
expiration {
days = 30
}
}
}
# Enable S3 Intelligent-Tiering for unknown access patterns
resource "aws_s3_bucket_intelligent_tiering_configuration" "main" {
bucket = aws_s3_bucket.data.id
name = "entire-bucket"
tiering {
access_tier = "ARCHIVE_ACCESS"
days = 90
}
tiering {
access_tier = "DEEP_ARCHIVE_ACCESS"
days = 180
}
}Step 6: NAT Gateway Optimization
# Use VPC endpoints instead of NAT Gateway
# Gateway endpoints (FREE): S3, DynamoDB
resource "aws_vpc_endpoint" "s3" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.${var.region}.s3"
vpc_endpoint_type = "Gateway"
route_table_ids = aws_route_table.private[*].id
}
resource "aws_vpc_endpoint" "dynamodb" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.${var.region}.dynamodb"
vpc_endpoint_type = "Gateway"
route_table_ids = aws_route_table.private[*].id
}
# Interface endpoints (~$7/month each, but saves NAT data transfer)
resource "aws_vpc_endpoint" "ecr_api" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.${var.region}.ecr.api"
vpc_endpoint_type = "Interface"
subnet_ids = aws_subnet.private[*].id
security_group_ids = [aws_security_group.vpc_endpoints.id]
private_dns_enabled = true
}
# NAT Gateway cost breakdown:
# - Hourly charge: $0.045/hour = ~$32/month per NAT GW
# - Data processing: $0.045/GB
# - Cross-AZ: Additional $0.01/GB
#
# If processing 1TB/month through NAT: $32 + $45 = $77/month
# With VPC endpoints for S3/ECR: Saves most of that data transferStep 7: Budget Alerts
resource "aws_budgets_budget" "monthly" {
name = "monthly-budget"
budget_type = "COST"
limit_amount = "60000"
limit_unit = "USD"
time_unit = "MONTHLY"
time_period_start = "2024-01-01_00:00"
cost_filter {
name = "TagKeyValue"
values = ["user:Environment$production"]
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 80
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_email_addresses = ["team@example.com"]
subscriber_sns_topic_arns = [aws_sns_topic.budget_alerts.arn]
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 100
threshold_type = "PERCENTAGE"
notification_type = "FORECASTED"
subscriber_email_addresses = ["team@example.com"]
subscriber_sns_topic_arns = [aws_sns_topic.budget_alerts.arn]
}
}
# Per-service budgets
resource "aws_budgets_budget" "ec2" {
name = "ec2-budget"
budget_type = "COST"
limit_amount = "35000"
limit_unit = "USD"
time_unit = "MONTHLY"
time_period_start = "2024-01-01_00:00"
cost_filter {
name = "Service"
values = ["Amazon Elastic Compute Cloud - Compute"]
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 90
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_sns_topic_arns = [aws_sns_topic.budget_alerts.arn]
}
}Step 8: Automated Cleanup
import boto3
from datetime import datetime, timedelta
ec2 = boto3.client('ec2')
def cleanup_unused_resources():
"""Find and optionally delete unused resources."""
savings = 0
# Unattached EBS volumes
volumes = ec2.describe_volumes(
Filters=[{'Name': 'status', 'Values': ['available']}]
)['Volumes']
for vol in volumes:
age = (datetime.now(vol['CreateTime'].tzinfo) - vol['CreateTime']).days
if age > 30:
print(f"Unused volume: {vol['VolumeId']}, Size: {vol['Size']}GB, Age: {age} days")
savings += vol['Size'] * 0.10 # ~$0.10/GB/month
# Uncomment to delete
# ec2.delete_volume(VolumeId=vol['VolumeId'])
# Old snapshots
snapshots = ec2.describe_snapshots(OwnerIds=['self'])['Snapshots']
cutoff = datetime.now(snapshots[0]['StartTime'].tzinfo) - timedelta(days=90)
for snap in snapshots:
if snap['StartTime'] < cutoff:
print(f"Old snapshot: {snap['SnapshotId']}, Size: {snap['VolumeSize']}GB")
savings += snap['VolumeSize'] * 0.05 # ~$0.05/GB/month
# Unattached Elastic IPs
addresses = ec2.describe_addresses()['Addresses']
for addr in addresses:
if 'AssociationId' not in addr:
print(f"Unused EIP: {addr['PublicIp']}")
savings += 3.60 # ~$3.60/month
print(f"\nEstimated monthly savings: ${savings:.2f}")
if __name__ == '__main__':
cleanup_unused_resources() Cost Optimization Summary
| Strategy | Potential Savings | Implementation Effort |
|---|---|---|
| Savings Plans (1yr) | 20-30% on compute | Low (purchase decision) |
| Right-sizing | 20-40% | Medium (analysis needed) |
| Dev shutdown | 65% on dev/test | Medium (scheduling) |
| S3 lifecycle | 50-80% on storage | Low (configuration) |
| VPC endpoints | 50%+ on NAT | Low (infrastructure) |
| Spot instances | 60-90% | High (architecture change) |
Practice Question
Why do NAT Gateway costs often spike unexpectedly?