Your AWS bill increased 40% last month. Identify waste and implement cost controls.

Q: Your AWS bill increased 40% last month. Identify waste and implement cost controls.

Learn the answer to "Your AWS bill increased 40% last month. Identify waste and implement cost controls." with detailed explanations, code examples, and best practices on DeployU.

The Scenario

Your AWS costs are out of control:

Cost Breakdown (Last Month):
├── EC2: $45,000 (40% increase)
├── RDS: $12,000 (stable)
├── S3: $8,000 (20% increase)
├── NAT Gateway: $6,000 (100% increase!)
├── Data Transfer: $5,000
├── Lambda: $3,000
└── Total: $79,000 (+40% from previous month)

Issues Found:
├── Dev instances running 24/7
├── Unattached EBS volumes: 50TB
├── Old snapshots: 100TB
├── No Reserved Instances
└── No Savings Plans

The Challenge

Identify cost optimization opportunities, implement proper tagging for cost allocation, set up budgets and alerts, and reduce the bill by at least 30%.

Wrong Approach

A junior engineer might just turn off instances randomly, delete resources without checking dependencies, or ignore the data transfer costs. These approaches cause outages, lose data, and miss the biggest savings opportunities.

Addresses symptoms, not root cause

Right Approach

A senior engineer analyzes Cost Explorer data, implements tagging for attribution, uses Compute Optimizer for right-sizing, leverages Savings Plans, and addresses each cost category systematically with proper controls.

Step 1: Cost Analysis with Cost Explorer

# Get cost breakdown by service
aws ce get-cost-and-usage \
  --time-period Start=2024-01-01,End=2024-01-31 \
  --granularity MONTHLY \
  --metrics "UnblendedCost" \
  --group-by Type=DIMENSION,Key=SERVICE

# Find unused resources
# Unattached EBS volumes
aws ec2 describe-volumes \
  --filters "Name=status,Values=available" \
  --query 'Volumes[*].[VolumeId,Size,CreateTime]'

# Old snapshots (over 90 days)
aws ec2 describe-snapshots --owner-ids self \
  --query 'Snapshots[?StartTime<=`2023-10-01`].[SnapshotId,VolumeSize,StartTime]'

# Idle EC2 instances (low CPU)
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-xxx \
  --start-time 2024-01-01T00:00:00Z \
  --end-time 2024-01-31T00:00:00Z \
  --period 86400 \
  --statistics Average

# Unattached Elastic IPs
aws ec2 describe-addresses \
  --query 'Addresses[?AssociationId==`null`].[PublicIp,AllocationId]'

Step 2: Cost Allocation Tags

# Enforce tagging with AWS Config
resource "aws_config_config_rule" "required_tags" {
  name = "required-tags"

  source {
    owner             = "AWS"
    source_identifier = "REQUIRED_TAGS"
  }

  input_parameters = jsonencode({
    tag1Key   = "Environment"
    tag1Value = "production,staging,development"
    tag2Key   = "Team"
    tag3Key   = "CostCenter"
  })

  scope {
    compliance_resource_types = [
      "AWS::EC2::Instance",
      "AWS::RDS::DBInstance",
      "AWS::S3::Bucket",
      "AWS::Lambda::Function"
    ]
  }
}

# Tag policy in AWS Organizations
resource "aws_organizations_policy" "tag_policy" {
  name    = "required-tags"
  type    = "TAG_POLICY"
  content = jsonencode({
    tags = {
      Environment = {
        tag_key = {
          @@assign = "Environment"
        }
        tag_value = {
          @@assign = ["production", "staging", "development"]
        }
        enforced_for = {
          @@assign = ["ec2:instance", "rds:db", "s3:bucket"]
        }
      }
      Team = {
        tag_key = {
          @@assign = "Team"
        }
      }
      CostCenter = {
        tag_key = {
          @@assign = "CostCenter"
        }
      }
    }
  })
}

Step 3: Savings Plans and Reserved Instances

# Check recommendations
# aws ce get-savings-plans-purchase-recommendation

# Compute Savings Plans (most flexible)
# Covers EC2, Lambda, and Fargate
# 1-year no upfront: ~20% savings
# 1-year all upfront: ~30% savings
# 3-year all upfront: ~50% savings

# Example: Purchase via CLI
# aws savingsplans create-savings-plan \
#   --savings-plan-offering-id xxx \
#   --commitment 10.0 \
#   --savings-plan-type ComputeSavingsPlans

# Reserved Instances for RDS
resource "aws_db_instance" "main" {
  # ... configuration

  # Use Reserved Instance pricing
  # Purchase separately via console or CLI
  # aws rds purchase-reserved-db-instances-offering
}

# Cost comparison table
#
# Resource Type    | On-Demand | 1yr No Upfront | 1yr All Upfront | 3yr All Upfront
# EC2 m5.xlarge   | $140/mo   | $90/mo (-36%)  | $84/mo (-40%)   | $56/mo (-60%)
# RDS db.r5.large | $175/mo   | $110/mo (-37%) | $100/mo (-43%)  | $67/mo (-62%)

Step 4: Right-Sizing with Compute Optimizer

# Enable Compute Optimizer
aws compute-optimizer update-enrollment-status --status Active

# Get EC2 recommendations
aws compute-optimizer get-ec2-instance-recommendations \
  --query 'instanceRecommendations[*].[instanceArn,currentInstanceType,recommendationOptions[0].instanceType,recommendationOptions[0].estimatedMonthlySavings]'

# Get Lambda recommendations
aws compute-optimizer get-lambda-function-recommendations \
  --query 'lambdaFunctionRecommendations[*].[functionArn,currentMemorySize,memorySizeRecommendationOptions[0].memorySize]'

# Auto-shutdown for dev/test environments
resource "aws_autoscaling_schedule" "scale_down" {
  scheduled_action_name  = "scale-down-night"
  min_size               = 0
  max_size               = 0
  desired_capacity       = 0
  recurrence             = "0 20 * * MON-FRI"  # 8 PM weekdays
  autoscaling_group_name = aws_autoscaling_group.dev.name
}

resource "aws_autoscaling_schedule" "scale_up" {
  scheduled_action_name  = "scale-up-morning"
  min_size               = 2
  max_size               = 4
  desired_capacity       = 2
  recurrence             = "0 8 * * MON-FRI"  # 8 AM weekdays
  autoscaling_group_name = aws_autoscaling_group.dev.name
}

# Instance Scheduler for EC2/RDS
# Use AWS Instance Scheduler solution
# https://aws.amazon.com/solutions/implementations/instance-scheduler-on-aws/

Step 5: S3 Cost Optimization

resource "aws_s3_bucket_lifecycle_configuration" "main" {
  bucket = aws_s3_bucket.data.id

  # Move to cheaper storage classes
  rule {
    id     = "transition-to-ia"
    status = "Enabled"

    transition {
      days          = 30
      storage_class = "STANDARD_IA"
    }

    transition {
      days          = 90
      storage_class = "GLACIER_IR"  # Instant Retrieval
    }

    transition {
      days          = 180
      storage_class = "GLACIER"
    }

    transition {
      days          = 365
      storage_class = "DEEP_ARCHIVE"
    }
  }

  # Delete old versions
  rule {
    id     = "delete-old-versions"
    status = "Enabled"

    noncurrent_version_transition {
      noncurrent_days = 30
      storage_class   = "STANDARD_IA"
    }

    noncurrent_version_expiration {
      noncurrent_days = 90
    }
  }

  # Abort incomplete multipart uploads
  rule {
    id     = "abort-multipart"
    status = "Enabled"

    abort_incomplete_multipart_upload {
      days_after_initiation = 7
    }
  }

  # Delete expired objects
  rule {
    id     = "expire-logs"
    status = "Enabled"

    filter {
      prefix = "logs/"
    }

    expiration {
      days = 30
    }
  }
}

# Enable S3 Intelligent-Tiering for unknown access patterns
resource "aws_s3_bucket_intelligent_tiering_configuration" "main" {
  bucket = aws_s3_bucket.data.id
  name   = "entire-bucket"

  tiering {
    access_tier = "ARCHIVE_ACCESS"
    days        = 90
  }

  tiering {
    access_tier = "DEEP_ARCHIVE_ACCESS"
    days        = 180
  }
}

Step 6: NAT Gateway Optimization

# Use VPC endpoints instead of NAT Gateway
# Gateway endpoints (FREE): S3, DynamoDB
resource "aws_vpc_endpoint" "s3" {
  vpc_id            = aws_vpc.main.id
  service_name      = "com.amazonaws.${var.region}.s3"
  vpc_endpoint_type = "Gateway"
  route_table_ids   = aws_route_table.private[*].id
}

resource "aws_vpc_endpoint" "dynamodb" {
  vpc_id            = aws_vpc.main.id
  service_name      = "com.amazonaws.${var.region}.dynamodb"
  vpc_endpoint_type = "Gateway"
  route_table_ids   = aws_route_table.private[*].id
}

# Interface endpoints (~$7/month each, but saves NAT data transfer)
resource "aws_vpc_endpoint" "ecr_api" {
  vpc_id              = aws_vpc.main.id
  service_name        = "com.amazonaws.${var.region}.ecr.api"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = aws_subnet.private[*].id
  security_group_ids  = [aws_security_group.vpc_endpoints.id]
  private_dns_enabled = true
}

# NAT Gateway cost breakdown:
# - Hourly charge: $0.045/hour = ~$32/month per NAT GW
# - Data processing: $0.045/GB
# - Cross-AZ: Additional $0.01/GB
#
# If processing 1TB/month through NAT: $32 + $45 = $77/month
# With VPC endpoints for S3/ECR: Saves most of that data transfer

Step 7: Budget Alerts

resource "aws_budgets_budget" "monthly" {
  name              = "monthly-budget"
  budget_type       = "COST"
  limit_amount      = "60000"
  limit_unit        = "USD"
  time_unit         = "MONTHLY"
  time_period_start = "2024-01-01_00:00"

  cost_filter {
    name   = "TagKeyValue"
    values = ["user:Environment$production"]
  }

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 80
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_email_addresses = ["team@example.com"]
    subscriber_sns_topic_arns  = [aws_sns_topic.budget_alerts.arn]
  }

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 100
    threshold_type             = "PERCENTAGE"
    notification_type          = "FORECASTED"
    subscriber_email_addresses = ["team@example.com"]
    subscriber_sns_topic_arns  = [aws_sns_topic.budget_alerts.arn]
  }
}

# Per-service budgets
resource "aws_budgets_budget" "ec2" {
  name              = "ec2-budget"
  budget_type       = "COST"
  limit_amount      = "35000"
  limit_unit        = "USD"
  time_unit         = "MONTHLY"
  time_period_start = "2024-01-01_00:00"

  cost_filter {
    name   = "Service"
    values = ["Amazon Elastic Compute Cloud - Compute"]
  }

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 90
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_sns_topic_arns  = [aws_sns_topic.budget_alerts.arn]
  }
}

Step 8: Automated Cleanup

import boto3
from datetime import datetime, timedelta

ec2 = boto3.client('ec2')

def cleanup_unused_resources():
    """Find and optionally delete unused resources."""
    savings = 0

    # Unattached EBS volumes
    volumes = ec2.describe_volumes(
        Filters=[{'Name': 'status', 'Values': ['available']}]
    )['Volumes']

    for vol in volumes:
        age = (datetime.now(vol['CreateTime'].tzinfo) - vol['CreateTime']).days
        if age > 30:
            print(f"Unused volume: {vol['VolumeId']}, Size: {vol['Size']}GB, Age: {age} days")
            savings += vol['Size'] * 0.10  # ~$0.10/GB/month

            # Uncomment to delete
            # ec2.delete_volume(VolumeId=vol['VolumeId'])

    # Old snapshots
    snapshots = ec2.describe_snapshots(OwnerIds=['self'])['Snapshots']
    cutoff = datetime.now(snapshots[0]['StartTime'].tzinfo) - timedelta(days=90)

    for snap in snapshots:
        if snap['StartTime'] < cutoff:
            print(f"Old snapshot: {snap['SnapshotId']}, Size: {snap['VolumeSize']}GB")
            savings += snap['VolumeSize'] * 0.05  # ~$0.05/GB/month

    # Unattached Elastic IPs
    addresses = ec2.describe_addresses()['Addresses']
    for addr in addresses:
        if 'AssociationId' not in addr:
            print(f"Unused EIP: {addr['PublicIp']}")
            savings += 3.60  # ~$3.60/month

    print(f"\nEstimated monthly savings: ${savings:.2f}")

if __name__ == '__main__':
    cleanup_unused_resources()

Systematic, production-ready debugging

Cost Optimization Summary

Strategy	Potential Savings	Implementation Effort
Savings Plans (1yr)	20-30% on compute	Low (purchase decision)
Right-sizing	20-40%	Medium (analysis needed)
Dev shutdown	65% on dev/test	Medium (scheduling)
S3 lifecycle	50-80% on storage	Low (configuration)
VPC endpoints	50%+ on NAT	Low (infrastructure)
Spot instances	60-90%	High (architecture change)

Practice Question

Why do NAT Gateway costs often spike unexpectedly?

Questions