Questions
Terraform plan takes 15 minutes and times out. Your state file has 2000+ resources. Optimize it.
The Scenario
Your infrastructure has grown over 3 years. Every terraform plan now takes 15+ minutes:
$ time terraform plan
aws_instance.web[0]: Refreshing state...
aws_instance.web[1]: Refreshing state...
# ... 2000+ resources refreshing ...
# 15 minutes later...
Plan: 0 to add, 1 to change, 0 to destroy.
real 15m23.456s
CI/CD pipelines are timing out. Engineers are frustrated waiting. Meanwhile, AWS API rate limiting errors appear intermittently.
The Challenge
Optimize Terraform performance for a large-scale infrastructure without sacrificing safety or manageability.
A junior engineer might add -parallelism=50 to speed up API calls, use -refresh=false to skip refresh entirely, or split everything into tiny states randomly. These approaches cause API rate limiting, risk applying stale plans, or create an unmaintainable mess of cross-state dependencies.
A senior engineer analyzes where time is spent, implements strategic state separation along service boundaries, uses -target for specific changes during development, configures provider parallelism appropriately, and considers data sources for read-only lookups.
Step 1: Analyze Where Time Is Spent
# Enable trace logging
TF_LOG=TRACE terraform plan 2>&1 | tee plan.log
# Analyze API calls
grep "HTTP Request" plan.log | sort | uniq -c | sort -rn
# Count resources per type
terraform state list | cut -d'.' -f1-2 | sort | uniq -c | sort -rn
# Output might show:
# 500 aws_security_group_rule
# 400 aws_iam_policy_document
# 300 aws_route53_record
# 200 aws_instance
# ...Step 2: Strategic State Separation
Bad: Random splits
state-1.tfstate # 700 random resources
state-2.tfstate # 700 random resources
state-3.tfstate # 700 random resources with cross-dependenciesGood: Service/domain boundaries
infrastructure/
├── network/ # VPC, subnets, NAT gateways (changes rarely)
│ └── terraform.tfstate
├── security/ # IAM roles, policies, SCPs
│ └── terraform.tfstate
├── data/ # RDS, ElastiCache, S3
│ └── terraform.tfstate
└── compute/ # ECS, EC2, ASGs
└── terraform.tfstateImplementation:
# network/main.tf
resource "aws_vpc" "main" {
cidr_block = "10.0.0.0/16"
}
output "vpc_id" {
value = aws_vpc.main.id
}
output "private_subnet_ids" {
value = aws_subnet.private[*].id
}# compute/main.tf
# Reference network state via data source
data "terraform_remote_state" "network" {
backend = "s3"
config = {
bucket = "terraform-state"
key = "network/terraform.tfstate"
region = "us-east-1"
}
}
resource "aws_instance" "web" {
subnet_id = data.terraform_remote_state.network.outputs.private_subnet_ids[0]
# ...
}Step 3: Reduce Unnecessary Resources
Remove redundant security group rules:
# BEFORE: 500 individual rules
resource "aws_security_group_rule" "allow_443" {
# ...
}
# AFTER: Consolidated into security group
resource "aws_security_group" "web" {
ingress {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = var.allowed_cidrs # Pass list instead of creating rules per CIDR
}
}Use for_each instead of count for stable addressing:
# for_each creates fewer resources to track
resource "aws_route53_record" "services" {
for_each = var.services
name = each.key
type = "A"
zone_id = var.zone_id
# ...
}Step 4: Provider Configuration Tuning
# provider.tf
provider "aws" {
region = "us-east-1"
# Reduce parallelism to avoid rate limits
# Default is 10, reduce if hitting limits
max_retries = 5
# Skip unnecessary metadata lookups
skip_metadata_api_check = true
skip_region_validation = true
skip_credentials_validation = true
}Step 5: Use -target for Development
# During development, target specific resources
terraform plan -target=aws_instance.web
# Apply specific changes without full refresh
terraform apply -target=module.ecs_service
# WARNING: Only for development!
# Always run full plan before merging to mainStep 6: Selective Refresh (Terraform 1.5+)
# Skip refresh entirely (use with caution!)
terraform plan -refresh=false
# Refresh only specific resources
terraform apply -refresh-only -target=aws_instance.web
# Better: Use refresh in a separate step
terraform refresh # Run periodically
terraform plan -refresh=false # Faster subsequent plansStep 7: State File Optimization
# Check state file size
aws s3 ls s3://terraform-state/terraform.tfstate --human-readable
# If state is huge, there might be orphaned resources
terraform state list | wc -l # Count managed resources
# Remove resources that no longer exist
terraform state rm 'aws_instance.old_server'
# Compact state (remove old versions)
terraform state pull > state.json
terraform state push state.jsonStep 8: Consider Terraform Cloud
# Terraform Cloud has optimized state handling
terraform {
cloud {
organization = "company"
workspaces {
name = "production"
}
}
}
# Benefits:
# - Remote execution (doesn't use your network)
# - Optimized state storage
# - Parallel runs across workspaces
# - Built-in cost estimationPerformance Comparison
| State Size | Plan Time (Before) | Plan Time (After) |
|---|---|---|
| 2000 resources (single state) | 15 min | 15 min |
| 500 resources (network) | - | 2 min |
| 400 resources (security) | - | 1.5 min |
| 600 resources (data) | - | 3 min |
| 500 resources (compute) | - | 2 min |
| Total (parallel) | 15 min | 3 min |
Terragrunt for Parallel Execution
# Run plans for all modules in parallel
terragrunt run-all plan --parallelism=4
# Dependencies are automatically handled
# Independent modules run simultaneously# terragrunt.hcl
dependency "network" {
config_path = "../network"
}
inputs = {
vpc_id = dependency.network.outputs.vpc_id
} Quick Wins Checklist
| Action | Impact | Risk |
|---|---|---|
| Split state by domain | High | Medium (migration needed) |
| Reduce -parallelism | Medium | Low |
| Use -target (dev only) | High | Medium (incomplete plans) |
| Remove orphaned resources | Medium | Low |
| Consolidate SG rules | Medium | Low |
| Use for_each vs count | Medium | Low |
Practice Question
Why is using 'terraform plan -refresh=false' risky in production pipelines?