Terraform plan takes 15 minutes and times out. Your state file has 2000+ resources. Optimize it.

Q: Terraform plan takes 15 minutes and times out. Your state file has 2000+ resources. Optimize it.

Learn the answer to "Terraform plan takes 15 minutes and times out. Your state file has 2000+ resources. Optimize it." with detailed explanations, code examples, and best practices on DeployU.

The Scenario

Your infrastructure has grown over 3 years. Every terraform plan now takes 15+ minutes:

$ time terraform plan
aws_instance.web[0]: Refreshing state...
aws_instance.web[1]: Refreshing state...
# ... 2000+ resources refreshing ...

# 15 minutes later...
Plan: 0 to add, 1 to change, 0 to destroy.

real    15m23.456s

CI/CD pipelines are timing out. Engineers are frustrated waiting. Meanwhile, AWS API rate limiting errors appear intermittently.

The Challenge

Optimize Terraform performance for a large-scale infrastructure without sacrificing safety or manageability.

Wrong Approach

A junior engineer might add -parallelism=50 to speed up API calls, use -refresh=false to skip refresh entirely, or split everything into tiny states randomly. These approaches cause API rate limiting, risk applying stale plans, or create an unmaintainable mess of cross-state dependencies.

Addresses symptoms, not root cause

Right Approach

A senior engineer analyzes where time is spent, implements strategic state separation along service boundaries, uses -target for specific changes during development, configures provider parallelism appropriately, and considers data sources for read-only lookups.

Step 1: Analyze Where Time Is Spent

# Enable trace logging
TF_LOG=TRACE terraform plan 2>&1 | tee plan.log

# Analyze API calls
grep "HTTP Request" plan.log | sort | uniq -c | sort -rn

# Count resources per type
terraform state list | cut -d'.' -f1-2 | sort | uniq -c | sort -rn

# Output might show:
# 500 aws_security_group_rule
# 400 aws_iam_policy_document
# 300 aws_route53_record
# 200 aws_instance
# ...

Step 2: Strategic State Separation

Bad: Random splits

state-1.tfstate  # 700 random resources
state-2.tfstate  # 700 random resources
state-3.tfstate  # 700 random resources with cross-dependencies

Good: Service/domain boundaries

infrastructure/
├── network/           # VPC, subnets, NAT gateways (changes rarely)
│   └── terraform.tfstate
├── security/          # IAM roles, policies, SCPs
│   └── terraform.tfstate
├── data/             # RDS, ElastiCache, S3
│   └── terraform.tfstate
└── compute/          # ECS, EC2, ASGs
    └── terraform.tfstate

Implementation:

# network/main.tf
resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"
}

output "vpc_id" {
  value = aws_vpc.main.id
}

output "private_subnet_ids" {
  value = aws_subnet.private[*].id
}

# compute/main.tf
# Reference network state via data source
data "terraform_remote_state" "network" {
  backend = "s3"
  config = {
    bucket = "terraform-state"
    key    = "network/terraform.tfstate"
    region = "us-east-1"
  }
}

resource "aws_instance" "web" {
  subnet_id = data.terraform_remote_state.network.outputs.private_subnet_ids[0]
  # ...
}

Step 3: Reduce Unnecessary Resources

Remove redundant security group rules:

# BEFORE: 500 individual rules
resource "aws_security_group_rule" "allow_443" {
  # ...
}

# AFTER: Consolidated into security group
resource "aws_security_group" "web" {
  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = var.allowed_cidrs  # Pass list instead of creating rules per CIDR
  }
}

Use for_each instead of count for stable addressing:

# for_each creates fewer resources to track
resource "aws_route53_record" "services" {
  for_each = var.services

  name    = each.key
  type    = "A"
  zone_id = var.zone_id
  # ...
}

Step 4: Provider Configuration Tuning

# provider.tf
provider "aws" {
  region = "us-east-1"

  # Reduce parallelism to avoid rate limits
  # Default is 10, reduce if hitting limits
  max_retries = 5

  # Skip unnecessary metadata lookups
  skip_metadata_api_check     = true
  skip_region_validation      = true
  skip_credentials_validation = true
}

Step 5: Use -target for Development

# During development, target specific resources
terraform plan -target=aws_instance.web

# Apply specific changes without full refresh
terraform apply -target=module.ecs_service

# WARNING: Only for development!
# Always run full plan before merging to main

Step 6: Selective Refresh (Terraform 1.5+)

# Skip refresh entirely (use with caution!)
terraform plan -refresh=false

# Refresh only specific resources
terraform apply -refresh-only -target=aws_instance.web

# Better: Use refresh in a separate step
terraform refresh  # Run periodically
terraform plan -refresh=false  # Faster subsequent plans

Step 7: State File Optimization

# Check state file size
aws s3 ls s3://terraform-state/terraform.tfstate --human-readable

# If state is huge, there might be orphaned resources
terraform state list | wc -l  # Count managed resources

# Remove resources that no longer exist
terraform state rm 'aws_instance.old_server'

# Compact state (remove old versions)
terraform state pull > state.json
terraform state push state.json

Step 8: Consider Terraform Cloud

# Terraform Cloud has optimized state handling
terraform {
  cloud {
    organization = "company"
    workspaces {
      name = "production"
    }
  }
}

# Benefits:
# - Remote execution (doesn't use your network)
# - Optimized state storage
# - Parallel runs across workspaces
# - Built-in cost estimation

Performance Comparison

State Size	Plan Time (Before)	Plan Time (After)
2000 resources (single state)	15 min	15 min
500 resources (network)	-	2 min
400 resources (security)	-	1.5 min
600 resources (data)	-	3 min
500 resources (compute)	-	2 min
Total (parallel)	15 min	3 min

Terragrunt for Parallel Execution

# Run plans for all modules in parallel
terragrunt run-all plan --parallelism=4

# Dependencies are automatically handled
# Independent modules run simultaneously

# terragrunt.hcl
dependency "network" {
  config_path = "../network"
}

inputs = {
  vpc_id = dependency.network.outputs.vpc_id
}

Systematic, production-ready debugging

Quick Wins Checklist

Action	Impact	Risk
Split state by domain	High	Medium (migration needed)
Reduce -parallelism	Medium	Low
Use -target (dev only)	High	Medium (incomplete plans)
Remove orphaned resources	Medium	Low
Consolidate SG rules	Medium	Low
Use for_each vs count	Medium	Low

Practice Question

Why is using 'terraform plan -refresh=false' risky in production pipelines?

Questions