DeployU
Interviews / Cloud & DevOps / Terraform apply is stuck because of a state lock. Another engineer's apply crashed mid-execution. Fix it.

Terraform apply is stuck because of a state lock. Another engineer's apply crashed mid-execution. Fix it.

debugging State Management Interactive Quiz Code Examples

The Scenario

Your teammate started a terraform apply but their laptop crashed mid-execution. Now everyone on the team sees this error:

$ terraform plan
Acquiring state lock. This may take a few moments...

Error: Error acquiring the state lock

Error message: ConditionalCheckFailedException: The conditional request failed
Lock Info:
  ID:        a1b2c3d4-5678-90ab-cdef-1234567890ab
  Path:      s3://company-terraform-state/prod/vpc/terraform.tfstate
  Operation: OperationTypeApply
  Who:       john@company-laptop
  Version:   1.5.0
  Created:   2024-01-15 10:30:45.123456 +0000 UTC
  Info:

Terraform acquires a state lock to protect against two processes
writing to state simultaneously. This error indicates that the
lock has not been released.

The original engineer is unavailable and production changes are blocked.

The Challenge

Safely release the state lock, verify state integrity, and prevent this from happening again. Explain the risks and safeguards.

Wrong Approach

A junior engineer might immediately run 'terraform force-unlock' without understanding the risks, skip state verification afterward, or not investigate whether the previous apply partially completed. These approaches risk state corruption, resource inconsistencies, or applying changes on top of a broken state.

Right Approach

A senior engineer first verifies the lock holder is truly unavailable, checks if any apply is actually running, uses force-unlock with the correct lock ID, immediately runs terraform plan to verify state consistency, and implements safeguards to prevent future orphaned locks.

Step 1: Verify the Lock is Actually Orphaned

Before force-unlocking, confirm no apply is running:

# Check if the lock holder's process might still be running
# Contact the engineer if possible - their apply might be in progress!

# Check CloudWatch or your monitoring for recent Terraform activity
aws cloudwatch get-metric-statistics \
  --namespace AWS/DynamoDB \
  --metric-name ConsumedWriteCapacityUnits \
  --dimensions Name=TableName,Value=terraform-locks \
  --start-time 2024-01-15T10:00:00Z \
  --end-time 2024-01-15T11:00:00Z \
  --period 300 \
  --statistics Sum

# Check DynamoDB directly for lock details
aws dynamodb get-item \
  --table-name terraform-locks \
  --key '{"LockID": {"S": "s3://company-terraform-state/prod/vpc/terraform.tfstate-md5"}}'

Step 2: Force Unlock the State

# Use the Lock ID from the error message
terraform force-unlock a1b2c3d4-5678-90ab-cdef-1234567890ab

# You'll see a confirmation prompt:
Do you really want to force-unlock?
  Terraform will remove the lock on the remote state.
  This will allow local Terraform commands to modify this state, even though it
  may still be in use. Only 'yes' will be accepted to confirm.

  Enter a value: yes

# Output on success:
Terraform state has been successfully unlocked!

Step 3: Verify State Integrity

Critical: After force-unlock, immediately check for issues:

# Run plan to see current state vs reality
terraform plan

# If plan shows unexpected changes, the previous apply may have partially completed
# Look for resources that were being created/modified

# Check state list to see all managed resources
terraform state list

# For specific resources that might be affected:
terraform state show aws_vpc.main

Step 4: Handle Partial Apply Scenarios

If the previous apply partially completed:

# Option 1: If resources were partially created, import them
terraform import aws_instance.web i-1234567890abcdef0

# Option 2: If resources are in bad state, taint and recreate
terraform taint aws_instance.web
terraform apply

# Option 3: Refresh state to match reality
terraform refresh  # Deprecated in 1.5+, use:
terraform apply -refresh-only

Step 5: Implement Safeguards

Backend configuration with proper timeouts:

# backend.tf
terraform {
  backend "s3" {
    bucket         = "company-terraform-state"
    key            = "prod/vpc/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"

    # Reduce lock timeout for faster failure detection
    # Default is 0 (wait forever)
  }
}

Use Terraform Cloud/Enterprise for better lock management:

terraform {
  cloud {
    organization = "company"
    workspaces {
      name = "prod-vpc"
    }
  }
}

CI/CD pipeline with automatic lock release:

# .github/workflows/terraform.yml
jobs:
  terraform:
    runs-on: ubuntu-latest
    timeout-minutes: 30  # Pipeline timeout prevents infinite locks
    steps:
      - uses: actions/checkout@v4

      - name: Terraform Apply
        run: terraform apply -auto-approve
        timeout-minutes: 20  # Step-level timeout

      # Cleanup step runs even if apply fails
      - name: Release Lock on Failure
        if: failure()
        run: |
          # Log for audit trail
          echo "Apply failed, checking for orphaned lock..."
          # Don't auto-unlock - notify team instead
          # terraform force-unlock would be dangerous here

State Locking Deep Dive

BackendLock MechanismLock Location
S3DynamoDBSeparate DynamoDB table
GCSNativeBuilt into GCS
Azure BlobBlob LeaseBuilt into Azure
Terraform CloudNativeManaged by TFC
LocalFilesystem.terraform.tfstate.lock.info

DynamoDB Lock Table Schema

resource "aws_dynamodb_table" "terraform_locks" {
  name         = "terraform-locks"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"

  attribute {
    name = "LockID"
    type = "S"
  }

  tags = {
    Purpose = "Terraform state locking"
  }
}

Emergency Procedures Runbook

#!/bin/bash
# emergency-unlock.sh - Use with extreme caution!

set -e

LOCK_ID=$1
STATE_PATH=$2

if [ -z "$LOCK_ID" ] || [ -z "$STATE_PATH" ]; then
  echo "Usage: ./emergency-unlock.sh <lock-id> <state-path>"
  exit 1
fi

echo "WARNING: About to force-unlock Terraform state"
echo "Lock ID: $LOCK_ID"
echo "State Path: $STATE_PATH"
echo ""
echo "Checklist before proceeding:"
echo "[ ] Confirmed lock holder is unavailable"
echo "[ ] Checked no apply is currently running"
echo "[ ] Have backup of current state"
echo ""
read -p "Type 'UNLOCK' to proceed: " confirm

if [ "$confirm" != "UNLOCK" ]; then
  echo "Aborted"
  exit 1
fi

# Create state backup first
terraform state pull > "state-backup-$(date +%Y%m%d-%H%M%S).json"

# Force unlock
terraform force-unlock -force "$LOCK_ID"

# Immediate verification
echo "Running terraform plan to verify state..."
terraform plan -detailed-exitcode || true

echo "Done. Review the plan output carefully!"

Practice Question

Why is it dangerous to run 'terraform force-unlock' without first verifying the lock holder's status?