A critical production pod is stuck in CrashLoopBackOff. How do you diagnose and fix it?

Q: A critical production pod is stuck in CrashLoopBackOff. How do you diagnose and fix it?

Learn the answer to "A critical production pod is stuck in CrashLoopBackOff. How do you diagnose and fix it?" with detailed explanations, code examples, and best practices on DeployU.

The Scenario

It’s 3 AM and you get paged. Your company’s payment processing service—a critical microservice handling thousands of transactions per minute—has been down for 5 minutes. The on-call engineer tried restarting the deployment, but the pods keep crashing.

When you check the cluster, you see:

kubectl get pods -n payments
NAME                           READY   STATUS             RESTARTS   AGE
payment-processor-6d4f7b-abc   0/1     CrashLoopBackOff   5          3m
payment-processor-6d4f7b-def   0/1     CrashLoopBackOff   5          3m
payment-processor-6d4f7b-ghi   0/1     CrashLoopBackOff   5          3m

Every transaction is failing. Revenue is being lost. Your VP of Engineering is awake and watching Slack. You have 10 minutes to diagnose and fix this.

The Challenge

Walk me through your systematic debugging process. What commands would you run, in what order, and why? How would you quickly isolate whether this is an application issue, configuration problem, or infrastructure failure?

Wrong Approach

A junior engineer might panic and randomly restart pods hoping they'll work, immediately rebuild the container without checking logs, scale up replicas thinking more pods will help, or SSH into nodes to check system resources. This fails because there's no systematic approach wasting time, rebuilding without diagnosis repeats the problem, scaling up creates more crashing pods, and ignoring the actual error messages in Kubernetes events.

Addresses symptoms, not root cause

Right Approach

A senior SRE follows a methodical process starting with checking recent changes within the first 30 seconds. Check recent deployments with rollout history and recent events sorted by timestamp. Since 80% of production incidents are caused by recent changes, if you see a deployment 5 minutes ago that's your smoking gun. Quick fix is immediate rollback using rollout undo. If no recent deployment, examine pod logs using previous flag if container crashed before logging, then inspect pod description for events showing why it failed and exit codes revealing the issue type (0 successful, 1 application error, 137 OOMKilled, 139 segfault, 143 SIGTERM).

Step 1: Check Recent Changes (First 30 seconds)

Before diving into logs, check what changed:

# Check recent deployments
kubectl rollout history deployment/payment-processor -n payments

# Check recent events
kubectl get events -n payments --sort-by='.lastTimestamp' | tail -20

Why: 80% of production incidents are caused by recent changes. If you see a deployment 5 minutes ago, that’s your smoking gun.

Quick Fix: If a recent deployment caused this:

# Immediate rollback
kubectl rollout undo deployment/payment-processor -n payments

# Verify pods are recovering
kubectl get pods -n payments -w

Step 2: Examine Pod Logs (Next 2 minutes)

If rollback doesn’t help or there was no recent deployment:

# Get logs from the crashing pod
kubectl logs payment-processor-6d4f7b-abc -n payments

# If the container crashed before logging anything, check previous instance
kubectl logs payment-processor-6d4f7b-abc -n payments --previous

# Check all container logs if it's a multi-container pod
kubectl logs payment-processor-6d4f7b-abc -n payments --all-containers=true

What to Look For:

Application errors: Stack traces, null pointer exceptions, connection errors
Configuration errors: “Cannot read config file”, “Environment variable X not set”
Dependency failures: “Cannot connect to database”, “Redis timeout”
OOM kills: “Out of memory” or sudden termination with exit code 137

Step 3: Inspect Pod Description (Next 2 minutes)

kubectl describe pod payment-processor-6d4f7b-abc -n payments

Critical Sections to Check:

Events Section showing why the pod failed
Last State showing the exit code

Common Exit Codes:

0: Successful exit (shouldn’t cause crash)
1: Application error (check logs)
137: OOMKilled (out of memory)
139: Segmentation fault
143: Terminated by SIGTERM

Step 4: Check Dependencies and ConfigMaps/Secrets (Next 2 minutes)

# Verify ConfigMap exists and has correct data
kubectl get configmap payment-config -n payments -o yaml

# Verify Secrets exist
kubectl get secrets -n payments

# Check if the database/Redis are accessible
kubectl run debug-pod --rm -it --image=busybox -n payments -- sh
# Inside the pod:
nslookup payment-db.payments.svc.cluster.local
wget -O- payment-db.payments.svc.cluster.local:5432

Step 5: Check Resource Limits (Next 1 minute)

# Check if pods are being OOMKilled
kubectl describe pod payment-processor-6d4f7b-abc -n payments | grep -A 5 "Last State"

# Check current resource usage vs limits
kubectl top pods -n payments
kubectl describe deployment payment-processor -n payments | grep -A 3 "Limits"

Systematic, production-ready debugging

Common Root Causes and Fixes

Symptom	Root Cause	Fix
`ErrImagePull` / `ImagePullBackOff`	Wrong image tag or registry auth failure	Check image name, verify image exists, check imagePullSecrets
`CrashLoopBackOff` + Exit Code 1	Application error on startup	Check logs, verify env vars, config files
`CrashLoopBackOff` + Exit Code 137	Pod using more memory than limit	Increase memory limits or fix memory leak
`CreateContainerConfigError`	Missing ConfigMap/Secret	Verify ConfigMap/Secret exists and is mounted correctly
Pods start but immediately crash	Failed health check or dependency unreachable	Check liveness/readiness probes, verify database connectivity

Real-World Example: Database Connection Failure

Logs show:

Error: connect ECONNREFUSED payment-db:5432
Application failed to start

Root Cause: Database service name changed from payment-db to payment-database in a recent update, but the app’s environment variable still points to the old name.

Fix:

# Update the deployment's environment variable
kubectl set env deployment/payment-processor -n payments \
  DATABASE_HOST=payment-database.payments.svc.cluster.local

# Verify pods are running
kubectl get pods -n payments -w

Practice Question

A pod is stuck in CrashLoopBackOff with exit code 137. What is the most likely cause?

Questions