DeployU
Interviews / DevOps & Cloud Infrastructure / A critical production pod is stuck in CrashLoopBackOff. How do you diagnose and fix it?

A critical production pod is stuck in CrashLoopBackOff. How do you diagnose and fix it?

debugging Pod Lifecycle & Debugging Interactive Quiz Code Examples

The Scenario

It’s 3 AM and you get paged. Your company’s payment processing service—a critical microservice handling thousands of transactions per minute—has been down for 5 minutes. The on-call engineer tried restarting the deployment, but the pods keep crashing.

When you check the cluster, you see:

kubectl get pods -n payments
NAME                           READY   STATUS             RESTARTS   AGE
payment-processor-6d4f7b-abc   0/1     CrashLoopBackOff   5          3m
payment-processor-6d4f7b-def   0/1     CrashLoopBackOff   5          3m
payment-processor-6d4f7b-ghi   0/1     CrashLoopBackOff   5          3m

Every transaction is failing. Revenue is being lost. Your VP of Engineering is awake and watching Slack. You have 10 minutes to diagnose and fix this.

The Challenge

Walk me through your systematic debugging process. What commands would you run, in what order, and why? How would you quickly isolate whether this is an application issue, configuration problem, or infrastructure failure?

Wrong Approach

A junior engineer might panic and randomly restart pods hoping they'll work, immediately rebuild the container without checking logs, scale up replicas thinking more pods will help, or SSH into nodes to check system resources. This fails because there's no systematic approach wasting time, rebuilding without diagnosis repeats the problem, scaling up creates more crashing pods, and ignoring the actual error messages in Kubernetes events.

Right Approach

A senior SRE follows a methodical process starting with checking recent changes within the first 30 seconds. Check recent deployments with rollout history and recent events sorted by timestamp. Since 80% of production incidents are caused by recent changes, if you see a deployment 5 minutes ago that's your smoking gun. Quick fix is immediate rollback using rollout undo. If no recent deployment, examine pod logs using previous flag if container crashed before logging, then inspect pod description for events showing why it failed and exit codes revealing the issue type (0 successful, 1 application error, 137 OOMKilled, 139 segfault, 143 SIGTERM).

Step 1: Check Recent Changes (First 30 seconds)

Before diving into logs, check what changed:

# Check recent deployments
kubectl rollout history deployment/payment-processor -n payments

# Check recent events
kubectl get events -n payments --sort-by='.lastTimestamp' | tail -20

Why: 80% of production incidents are caused by recent changes. If you see a deployment 5 minutes ago, that’s your smoking gun.

Quick Fix: If a recent deployment caused this:

# Immediate rollback
kubectl rollout undo deployment/payment-processor -n payments

# Verify pods are recovering
kubectl get pods -n payments -w

Step 2: Examine Pod Logs (Next 2 minutes)

If rollback doesn’t help or there was no recent deployment:

# Get logs from the crashing pod
kubectl logs payment-processor-6d4f7b-abc -n payments

# If the container crashed before logging anything, check previous instance
kubectl logs payment-processor-6d4f7b-abc -n payments --previous

# Check all container logs if it's a multi-container pod
kubectl logs payment-processor-6d4f7b-abc -n payments --all-containers=true

What to Look For:

  • Application errors: Stack traces, null pointer exceptions, connection errors
  • Configuration errors: “Cannot read config file”, “Environment variable X not set”
  • Dependency failures: “Cannot connect to database”, “Redis timeout”
  • OOM kills: “Out of memory” or sudden termination with exit code 137

Step 3: Inspect Pod Description (Next 2 minutes)

kubectl describe pod payment-processor-6d4f7b-abc -n payments

Critical Sections to Check:

  1. Events Section showing why the pod failed
  2. Last State showing the exit code

Common Exit Codes:

  • 0: Successful exit (shouldn’t cause crash)
  • 1: Application error (check logs)
  • 137: OOMKilled (out of memory)
  • 139: Segmentation fault
  • 143: Terminated by SIGTERM

Step 4: Check Dependencies and ConfigMaps/Secrets (Next 2 minutes)

# Verify ConfigMap exists and has correct data
kubectl get configmap payment-config -n payments -o yaml

# Verify Secrets exist
kubectl get secrets -n payments

# Check if the database/Redis are accessible
kubectl run debug-pod --rm -it --image=busybox -n payments -- sh
# Inside the pod:
nslookup payment-db.payments.svc.cluster.local
wget -O- payment-db.payments.svc.cluster.local:5432

Step 5: Check Resource Limits (Next 1 minute)

# Check if pods are being OOMKilled
kubectl describe pod payment-processor-6d4f7b-abc -n payments | grep -A 5 "Last State"

# Check current resource usage vs limits
kubectl top pods -n payments
kubectl describe deployment payment-processor -n payments | grep -A 3 "Limits"

Common Root Causes and Fixes

SymptomRoot CauseFix
ErrImagePull / ImagePullBackOffWrong image tag or registry auth failureCheck image name, verify image exists, check imagePullSecrets
CrashLoopBackOff + Exit Code 1Application error on startupCheck logs, verify env vars, config files
CrashLoopBackOff + Exit Code 137Pod using more memory than limitIncrease memory limits or fix memory leak
CreateContainerConfigErrorMissing ConfigMap/SecretVerify ConfigMap/Secret exists and is mounted correctly
Pods start but immediately crashFailed health check or dependency unreachableCheck liveness/readiness probes, verify database connectivity

Real-World Example: Database Connection Failure

Logs show:

Error: connect ECONNREFUSED payment-db:5432
Application failed to start

Root Cause: Database service name changed from payment-db to payment-database in a recent update, but the app’s environment variable still points to the old name.

Fix:

# Update the deployment's environment variable
kubectl set env deployment/payment-processor -n payments \
  DATABASE_HOST=payment-database.payments.svc.cluster.local

# Verify pods are running
kubectl get pods -n payments -w

Practice Question

A pod is stuck in CrashLoopBackOff with exit code 137. What is the most likely cause?