Questions
A critical production pod is stuck in CrashLoopBackOff. How do you diagnose and fix it?
The Scenario
It’s 3 AM and you get paged. Your company’s payment processing service—a critical microservice handling thousands of transactions per minute—has been down for 5 minutes. The on-call engineer tried restarting the deployment, but the pods keep crashing.
When you check the cluster, you see:
kubectl get pods -n payments
NAME READY STATUS RESTARTS AGE
payment-processor-6d4f7b-abc 0/1 CrashLoopBackOff 5 3m
payment-processor-6d4f7b-def 0/1 CrashLoopBackOff 5 3m
payment-processor-6d4f7b-ghi 0/1 CrashLoopBackOff 5 3m
Every transaction is failing. Revenue is being lost. Your VP of Engineering is awake and watching Slack. You have 10 minutes to diagnose and fix this.
The Challenge
Walk me through your systematic debugging process. What commands would you run, in what order, and why? How would you quickly isolate whether this is an application issue, configuration problem, or infrastructure failure?
A junior engineer might panic and randomly restart pods hoping they'll work, immediately rebuild the container without checking logs, scale up replicas thinking more pods will help, or SSH into nodes to check system resources. This fails because there's no systematic approach wasting time, rebuilding without diagnosis repeats the problem, scaling up creates more crashing pods, and ignoring the actual error messages in Kubernetes events.
A senior SRE follows a methodical process starting with checking recent changes within the first 30 seconds. Check recent deployments with rollout history and recent events sorted by timestamp. Since 80% of production incidents are caused by recent changes, if you see a deployment 5 minutes ago that's your smoking gun. Quick fix is immediate rollback using rollout undo. If no recent deployment, examine pod logs using previous flag if container crashed before logging, then inspect pod description for events showing why it failed and exit codes revealing the issue type (0 successful, 1 application error, 137 OOMKilled, 139 segfault, 143 SIGTERM).
Step 1: Check Recent Changes (First 30 seconds)
Before diving into logs, check what changed:
# Check recent deployments
kubectl rollout history deployment/payment-processor -n payments
# Check recent events
kubectl get events -n payments --sort-by='.lastTimestamp' | tail -20Why: 80% of production incidents are caused by recent changes. If you see a deployment 5 minutes ago, that’s your smoking gun.
Quick Fix: If a recent deployment caused this:
# Immediate rollback
kubectl rollout undo deployment/payment-processor -n payments
# Verify pods are recovering
kubectl get pods -n payments -wStep 2: Examine Pod Logs (Next 2 minutes)
If rollback doesn’t help or there was no recent deployment:
# Get logs from the crashing pod
kubectl logs payment-processor-6d4f7b-abc -n payments
# If the container crashed before logging anything, check previous instance
kubectl logs payment-processor-6d4f7b-abc -n payments --previous
# Check all container logs if it's a multi-container pod
kubectl logs payment-processor-6d4f7b-abc -n payments --all-containers=trueWhat to Look For:
- Application errors: Stack traces, null pointer exceptions, connection errors
- Configuration errors: “Cannot read config file”, “Environment variable X not set”
- Dependency failures: “Cannot connect to database”, “Redis timeout”
- OOM kills: “Out of memory” or sudden termination with exit code 137
Step 3: Inspect Pod Description (Next 2 minutes)
kubectl describe pod payment-processor-6d4f7b-abc -n paymentsCritical Sections to Check:
- Events Section showing why the pod failed
- Last State showing the exit code
Common Exit Codes:
- 0: Successful exit (shouldn’t cause crash)
- 1: Application error (check logs)
- 137: OOMKilled (out of memory)
- 139: Segmentation fault
- 143: Terminated by SIGTERM
Step 4: Check Dependencies and ConfigMaps/Secrets (Next 2 minutes)
# Verify ConfigMap exists and has correct data
kubectl get configmap payment-config -n payments -o yaml
# Verify Secrets exist
kubectl get secrets -n payments
# Check if the database/Redis are accessible
kubectl run debug-pod --rm -it --image=busybox -n payments -- sh
# Inside the pod:
nslookup payment-db.payments.svc.cluster.local
wget -O- payment-db.payments.svc.cluster.local:5432Step 5: Check Resource Limits (Next 1 minute)
# Check if pods are being OOMKilled
kubectl describe pod payment-processor-6d4f7b-abc -n payments | grep -A 5 "Last State"
# Check current resource usage vs limits
kubectl top pods -n payments
kubectl describe deployment payment-processor -n payments | grep -A 3 "Limits" Common Root Causes and Fixes
| Symptom | Root Cause | Fix |
|---|---|---|
ErrImagePull / ImagePullBackOff | Wrong image tag or registry auth failure | Check image name, verify image exists, check imagePullSecrets |
CrashLoopBackOff + Exit Code 1 | Application error on startup | Check logs, verify env vars, config files |
CrashLoopBackOff + Exit Code 137 | Pod using more memory than limit | Increase memory limits or fix memory leak |
CreateContainerConfigError | Missing ConfigMap/Secret | Verify ConfigMap/Secret exists and is mounted correctly |
| Pods start but immediately crash | Failed health check or dependency unreachable | Check liveness/readiness probes, verify database connectivity |
Real-World Example: Database Connection Failure
Logs show:
Error: connect ECONNREFUSED payment-db:5432
Application failed to start
Root Cause: Database service name changed from payment-db to payment-database in a recent update, but the app’s environment variable still points to the old name.
Fix:
# Update the deployment's environment variable
kubectl set env deployment/payment-processor -n payments \
DATABASE_HOST=payment-database.payments.svc.cluster.local
# Verify pods are running
kubectl get pods -n payments -w
Practice Question
A pod is stuck in CrashLoopBackOff with exit code 137. What is the most likely cause?