Pods are being OOMKilled in production. How do you diagnose and prevent this?

Q: Pods are being OOMKilled in production. How do you diagnose and prevent this?

Learn the answer to "Pods are being OOMKilled in production. How do you diagnose and prevent this?" with detailed explanations, code examples, and best practices on DeployU.

The Scenario

It’s Monday morning. Your analytics service—which processes customer data for reporting—keeps crashing. The on-call logs show pods restarting every 10-15 minutes during peak hours.

When you check the cluster:

$ kubectl get pods -n analytics
NAME                           READY   STATUS      RESTARTS   AGE
analytics-worker-7d9f8c-abc    0/1     OOMKilled   12         45m
analytics-worker-7d9f8c-def    1/1     Running     8          45m
analytics-worker-7d9f8c-ghi    0/1     OOMKilled   10         45m

The logs don’t show application errors—pods just suddenly restart. Users are complaining that reports are incomplete or missing data. Your VP of Product is asking for an ETA on the fix.

Current deployment configuration:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: analytics-worker
  namespace: analytics
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: worker
        image: company/analytics-worker:v2.1
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "512Mi"  # Same as request
            cpu: "1000m"

The Challenge

Diagnose: Prove that OOMKilled is actually the issue and identify the memory consumption pattern
Immediate fix: Get the service stable ASAP
Root cause: Why is memory usage increasing?
Long-term solution: Prevent this from happening again

Walk through your complete debugging and remediation process.

Wrong Approach

Just increase memory without investigation - increase the memory limit to 4Gi and hope it fixes the problem. Doesn't address root cause (memory leak still exists), wastes cluster resources, problem will return when memory reaches 4Gi, no monitoring or alerting setup, and doesn't understand WHY pods are OOMKilled.

Addresses symptoms, not root cause

Right Approach

Systematic debugging approach: confirm OOMKilled with exit code 137, check resource metrics with kubectl top, analyze historical memory usage patterns, examine application logs, implement immediate fix with proper Burstable QoS, investigate root cause for memory leaks, set up monitoring and alerts, implement HPA for horizontal scaling, and optimize application memory usage with proper chunking and streaming.

Wrong Approach: Just Increase Memory

The wrong approach is to blindly increase the memory limit:

resources:
  limits:
    memory: "4Gi"  # Just make it bigger

Problems with this approach:

Doesn’t address root cause (memory leak still exists)
Wastes cluster resources
Problem will return when memory reaches 4Gi
No monitoring or alerting setup
Doesn’t understand WHY pods are OOMKilled

Right Approach: Systematic Debugging and Comprehensive Solution

This is one of the most common production issues in Kubernetes. Here’s how senior SREs handle it:

Phase 1: Confirm OOMKilled (30 seconds)

# Check pod status
kubectl get pods -n analytics

# Describe pod to see exit code
kubectl describe pod analytics-worker-7d9f8c-abc -n analytics

# Look for this in the output:
Last State:     Terminated
  Reason:       OOMKilled
  Exit Code:    137
  Started:      Mon, 15 Jan 2024 09:00:00 +0000
  Finished:     Mon, 15 Jan 2024 09:12:34 +0000

Exit Code 137 = 128 + 9 (SIGKILL) - This is the definitive proof of OOMKilled.

Phase 2: Check Resource Metrics (Next 2 minutes)

# Check current memory usage across all pods
kubectl top pods -n analytics

NAME                           CPU(cores)   MEMORY(bytes)
analytics-worker-7d9f8c-abc    450m         498Mi
analytics-worker-7d9f8c-def    380m         512Mi  # At limit!
analytics-worker-7d9f8c-ghi    520m         501Mi

# Check node capacity
kubectl top nodes

# Get detailed resource info
kubectl describe node <node-name> | grep -A 5 "Allocated resources"

Phase 3: Analyze Historical Memory Usage

If you have Prometheus + Grafana, run these queries:

# Memory usage over time
container_memory_usage_bytes{
  namespace="analytics",
  pod=~"analytics-worker.*"
}

# Memory usage as % of limit
(container_memory_usage_bytes / container_spec_memory_limit_bytes) * 100

What to look for:

Gradual increase: Memory leak in application
Sudden spikes: Processing large datasets
Cyclic pattern: Batch jobs running periodically

Phase 4: Examine Application Logs

# Check logs from the crashed pod
kubectl logs analytics-worker-7d9f8c-abc -n analytics --previous

# Look for memory-related errors before crash
# Common patterns:
# - "heap out of memory"
# - "cannot allocate memory"
# - Processing large files: "loading 10GB CSV file"

Immediate Fix: Increase Memory Limits

Short-term solution (deploy immediately):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: analytics-worker
  namespace: analytics
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: worker
        image: company/analytics-worker:v2.1
        resources:
          requests:
            memory: "512Mi"   # Request stays same (guaranteed memory)
            cpu: "500m"
          limits:
            memory: "2Gi"     # Increased 4x (max burst capacity)
            cpu: "2000m"

Why this works:

Gives pods more headroom during memory spikes
Prevents OOMKilled during peak processing
Pods can burst above request but stay within limit

Deploy the fix:

kubectl apply -f deployment.yaml
kubectl rollout status deployment/analytics-worker -n analytics
kubectl get pods -n analytics -w

Understanding Kubernetes QoS Classes

Kubernetes assigns pods to QoS (Quality of Service) classes based on resources:

1. Guaranteed (Highest Priority)

resources:
  requests:
    memory: "1Gi"
    cpu: "1000m"
  limits:
    memory: "1Gi"  # Same as request
    cpu: "1000m"   # Same as request

Gets evicted last
Best for critical workloads
But: No burst capacity!

2. Burstable (Medium Priority)

resources:
  requests:
    memory: "512Mi"
    cpu: "500m"
  limits:
    memory: "2Gi"   # Higher than request
    cpu: "2000m"

Can burst above request
Good for variable workloads
Gets evicted after BestEffort

3. BestEffort (Lowest Priority - AVOID)

resources: {}  # No requests or limits

Gets evicted first
Never use in production!

Our fix uses Burstable QoS:

Guaranteed 512Mi (request)
Can burst to 2Gi (limit) during processing

Root Cause Analysis: Memory Leak Investigation

Check application code for memory leaks:

// ❌ MEMORY LEAK - Array grows forever
const processedRecords = [];

async function processData() {
  while (true) {
    const batch = await fetchNextBatch();
    processedRecords.push(...batch);  // Never cleared!
    await generateReport(processedRecords);
  }
}

// ✅ FIX - Clear array after processing
const processedRecords = [];

async function processData() {
  while (true) {
    const batch = await fetchNextBatch();
    processedRecords.push(...batch);
    await generateReport(processedRecords);
    processedRecords.length = 0;  // Clear memory
  }
}

Common memory leak patterns:

Event listeners not removed
Caching without eviction policy
Large objects kept in memory
Database connections not closed
Timers/intervals not cleared

Node.js specific debugging:

# Get heap snapshot from running pod
kubectl exec -it analytics-worker-7d9f8c-abc -n analytics -- \
  node --expose-gc --inspect=0.0.0.0:9229 app.js

# Port-forward to access debugger
kubectl port-forward analytics-worker-7d9f8c-abc 9229:9229 -n analytics

# Open Chrome DevTools → Memory tab → Take heap snapshot

Long-Term Solutions

1. Implement Memory Monitoring with Alerts

# Prometheus alert rule
groups:
- name: kubernetes-memory
  rules:
  - alert: PodHighMemoryUsage
    expr: |
      (container_memory_usage_bytes / container_spec_memory_limit_bytes) > 0.9
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Pod {{ $labels.pod }} is using > 90% memory"
      description: "Memory usage: {{ $value | humanizePercentage }}"

2. Set Appropriate Resource Requests and Limits

Best practice formula:

Request = Average usage + 20% buffer
Limit = Peak usage + 30% buffer

Example calculation:

Average memory usage: 400Mi (from metrics)
Request: 400Mi * 1.2 = 480Mi → Round to 512Mi

Peak usage during processing: 1.5Gi (from metrics)
Limit: 1.5Gi * 1.3 = 1.95Gi → Round to 2Gi

3. Implement Horizontal Pod Autoscaling

Instead of making pods bigger, make more pods:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: analytics-worker-hpa
  namespace: analytics
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: analytics-worker
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 70  # Scale when avg pod uses >70% memory
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100  # Double pods at once if needed
        periodSeconds: 60

4. Optimize Application Memory Usage

// Process data in smaller chunks instead of all at once
async function processDataOptimized() {
  const CHUNK_SIZE = 1000;

  while (true) {
    // Fetch small batch
    const batch = await fetchNextBatch(CHUNK_SIZE);
    if (batch.length === 0) break;

    // Process batch
    await processBatch(batch);

    // Batch is garbage collected after loop iteration
  }
}

// Use streams for large files
const fs = require('fs');
const readline = require('readline');

async function processLargeFile(filePath) {
  const fileStream = fs.createReadStream(filePath);
  const rl = readline.createInterface({
    input: fileStream,
    crlfDelay: Infinity
  });

  for await (const line of rl) {
    // Process one line at a time
    await processLine(line);
    // Memory is released after each line
  }
}

5. Enable Vertical Pod Autoscaler (VPA) for Automatic Tuning

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: analytics-worker-vpa
  namespace: analytics
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: analytics-worker
  updatePolicy:
    updateMode: "Auto"  # Automatically adjust requests/limits
  resourcePolicy:
    containerPolicies:
    - containerName: worker
      minAllowed:
        memory: "512Mi"
      maxAllowed:
        memory: "4Gi"
      controlledResources: ["memory"]

VPA will:

Monitor actual memory usage
Automatically adjust requests/limits
Prevent OOMKills by increasing limits proactively

Production Checklist: Preventing OOMKilled

Set memory requests based on average usage + buffer
Set memory limits based on peak usage + buffer
Use Burstable QoS (requests < limits) for variable workloads
Implement memory usage alerts (> 80% for warning, > 90% for critical)
Enable HPA to scale out instead of up
Consider VPA for automatic resource tuning
Profile application for memory leaks
Process large datasets in chunks/streams
Monitor with Prometheus + Grafana
Test under production-like load before deploying

Systematic, production-ready debugging

Practice Question

A pod has memory request of 512Mi and limit of 1Gi. The node has 2Gi total memory. What happens when the pod tries to use 1.5Gi?

Questions