A container's memory usage keeps growing until it gets OOM killed. How do you diagnose and fix it?

Q: A container's memory usage keeps growing until it gets OOM killed. How do you diagnose and fix it?

Learn the answer to "A container's memory usage keeps growing until it gets OOM killed. How do you diagnose and fix it?" with detailed explanations, code examples, and best practices on DeployU.

The Scenario

Your production API container keeps getting killed after running for a few hours:

$ docker ps -a
CONTAINER ID   IMAGE      STATUS                       NAMES
a1b2c3d4e5f6   api:v2.0   Exited (137) 5 minutes ago   api-service

$ docker inspect api-service --format='{{.State.OOMKilled}}'
true

$ docker stats --no-stream api-service
CONTAINER ID   NAME          CPU %     MEM USAGE / LIMIT     MEM %
a1b2c3d4e5f6   api-service   2.5%      512MiB / 512MiB       100%

The container starts at 150MB memory usage but grows steadily until it hits the 512MB limit and gets killed.

The Challenge

Diagnose the memory leak and implement fixes. Explain how to monitor container memory, identify the source of leaks, and prevent OOM kills in production.

Wrong Approach

A junior engineer might just increase the memory limit hoping the problem goes away, restart the container more frequently as a workaround, blame the container runtime, or not understand the difference between application and container memory issues. These fail because increasing limits just delays the inevitable, frequent restarts cause downtime, the issue is almost always in the application, and proper diagnosis is essential.

Addresses symptoms, not root cause

Right Approach

A senior engineer follows a systematic approach: first confirm it's a leak (not just expected growth), profile memory inside the container, identify the leak source using appropriate tools for the language, fix the root cause, and implement monitoring to catch regressions. The container is just revealing an application-level issue.

Step 1: Confirm It’s Actually a Leak

# Monitor memory over time
docker stats api-service

# Export metrics to analyze trend
docker stats --format "{{.MemUsage}}" api-service >> memory.log

# Check if memory grows continuously or reaches a plateau
watch -n 5 'docker stats --no-stream api-service'

Memory growth patterns:

Leak: Continuous unbounded growth until OOM
Cache: Growth to a plateau, then stable
Normal: Growth during load, decrease after

Step 2: Profile Memory Inside Container

# Get shell access
docker exec -it api-service sh

# For Node.js - check heap usage
$ node -e "console.log(process.memoryUsage())"

# For Python - check memory
$ python -c "import resource; print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)"

# For any process - use top
$ top -b -n 1 | head -20

# Check process memory details
$ cat /proc/1/status | grep -i mem
VmRSS:	  524288 kB  # Resident memory
VmSize:	 1048576 kB  # Virtual memory

Step 3: Identify Leak Source (Node.js Example)

# Add debugging flags to CMD
CMD ["node", "--expose-gc", "--max-old-space-size=400", "server.js"]

// Add memory monitoring endpoint
app.get('/debug/memory', (req, res) => {
  const used = process.memoryUsage();
  res.json({
    heapTotal: `${Math.round(used.heapTotal / 1024 / 1024)} MB`,
    heapUsed: `${Math.round(used.heapUsed / 1024 / 1024)} MB`,
    external: `${Math.round(used.external / 1024 / 1024)} MB`,
    rss: `${Math.round(used.rss / 1024 / 1024)} MB`
  });
});

// Force garbage collection endpoint (dev only)
app.post('/debug/gc', (req, res) => {
  if (global.gc) {
    global.gc();
    res.json({ status: 'GC triggered' });
  } else {
    res.status(400).json({ error: 'GC not exposed. Run with --expose-gc' });
  }
});

Step 4: Common Leak Patterns and Fixes

Event Listeners Not Removed:

// LEAK: Adding listener on every request
app.get('/data', (req, res) => {
  eventEmitter.on('data', handler); // Never removed!
});

// FIX: Use once or remove listener
app.get('/data', (req, res) => {
  eventEmitter.once('data', handler);
  // OR
  const handler = (data) => { /* ... */ };
  eventEmitter.on('data', handler);
  req.on('close', () => eventEmitter.off('data', handler));
});

Unclosed Database Connections:

// LEAK: Connections never returned to pool
async function query(sql) {
  const conn = await pool.getConnection();
  const result = await conn.query(sql);
  // Connection never released!
  return result;
}

// FIX: Always release connections
async function query(sql) {
  const conn = await pool.getConnection();
  try {
    return await conn.query(sql);
  } finally {
    conn.release();
  }
}

Growing Caches:

// LEAK: Unbounded cache
const cache = {};
function getCached(key) {
  if (!cache[key]) {
    cache[key] = expensiveComputation(key);
  }
  return cache[key];
}

// FIX: Use LRU cache with max size
const LRU = require('lru-cache');
const cache = new LRU({ max: 500 });

Systematic, production-ready debugging

Container Memory Limits

# docker-compose.yml
services:
  api:
    image: api:v2.0
    deploy:
      resources:
        limits:
          memory: 512M
        reservations:
          memory: 256M
    # Swap disabled for predictable behavior
    memswap_limit: 512M

Memory Monitoring Stack

# Add cAdvisor for container metrics
services:
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:latest
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
    ports:
      - "8081:8080"

  prometheus:
    image: prom/prometheus
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"

OOM Prevention Strategies

Strategy	Implementation
Set appropriate limits	Match actual app needs, not arbitrary numbers
Application-level limits	`--max-old-space-size` for Node.js
Memory monitoring	Prometheus + Grafana alerts
Graceful degradation	Shed load before OOM
Restart policies	`restart: on-failure:3`

Practice Question

A Node.js container is OOM killed with exit code 137. Which flag helps prevent this by limiting the JavaScript heap size?

Questions