Questions
A container's memory usage keeps growing until it gets OOM killed. How do you diagnose and fix it?
The Scenario
Your production API container keeps getting killed after running for a few hours:
$ docker ps -a
CONTAINER ID IMAGE STATUS NAMES
a1b2c3d4e5f6 api:v2.0 Exited (137) 5 minutes ago api-service
$ docker inspect api-service --format='{{.State.OOMKilled}}'
true
$ docker stats --no-stream api-service
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM %
a1b2c3d4e5f6 api-service 2.5% 512MiB / 512MiB 100%
The container starts at 150MB memory usage but grows steadily until it hits the 512MB limit and gets killed.
The Challenge
Diagnose the memory leak and implement fixes. Explain how to monitor container memory, identify the source of leaks, and prevent OOM kills in production.
A junior engineer might just increase the memory limit hoping the problem goes away, restart the container more frequently as a workaround, blame the container runtime, or not understand the difference between application and container memory issues. These fail because increasing limits just delays the inevitable, frequent restarts cause downtime, the issue is almost always in the application, and proper diagnosis is essential.
A senior engineer follows a systematic approach: first confirm it's a leak (not just expected growth), profile memory inside the container, identify the leak source using appropriate tools for the language, fix the root cause, and implement monitoring to catch regressions. The container is just revealing an application-level issue.
Step 1: Confirm It’s Actually a Leak
# Monitor memory over time
docker stats api-service
# Export metrics to analyze trend
docker stats --format "{{.MemUsage}}" api-service >> memory.log
# Check if memory grows continuously or reaches a plateau
watch -n 5 'docker stats --no-stream api-service'Memory growth patterns:
- Leak: Continuous unbounded growth until OOM
- Cache: Growth to a plateau, then stable
- Normal: Growth during load, decrease after
Step 2: Profile Memory Inside Container
# Get shell access
docker exec -it api-service sh
# For Node.js - check heap usage
$ node -e "console.log(process.memoryUsage())"
# For Python - check memory
$ python -c "import resource; print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)"
# For any process - use top
$ top -b -n 1 | head -20
# Check process memory details
$ cat /proc/1/status | grep -i mem
VmRSS: 524288 kB # Resident memory
VmSize: 1048576 kB # Virtual memoryStep 3: Identify Leak Source (Node.js Example)
# Add debugging flags to CMD
CMD ["node", "--expose-gc", "--max-old-space-size=400", "server.js"]// Add memory monitoring endpoint
app.get('/debug/memory', (req, res) => {
const used = process.memoryUsage();
res.json({
heapTotal: `${Math.round(used.heapTotal / 1024 / 1024)} MB`,
heapUsed: `${Math.round(used.heapUsed / 1024 / 1024)} MB`,
external: `${Math.round(used.external / 1024 / 1024)} MB`,
rss: `${Math.round(used.rss / 1024 / 1024)} MB`
});
});
// Force garbage collection endpoint (dev only)
app.post('/debug/gc', (req, res) => {
if (global.gc) {
global.gc();
res.json({ status: 'GC triggered' });
} else {
res.status(400).json({ error: 'GC not exposed. Run with --expose-gc' });
}
});Step 4: Common Leak Patterns and Fixes
Event Listeners Not Removed:
// LEAK: Adding listener on every request
app.get('/data', (req, res) => {
eventEmitter.on('data', handler); // Never removed!
});
// FIX: Use once or remove listener
app.get('/data', (req, res) => {
eventEmitter.once('data', handler);
// OR
const handler = (data) => { /* ... */ };
eventEmitter.on('data', handler);
req.on('close', () => eventEmitter.off('data', handler));
});Unclosed Database Connections:
// LEAK: Connections never returned to pool
async function query(sql) {
const conn = await pool.getConnection();
const result = await conn.query(sql);
// Connection never released!
return result;
}
// FIX: Always release connections
async function query(sql) {
const conn = await pool.getConnection();
try {
return await conn.query(sql);
} finally {
conn.release();
}
}Growing Caches:
// LEAK: Unbounded cache
const cache = {};
function getCached(key) {
if (!cache[key]) {
cache[key] = expensiveComputation(key);
}
return cache[key];
}
// FIX: Use LRU cache with max size
const LRU = require('lru-cache');
const cache = new LRU({ max: 500 }); Container Memory Limits
# docker-compose.yml
services:
api:
image: api:v2.0
deploy:
resources:
limits:
memory: 512M
reservations:
memory: 256M
# Swap disabled for predictable behavior
memswap_limit: 512M
Memory Monitoring Stack
# Add cAdvisor for container metrics
services:
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
ports:
- "8081:8080"
prometheus:
image: prom/prometheus
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- "9090:9090"
OOM Prevention Strategies
| Strategy | Implementation |
|---|---|
| Set appropriate limits | Match actual app needs, not arbitrary numbers |
| Application-level limits | --max-old-space-size for Node.js |
| Memory monitoring | Prometheus + Grafana alerts |
| Graceful degradation | Shed load before OOM |
| Restart policies | restart: on-failure:3 |
Practice Question
A Node.js container is OOM killed with exit code 137. Which flag helps prevent this by limiting the JavaScript heap size?