ECS tasks are failing with exit code 137 and health check failures. Debug the container issues.

Q: ECS tasks are failing with exit code 137 and health check failures. Debug the container issues.

Learn the answer to "ECS tasks are failing with exit code 137 and health check failures. Debug the container issues." with detailed explanations, code examples, and best practices on DeployU.

The Scenario

Your ECS service is unstable:

Current Problems:
├── Tasks constantly restarting
├── Exit code 137 (OOMKilled)
├── Health check failures: 50%
├── Container insights: Memory spikes to 100%
├── Task definition: 512 CPU, 1024 MB memory
├── Application: Node.js API server
└── Error: "Container killed due to memory"

The Challenge

Debug and fix the ECS task failures, optimize container resource allocation, and implement proper health checks and monitoring.

Wrong Approach

A junior engineer might just double the memory allocation, disable health checks, or ignore the OOM errors. These approaches waste resources, hide real problems, and don't address root causes like memory leaks or incorrect health check configuration.

Addresses symptoms, not root cause

Right Approach

A senior engineer analyzes memory usage patterns, profiles the application, configures appropriate resource limits with headroom, implements proper health checks with realistic timeouts, and uses Container Insights for monitoring.

Step 1: Understand Exit Codes

Common ECS Exit Codes:
├── 0: Normal exit (success)
├── 1: Application error
├── 137: SIGKILL (OOMKilled or manual stop)
├── 139: SIGSEGV (Segmentation fault)
├── 143: SIGTERM (Graceful shutdown)
└── 255: Exit status out of range

Exit code 137 = 128 + 9 (SIGKILL)
- Container exceeded memory limit
- Killed by orchestrator

Step 2: Debug with ECS Exec

# Enable ECS Exec on the service
aws ecs update-service \
  --cluster production \
  --service api-service \
  --enable-execute-command

# Execute into running container
aws ecs execute-command \
  --cluster production \
  --task arn:aws:ecs:us-east-1:123456789:task/production/abc123 \
  --container api \
  --interactive \
  --command "/bin/sh"

# Inside container: Check memory usage
cat /sys/fs/cgroup/memory/memory.usage_in_bytes
cat /sys/fs/cgroup/memory/memory.limit_in_bytes

# Check Node.js heap
node -e "console.log(process.memoryUsage())"

# Check for memory leaks
node --inspect app.js
# Then use Chrome DevTools to analyze heap snapshots

Step 3: Fix Task Definition

resource "aws_ecs_task_definition" "api" {
  family                   = "api-task"
  network_mode             = "awsvpc"
  requires_compatibilities = ["FARGATE"]
  cpu                      = "1024"   # 1 vCPU
  memory                   = "2048"   # 2 GB
  execution_role_arn       = aws_iam_role.ecs_execution.arn
  task_role_arn            = aws_iam_role.ecs_task.arn

  container_definitions = jsonencode([
    {
      name      = "api"
      image     = "${aws_ecr_repository.api.repository_url}:latest"
      essential = true

      # Resource limits
      cpu    = 896  # Leave some for sidecar
      memory = 1792 # Leave headroom for container overhead

      # Memory reservation (soft limit)
      memoryReservation = 1536

      portMappings = [
        {
          containerPort = 3000
          protocol      = "tcp"
        }
      ]

      # Environment for Node.js memory management
      environment = [
        {
          name  = "NODE_OPTIONS"
          value = "--max-old-space-size=1536"  # 75% of memory limit
        },
        {
          name  = "NODE_ENV"
          value = "production"
        }
      ]

      # Health check
      healthCheck = {
        command     = ["CMD-SHELL", "curl -f http://localhost:3000/health || exit 1"]
        interval    = 30
        timeout     = 5
        retries     = 3
        startPeriod = 60  # Give app time to start
      }

      # Logging
      logConfiguration = {
        logDriver = "awslogs"
        options = {
          "awslogs-group"         = "/ecs/api"
          "awslogs-region"        = "us-east-1"
          "awslogs-stream-prefix" = "api"
        }
      }

      # Graceful shutdown
      stopTimeout = 30

      # Linux parameters
      linuxParameters = {
        initProcessEnabled = true  # Enable init process for proper signal handling
      }
    },
    # Sidecar for metrics
    {
      name      = "datadog-agent"
      image     = "datadog/agent:latest"
      essential = false
      cpu       = 128
      memory    = 256

      environment = [
        {
          name  = "DD_API_KEY"
          value = "from-secrets-manager"
        },
        {
          name  = "ECS_FARGATE"
          value = "true"
        }
      ]
    }
  ])
}

Step 4: Fix ALB Health Checks

resource "aws_lb_target_group" "api" {
  name        = "api-tg"
  port        = 3000
  protocol    = "HTTP"
  vpc_id      = aws_vpc.main.id
  target_type = "ip"

  health_check {
    enabled             = true
    healthy_threshold   = 2
    unhealthy_threshold = 3
    timeout             = 5
    interval            = 30
    path                = "/health"
    port                = "traffic-port"
    protocol            = "HTTP"
    matcher             = "200"
  }

  # Important for graceful deployments
  deregistration_delay = 30

  stickiness {
    type            = "lb_cookie"
    cookie_duration = 86400
    enabled         = false
  }

  tags = {
    Name = "api-target-group"
  }
}

# ECS Service with proper deployment configuration
resource "aws_ecs_service" "api" {
  name            = "api-service"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.api.arn
  desired_count   = 3
  launch_type     = "FARGATE"

  # Enable ECS Exec for debugging
  enable_execute_command = true

  network_configuration {
    subnets          = aws_subnet.private[*].id
    security_groups  = [aws_security_group.ecs_tasks.id]
    assign_public_ip = false
  }

  load_balancer {
    target_group_arn = aws_lb_target_group.api.arn
    container_name   = "api"
    container_port   = 3000
  }

  # Deployment configuration
  deployment_configuration {
    maximum_percent         = 200
    minimum_healthy_percent = 100

    deployment_circuit_breaker {
      enable   = true
      rollback = true
    }
  }

  # Service discovery (optional)
  service_registries {
    registry_arn = aws_service_discovery_service.api.arn
  }

  # Ensure new tasks pass health checks before draining old ones
  health_check_grace_period_seconds = 60

  lifecycle {
    ignore_changes = [desired_count]  # Allow auto-scaling to manage
  }
}

Step 5: Implement Proper Health Check Endpoint

// health.js - Comprehensive health check
const express = require('express');
const router = express.Router();

// Simple liveness check (for container health)
router.get('/health', (req, res) => {
  res.status(200).json({ status: 'ok' });
});

// Detailed readiness check (for ALB)
router.get('/health/ready', async (req, res) => {
  const checks = {
    database: false,
    cache: false,
    memory: false,
  };

  try {
    // Check database connection
    await db.query('SELECT 1');
    checks.database = true;
  } catch (err) {
    console.error('Database health check failed:', err);
  }

  try {
    // Check Redis connection
    await redis.ping();
    checks.cache = true;
  } catch (err) {
    console.error('Cache health check failed:', err);
  }

  // Check memory usage (fail if over 90%)
  const used = process.memoryUsage();
  const heapUsedPercent = (used.heapUsed / used.heapTotal) * 100;
  checks.memory = heapUsedPercent < 90;

  const allHealthy = Object.values(checks).every(Boolean);

  res.status(allHealthy ? 200 : 503).json({
    status: allHealthy ? 'healthy' : 'unhealthy',
    checks,
    memory: {
      heapUsed: Math.round(used.heapUsed / 1024 / 1024) + 'MB',
      heapTotal: Math.round(used.heapTotal / 1024 / 1024) + 'MB',
      rss: Math.round(used.rss / 1024 / 1024) + 'MB',
    },
    uptime: process.uptime(),
  });
});

module.exports = router;

Step 6: Graceful Shutdown

// server.js - Handle shutdown signals
const express = require('express');
const app = express();

let server;
let isShuttingDown = false;

// Health check returns 503 during shutdown
app.get('/health', (req, res) => {
  if (isShuttingDown) {
    return res.status(503).json({ status: 'shutting down' });
  }
  res.status(200).json({ status: 'ok' });
});

// Start server
server = app.listen(3000, () => {
  console.log('Server started on port 3000');
});

// Graceful shutdown handler
async function gracefulShutdown(signal) {
  console.log(`Received ${signal}, starting graceful shutdown`);
  isShuttingDown = true;

  // Stop accepting new connections
  server.close(async () => {
    console.log('HTTP server closed');

    try {
      // Close database connections
      await db.end();
      console.log('Database connections closed');

      // Close Redis connections
      await redis.quit();
      console.log('Redis connections closed');

      process.exit(0);
    } catch (err) {
      console.error('Error during shutdown:', err);
      process.exit(1);
    }
  });

  // Force shutdown after timeout
  setTimeout(() => {
    console.error('Forced shutdown after timeout');
    process.exit(1);
  }, 25000); // Less than ECS stopTimeout (30s)
}

// Handle termination signals
process.on('SIGTERM', () => gracefulShutdown('SIGTERM'));
process.on('SIGINT', () => gracefulShutdown('SIGINT'));

Step 7: Container Insights and Monitoring

# Enable Container Insights
resource "aws_ecs_cluster" "main" {
  name = "production"

  setting {
    name  = "containerInsights"
    value = "enabled"
  }

  configuration {
    execute_command_configuration {
      logging = "OVERRIDE"

      log_configuration {
        cloud_watch_log_group_name = aws_cloudwatch_log_group.ecs_exec.name
      }
    }
  }
}

# CloudWatch alarms for ECS
resource "aws_cloudwatch_metric_alarm" "cpu_high" {
  alarm_name          = "ecs-cpu-high"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "CpuUtilized"
  namespace           = "ECS/ContainerInsights"
  period              = 60
  statistic           = "Average"
  threshold           = 80
  alarm_description   = "ECS CPU utilization is high"
  alarm_actions       = [aws_sns_topic.alerts.arn]

  dimensions = {
    ClusterName = aws_ecs_cluster.main.name
    ServiceName = aws_ecs_service.api.name
  }
}

resource "aws_cloudwatch_metric_alarm" "memory_high" {
  alarm_name          = "ecs-memory-high"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "MemoryUtilized"
  namespace           = "ECS/ContainerInsights"
  period              = 60
  statistic           = "Average"
  threshold           = 80
  alarm_description   = "ECS memory utilization is high"
  alarm_actions       = [aws_sns_topic.alerts.arn]

  dimensions = {
    ClusterName = aws_ecs_cluster.main.name
    ServiceName = aws_ecs_service.api.name
  }
}

resource "aws_cloudwatch_metric_alarm" "task_count" {
  alarm_name          = "ecs-task-count-low"
  comparison_operator = "LessThanThreshold"
  evaluation_periods  = 2
  metric_name         = "RunningTaskCount"
  namespace           = "ECS/ContainerInsights"
  period              = 60
  statistic           = "Average"
  threshold           = 2
  alarm_description   = "ECS running task count is low"
  alarm_actions       = [aws_sns_topic.alerts.arn]

  dimensions = {
    ClusterName = aws_ecs_cluster.main.name
    ServiceName = aws_ecs_service.api.name
  }
}

Step 8: Auto Scaling

resource "aws_appautoscaling_target" "ecs" {
  max_capacity       = 10
  min_capacity       = 2
  resource_id        = "service/${aws_ecs_cluster.main.name}/${aws_ecs_service.api.name}"
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"
}

# Scale based on CPU
resource "aws_appautoscaling_policy" "cpu" {
  name               = "cpu-scaling"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.ecs.resource_id
  scalable_dimension = aws_appautoscaling_target.ecs.scalable_dimension
  service_namespace  = aws_appautoscaling_target.ecs.service_namespace

  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageCPUUtilization"
    }
    target_value       = 70.0
    scale_in_cooldown  = 300
    scale_out_cooldown = 60
  }
}

# Scale based on memory
resource "aws_appautoscaling_policy" "memory" {
  name               = "memory-scaling"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.ecs.resource_id
  scalable_dimension = aws_appautoscaling_target.ecs.scalable_dimension
  service_namespace  = aws_appautoscaling_target.ecs.service_namespace

  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageMemoryUtilization"
    }
    target_value       = 70.0
    scale_in_cooldown  = 300
    scale_out_cooldown = 60
  }
}

Systematic, production-ready debugging

ECS Debugging Checklist

Issue	Check	Solution
Exit code 137	Memory limit	Increase memory, fix leaks
Health check fail	Endpoint, timeout	Increase startPeriod, fix endpoint
Image pull error	ECR permissions	Check execution role
Network timeout	Security groups	Allow egress to endpoints
Slow startup	Container size	Use smaller base images

Practice Question

Why should you set NODE_OPTIONS='--max-old-space-size' to less than the container memory limit?

Questions