Interviews / Cloud & DevOps / ECS tasks are failing with exit code 137 and health check failures. Debug the container issues.
Lambda functions are timing out when accessing RDS in a VPC. Debug the connectivity issue.
Design a multi-tier VPC architecture with public, private, and database subnets.
DynamoDB is throttling requests and costs are high. Optimize the table design.
RDS connections are exhausted and failover takes too long. Fix the database setup.
Implement S3 with CloudFront for secure, cached content delivery with signed URLs.
ECS tasks are failing with exit code 137 and health check failures. Debug the container issues.
Messages are being lost and processed multiple times. Implement reliable SQS/SNS messaging.
Design a scalable API Gateway with throttling, caching, and Lambda integration.
Production incidents take hours to detect. Implement CloudWatch alarms and dashboards.
IAM policies are too permissive. Implement least privilege access with proper role design.
Build a CI/CD pipeline with CodePipeline that deploys to ECS with blue-green deployments.
Your AWS bill increased 40% last month. Identify waste and implement cost controls.
Questions
ECS tasks are failing with exit code 137 and health check failures. Debug the container issues.
The Scenario
Your ECS service is unstable:
Current Problems:
├── Tasks constantly restarting
├── Exit code 137 (OOMKilled)
├── Health check failures: 50%
├── Container insights: Memory spikes to 100%
├── Task definition: 512 CPU, 1024 MB memory
├── Application: Node.js API server
└── Error: "Container killed due to memory"
The Challenge
Debug and fix the ECS task failures, optimize container resource allocation, and implement proper health checks and monitoring.
Wrong Approach
A junior engineer might just double the memory allocation, disable health checks, or ignore the OOM errors. These approaches waste resources, hide real problems, and don't address root causes like memory leaks or incorrect health check configuration.
Right Approach
A senior engineer analyzes memory usage patterns, profiles the application, configures appropriate resource limits with headroom, implements proper health checks with realistic timeouts, and uses Container Insights for monitoring.
Step 1: Understand Exit Codes
Common ECS Exit Codes:
├── 0: Normal exit (success)
├── 1: Application error
├── 137: SIGKILL (OOMKilled or manual stop)
├── 139: SIGSEGV (Segmentation fault)
├── 143: SIGTERM (Graceful shutdown)
└── 255: Exit status out of range
Exit code 137 = 128 + 9 (SIGKILL)
- Container exceeded memory limit
- Killed by orchestratorStep 2: Debug with ECS Exec
# Enable ECS Exec on the service
aws ecs update-service \
--cluster production \
--service api-service \
--enable-execute-command
# Execute into running container
aws ecs execute-command \
--cluster production \
--task arn:aws:ecs:us-east-1:123456789:task/production/abc123 \
--container api \
--interactive \
--command "/bin/sh"
# Inside container: Check memory usage
cat /sys/fs/cgroup/memory/memory.usage_in_bytes
cat /sys/fs/cgroup/memory/memory.limit_in_bytes
# Check Node.js heap
node -e "console.log(process.memoryUsage())"
# Check for memory leaks
node --inspect app.js
# Then use Chrome DevTools to analyze heap snapshotsStep 3: Fix Task Definition
resource "aws_ecs_task_definition" "api" {
family = "api-task"
network_mode = "awsvpc"
requires_compatibilities = ["FARGATE"]
cpu = "1024" # 1 vCPU
memory = "2048" # 2 GB
execution_role_arn = aws_iam_role.ecs_execution.arn
task_role_arn = aws_iam_role.ecs_task.arn
container_definitions = jsonencode([
{
name = "api"
image = "${aws_ecr_repository.api.repository_url}:latest"
essential = true
# Resource limits
cpu = 896 # Leave some for sidecar
memory = 1792 # Leave headroom for container overhead
# Memory reservation (soft limit)
memoryReservation = 1536
portMappings = [
{
containerPort = 3000
protocol = "tcp"
}
]
# Environment for Node.js memory management
environment = [
{
name = "NODE_OPTIONS"
value = "--max-old-space-size=1536" # 75% of memory limit
},
{
name = "NODE_ENV"
value = "production"
}
]
# Health check
healthCheck = {
command = ["CMD-SHELL", "curl -f http://localhost:3000/health || exit 1"]
interval = 30
timeout = 5
retries = 3
startPeriod = 60 # Give app time to start
}
# Logging
logConfiguration = {
logDriver = "awslogs"
options = {
"awslogs-group" = "/ecs/api"
"awslogs-region" = "us-east-1"
"awslogs-stream-prefix" = "api"
}
}
# Graceful shutdown
stopTimeout = 30
# Linux parameters
linuxParameters = {
initProcessEnabled = true # Enable init process for proper signal handling
}
},
# Sidecar for metrics
{
name = "datadog-agent"
image = "datadog/agent:latest"
essential = false
cpu = 128
memory = 256
environment = [
{
name = "DD_API_KEY"
value = "from-secrets-manager"
},
{
name = "ECS_FARGATE"
value = "true"
}
]
}
])
}Step 4: Fix ALB Health Checks
resource "aws_lb_target_group" "api" {
name = "api-tg"
port = 3000
protocol = "HTTP"
vpc_id = aws_vpc.main.id
target_type = "ip"
health_check {
enabled = true
healthy_threshold = 2
unhealthy_threshold = 3
timeout = 5
interval = 30
path = "/health"
port = "traffic-port"
protocol = "HTTP"
matcher = "200"
}
# Important for graceful deployments
deregistration_delay = 30
stickiness {
type = "lb_cookie"
cookie_duration = 86400
enabled = false
}
tags = {
Name = "api-target-group"
}
}
# ECS Service with proper deployment configuration
resource "aws_ecs_service" "api" {
name = "api-service"
cluster = aws_ecs_cluster.main.id
task_definition = aws_ecs_task_definition.api.arn
desired_count = 3
launch_type = "FARGATE"
# Enable ECS Exec for debugging
enable_execute_command = true
network_configuration {
subnets = aws_subnet.private[*].id
security_groups = [aws_security_group.ecs_tasks.id]
assign_public_ip = false
}
load_balancer {
target_group_arn = aws_lb_target_group.api.arn
container_name = "api"
container_port = 3000
}
# Deployment configuration
deployment_configuration {
maximum_percent = 200
minimum_healthy_percent = 100
deployment_circuit_breaker {
enable = true
rollback = true
}
}
# Service discovery (optional)
service_registries {
registry_arn = aws_service_discovery_service.api.arn
}
# Ensure new tasks pass health checks before draining old ones
health_check_grace_period_seconds = 60
lifecycle {
ignore_changes = [desired_count] # Allow auto-scaling to manage
}
}Step 5: Implement Proper Health Check Endpoint
// health.js - Comprehensive health check
const express = require('express');
const router = express.Router();
// Simple liveness check (for container health)
router.get('/health', (req, res) => {
res.status(200).json({ status: 'ok' });
});
// Detailed readiness check (for ALB)
router.get('/health/ready', async (req, res) => {
const checks = {
database: false,
cache: false,
memory: false,
};
try {
// Check database connection
await db.query('SELECT 1');
checks.database = true;
} catch (err) {
console.error('Database health check failed:', err);
}
try {
// Check Redis connection
await redis.ping();
checks.cache = true;
} catch (err) {
console.error('Cache health check failed:', err);
}
// Check memory usage (fail if over 90%)
const used = process.memoryUsage();
const heapUsedPercent = (used.heapUsed / used.heapTotal) * 100;
checks.memory = heapUsedPercent < 90;
const allHealthy = Object.values(checks).every(Boolean);
res.status(allHealthy ? 200 : 503).json({
status: allHealthy ? 'healthy' : 'unhealthy',
checks,
memory: {
heapUsed: Math.round(used.heapUsed / 1024 / 1024) + 'MB',
heapTotal: Math.round(used.heapTotal / 1024 / 1024) + 'MB',
rss: Math.round(used.rss / 1024 / 1024) + 'MB',
},
uptime: process.uptime(),
});
});
module.exports = router;Step 6: Graceful Shutdown
// server.js - Handle shutdown signals
const express = require('express');
const app = express();
let server;
let isShuttingDown = false;
// Health check returns 503 during shutdown
app.get('/health', (req, res) => {
if (isShuttingDown) {
return res.status(503).json({ status: 'shutting down' });
}
res.status(200).json({ status: 'ok' });
});
// Start server
server = app.listen(3000, () => {
console.log('Server started on port 3000');
});
// Graceful shutdown handler
async function gracefulShutdown(signal) {
console.log(`Received ${signal}, starting graceful shutdown`);
isShuttingDown = true;
// Stop accepting new connections
server.close(async () => {
console.log('HTTP server closed');
try {
// Close database connections
await db.end();
console.log('Database connections closed');
// Close Redis connections
await redis.quit();
console.log('Redis connections closed');
process.exit(0);
} catch (err) {
console.error('Error during shutdown:', err);
process.exit(1);
}
});
// Force shutdown after timeout
setTimeout(() => {
console.error('Forced shutdown after timeout');
process.exit(1);
}, 25000); // Less than ECS stopTimeout (30s)
}
// Handle termination signals
process.on('SIGTERM', () => gracefulShutdown('SIGTERM'));
process.on('SIGINT', () => gracefulShutdown('SIGINT'));Step 7: Container Insights and Monitoring
# Enable Container Insights
resource "aws_ecs_cluster" "main" {
name = "production"
setting {
name = "containerInsights"
value = "enabled"
}
configuration {
execute_command_configuration {
logging = "OVERRIDE"
log_configuration {
cloud_watch_log_group_name = aws_cloudwatch_log_group.ecs_exec.name
}
}
}
}
# CloudWatch alarms for ECS
resource "aws_cloudwatch_metric_alarm" "cpu_high" {
alarm_name = "ecs-cpu-high"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "CpuUtilized"
namespace = "ECS/ContainerInsights"
period = 60
statistic = "Average"
threshold = 80
alarm_description = "ECS CPU utilization is high"
alarm_actions = [aws_sns_topic.alerts.arn]
dimensions = {
ClusterName = aws_ecs_cluster.main.name
ServiceName = aws_ecs_service.api.name
}
}
resource "aws_cloudwatch_metric_alarm" "memory_high" {
alarm_name = "ecs-memory-high"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "MemoryUtilized"
namespace = "ECS/ContainerInsights"
period = 60
statistic = "Average"
threshold = 80
alarm_description = "ECS memory utilization is high"
alarm_actions = [aws_sns_topic.alerts.arn]
dimensions = {
ClusterName = aws_ecs_cluster.main.name
ServiceName = aws_ecs_service.api.name
}
}
resource "aws_cloudwatch_metric_alarm" "task_count" {
alarm_name = "ecs-task-count-low"
comparison_operator = "LessThanThreshold"
evaluation_periods = 2
metric_name = "RunningTaskCount"
namespace = "ECS/ContainerInsights"
period = 60
statistic = "Average"
threshold = 2
alarm_description = "ECS running task count is low"
alarm_actions = [aws_sns_topic.alerts.arn]
dimensions = {
ClusterName = aws_ecs_cluster.main.name
ServiceName = aws_ecs_service.api.name
}
}Step 8: Auto Scaling
resource "aws_appautoscaling_target" "ecs" {
max_capacity = 10
min_capacity = 2
resource_id = "service/${aws_ecs_cluster.main.name}/${aws_ecs_service.api.name}"
scalable_dimension = "ecs:service:DesiredCount"
service_namespace = "ecs"
}
# Scale based on CPU
resource "aws_appautoscaling_policy" "cpu" {
name = "cpu-scaling"
policy_type = "TargetTrackingScaling"
resource_id = aws_appautoscaling_target.ecs.resource_id
scalable_dimension = aws_appautoscaling_target.ecs.scalable_dimension
service_namespace = aws_appautoscaling_target.ecs.service_namespace
target_tracking_scaling_policy_configuration {
predefined_metric_specification {
predefined_metric_type = "ECSServiceAverageCPUUtilization"
}
target_value = 70.0
scale_in_cooldown = 300
scale_out_cooldown = 60
}
}
# Scale based on memory
resource "aws_appautoscaling_policy" "memory" {
name = "memory-scaling"
policy_type = "TargetTrackingScaling"
resource_id = aws_appautoscaling_target.ecs.resource_id
scalable_dimension = aws_appautoscaling_target.ecs.scalable_dimension
service_namespace = aws_appautoscaling_target.ecs.service_namespace
target_tracking_scaling_policy_configuration {
predefined_metric_specification {
predefined_metric_type = "ECSServiceAverageMemoryUtilization"
}
target_value = 70.0
scale_in_cooldown = 300
scale_out_cooldown = 60
}
} ECS Debugging Checklist
| Issue | Check | Solution |
|---|---|---|
| Exit code 137 | Memory limit | Increase memory, fix leaks |
| Health check fail | Endpoint, timeout | Increase startPeriod, fix endpoint |
| Image pull error | ECR permissions | Check execution role |
| Network timeout | Security groups | Allow egress to endpoints |
| Slow startup | Container size | Use smaller base images |
Practice Question
Why should you set NODE_OPTIONS='--max-old-space-size' to less than the container memory limit?