Production incidents take hours to detect. Implement CloudWatch alarms and dashboards.

Q: Production incidents take hours to detect. Implement CloudWatch alarms and dashboards.

Learn the answer to "Production incidents take hours to detect. Implement CloudWatch alarms and dashboards." with detailed explanations, code examples, and best practices on DeployU.

The Scenario

Your monitoring is inadequate:

Current State:
├── CloudWatch agent: Not installed
├── Alarms: Only EC2 CPU above 80%
├── Dashboards: None
├── Log retention: Default (never expire, high costs)
├── MTTD (Mean Time to Detect): 2-4 hours
├── MTTR (Mean Time to Recover): 4-6 hours
└── Last incident: Customers reported before team noticed

The Challenge

Implement comprehensive monitoring with CloudWatch metrics, custom alarms, centralized logging, and actionable dashboards.

Wrong Approach

A junior engineer might create alarms for every metric, use static thresholds that cause alert fatigue, skip log aggregation, or create dashboards with too many widgets. These approaches cause noise, miss real issues, make debugging difficult, and provide no actionable insights.

Addresses symptoms, not root cause

Right Approach

A senior engineer implements composite alarms for meaningful alerts, uses anomaly detection for dynamic thresholds, centralizes logs with proper retention, creates focused dashboards, and implements metric filters for business KPIs.

Step 1: CloudWatch Alarm Strategy

# SNS Topic for alerts
resource "aws_sns_topic" "alerts" {
  name = "production-alerts"
}

resource "aws_sns_topic_subscription" "pagerduty" {
  topic_arn = aws_sns_topic.alerts.arn
  protocol  = "https"
  endpoint  = "https://events.pagerduty.com/integration/xxx/enqueue"
}

# High error rate alarm
resource "aws_cloudwatch_metric_alarm" "api_error_rate" {
  alarm_name          = "api-high-error-rate"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  threshold           = 5

  metric_query {
    id          = "error_rate"
    expression  = "(errors/requests)*100"
    label       = "Error Rate %"
    return_data = true
  }

  metric_query {
    id = "errors"
    metric {
      metric_name = "5XXError"
      namespace   = "AWS/ApiGateway"
      period      = 300
      stat        = "Sum"
      dimensions = {
        ApiName = "orders-api"
        Stage   = "prod"
      }
    }
  }

  metric_query {
    id = "requests"
    metric {
      metric_name = "Count"
      namespace   = "AWS/ApiGateway"
      period      = 300
      stat        = "Sum"
      dimensions = {
        ApiName = "orders-api"
        Stage   = "prod"
      }
    }
  }

  alarm_description = "API error rate exceeds 5%"
  alarm_actions     = [aws_sns_topic.alerts.arn]
  ok_actions        = [aws_sns_topic.alerts.arn]

  treat_missing_data = "notBreaching"
}

# Anomaly detection alarm for latency
resource "aws_cloudwatch_metric_alarm" "latency_anomaly" {
  alarm_name          = "api-latency-anomaly"
  comparison_operator = "GreaterThanUpperThreshold"
  evaluation_periods  = 3
  threshold_metric_id = "anomaly_band"

  metric_query {
    id          = "latency"
    return_data = true
    metric {
      metric_name = "Latency"
      namespace   = "AWS/ApiGateway"
      period      = 300
      stat        = "p99"
      dimensions = {
        ApiName = "orders-api"
        Stage   = "prod"
      }
    }
  }

  metric_query {
    id          = "anomaly_band"
    expression  = "ANOMALY_DETECTION_BAND(latency, 2)"
    label       = "Latency Anomaly Band"
    return_data = true
  }

  alarm_description = "API latency is abnormally high"
  alarm_actions     = [aws_sns_topic.alerts.arn]
}

# Composite alarm for service health
resource "aws_cloudwatch_composite_alarm" "service_unhealthy" {
  alarm_name = "service-unhealthy"

  alarm_rule = join(" OR ", [
    "ALARM(${aws_cloudwatch_metric_alarm.api_error_rate.alarm_name})",
    "ALARM(${aws_cloudwatch_metric_alarm.latency_anomaly.alarm_name})",
    "ALARM(${aws_cloudwatch_metric_alarm.lambda_errors.alarm_name})"
  ])

  alarm_description = "Service is unhealthy - multiple issues detected"
  alarm_actions     = [aws_sns_topic.alerts.arn]
}

Step 2: Custom Metrics

import boto3
from datetime import datetime

cloudwatch = boto3.client('cloudwatch')

def put_business_metrics(order_data: dict):
    """Publish custom business metrics."""
    cloudwatch.put_metric_data(
        Namespace='OrdersApp',
        MetricData=[
            {
                'MetricName': 'OrderValue',
                'Value': order_data['total_amount'],
                'Unit': 'None',
                'Dimensions': [
                    {'Name': 'Environment', 'Value': 'production'},
                    {'Name': 'Region', 'Value': order_data['region']}
                ],
                'Timestamp': datetime.utcnow()
            },
            {
                'MetricName': 'OrderCount',
                'Value': 1,
                'Unit': 'Count',
                'Dimensions': [
                    {'Name': 'Environment', 'Value': 'production'},
                    {'Name': 'ProductCategory', 'Value': order_data['category']}
                ]
            },
            {
                'MetricName': 'ProcessingTime',
                'Value': order_data['processing_time_ms'],
                'Unit': 'Milliseconds',
                'Dimensions': [
                    {'Name': 'Environment', 'Value': 'production'}
                ]
            }
        ]
    )

# Using EMF (Embedded Metric Format) for Lambda
import json

def emit_emf_metric(metric_name: str, value: float, dimensions: dict):
    """Emit metric using Embedded Metric Format."""
    emf_log = {
        "_aws": {
            "Timestamp": int(datetime.utcnow().timestamp() * 1000),
            "CloudWatchMetrics": [{
                "Namespace": "OrdersApp",
                "Dimensions": [list(dimensions.keys())],
                "Metrics": [{"Name": metric_name, "Unit": "None"}]
            }]
        },
        metric_name: value,
        **dimensions
    }
    print(json.dumps(emf_log))

Step 3: Log Aggregation

# Centralized log group with retention
resource "aws_cloudwatch_log_group" "app_logs" {
  name              = "/app/orders-service"
  retention_in_days = 30

  tags = {
    Environment = "production"
    Service     = "orders"
  }
}

# Metric filter for errors
resource "aws_cloudwatch_log_metric_filter" "errors" {
  name           = "error-count"
  pattern        = "[timestamp, level=ERROR, ...]"
  log_group_name = aws_cloudwatch_log_group.app_logs.name

  metric_transformation {
    name          = "ErrorCount"
    namespace     = "OrdersApp/Logs"
    value         = "1"
    default_value = "0"
  }
}

# Metric filter for specific errors
resource "aws_cloudwatch_log_metric_filter" "payment_failures" {
  name           = "payment-failures"
  pattern        = "{ $.error_type = \"PaymentFailed\" }"
  log_group_name = aws_cloudwatch_log_group.app_logs.name

  metric_transformation {
    name      = "PaymentFailures"
    namespace = "OrdersApp/Business"
    value     = "1"
    dimensions = {
      ErrorCode = "$.error_code"
    }
  }
}

# Subscription filter to stream logs
resource "aws_cloudwatch_log_subscription_filter" "to_elasticsearch" {
  name            = "logs-to-elasticsearch"
  log_group_name  = aws_cloudwatch_log_group.app_logs.name
  filter_pattern  = ""
  destination_arn = aws_lambda_function.log_shipper.arn
}

Step 4: CloudWatch Dashboard

resource "aws_cloudwatch_dashboard" "main" {
  dashboard_name = "orders-service-prod"

  dashboard_body = jsonencode({
    widgets = [
      # Row 1: Key Metrics
      {
        type   = "metric"
        x      = 0
        y      = 0
        width  = 6
        height = 6
        properties = {
          title  = "Request Count"
          region = "us-east-1"
          metrics = [
            ["AWS/ApiGateway", "Count", "ApiName", "orders-api", "Stage", "prod",
             { stat = "Sum", period = 60 }]
          ]
          view = "timeSeries"
        }
      },
      {
        type   = "metric"
        x      = 6
        y      = 0
        width  = 6
        height = 6
        properties = {
          title  = "Error Rate (%)"
          region = "us-east-1"
          metrics = [
            [{ expression = "(m1/m2)*100", label = "Error Rate", id = "e1" }],
            ["AWS/ApiGateway", "5XXError", "ApiName", "orders-api", "Stage", "prod",
             { stat = "Sum", period = 60, id = "m1", visible = false }],
            ["AWS/ApiGateway", "Count", "ApiName", "orders-api", "Stage", "prod",
             { stat = "Sum", period = 60, id = "m2", visible = false }]
          ]
          view = "timeSeries"
          annotations = {
            horizontal = [
              { label = "Critical", value = 5, color = "#ff0000" }
            ]
          }
        }
      },
      {
        type   = "metric"
        x      = 12
        y      = 0
        width  = 6
        height = 6
        properties = {
          title  = "Latency (p50, p90, p99)"
          region = "us-east-1"
          metrics = [
            ["AWS/ApiGateway", "Latency", "ApiName", "orders-api", "Stage", "prod",
             { stat = "p50", period = 60, label = "p50" }],
            ["...", { stat = "p90", period = 60, label = "p90" }],
            ["...", { stat = "p99", period = 60, label = "p99" }]
          ]
          view = "timeSeries"
        }
      },
      {
        type   = "metric"
        x      = 18
        y      = 0
        width  = 6
        height = 6
        properties = {
          title  = "Active Alarms"
          region = "us-east-1"
          alarms = [
            aws_cloudwatch_metric_alarm.api_error_rate.arn,
            aws_cloudwatch_metric_alarm.latency_anomaly.arn
          ]
        }
      },

      # Row 2: Lambda Metrics
      {
        type   = "metric"
        x      = 0
        y      = 6
        width  = 8
        height = 6
        properties = {
          title  = "Lambda Invocations & Errors"
          region = "us-east-1"
          metrics = [
            ["AWS/Lambda", "Invocations", "FunctionName", "process-order",
             { stat = "Sum", period = 60 }],
            ["AWS/Lambda", "Errors", "FunctionName", "process-order",
             { stat = "Sum", period = 60, color = "#ff0000" }]
          ]
          view = "timeSeries"
        }
      },
      {
        type   = "metric"
        x      = 8
        y      = 6
        width  = 8
        height = 6
        properties = {
          title  = "Lambda Duration"
          region = "us-east-1"
          metrics = [
            ["AWS/Lambda", "Duration", "FunctionName", "process-order",
             { stat = "Average", period = 60 }],
            ["...", { stat = "Maximum", period = 60 }]
          ]
          view = "timeSeries"
        }
      },
      {
        type   = "metric"
        x      = 16
        y      = 6
        width  = 8
        height = 6
        properties = {
          title  = "Lambda Concurrent Executions"
          region = "us-east-1"
          metrics = [
            ["AWS/Lambda", "ConcurrentExecutions", "FunctionName", "process-order",
             { stat = "Maximum", period = 60 }]
          ]
          view = "timeSeries"
        }
      },

      # Row 3: Business Metrics
      {
        type   = "metric"
        x      = 0
        y      = 12
        width  = 12
        height = 6
        properties = {
          title  = "Order Value (Hourly)"
          region = "us-east-1"
          metrics = [
            ["OrdersApp", "OrderValue", "Environment", "production",
             { stat = "Sum", period = 3600 }]
          ]
          view = "timeSeries"
        }
      },
      {
        type   = "metric"
        x      = 12
        y      = 12
        width  = 12
        height = 6
        properties = {
          title  = "Orders by Category"
          region = "us-east-1"
          metrics = [
            ["OrdersApp", "OrderCount", "ProductCategory", "electronics",
             { stat = "Sum", period = 3600 }],
            ["...", "clothing", { stat = "Sum", period = 3600 }],
            ["...", "books", { stat = "Sum", period = 3600 }]
          ]
          view = "timeSeries"
        }
      },

      # Row 4: Logs
      {
        type   = "log"
        x      = 0
        y      = 18
        width  = 24
        height = 6
        properties = {
          title  = "Recent Errors"
          region = "us-east-1"
          query  = <<-EOT
            SOURCE '/app/orders-service'
            | filter level = 'ERROR'
            | sort @timestamp desc
            | limit 100
          EOT
        }
      }
    ]
  })
}

Step 5: CloudWatch Logs Insights Queries

-- Find slow requests
fields @timestamp, @message
| filter @message like /duration/
| parse @message "duration: * ms" as duration
| filter duration > 1000
| sort duration desc
| limit 100

-- Error breakdown by type
fields @timestamp, error_type, error_message
| filter level = 'ERROR'
| stats count(*) as count by error_type
| sort count desc

-- Request volume by endpoint
fields @timestamp, path, method
| filter @message like /request/
| stats count(*) as requests by path, method
| sort requests desc

-- Find cold starts
fields @timestamp, @message
| filter @message like /Init Duration/
| parse @message "Init Duration: * ms" as init_duration
| stats avg(init_duration) as avg_init, max(init_duration) as max_init by bin(1h)

-- Trace request through logs
fields @timestamp, @message, request_id
| filter request_id = 'abc-123'
| sort @timestamp asc

Step 6: X-Ray Tracing

# Enable X-Ray for Lambda
resource "aws_lambda_function" "process_order" {
  # ... other config

  tracing_config {
    mode = "Active"
  }
}

# X-Ray sampling rules
resource "aws_xray_sampling_rule" "main" {
  rule_name      = "orders-sampling"
  priority       = 1000
  reservoir_size = 5
  fixed_rate     = 0.05  # 5% of requests
  url_path       = "*"
  host           = "*"
  http_method    = "*"
  service_type   = "*"
  service_name   = "orders-service"
  version        = 1
}

Systematic, production-ready debugging

CloudWatch Best Practices

Component	Recommendation	Purpose
Alarms	Composite with anomaly detection	Reduce false positives
Logs	30-90 day retention	Balance cost and debugging
Dashboards	4-6 key metrics	Quick health assessment
Metrics	Use EMF in Lambda	Lower cost, simpler code
X-Ray	5% sampling	Tracing without overhead

Practice Question

Why should you use anomaly detection alarms instead of static threshold alarms?

Questions