DeployU
Interviews / Cloud & DevOps / Production incidents take hours to detect. Implement CloudWatch alarms and dashboards.

Production incidents take hours to detect. Implement CloudWatch alarms and dashboards.

practical Observability Interactive Quiz Code Examples

The Scenario

Your monitoring is inadequate:

Current State:
├── CloudWatch agent: Not installed
├── Alarms: Only EC2 CPU above 80%
├── Dashboards: None
├── Log retention: Default (never expire, high costs)
├── MTTD (Mean Time to Detect): 2-4 hours
├── MTTR (Mean Time to Recover): 4-6 hours
└── Last incident: Customers reported before team noticed

The Challenge

Implement comprehensive monitoring with CloudWatch metrics, custom alarms, centralized logging, and actionable dashboards.

Wrong Approach

A junior engineer might create alarms for every metric, use static thresholds that cause alert fatigue, skip log aggregation, or create dashboards with too many widgets. These approaches cause noise, miss real issues, make debugging difficult, and provide no actionable insights.

Right Approach

A senior engineer implements composite alarms for meaningful alerts, uses anomaly detection for dynamic thresholds, centralizes logs with proper retention, creates focused dashboards, and implements metric filters for business KPIs.

Step 1: CloudWatch Alarm Strategy

# SNS Topic for alerts
resource "aws_sns_topic" "alerts" {
  name = "production-alerts"
}

resource "aws_sns_topic_subscription" "pagerduty" {
  topic_arn = aws_sns_topic.alerts.arn
  protocol  = "https"
  endpoint  = "https://events.pagerduty.com/integration/xxx/enqueue"
}

# High error rate alarm
resource "aws_cloudwatch_metric_alarm" "api_error_rate" {
  alarm_name          = "api-high-error-rate"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  threshold           = 5

  metric_query {
    id          = "error_rate"
    expression  = "(errors/requests)*100"
    label       = "Error Rate %"
    return_data = true
  }

  metric_query {
    id = "errors"
    metric {
      metric_name = "5XXError"
      namespace   = "AWS/ApiGateway"
      period      = 300
      stat        = "Sum"
      dimensions = {
        ApiName = "orders-api"
        Stage   = "prod"
      }
    }
  }

  metric_query {
    id = "requests"
    metric {
      metric_name = "Count"
      namespace   = "AWS/ApiGateway"
      period      = 300
      stat        = "Sum"
      dimensions = {
        ApiName = "orders-api"
        Stage   = "prod"
      }
    }
  }

  alarm_description = "API error rate exceeds 5%"
  alarm_actions     = [aws_sns_topic.alerts.arn]
  ok_actions        = [aws_sns_topic.alerts.arn]

  treat_missing_data = "notBreaching"
}

# Anomaly detection alarm for latency
resource "aws_cloudwatch_metric_alarm" "latency_anomaly" {
  alarm_name          = "api-latency-anomaly"
  comparison_operator = "GreaterThanUpperThreshold"
  evaluation_periods  = 3
  threshold_metric_id = "anomaly_band"

  metric_query {
    id          = "latency"
    return_data = true
    metric {
      metric_name = "Latency"
      namespace   = "AWS/ApiGateway"
      period      = 300
      stat        = "p99"
      dimensions = {
        ApiName = "orders-api"
        Stage   = "prod"
      }
    }
  }

  metric_query {
    id          = "anomaly_band"
    expression  = "ANOMALY_DETECTION_BAND(latency, 2)"
    label       = "Latency Anomaly Band"
    return_data = true
  }

  alarm_description = "API latency is abnormally high"
  alarm_actions     = [aws_sns_topic.alerts.arn]
}

# Composite alarm for service health
resource "aws_cloudwatch_composite_alarm" "service_unhealthy" {
  alarm_name = "service-unhealthy"

  alarm_rule = join(" OR ", [
    "ALARM(${aws_cloudwatch_metric_alarm.api_error_rate.alarm_name})",
    "ALARM(${aws_cloudwatch_metric_alarm.latency_anomaly.alarm_name})",
    "ALARM(${aws_cloudwatch_metric_alarm.lambda_errors.alarm_name})"
  ])

  alarm_description = "Service is unhealthy - multiple issues detected"
  alarm_actions     = [aws_sns_topic.alerts.arn]
}

Step 2: Custom Metrics

import boto3
from datetime import datetime

cloudwatch = boto3.client('cloudwatch')

def put_business_metrics(order_data: dict):
    """Publish custom business metrics."""
    cloudwatch.put_metric_data(
        Namespace='OrdersApp',
        MetricData=[
            {
                'MetricName': 'OrderValue',
                'Value': order_data['total_amount'],
                'Unit': 'None',
                'Dimensions': [
                    {'Name': 'Environment', 'Value': 'production'},
                    {'Name': 'Region', 'Value': order_data['region']}
                ],
                'Timestamp': datetime.utcnow()
            },
            {
                'MetricName': 'OrderCount',
                'Value': 1,
                'Unit': 'Count',
                'Dimensions': [
                    {'Name': 'Environment', 'Value': 'production'},
                    {'Name': 'ProductCategory', 'Value': order_data['category']}
                ]
            },
            {
                'MetricName': 'ProcessingTime',
                'Value': order_data['processing_time_ms'],
                'Unit': 'Milliseconds',
                'Dimensions': [
                    {'Name': 'Environment', 'Value': 'production'}
                ]
            }
        ]
    )

# Using EMF (Embedded Metric Format) for Lambda
import json

def emit_emf_metric(metric_name: str, value: float, dimensions: dict):
    """Emit metric using Embedded Metric Format."""
    emf_log = {
        "_aws": {
            "Timestamp": int(datetime.utcnow().timestamp() * 1000),
            "CloudWatchMetrics": [{
                "Namespace": "OrdersApp",
                "Dimensions": [list(dimensions.keys())],
                "Metrics": [{"Name": metric_name, "Unit": "None"}]
            }]
        },
        metric_name: value,
        **dimensions
    }
    print(json.dumps(emf_log))

Step 3: Log Aggregation

# Centralized log group with retention
resource "aws_cloudwatch_log_group" "app_logs" {
  name              = "/app/orders-service"
  retention_in_days = 30

  tags = {
    Environment = "production"
    Service     = "orders"
  }
}

# Metric filter for errors
resource "aws_cloudwatch_log_metric_filter" "errors" {
  name           = "error-count"
  pattern        = "[timestamp, level=ERROR, ...]"
  log_group_name = aws_cloudwatch_log_group.app_logs.name

  metric_transformation {
    name          = "ErrorCount"
    namespace     = "OrdersApp/Logs"
    value         = "1"
    default_value = "0"
  }
}

# Metric filter for specific errors
resource "aws_cloudwatch_log_metric_filter" "payment_failures" {
  name           = "payment-failures"
  pattern        = "{ $.error_type = \"PaymentFailed\" }"
  log_group_name = aws_cloudwatch_log_group.app_logs.name

  metric_transformation {
    name      = "PaymentFailures"
    namespace = "OrdersApp/Business"
    value     = "1"
    dimensions = {
      ErrorCode = "$.error_code"
    }
  }
}

# Subscription filter to stream logs
resource "aws_cloudwatch_log_subscription_filter" "to_elasticsearch" {
  name            = "logs-to-elasticsearch"
  log_group_name  = aws_cloudwatch_log_group.app_logs.name
  filter_pattern  = ""
  destination_arn = aws_lambda_function.log_shipper.arn
}

Step 4: CloudWatch Dashboard

resource "aws_cloudwatch_dashboard" "main" {
  dashboard_name = "orders-service-prod"

  dashboard_body = jsonencode({
    widgets = [
      # Row 1: Key Metrics
      {
        type   = "metric"
        x      = 0
        y      = 0
        width  = 6
        height = 6
        properties = {
          title  = "Request Count"
          region = "us-east-1"
          metrics = [
            ["AWS/ApiGateway", "Count", "ApiName", "orders-api", "Stage", "prod",
             { stat = "Sum", period = 60 }]
          ]
          view = "timeSeries"
        }
      },
      {
        type   = "metric"
        x      = 6
        y      = 0
        width  = 6
        height = 6
        properties = {
          title  = "Error Rate (%)"
          region = "us-east-1"
          metrics = [
            [{ expression = "(m1/m2)*100", label = "Error Rate", id = "e1" }],
            ["AWS/ApiGateway", "5XXError", "ApiName", "orders-api", "Stage", "prod",
             { stat = "Sum", period = 60, id = "m1", visible = false }],
            ["AWS/ApiGateway", "Count", "ApiName", "orders-api", "Stage", "prod",
             { stat = "Sum", period = 60, id = "m2", visible = false }]
          ]
          view = "timeSeries"
          annotations = {
            horizontal = [
              { label = "Critical", value = 5, color = "#ff0000" }
            ]
          }
        }
      },
      {
        type   = "metric"
        x      = 12
        y      = 0
        width  = 6
        height = 6
        properties = {
          title  = "Latency (p50, p90, p99)"
          region = "us-east-1"
          metrics = [
            ["AWS/ApiGateway", "Latency", "ApiName", "orders-api", "Stage", "prod",
             { stat = "p50", period = 60, label = "p50" }],
            ["...", { stat = "p90", period = 60, label = "p90" }],
            ["...", { stat = "p99", period = 60, label = "p99" }]
          ]
          view = "timeSeries"
        }
      },
      {
        type   = "metric"
        x      = 18
        y      = 0
        width  = 6
        height = 6
        properties = {
          title  = "Active Alarms"
          region = "us-east-1"
          alarms = [
            aws_cloudwatch_metric_alarm.api_error_rate.arn,
            aws_cloudwatch_metric_alarm.latency_anomaly.arn
          ]
        }
      },

      # Row 2: Lambda Metrics
      {
        type   = "metric"
        x      = 0
        y      = 6
        width  = 8
        height = 6
        properties = {
          title  = "Lambda Invocations & Errors"
          region = "us-east-1"
          metrics = [
            ["AWS/Lambda", "Invocations", "FunctionName", "process-order",
             { stat = "Sum", period = 60 }],
            ["AWS/Lambda", "Errors", "FunctionName", "process-order",
             { stat = "Sum", period = 60, color = "#ff0000" }]
          ]
          view = "timeSeries"
        }
      },
      {
        type   = "metric"
        x      = 8
        y      = 6
        width  = 8
        height = 6
        properties = {
          title  = "Lambda Duration"
          region = "us-east-1"
          metrics = [
            ["AWS/Lambda", "Duration", "FunctionName", "process-order",
             { stat = "Average", period = 60 }],
            ["...", { stat = "Maximum", period = 60 }]
          ]
          view = "timeSeries"
        }
      },
      {
        type   = "metric"
        x      = 16
        y      = 6
        width  = 8
        height = 6
        properties = {
          title  = "Lambda Concurrent Executions"
          region = "us-east-1"
          metrics = [
            ["AWS/Lambda", "ConcurrentExecutions", "FunctionName", "process-order",
             { stat = "Maximum", period = 60 }]
          ]
          view = "timeSeries"
        }
      },

      # Row 3: Business Metrics
      {
        type   = "metric"
        x      = 0
        y      = 12
        width  = 12
        height = 6
        properties = {
          title  = "Order Value (Hourly)"
          region = "us-east-1"
          metrics = [
            ["OrdersApp", "OrderValue", "Environment", "production",
             { stat = "Sum", period = 3600 }]
          ]
          view = "timeSeries"
        }
      },
      {
        type   = "metric"
        x      = 12
        y      = 12
        width  = 12
        height = 6
        properties = {
          title  = "Orders by Category"
          region = "us-east-1"
          metrics = [
            ["OrdersApp", "OrderCount", "ProductCategory", "electronics",
             { stat = "Sum", period = 3600 }],
            ["...", "clothing", { stat = "Sum", period = 3600 }],
            ["...", "books", { stat = "Sum", period = 3600 }]
          ]
          view = "timeSeries"
        }
      },

      # Row 4: Logs
      {
        type   = "log"
        x      = 0
        y      = 18
        width  = 24
        height = 6
        properties = {
          title  = "Recent Errors"
          region = "us-east-1"
          query  = <<-EOT
            SOURCE '/app/orders-service'
            | filter level = 'ERROR'
            | sort @timestamp desc
            | limit 100
          EOT
        }
      }
    ]
  })
}

Step 5: CloudWatch Logs Insights Queries

-- Find slow requests
fields @timestamp, @message
| filter @message like /duration/
| parse @message "duration: * ms" as duration
| filter duration > 1000
| sort duration desc
| limit 100

-- Error breakdown by type
fields @timestamp, error_type, error_message
| filter level = 'ERROR'
| stats count(*) as count by error_type
| sort count desc

-- Request volume by endpoint
fields @timestamp, path, method
| filter @message like /request/
| stats count(*) as requests by path, method
| sort requests desc

-- Find cold starts
fields @timestamp, @message
| filter @message like /Init Duration/
| parse @message "Init Duration: * ms" as init_duration
| stats avg(init_duration) as avg_init, max(init_duration) as max_init by bin(1h)

-- Trace request through logs
fields @timestamp, @message, request_id
| filter request_id = 'abc-123'
| sort @timestamp asc

Step 6: X-Ray Tracing

# Enable X-Ray for Lambda
resource "aws_lambda_function" "process_order" {
  # ... other config

  tracing_config {
    mode = "Active"
  }
}

# X-Ray sampling rules
resource "aws_xray_sampling_rule" "main" {
  rule_name      = "orders-sampling"
  priority       = 1000
  reservoir_size = 5
  fixed_rate     = 0.05  # 5% of requests
  url_path       = "*"
  host           = "*"
  http_method    = "*"
  service_type   = "*"
  service_name   = "orders-service"
  version        = 1
}

CloudWatch Best Practices

ComponentRecommendationPurpose
AlarmsComposite with anomaly detectionReduce false positives
Logs30-90 day retentionBalance cost and debugging
Dashboards4-6 key metricsQuick health assessment
MetricsUse EMF in LambdaLower cost, simpler code
X-Ray5% samplingTracing without overhead

Practice Question

Why should you use anomaly detection alarms instead of static threshold alarms?