Interviews / Cloud & DevOps / Production incidents take hours to detect. Implement CloudWatch alarms and dashboards.
Lambda functions are timing out when accessing RDS in a VPC. Debug the connectivity issue.
Design a multi-tier VPC architecture with public, private, and database subnets.
DynamoDB is throttling requests and costs are high. Optimize the table design.
RDS connections are exhausted and failover takes too long. Fix the database setup.
Implement S3 with CloudFront for secure, cached content delivery with signed URLs.
ECS tasks are failing with exit code 137 and health check failures. Debug the container issues.
Messages are being lost and processed multiple times. Implement reliable SQS/SNS messaging.
Design a scalable API Gateway with throttling, caching, and Lambda integration.
Production incidents take hours to detect. Implement CloudWatch alarms and dashboards.
IAM policies are too permissive. Implement least privilege access with proper role design.
Build a CI/CD pipeline with CodePipeline that deploys to ECS with blue-green deployments.
Your AWS bill increased 40% last month. Identify waste and implement cost controls.
Questions
Production incidents take hours to detect. Implement CloudWatch alarms and dashboards.
The Scenario
Your monitoring is inadequate:
Current State:
├── CloudWatch agent: Not installed
├── Alarms: Only EC2 CPU above 80%
├── Dashboards: None
├── Log retention: Default (never expire, high costs)
├── MTTD (Mean Time to Detect): 2-4 hours
├── MTTR (Mean Time to Recover): 4-6 hours
└── Last incident: Customers reported before team noticed
The Challenge
Implement comprehensive monitoring with CloudWatch metrics, custom alarms, centralized logging, and actionable dashboards.
Wrong Approach
A junior engineer might create alarms for every metric, use static thresholds that cause alert fatigue, skip log aggregation, or create dashboards with too many widgets. These approaches cause noise, miss real issues, make debugging difficult, and provide no actionable insights.
Right Approach
A senior engineer implements composite alarms for meaningful alerts, uses anomaly detection for dynamic thresholds, centralizes logs with proper retention, creates focused dashboards, and implements metric filters for business KPIs.
Step 1: CloudWatch Alarm Strategy
# SNS Topic for alerts
resource "aws_sns_topic" "alerts" {
name = "production-alerts"
}
resource "aws_sns_topic_subscription" "pagerduty" {
topic_arn = aws_sns_topic.alerts.arn
protocol = "https"
endpoint = "https://events.pagerduty.com/integration/xxx/enqueue"
}
# High error rate alarm
resource "aws_cloudwatch_metric_alarm" "api_error_rate" {
alarm_name = "api-high-error-rate"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
threshold = 5
metric_query {
id = "error_rate"
expression = "(errors/requests)*100"
label = "Error Rate %"
return_data = true
}
metric_query {
id = "errors"
metric {
metric_name = "5XXError"
namespace = "AWS/ApiGateway"
period = 300
stat = "Sum"
dimensions = {
ApiName = "orders-api"
Stage = "prod"
}
}
}
metric_query {
id = "requests"
metric {
metric_name = "Count"
namespace = "AWS/ApiGateway"
period = 300
stat = "Sum"
dimensions = {
ApiName = "orders-api"
Stage = "prod"
}
}
}
alarm_description = "API error rate exceeds 5%"
alarm_actions = [aws_sns_topic.alerts.arn]
ok_actions = [aws_sns_topic.alerts.arn]
treat_missing_data = "notBreaching"
}
# Anomaly detection alarm for latency
resource "aws_cloudwatch_metric_alarm" "latency_anomaly" {
alarm_name = "api-latency-anomaly"
comparison_operator = "GreaterThanUpperThreshold"
evaluation_periods = 3
threshold_metric_id = "anomaly_band"
metric_query {
id = "latency"
return_data = true
metric {
metric_name = "Latency"
namespace = "AWS/ApiGateway"
period = 300
stat = "p99"
dimensions = {
ApiName = "orders-api"
Stage = "prod"
}
}
}
metric_query {
id = "anomaly_band"
expression = "ANOMALY_DETECTION_BAND(latency, 2)"
label = "Latency Anomaly Band"
return_data = true
}
alarm_description = "API latency is abnormally high"
alarm_actions = [aws_sns_topic.alerts.arn]
}
# Composite alarm for service health
resource "aws_cloudwatch_composite_alarm" "service_unhealthy" {
alarm_name = "service-unhealthy"
alarm_rule = join(" OR ", [
"ALARM(${aws_cloudwatch_metric_alarm.api_error_rate.alarm_name})",
"ALARM(${aws_cloudwatch_metric_alarm.latency_anomaly.alarm_name})",
"ALARM(${aws_cloudwatch_metric_alarm.lambda_errors.alarm_name})"
])
alarm_description = "Service is unhealthy - multiple issues detected"
alarm_actions = [aws_sns_topic.alerts.arn]
}Step 2: Custom Metrics
import boto3
from datetime import datetime
cloudwatch = boto3.client('cloudwatch')
def put_business_metrics(order_data: dict):
"""Publish custom business metrics."""
cloudwatch.put_metric_data(
Namespace='OrdersApp',
MetricData=[
{
'MetricName': 'OrderValue',
'Value': order_data['total_amount'],
'Unit': 'None',
'Dimensions': [
{'Name': 'Environment', 'Value': 'production'},
{'Name': 'Region', 'Value': order_data['region']}
],
'Timestamp': datetime.utcnow()
},
{
'MetricName': 'OrderCount',
'Value': 1,
'Unit': 'Count',
'Dimensions': [
{'Name': 'Environment', 'Value': 'production'},
{'Name': 'ProductCategory', 'Value': order_data['category']}
]
},
{
'MetricName': 'ProcessingTime',
'Value': order_data['processing_time_ms'],
'Unit': 'Milliseconds',
'Dimensions': [
{'Name': 'Environment', 'Value': 'production'}
]
}
]
)
# Using EMF (Embedded Metric Format) for Lambda
import json
def emit_emf_metric(metric_name: str, value: float, dimensions: dict):
"""Emit metric using Embedded Metric Format."""
emf_log = {
"_aws": {
"Timestamp": int(datetime.utcnow().timestamp() * 1000),
"CloudWatchMetrics": [{
"Namespace": "OrdersApp",
"Dimensions": [list(dimensions.keys())],
"Metrics": [{"Name": metric_name, "Unit": "None"}]
}]
},
metric_name: value,
**dimensions
}
print(json.dumps(emf_log))Step 3: Log Aggregation
# Centralized log group with retention
resource "aws_cloudwatch_log_group" "app_logs" {
name = "/app/orders-service"
retention_in_days = 30
tags = {
Environment = "production"
Service = "orders"
}
}
# Metric filter for errors
resource "aws_cloudwatch_log_metric_filter" "errors" {
name = "error-count"
pattern = "[timestamp, level=ERROR, ...]"
log_group_name = aws_cloudwatch_log_group.app_logs.name
metric_transformation {
name = "ErrorCount"
namespace = "OrdersApp/Logs"
value = "1"
default_value = "0"
}
}
# Metric filter for specific errors
resource "aws_cloudwatch_log_metric_filter" "payment_failures" {
name = "payment-failures"
pattern = "{ $.error_type = \"PaymentFailed\" }"
log_group_name = aws_cloudwatch_log_group.app_logs.name
metric_transformation {
name = "PaymentFailures"
namespace = "OrdersApp/Business"
value = "1"
dimensions = {
ErrorCode = "$.error_code"
}
}
}
# Subscription filter to stream logs
resource "aws_cloudwatch_log_subscription_filter" "to_elasticsearch" {
name = "logs-to-elasticsearch"
log_group_name = aws_cloudwatch_log_group.app_logs.name
filter_pattern = ""
destination_arn = aws_lambda_function.log_shipper.arn
}Step 4: CloudWatch Dashboard
resource "aws_cloudwatch_dashboard" "main" {
dashboard_name = "orders-service-prod"
dashboard_body = jsonencode({
widgets = [
# Row 1: Key Metrics
{
type = "metric"
x = 0
y = 0
width = 6
height = 6
properties = {
title = "Request Count"
region = "us-east-1"
metrics = [
["AWS/ApiGateway", "Count", "ApiName", "orders-api", "Stage", "prod",
{ stat = "Sum", period = 60 }]
]
view = "timeSeries"
}
},
{
type = "metric"
x = 6
y = 0
width = 6
height = 6
properties = {
title = "Error Rate (%)"
region = "us-east-1"
metrics = [
[{ expression = "(m1/m2)*100", label = "Error Rate", id = "e1" }],
["AWS/ApiGateway", "5XXError", "ApiName", "orders-api", "Stage", "prod",
{ stat = "Sum", period = 60, id = "m1", visible = false }],
["AWS/ApiGateway", "Count", "ApiName", "orders-api", "Stage", "prod",
{ stat = "Sum", period = 60, id = "m2", visible = false }]
]
view = "timeSeries"
annotations = {
horizontal = [
{ label = "Critical", value = 5, color = "#ff0000" }
]
}
}
},
{
type = "metric"
x = 12
y = 0
width = 6
height = 6
properties = {
title = "Latency (p50, p90, p99)"
region = "us-east-1"
metrics = [
["AWS/ApiGateway", "Latency", "ApiName", "orders-api", "Stage", "prod",
{ stat = "p50", period = 60, label = "p50" }],
["...", { stat = "p90", period = 60, label = "p90" }],
["...", { stat = "p99", period = 60, label = "p99" }]
]
view = "timeSeries"
}
},
{
type = "metric"
x = 18
y = 0
width = 6
height = 6
properties = {
title = "Active Alarms"
region = "us-east-1"
alarms = [
aws_cloudwatch_metric_alarm.api_error_rate.arn,
aws_cloudwatch_metric_alarm.latency_anomaly.arn
]
}
},
# Row 2: Lambda Metrics
{
type = "metric"
x = 0
y = 6
width = 8
height = 6
properties = {
title = "Lambda Invocations & Errors"
region = "us-east-1"
metrics = [
["AWS/Lambda", "Invocations", "FunctionName", "process-order",
{ stat = "Sum", period = 60 }],
["AWS/Lambda", "Errors", "FunctionName", "process-order",
{ stat = "Sum", period = 60, color = "#ff0000" }]
]
view = "timeSeries"
}
},
{
type = "metric"
x = 8
y = 6
width = 8
height = 6
properties = {
title = "Lambda Duration"
region = "us-east-1"
metrics = [
["AWS/Lambda", "Duration", "FunctionName", "process-order",
{ stat = "Average", period = 60 }],
["...", { stat = "Maximum", period = 60 }]
]
view = "timeSeries"
}
},
{
type = "metric"
x = 16
y = 6
width = 8
height = 6
properties = {
title = "Lambda Concurrent Executions"
region = "us-east-1"
metrics = [
["AWS/Lambda", "ConcurrentExecutions", "FunctionName", "process-order",
{ stat = "Maximum", period = 60 }]
]
view = "timeSeries"
}
},
# Row 3: Business Metrics
{
type = "metric"
x = 0
y = 12
width = 12
height = 6
properties = {
title = "Order Value (Hourly)"
region = "us-east-1"
metrics = [
["OrdersApp", "OrderValue", "Environment", "production",
{ stat = "Sum", period = 3600 }]
]
view = "timeSeries"
}
},
{
type = "metric"
x = 12
y = 12
width = 12
height = 6
properties = {
title = "Orders by Category"
region = "us-east-1"
metrics = [
["OrdersApp", "OrderCount", "ProductCategory", "electronics",
{ stat = "Sum", period = 3600 }],
["...", "clothing", { stat = "Sum", period = 3600 }],
["...", "books", { stat = "Sum", period = 3600 }]
]
view = "timeSeries"
}
},
# Row 4: Logs
{
type = "log"
x = 0
y = 18
width = 24
height = 6
properties = {
title = "Recent Errors"
region = "us-east-1"
query = <<-EOT
SOURCE '/app/orders-service'
| filter level = 'ERROR'
| sort @timestamp desc
| limit 100
EOT
}
}
]
})
}Step 5: CloudWatch Logs Insights Queries
-- Find slow requests
fields @timestamp, @message
| filter @message like /duration/
| parse @message "duration: * ms" as duration
| filter duration > 1000
| sort duration desc
| limit 100
-- Error breakdown by type
fields @timestamp, error_type, error_message
| filter level = 'ERROR'
| stats count(*) as count by error_type
| sort count desc
-- Request volume by endpoint
fields @timestamp, path, method
| filter @message like /request/
| stats count(*) as requests by path, method
| sort requests desc
-- Find cold starts
fields @timestamp, @message
| filter @message like /Init Duration/
| parse @message "Init Duration: * ms" as init_duration
| stats avg(init_duration) as avg_init, max(init_duration) as max_init by bin(1h)
-- Trace request through logs
fields @timestamp, @message, request_id
| filter request_id = 'abc-123'
| sort @timestamp ascStep 6: X-Ray Tracing
# Enable X-Ray for Lambda
resource "aws_lambda_function" "process_order" {
# ... other config
tracing_config {
mode = "Active"
}
}
# X-Ray sampling rules
resource "aws_xray_sampling_rule" "main" {
rule_name = "orders-sampling"
priority = 1000
reservoir_size = 5
fixed_rate = 0.05 # 5% of requests
url_path = "*"
host = "*"
http_method = "*"
service_type = "*"
service_name = "orders-service"
version = 1
} CloudWatch Best Practices
| Component | Recommendation | Purpose |
|---|---|---|
| Alarms | Composite with anomaly detection | Reduce false positives |
| Logs | 30-90 day retention | Balance cost and debugging |
| Dashboards | 4-6 key metrics | Quick health assessment |
| Metrics | Use EMF in Lambda | Lower cost, simpler code |
| X-Ray | 5% sampling | Tracing without overhead |
Practice Question
Why should you use anomaly detection alarms instead of static threshold alarms?