Questions
Build a monitoring dashboard and alerting system that catches issues before users notice.
The Scenario
Your team discovers issues from user complaints, not monitoring:
Timeline of typical incident:
09:00 - Error rate increases to 5%
09:15 - Database connections exhausted
09:30 - First user complaint received
09:45 - Team starts investigating
10:00 - Root cause identified
10:30 - Issue resolved
Total user impact: 1.5 hours
You need a monitoring system that would have caught this at 09:00.
The Challenge
Design a comprehensive monitoring and alerting strategy using Cloud Monitoring, with SLIs/SLOs, proactive alerts, and runbooks for common issues.
A junior engineer might alert on every metric exceeding a threshold, use static thresholds that don't account for traffic patterns, create too many alerts causing fatigue, or only alert on errors without context. This leads to alert storms, missed issues, and ineffective response.
A senior engineer defines SLIs/SLOs based on user experience, creates multi-signal alerts that reduce false positives, implements alerting tiers (warning vs critical), uses anomaly detection for dynamic thresholds, and includes runbooks with each alert.
Step 1: Define Service Level Indicators (SLIs)
# SLIs based on user experience
availability_sli:
description: "Percentage of successful requests"
calculation: |
successful_requests / total_requests
good_threshold: "> 99.9%"
latency_sli:
description: "P95 request latency"
calculation: |
95th percentile of request duration
good_threshold: "< 500ms"
error_rate_sli:
description: "Percentage of 5xx errors"
calculation: |
5xx_responses / total_responses
good_threshold: "< 0.1%"Step 2: Create Custom Metrics
from google.cloud import monitoring_v3
import time
client = monitoring_v3.MetricServiceClient()
project_name = f"projects/{PROJECT_ID}"
def record_custom_metric(metric_type, value, labels=None):
"""Record a custom metric to Cloud Monitoring."""
series = monitoring_v3.TimeSeries()
series.metric.type = f"custom.googleapis.com/{metric_type}"
if labels:
series.metric.labels.update(labels)
series.resource.type = "global"
point = monitoring_v3.Point()
point.value.double_value = value
point.interval.end_time.seconds = int(time.time())
series.points = [point]
client.create_time_series(name=project_name, time_series=[series])
# Usage: Track business metrics
record_custom_metric("orders/processing_time", 1.5, {"status": "success"})
record_custom_metric("payments/amount", 99.99, {"currency": "USD"})Step 3: Create Monitoring Dashboard
resource "google_monitoring_dashboard" "app" {
dashboard_json = jsonencode({
displayName = "Application Health Dashboard"
mosaicLayout = {
columns = 12
tiles = [
# Request Rate
{
width = 4
height = 4
widget = {
title = "Request Rate"
xyChart = {
dataSets = [{
timeSeriesQuery = {
timeSeriesFilter = {
filter = "metric.type=\"loadbalancing.googleapis.com/https/request_count\" resource.type=\"https_lb_rule\""
aggregation = {
alignmentPeriod = "60s"
perSeriesAligner = "ALIGN_RATE"
}
}
}
}]
}
}
},
# Error Rate
{
width = 4
height = 4
xPos = 4
widget = {
title = "Error Rate (%)"
xyChart = {
dataSets = [{
timeSeriesQuery = {
timeSeriesFilterRatio = {
numerator = {
filter = "metric.type=\"loadbalancing.googleapis.com/https/request_count\" metric.labels.response_code_class=\"500\""
}
denominator = {
filter = "metric.type=\"loadbalancing.googleapis.com/https/request_count\""
}
}
}
}]
thresholds = [{
value = 0.01
color = "YELLOW"
direction = "ABOVE"
}, {
value = 0.05
color = "RED"
direction = "ABOVE"
}]
}
}
},
# P95 Latency
{
width = 4
height = 4
xPos = 8
widget = {
title = "P95 Latency (ms)"
xyChart = {
dataSets = [{
timeSeriesQuery = {
timeSeriesFilter = {
filter = "metric.type=\"loadbalancing.googleapis.com/https/total_latencies\""
aggregation = {
alignmentPeriod = "60s"
perSeriesAligner = "ALIGN_PERCENTILE_95"
}
}
}
}]
}
}
},
# Database Connections
{
width = 6
height = 4
yPos = 4
widget = {
title = "Cloud SQL Connections"
xyChart = {
dataSets = [{
timeSeriesQuery = {
timeSeriesFilter = {
filter = "metric.type=\"cloudsql.googleapis.com/database/postgresql/num_backends\""
}
}
}]
}
}
},
# GKE Pod Status
{
width = 6
height = 4
xPos = 6
yPos = 4
widget = {
title = "GKE Pod Status"
xyChart = {
dataSets = [{
timeSeriesQuery = {
timeSeriesFilter = {
filter = "metric.type=\"kubernetes.io/container/restart_count\" resource.type=\"k8s_container\""
aggregation = {
alignmentPeriod = "300s"
perSeriesAligner = "ALIGN_DELTA"
}
}
}
}]
}
}
}
]
}
})
}Step 4: Create Alert Policies
# Critical: High Error Rate
resource "google_monitoring_alert_policy" "high_error_rate" {
display_name = "[CRITICAL] High Error Rate"
combiner = "OR"
conditions {
display_name = "Error rate > 5%"
condition_threshold {
filter = <<-EOT
metric.type="loadbalancing.googleapis.com/https/request_count"
AND metric.labels.response_code_class="500"
EOT
aggregations {
alignment_period = "60s"
per_series_aligner = "ALIGN_RATE"
cross_series_reducer = "REDUCE_SUM"
}
denominator_filter = <<-EOT
metric.type="loadbalancing.googleapis.com/https/request_count"
EOT
denominator_aggregations {
alignment_period = "60s"
per_series_aligner = "ALIGN_RATE"
cross_series_reducer = "REDUCE_SUM"
}
comparison = "COMPARISON_GT"
threshold_value = 0.05
duration = "60s"
trigger {
count = 1
}
}
}
notification_channels = [
google_monitoring_notification_channel.pagerduty.id,
google_monitoring_notification_channel.slack_critical.id
]
documentation {
content = <<-EOT
## High Error Rate Alert
### Impact
Users are experiencing errors. Error rate exceeds 5%.
### Runbook
1. Check Cloud Logging for error patterns:
`resource.type="k8s_container" severity>=ERROR`
2. Check backend health:
`gcloud compute backend-services get-health app-backend --global`
3. Check database connections:
`gcloud sql instances describe production-db`
4. Recent deployments:
`kubectl rollout history deployment/api-server`
### Escalation
If not resolved in 15 minutes, page on-call manager.
EOT
mime_type = "text/markdown"
}
alert_strategy {
auto_close = "1800s" # Auto-close after 30 min if resolved
}
}
# Warning: Elevated Latency
resource "google_monitoring_alert_policy" "elevated_latency" {
display_name = "[WARNING] Elevated Latency"
combiner = "OR"
conditions {
display_name = "P95 latency > 1s"
condition_threshold {
filter = "metric.type=\"loadbalancing.googleapis.com/https/total_latencies\""
aggregations {
alignment_period = "300s"
per_series_aligner = "ALIGN_PERCENTILE_95"
}
comparison = "COMPARISON_GT"
threshold_value = 1000 # 1 second in ms
duration = "300s"
}
}
notification_channels = [
google_monitoring_notification_channel.slack_warning.id
]
}
# Anomaly Detection: Traffic Drop
resource "google_monitoring_alert_policy" "traffic_anomaly" {
display_name = "[WARNING] Traffic Anomaly Detected"
combiner = "OR"
conditions {
display_name = "Traffic below expected range"
condition_threshold {
filter = "metric.type=\"loadbalancing.googleapis.com/https/request_count\""
aggregations {
alignment_period = "300s"
per_series_aligner = "ALIGN_RATE"
cross_series_reducer = "REDUCE_SUM"
}
# Compare to same time last week
comparison = "COMPARISON_LT"
threshold_value = 100 # Adjust based on baseline
# Use forecast for dynamic threshold
forecast_options {
forecast_horizon = "3600s"
}
duration = "600s"
}
}
}Step 5: Set Up Notification Channels
# PagerDuty for critical alerts
resource "google_monitoring_notification_channel" "pagerduty" {
display_name = "PagerDuty"
type = "pagerduty"
labels = {
service_key = var.pagerduty_service_key
}
}
# Slack for warnings
resource "google_monitoring_notification_channel" "slack_warning" {
display_name = "Slack #alerts-warning"
type = "slack"
labels = {
channel_name = "#alerts-warning"
}
sensitive_labels {
auth_token = var.slack_token
}
}
# Email for daily digests
resource "google_monitoring_notification_channel" "email" {
display_name = "Team Email"
type = "email"
labels = {
email_address = "team@example.com"
}
}Step 6: Create SLO Monitoring
resource "google_monitoring_slo" "availability" {
service = google_monitoring_custom_service.app.service_id
slo_id = "availability-slo"
display_name = "99.9% Availability"
goal = 0.999
rolling_period_days = 30
request_based_sli {
good_total_ratio {
good_service_filter = <<-EOT
metric.type="loadbalancing.googleapis.com/https/request_count"
metric.labels.response_code_class!="500"
EOT
total_service_filter = <<-EOT
metric.type="loadbalancing.googleapis.com/https/request_count"
EOT
}
}
}
# Alert when burning through error budget too fast
resource "google_monitoring_alert_policy" "slo_burn_rate" {
display_name = "SLO Burn Rate Alert"
combiner = "OR"
conditions {
display_name = "Burning error budget too fast"
condition_threshold {
filter = <<-EOT
select_slo_burn_rate(
"projects/${var.project}/services/${google_monitoring_custom_service.app.service_id}/serviceLevelObjectives/${google_monitoring_slo.availability.slo_id}",
"1h"
)
EOT
comparison = "COMPARISON_GT"
threshold_value = 10 # 10x normal burn rate
duration = "0s"
}
}
} Alert Severity Matrix
| Severity | Response Time | Notification | Examples |
|---|---|---|---|
| Critical | 5 minutes | PagerDuty + Slack | >5% errors, service down |
| Warning | 30 minutes | Slack | >1% errors, high latency |
| Info | Next business day | SLO trending down |
Golden Signals
| Signal | Metric | Alert Threshold |
|---|---|---|
| Latency | P95 response time | > 1 second |
| Traffic | Requests per second | Anomaly detection |
| Errors | 5xx error rate | > 0.1% |
| Saturation | CPU/Memory usage | > 80% |
Practice Question
Why should you alert on error budget burn rate rather than just error rate threshold?