Build a monitoring dashboard and alerting system that catches issues before users notice.

Q: Build a monitoring dashboard and alerting system that catches issues before users notice.

Learn the answer to "Build a monitoring dashboard and alerting system that catches issues before users notice." with detailed explanations, code examples, and best practices on DeployU.

The Scenario

Your team discovers issues from user complaints, not monitoring:

Timeline of typical incident:
09:00 - Error rate increases to 5%
09:15 - Database connections exhausted
09:30 - First user complaint received
09:45 - Team starts investigating
10:00 - Root cause identified
10:30 - Issue resolved

Total user impact: 1.5 hours

You need a monitoring system that would have caught this at 09:00.

The Challenge

Design a comprehensive monitoring and alerting strategy using Cloud Monitoring, with SLIs/SLOs, proactive alerts, and runbooks for common issues.

Wrong Approach

A junior engineer might alert on every metric exceeding a threshold, use static thresholds that don't account for traffic patterns, create too many alerts causing fatigue, or only alert on errors without context. This leads to alert storms, missed issues, and ineffective response.

Addresses symptoms, not root cause

Right Approach

A senior engineer defines SLIs/SLOs based on user experience, creates multi-signal alerts that reduce false positives, implements alerting tiers (warning vs critical), uses anomaly detection for dynamic thresholds, and includes runbooks with each alert.

Step 1: Define Service Level Indicators (SLIs)

# SLIs based on user experience
availability_sli:
  description: "Percentage of successful requests"
  calculation: |
    successful_requests / total_requests
  good_threshold: "&gt; 99.9%"

latency_sli:
  description: "P95 request latency"
  calculation: |
    95th percentile of request duration
  good_threshold: "&lt; 500ms"

error_rate_sli:
  description: "Percentage of 5xx errors"
  calculation: |
    5xx_responses / total_responses
  good_threshold: "&lt; 0.1%"

Step 2: Create Custom Metrics

from google.cloud import monitoring_v3
import time

client = monitoring_v3.MetricServiceClient()
project_name = f"projects/{PROJECT_ID}"

def record_custom_metric(metric_type, value, labels=None):
    """Record a custom metric to Cloud Monitoring."""
    series = monitoring_v3.TimeSeries()
    series.metric.type = f"custom.googleapis.com/{metric_type}"

    if labels:
        series.metric.labels.update(labels)

    series.resource.type = "global"

    point = monitoring_v3.Point()
    point.value.double_value = value
    point.interval.end_time.seconds = int(time.time())
    series.points = [point]

    client.create_time_series(name=project_name, time_series=[series])

# Usage: Track business metrics
record_custom_metric("orders/processing_time", 1.5, {"status": "success"})
record_custom_metric("payments/amount", 99.99, {"currency": "USD"})

Step 3: Create Monitoring Dashboard

resource "google_monitoring_dashboard" "app" {
  dashboard_json = jsonencode({
    displayName = "Application Health Dashboard"

    mosaicLayout = {
      columns = 12
      tiles = [
        # Request Rate
        {
          width  = 4
          height = 4
          widget = {
            title = "Request Rate"
            xyChart = {
              dataSets = [{
                timeSeriesQuery = {
                  timeSeriesFilter = {
                    filter = "metric.type=\"loadbalancing.googleapis.com/https/request_count\" resource.type=\"https_lb_rule\""
                    aggregation = {
                      alignmentPeriod  = "60s"
                      perSeriesAligner = "ALIGN_RATE"
                    }
                  }
                }
              }]
            }
          }
        },

        # Error Rate
        {
          width  = 4
          height = 4
          xPos   = 4
          widget = {
            title = "Error Rate (%)"
            xyChart = {
              dataSets = [{
                timeSeriesQuery = {
                  timeSeriesFilterRatio = {
                    numerator = {
                      filter = "metric.type=\"loadbalancing.googleapis.com/https/request_count\" metric.labels.response_code_class=\"500\""
                    }
                    denominator = {
                      filter = "metric.type=\"loadbalancing.googleapis.com/https/request_count\""
                    }
                  }
                }
              }]
              thresholds = [{
                value     = 0.01
                color     = "YELLOW"
                direction = "ABOVE"
              }, {
                value     = 0.05
                color     = "RED"
                direction = "ABOVE"
              }]
            }
          }
        },

        # P95 Latency
        {
          width  = 4
          height = 4
          xPos   = 8
          widget = {
            title = "P95 Latency (ms)"
            xyChart = {
              dataSets = [{
                timeSeriesQuery = {
                  timeSeriesFilter = {
                    filter = "metric.type=\"loadbalancing.googleapis.com/https/total_latencies\""
                    aggregation = {
                      alignmentPeriod    = "60s"
                      perSeriesAligner   = "ALIGN_PERCENTILE_95"
                    }
                  }
                }
              }]
            }
          }
        },

        # Database Connections
        {
          width  = 6
          height = 4
          yPos   = 4
          widget = {
            title = "Cloud SQL Connections"
            xyChart = {
              dataSets = [{
                timeSeriesQuery = {
                  timeSeriesFilter = {
                    filter = "metric.type=\"cloudsql.googleapis.com/database/postgresql/num_backends\""
                  }
                }
              }]
            }
          }
        },

        # GKE Pod Status
        {
          width  = 6
          height = 4
          xPos   = 6
          yPos   = 4
          widget = {
            title = "GKE Pod Status"
            xyChart = {
              dataSets = [{
                timeSeriesQuery = {
                  timeSeriesFilter = {
                    filter = "metric.type=\"kubernetes.io/container/restart_count\" resource.type=\"k8s_container\""
                    aggregation = {
                      alignmentPeriod  = "300s"
                      perSeriesAligner = "ALIGN_DELTA"
                    }
                  }
                }
              }]
            }
          }
        }
      ]
    }
  })
}

Step 4: Create Alert Policies

# Critical: High Error Rate
resource "google_monitoring_alert_policy" "high_error_rate" {
  display_name = "[CRITICAL] High Error Rate"
  combiner     = "OR"

  conditions {
    display_name = "Error rate > 5%"

    condition_threshold {
      filter = <<-EOT
        metric.type="loadbalancing.googleapis.com/https/request_count"
        AND metric.labels.response_code_class="500"
      EOT

      aggregations {
        alignment_period     = "60s"
        per_series_aligner   = "ALIGN_RATE"
        cross_series_reducer = "REDUCE_SUM"
      }

      denominator_filter = <<-EOT
        metric.type="loadbalancing.googleapis.com/https/request_count"
      EOT

      denominator_aggregations {
        alignment_period     = "60s"
        per_series_aligner   = "ALIGN_RATE"
        cross_series_reducer = "REDUCE_SUM"
      }

      comparison      = "COMPARISON_GT"
      threshold_value = 0.05
      duration        = "60s"

      trigger {
        count = 1
      }
    }
  }

  notification_channels = [
    google_monitoring_notification_channel.pagerduty.id,
    google_monitoring_notification_channel.slack_critical.id
  ]

  documentation {
    content = <<-EOT
      ## High Error Rate Alert

      ### Impact
      Users are experiencing errors. Error rate exceeds 5%.

      ### Runbook
      1. Check Cloud Logging for error patterns:
         `resource.type="k8s_container" severity>=ERROR`

      2. Check backend health:
         `gcloud compute backend-services get-health app-backend --global`

      3. Check database connections:
         `gcloud sql instances describe production-db`

      4. Recent deployments:
         `kubectl rollout history deployment/api-server`

      ### Escalation
      If not resolved in 15 minutes, page on-call manager.
    EOT
    mime_type = "text/markdown"
  }

  alert_strategy {
    auto_close = "1800s"  # Auto-close after 30 min if resolved
  }
}

# Warning: Elevated Latency
resource "google_monitoring_alert_policy" "elevated_latency" {
  display_name = "[WARNING] Elevated Latency"
  combiner     = "OR"

  conditions {
    display_name = "P95 latency > 1s"

    condition_threshold {
      filter = "metric.type=\"loadbalancing.googleapis.com/https/total_latencies\""

      aggregations {
        alignment_period   = "300s"
        per_series_aligner = "ALIGN_PERCENTILE_95"
      }

      comparison      = "COMPARISON_GT"
      threshold_value = 1000  # 1 second in ms
      duration        = "300s"
    }
  }

  notification_channels = [
    google_monitoring_notification_channel.slack_warning.id
  ]
}

# Anomaly Detection: Traffic Drop
resource "google_monitoring_alert_policy" "traffic_anomaly" {
  display_name = "[WARNING] Traffic Anomaly Detected"
  combiner     = "OR"

  conditions {
    display_name = "Traffic below expected range"

    condition_threshold {
      filter = "metric.type=\"loadbalancing.googleapis.com/https/request_count\""

      aggregations {
        alignment_period     = "300s"
        per_series_aligner   = "ALIGN_RATE"
        cross_series_reducer = "REDUCE_SUM"
      }

      # Compare to same time last week
      comparison      = "COMPARISON_LT"
      threshold_value = 100  # Adjust based on baseline

      # Use forecast for dynamic threshold
      forecast_options {
        forecast_horizon = "3600s"
      }

      duration = "600s"
    }
  }
}

Step 5: Set Up Notification Channels

# PagerDuty for critical alerts
resource "google_monitoring_notification_channel" "pagerduty" {
  display_name = "PagerDuty"
  type         = "pagerduty"

  labels = {
    service_key = var.pagerduty_service_key
  }
}

# Slack for warnings
resource "google_monitoring_notification_channel" "slack_warning" {
  display_name = "Slack #alerts-warning"
  type         = "slack"

  labels = {
    channel_name = "#alerts-warning"
  }

  sensitive_labels {
    auth_token = var.slack_token
  }
}

# Email for daily digests
resource "google_monitoring_notification_channel" "email" {
  display_name = "Team Email"
  type         = "email"

  labels = {
    email_address = "team@example.com"
  }
}

Step 6: Create SLO Monitoring

resource "google_monitoring_slo" "availability" {
  service      = google_monitoring_custom_service.app.service_id
  slo_id       = "availability-slo"
  display_name = "99.9% Availability"

  goal                = 0.999
  rolling_period_days = 30

  request_based_sli {
    good_total_ratio {
      good_service_filter = <<-EOT
        metric.type="loadbalancing.googleapis.com/https/request_count"
        metric.labels.response_code_class!="500"
      EOT
      total_service_filter = <<-EOT
        metric.type="loadbalancing.googleapis.com/https/request_count"
      EOT
    }
  }
}

# Alert when burning through error budget too fast
resource "google_monitoring_alert_policy" "slo_burn_rate" {
  display_name = "SLO Burn Rate Alert"
  combiner     = "OR"

  conditions {
    display_name = "Burning error budget too fast"

    condition_threshold {
      filter = <<-EOT
        select_slo_burn_rate(
          "projects/${var.project}/services/${google_monitoring_custom_service.app.service_id}/serviceLevelObjectives/${google_monitoring_slo.availability.slo_id}",
          "1h"
        )
      EOT

      comparison      = "COMPARISON_GT"
      threshold_value = 10  # 10x normal burn rate
      duration        = "0s"
    }
  }
}

Systematic, production-ready debugging

Alert Severity Matrix

Severity	Response Time	Notification	Examples
Critical	5 minutes	PagerDuty + Slack	>5% errors, service down
Warning	30 minutes	Slack	>1% errors, high latency
Info	Next business day	Email	SLO trending down

Golden Signals

Signal	Metric	Alert Threshold
Latency	P95 response time	> 1 second
Traffic	Requests per second	Anomaly detection
Errors	5xx error rate	> 0.1%
Saturation	CPU/Memory usage	> 80%

Practice Question

Why should you alert on error budget burn rate rather than just error rate threshold?

Questions