Your GCP bill increased 40% last month. Identify waste and implement cost controls.

Q: Your GCP bill increased 40% last month. Identify waste and implement cost controls.

Learn the answer to "Your GCP bill increased 40% last month. Identify waste and implement cost controls." with detailed explanations, code examples, and best practices on DeployU.

The Scenario

Your GCP bill jumped from $50,000 to $70,000 last month:

Cost breakdown:
├── Compute Engine: $25,000 (+50%)
├── BigQuery: $15,000 (+100%)
├── Cloud Storage: $12,000 (+20%)
├── GKE: $10,000 (+30%)
└── Other: $8,000 (+10%)

Finance is asking for explanations and a plan to reduce costs.

The Challenge

Analyze the cost increase, identify optimization opportunities, implement cost controls, and create a sustainable cost management strategy.

Wrong Approach

A junior engineer might immediately delete resources to cut costs, downsize all instances without analysis, turn off non-production environments entirely, or ignore the problem hoping it resolves. These approaches cause outages, performance issues, block development, or let costs spiral further.

Addresses symptoms, not root cause

Right Approach

A senior engineer uses billing reports and cost analysis to identify specific causes, implements committed use discounts for predictable workloads, right-sizes resources based on utilization data, sets up budgets and alerts, and creates a culture of cost awareness with showback/chargeback.

Step 1: Analyze Cost Breakdown

# Export detailed billing to BigQuery for analysis
bq query --use_legacy_sql=false '
SELECT
  service.description as service,
  sku.description as sku,
  project.id as project,
  SUM(cost) as total_cost,
  SUM(usage.amount) as usage_amount,
  usage.unit
FROM `billing_export.gcp_billing_export_v1_*`
WHERE _PARTITIONTIME >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY 1, 2, 3, 6
ORDER BY total_cost DESC
LIMIT 50'

# Find cost spikes by day
bq query --use_legacy_sql=false '
SELECT
  DATE(usage_start_time) as date,
  service.description as service,
  SUM(cost) as daily_cost
FROM `billing_export.gcp_billing_export_v1_*`
WHERE _PARTITIONTIME >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY 1, 2
ORDER BY 1, 3 DESC'

Step 2: Identify Compute Waste

# Find idle VMs (low CPU utilization)
gcloud recommender recommendations list \
  --project=my-project \
  --location=us-central1-a \
  --recommender=google.compute.instance.IdleResourceRecommender \
  --format="table(content.overview.resourceName,content.overview.utilizationStats)"

# Find oversized VMs
gcloud recommender recommendations list \
  --project=my-project \
  --location=us-central1-a \
  --recommender=google.compute.instance.MachineTypeRecommender \
  --format="table(content.overview.resourceName,content.overview.recommendedMachineType)"

# List unattached disks
gcloud compute disks list \
  --filter="NOT users:*" \
  --format="table(name,sizeGb,zone,status)"

Step 3: Implement Committed Use Discounts

# 1-year commitment for predictable workloads (37% discount)
resource "google_compute_commitment" "cpu_commitment" {
  name   = "cpu-1year-commitment"
  region = "us-central1"
  type   = "COMPUTE_OPTIMIZED"
  plan   = "TWELVE_MONTH"

  resources {
    type   = "VCPU"
    amount = 100  # 100 vCPUs committed
  }

  resources {
    type   = "MEMORY"
    amount = 400  # 400 GB RAM committed
  }
}

# 3-year commitment for stable workloads (57% discount)
resource "google_compute_commitment" "cpu_commitment_3yr" {
  name   = "cpu-3year-commitment"
  region = "us-central1"
  type   = "GENERAL_PURPOSE"
  plan   = "THIRTY_SIX_MONTH"

  resources {
    type   = "VCPU"
    amount = 50
  }

  resources {
    type   = "MEMORY"
    amount = 200
  }
}

Step 4: Right-Size GKE Clusters

# Enable cluster autoscaler with appropriate limits
resource "google_container_node_pool" "primary" {
  name       = "primary-pool"
  cluster    = google_container_cluster.main.name
  location   = "us-central1"

  # Autoscaling based on actual demand
  autoscaling {
    min_node_count  = 2   # Minimum for HA
    max_node_count  = 20  # Cap costs
    location_policy = "BALANCED"
  }

  node_config {
    # Use E2 instances for cost efficiency
    machine_type = "e2-standard-4"

    # Spot VMs for non-critical workloads (60-91% discount)
    spot = true

    # Only request what you need
    labels = {
      workload = "general"
    }
  }

  management {
    auto_repair  = true
    auto_upgrade = true
  }
}

# Separate node pool for critical workloads (on-demand)
resource "google_container_node_pool" "critical" {
  name    = "critical-pool"
  cluster = google_container_cluster.main.name

  autoscaling {
    min_node_count = 3
    max_node_count = 10
  }

  node_config {
    machine_type = "e2-standard-4"
    spot         = false  # On-demand for reliability

    taint {
      key    = "workload"
      value  = "critical"
      effect = "NO_SCHEDULE"
    }
  }
}

Step 5: Optimize BigQuery Costs

-- Find expensive queries
SELECT
  user_email,
  query,
  ROUND(total_bytes_processed / POW(10,12), 2) as tb_processed,
  ROUND(total_bytes_processed / POW(10,12) * 5, 2) as cost_usd
FROM `region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
  AND job_type = 'QUERY'
ORDER BY total_bytes_processed DESC
LIMIT 20;

-- Set maximum bytes billed per query (prevent runaway costs)
-- In BigQuery console or via API:
-- maximumBytesBilled = 10737418240 (10GB)

# Consider flat-rate pricing for heavy usage
# Break-even: ~$10,000/month in on-demand = 100 slots worth

resource "google_bigquery_reservation" "default" {
  name     = "default-reservation"
  location = "US"
  slot_capacity = 100

  # Autoscale slots based on demand
  autoscale {
    max_slots = 200
  }
}

Step 6: Set Up Budgets and Alerts

resource "google_billing_budget" "project_budget" {
  billing_account = var.billing_account_id
  display_name    = "Monthly Project Budget"

  budget_filter {
    projects = ["projects/${var.project_id}"]
  }

  amount {
    specified_amount {
      currency_code = "USD"
      units         = "60000"  # $60,000 budget
    }
  }

  threshold_rules {
    threshold_percent = 0.5
    spend_basis       = "CURRENT_SPEND"
  }

  threshold_rules {
    threshold_percent = 0.8
    spend_basis       = "CURRENT_SPEND"
  }

  threshold_rules {
    threshold_percent = 1.0
    spend_basis       = "CURRENT_SPEND"
  }

  threshold_rules {
    threshold_percent = 1.0
    spend_basis       = "FORECASTED_SPEND"
  }

  all_updates_rule {
    monitoring_notification_channels = [
      google_monitoring_notification_channel.email.id,
      google_monitoring_notification_channel.slack.id
    ]
    disable_default_iam_recipients = false
  }
}

Step 7: Implement Resource Labels for Cost Attribution

# Standard labels for all resources
locals {
  standard_labels = {
    environment = var.environment
    team        = var.team
    service     = var.service
    cost_center = var.cost_center
  }
}

resource "google_compute_instance" "app" {
  # ...
  labels = local.standard_labels
}

resource "google_storage_bucket" "data" {
  # ...
  labels = local.standard_labels
}

-- Query costs by team
SELECT
  labels.value as team,
  SUM(cost) as total_cost
FROM `billing_export.gcp_billing_export_v1_*`
CROSS JOIN UNNEST(labels) as labels
WHERE labels.key = 'team'
  AND _PARTITIONTIME >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY 1
ORDER BY 2 DESC;

Step 8: Automate Cost Optimization

# Cloud Function to stop dev instances at night
# functions/auto-stop/main.py
import googleapiclient.discovery
from google.cloud import compute_v1

def stop_dev_instances(event, context):
    """Stop all instances labeled environment=dev."""
    client = compute_v1.InstancesClient()
    project = 'my-project'

    # List all zones
    zones_client = compute_v1.ZonesClient()
    zones = zones_client.list(project=project)

    for zone in zones:
        instances = client.list(project=project, zone=zone.name)

        for instance in instances:
            labels = instance.labels or {}
            if labels.get('environment') == 'dev' and instance.status == 'RUNNING':
                print(f"Stopping {instance.name} in {zone.name}")
                client.stop(project=project, zone=zone.name, instance=instance.name)

Cost Optimization Summary

Category	Action	Estimated Savings
Compute	Right-size VMs	20-40%
Compute	Committed use (1yr)	37%
Compute	Spot VMs (non-critical)	60-91%
GKE	Cluster autoscaler	30-50%
BigQuery	Partitioning/clustering	50-90%
Storage	Lifecycle policies	40-70%
All	Delete unused resources	10-20%

Systematic, production-ready debugging

Cost Optimization Checklist

Check	Tool	Frequency
Idle VMs	Recommender	Weekly
Oversized VMs	Recommender	Weekly
Unattached disks	gcloud compute	Weekly
Expensive queries	INFORMATION_SCHEMA	Daily
Budget status	Billing console	Daily
Commitment coverage	Billing reports	Monthly

Cost Governance Model

                    ┌─────────────────┐
                    │   FinOps Team   │
                    │  (Centralized)  │
                    └────────┬────────┘
                             │
              ┌──────────────┼──────────────┐
              │              │              │
              ▼              ▼              ▼
        ┌──────────┐  ┌──────────┐  ┌──────────┐
        │  Team A  │  │  Team B  │  │  Team C  │
        │  Budget  │  │  Budget  │  │  Budget  │
        └──────────┘  └──────────┘  └──────────┘

Practice Question

Why are Committed Use Discounts more cost-effective than Sustained Use Discounts for predictable workloads?

Questions