Questions
Your GCP bill increased 40% last month. Identify waste and implement cost controls.
The Scenario
Your GCP bill jumped from $50,000 to $70,000 last month:
Cost breakdown:
├── Compute Engine: $25,000 (+50%)
├── BigQuery: $15,000 (+100%)
├── Cloud Storage: $12,000 (+20%)
├── GKE: $10,000 (+30%)
└── Other: $8,000 (+10%)
Finance is asking for explanations and a plan to reduce costs.
The Challenge
Analyze the cost increase, identify optimization opportunities, implement cost controls, and create a sustainable cost management strategy.
A junior engineer might immediately delete resources to cut costs, downsize all instances without analysis, turn off non-production environments entirely, or ignore the problem hoping it resolves. These approaches cause outages, performance issues, block development, or let costs spiral further.
A senior engineer uses billing reports and cost analysis to identify specific causes, implements committed use discounts for predictable workloads, right-sizes resources based on utilization data, sets up budgets and alerts, and creates a culture of cost awareness with showback/chargeback.
Step 1: Analyze Cost Breakdown
# Export detailed billing to BigQuery for analysis
bq query --use_legacy_sql=false '
SELECT
service.description as service,
sku.description as sku,
project.id as project,
SUM(cost) as total_cost,
SUM(usage.amount) as usage_amount,
usage.unit
FROM `billing_export.gcp_billing_export_v1_*`
WHERE _PARTITIONTIME >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY 1, 2, 3, 6
ORDER BY total_cost DESC
LIMIT 50'
# Find cost spikes by day
bq query --use_legacy_sql=false '
SELECT
DATE(usage_start_time) as date,
service.description as service,
SUM(cost) as daily_cost
FROM `billing_export.gcp_billing_export_v1_*`
WHERE _PARTITIONTIME >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY 1, 2
ORDER BY 1, 3 DESC'Step 2: Identify Compute Waste
# Find idle VMs (low CPU utilization)
gcloud recommender recommendations list \
--project=my-project \
--location=us-central1-a \
--recommender=google.compute.instance.IdleResourceRecommender \
--format="table(content.overview.resourceName,content.overview.utilizationStats)"
# Find oversized VMs
gcloud recommender recommendations list \
--project=my-project \
--location=us-central1-a \
--recommender=google.compute.instance.MachineTypeRecommender \
--format="table(content.overview.resourceName,content.overview.recommendedMachineType)"
# List unattached disks
gcloud compute disks list \
--filter="NOT users:*" \
--format="table(name,sizeGb,zone,status)"Step 3: Implement Committed Use Discounts
# 1-year commitment for predictable workloads (37% discount)
resource "google_compute_commitment" "cpu_commitment" {
name = "cpu-1year-commitment"
region = "us-central1"
type = "COMPUTE_OPTIMIZED"
plan = "TWELVE_MONTH"
resources {
type = "VCPU"
amount = 100 # 100 vCPUs committed
}
resources {
type = "MEMORY"
amount = 400 # 400 GB RAM committed
}
}
# 3-year commitment for stable workloads (57% discount)
resource "google_compute_commitment" "cpu_commitment_3yr" {
name = "cpu-3year-commitment"
region = "us-central1"
type = "GENERAL_PURPOSE"
plan = "THIRTY_SIX_MONTH"
resources {
type = "VCPU"
amount = 50
}
resources {
type = "MEMORY"
amount = 200
}
}Step 4: Right-Size GKE Clusters
# Enable cluster autoscaler with appropriate limits
resource "google_container_node_pool" "primary" {
name = "primary-pool"
cluster = google_container_cluster.main.name
location = "us-central1"
# Autoscaling based on actual demand
autoscaling {
min_node_count = 2 # Minimum for HA
max_node_count = 20 # Cap costs
location_policy = "BALANCED"
}
node_config {
# Use E2 instances for cost efficiency
machine_type = "e2-standard-4"
# Spot VMs for non-critical workloads (60-91% discount)
spot = true
# Only request what you need
labels = {
workload = "general"
}
}
management {
auto_repair = true
auto_upgrade = true
}
}
# Separate node pool for critical workloads (on-demand)
resource "google_container_node_pool" "critical" {
name = "critical-pool"
cluster = google_container_cluster.main.name
autoscaling {
min_node_count = 3
max_node_count = 10
}
node_config {
machine_type = "e2-standard-4"
spot = false # On-demand for reliability
taint {
key = "workload"
value = "critical"
effect = "NO_SCHEDULE"
}
}
}Step 5: Optimize BigQuery Costs
-- Find expensive queries
SELECT
user_email,
query,
ROUND(total_bytes_processed / POW(10,12), 2) as tb_processed,
ROUND(total_bytes_processed / POW(10,12) * 5, 2) as cost_usd
FROM `region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
AND job_type = 'QUERY'
ORDER BY total_bytes_processed DESC
LIMIT 20;
-- Set maximum bytes billed per query (prevent runaway costs)
-- In BigQuery console or via API:
-- maximumBytesBilled = 10737418240 (10GB)# Consider flat-rate pricing for heavy usage
# Break-even: ~$10,000/month in on-demand = 100 slots worth
resource "google_bigquery_reservation" "default" {
name = "default-reservation"
location = "US"
slot_capacity = 100
# Autoscale slots based on demand
autoscale {
max_slots = 200
}
}Step 6: Set Up Budgets and Alerts
resource "google_billing_budget" "project_budget" {
billing_account = var.billing_account_id
display_name = "Monthly Project Budget"
budget_filter {
projects = ["projects/${var.project_id}"]
}
amount {
specified_amount {
currency_code = "USD"
units = "60000" # $60,000 budget
}
}
threshold_rules {
threshold_percent = 0.5
spend_basis = "CURRENT_SPEND"
}
threshold_rules {
threshold_percent = 0.8
spend_basis = "CURRENT_SPEND"
}
threshold_rules {
threshold_percent = 1.0
spend_basis = "CURRENT_SPEND"
}
threshold_rules {
threshold_percent = 1.0
spend_basis = "FORECASTED_SPEND"
}
all_updates_rule {
monitoring_notification_channels = [
google_monitoring_notification_channel.email.id,
google_monitoring_notification_channel.slack.id
]
disable_default_iam_recipients = false
}
}Step 7: Implement Resource Labels for Cost Attribution
# Standard labels for all resources
locals {
standard_labels = {
environment = var.environment
team = var.team
service = var.service
cost_center = var.cost_center
}
}
resource "google_compute_instance" "app" {
# ...
labels = local.standard_labels
}
resource "google_storage_bucket" "data" {
# ...
labels = local.standard_labels
}-- Query costs by team
SELECT
labels.value as team,
SUM(cost) as total_cost
FROM `billing_export.gcp_billing_export_v1_*`
CROSS JOIN UNNEST(labels) as labels
WHERE labels.key = 'team'
AND _PARTITIONTIME >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY 1
ORDER BY 2 DESC;Step 8: Automate Cost Optimization
# Cloud Function to stop dev instances at night
# functions/auto-stop/main.py
import googleapiclient.discovery
from google.cloud import compute_v1
def stop_dev_instances(event, context):
"""Stop all instances labeled environment=dev."""
client = compute_v1.InstancesClient()
project = 'my-project'
# List all zones
zones_client = compute_v1.ZonesClient()
zones = zones_client.list(project=project)
for zone in zones:
instances = client.list(project=project, zone=zone.name)
for instance in instances:
labels = instance.labels or {}
if labels.get('environment') == 'dev' and instance.status == 'RUNNING':
print(f"Stopping {instance.name} in {zone.name}")
client.stop(project=project, zone=zone.name, instance=instance.name)Cost Optimization Summary
| Category | Action | Estimated Savings |
|---|---|---|
| Compute | Right-size VMs | 20-40% |
| Compute | Committed use (1yr) | 37% |
| Compute | Spot VMs (non-critical) | 60-91% |
| GKE | Cluster autoscaler | 30-50% |
| BigQuery | Partitioning/clustering | 50-90% |
| Storage | Lifecycle policies | 40-70% |
| All | Delete unused resources | 10-20% |
Cost Optimization Checklist
| Check | Tool | Frequency |
|---|---|---|
| Idle VMs | Recommender | Weekly |
| Oversized VMs | Recommender | Weekly |
| Unattached disks | gcloud compute | Weekly |
| Expensive queries | INFORMATION_SCHEMA | Daily |
| Budget status | Billing console | Daily |
| Commitment coverage | Billing reports | Monthly |
Cost Governance Model
┌─────────────────┐
│ FinOps Team │
│ (Centralized) │
└────────┬────────┘
│
┌──────────────┼──────────────┐
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Team A │ │ Team B │ │ Team C │
│ Budget │ │ Budget │ │ Budget │
└──────────┘ └──────────┘ └──────────┘
Practice Question
Why are Committed Use Discounts more cost-effective than Sustained Use Discounts for predictable workloads?