Questions
Implement a CI/CD pipeline with Cloud Build that deploys to GKE with canary releases.
The Scenario
Your team deploys manually to GKE:
- Deployments take 2 hours of engineer time
- No consistent testing before deployment
- Rollbacks are painful and slow
- A bad deployment last month caused 4 hours of downtime
You need an automated pipeline with safety guardrails.
The Challenge
Design and implement a CI/CD pipeline using Cloud Build that builds, tests, and deploys to GKE with canary releases. Include automated rollback on failure.
A junior engineer might deploy directly to production without staging, skip tests to speed up deployment, use kubectl apply without health checks, or implement canary manually with percentage-based replica counts. These approaches risk production outages, miss bugs, and make rollbacks difficult.
A senior engineer implements a multi-stage pipeline: build and test, push to Artifact Registry, deploy to staging, run integration tests, canary deployment to production with automated metrics validation, then gradual rollout or automatic rollback.
Pipeline Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Cloud Build Pipeline │
├─────────────────────────────────────────────────────────────────┤
│ PR Trigger Push to Main │
│ │ │ │
│ ▼ ▼ │
│ ┌────────┐ ┌────────┐ │
│ │ Lint │ │ Build │ │
│ │ Test │ │ Test │ │
│ └────────┘ └───┬────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │Push to │ │
│ │Artifact │ │
│ │Registry │ │
│ └──────┬──────┘ │
│ │ │
│ ┌────────────────┴────────────────┐ │
│ │ │ │
│ ▼ ▼ │
│ ┌───────────┐ ┌─────────────┐ │
│ │ Deploy │ │ Deploy │ │
│ │ Staging │ │ Canary │ │
│ └─────┬─────┘ │ (10%) │ │
│ │ └──────┬──────┘ │
│ ▼ │ │
│ ┌───────────┐ │ │
│ │Integration│ ┌─────────────┤ │
│ │ Tests │ │ │ │
│ └───────────┘ ▼ ▼ │
│ ┌─────────────┐ ┌──────────┐ │
│ │ Metrics │ │ Rollback │ │
│ │ Validation │ │ (if bad) │ │
│ └──────┬──────┘ └──────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ Full │ │
│ │ Rollout │ │
│ └─────────────┘ │
└─────────────────────────────────────────────────────────────────┘Step 1: Cloud Build Configuration
# cloudbuild.yaml
steps:
# Step 1: Run linting and unit tests
- id: 'test'
name: 'node:18'
entrypoint: 'bash'
args:
- '-c'
- |
npm ci
npm run lint
npm run test:unit
# Step 2: Build Docker image
- id: 'build'
name: 'gcr.io/cloud-builders/docker'
args:
- 'build'
- '-t'
- '${_REGION}-docker.pkg.dev/${PROJECT_ID}/${_REPO}/${_SERVICE}:${SHORT_SHA}'
- '-t'
- '${_REGION}-docker.pkg.dev/${PROJECT_ID}/${_REPO}/${_SERVICE}:latest'
- '--cache-from'
- '${_REGION}-docker.pkg.dev/${PROJECT_ID}/${_REPO}/${_SERVICE}:latest'
- '.'
# Step 3: Run security scan
- id: 'scan'
name: 'gcr.io/cloud-builders/gcloud'
entrypoint: 'bash'
args:
- '-c'
- |
gcloud artifacts docker images scan \
${_REGION}-docker.pkg.dev/${PROJECT_ID}/${_REPO}/${_SERVICE}:${SHORT_SHA} \
--format='json' > /workspace/scan-results.json
# Fail if critical vulnerabilities found
CRITICAL=$(cat /workspace/scan-results.json | jq '.vulnerabilities[] | select(.severity=="CRITICAL")' | wc -l)
if [ "$CRITICAL" -gt "0" ]; then
echo "Critical vulnerabilities found!"
exit 1
fi
# Step 4: Push to Artifact Registry
- id: 'push'
name: 'gcr.io/cloud-builders/docker'
args:
- 'push'
- '--all-tags'
- '${_REGION}-docker.pkg.dev/${PROJECT_ID}/${_REPO}/${_SERVICE}'
# Step 5: Deploy to staging
- id: 'deploy-staging'
name: 'gcr.io/cloud-builders/gke-deploy'
args:
- 'run'
- '--filename=k8s/'
- '--image=${_REGION}-docker.pkg.dev/${PROJECT_ID}/${_REPO}/${_SERVICE}:${SHORT_SHA}'
- '--cluster=${_STAGING_CLUSTER}'
- '--location=${_REGION}'
- '--namespace=staging'
# Step 6: Run integration tests against staging
- id: 'integration-tests'
name: 'gcr.io/cloud-builders/gcloud'
entrypoint: 'bash'
args:
- '-c'
- |
# Wait for deployment to be ready
kubectl rollout status deployment/${_SERVICE} -n staging --timeout=300s
# Run integration tests
npm run test:integration -- --baseUrl=https://staging.example.com
# Step 7: Deploy canary to production (10%)
- id: 'deploy-canary'
name: 'gcr.io/cloud-builders/gke-deploy'
args:
- 'run'
- '--filename=k8s/canary/'
- '--image=${_REGION}-docker.pkg.dev/${PROJECT_ID}/${_REPO}/${_SERVICE}:${SHORT_SHA}'
- '--cluster=${_PROD_CLUSTER}'
- '--location=${_REGION}'
- '--namespace=production'
# Step 8: Validate canary metrics
- id: 'validate-canary'
name: 'gcr.io/cloud-builders/gcloud'
entrypoint: 'bash'
args:
- '-c'
- |
# Wait for canary to receive traffic
sleep 300
# Check error rate for canary
ERROR_RATE=$(gcloud monitoring metrics-scopes list \
--filter="metric.type=custom.googleapis.com/http/error_rate AND resource.labels.version=canary" \
--format="value(points[0].value.doubleValue)")
if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
echo "Canary error rate too high: $ERROR_RATE"
exit 1
fi
echo "Canary validation passed"
# Step 9: Full production rollout
- id: 'deploy-production'
name: 'gcr.io/cloud-builders/gke-deploy'
args:
- 'run'
- '--filename=k8s/production/'
- '--image=${_REGION}-docker.pkg.dev/${PROJECT_ID}/${_REPO}/${_SERVICE}:${SHORT_SHA}'
- '--cluster=${_PROD_CLUSTER}'
- '--location=${_REGION}'
- '--namespace=production'
substitutions:
_REGION: us-central1
_REPO: app-images
_SERVICE: api-server
_STAGING_CLUSTER: staging-cluster
_PROD_CLUSTER: prod-cluster
options:
machineType: 'E2_HIGHCPU_8'
logging: CLOUD_LOGGING_ONLY
timeout: '1800s'Step 2: Canary Deployment Manifests
# k8s/canary/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server-canary
labels:
app: api-server
version: canary
spec:
replicas: 1 # Small canary
selector:
matchLabels:
app: api-server
version: canary
template:
metadata:
labels:
app: api-server
version: canary
spec:
containers:
- name: api
image: IMAGE_PLACEHOLDER # Replaced by Cloud Build
ports:
- containerPort: 8080
resources:
requests:
cpu: "250m"
memory: "512Mi"
limits:
cpu: "500m"
memory: "1Gi"
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
---
# k8s/canary/service.yaml
apiVersion: v1
kind: Service
metadata:
name: api-server
spec:
selector:
app: api-server
# No version selector - routes to both stable and canary
ports:
- port: 80
targetPort: 8080Step 3: Production Deployment with Rolling Update
# k8s/production/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
labels:
app: api-server
version: stable
spec:
replicas: 5
selector:
matchLabels:
app: api-server
version: stable
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 2
template:
metadata:
labels:
app: api-server
version: stable
spec:
containers:
- name: api
image: IMAGE_PLACEHOLDER
ports:
- containerPort: 8080
resources:
requests:
cpu: "250m"
memory: "512Mi"
limits:
cpu: "500m"
memory: "1Gi"Step 4: Cloud Build Trigger Configuration
# Terraform configuration for triggers
resource "google_cloudbuild_trigger" "pr_trigger" {
name = "pr-validation"
description = "Run tests on pull requests"
github {
owner = "myorg"
name = "myrepo"
pull_request {
branch = "^main$"
}
}
filename = "cloudbuild-pr.yaml"
}
resource "google_cloudbuild_trigger" "deploy_trigger" {
name = "deploy-to-production"
description = "Build and deploy on push to main"
github {
owner = "myorg"
name = "myrepo"
push {
branch = "^main$"
}
}
filename = "cloudbuild.yaml"
# Required approvals for production
approval_config {
approval_required = true
}
}Step 5: Automatic Rollback Script
# cloudbuild-rollback.yaml
steps:
- id: 'rollback'
name: 'gcr.io/cloud-builders/kubectl'
entrypoint: 'bash'
args:
- '-c'
- |
gcloud container clusters get-credentials ${_CLUSTER} --region=${_REGION}
# Get previous revision
PREV_REVISION=$(kubectl rollout history deployment/${_SERVICE} -n production | tail -3 | head -1 | awk '{print $1}')
# Rollback to previous
kubectl rollout undo deployment/${_SERVICE} -n production --to-revision=$PREV_REVISION
# Wait for rollback
kubectl rollout status deployment/${_SERVICE} -n production --timeout=300s
# Delete canary
kubectl delete deployment ${_SERVICE}-canary -n production --ignore-not-found
substitutions:
_SERVICE: api-server
_CLUSTER: prod-cluster
_REGION: us-central1Step 6: Service Account Permissions
resource "google_service_account" "cloudbuild" {
account_id = "cloudbuild-deployer"
display_name = "Cloud Build Deployer"
}
# GKE access
resource "google_project_iam_member" "cloudbuild_gke" {
project = var.project
role = "roles/container.developer"
member = "serviceAccount:${google_service_account.cloudbuild.email}"
}
# Artifact Registry access
resource "google_project_iam_member" "cloudbuild_artifact" {
project = var.project
role = "roles/artifactregistry.writer"
member = "serviceAccount:${google_service_account.cloudbuild.email}"
}
# Logging access
resource "google_project_iam_member" "cloudbuild_logs" {
project = var.project
role = "roles/logging.logWriter"
member = "serviceAccount:${google_service_account.cloudbuild.email}"
} Deployment Strategy Comparison
| Strategy | Risk | Rollback Speed | Complexity |
|---|---|---|---|
| Big Bang | High | Slow (redeploy) | Low |
| Rolling Update | Medium | Medium (rollback) | Low |
| Blue-Green | Low | Fast (switch) | Medium |
| Canary | Very Low | Fast (delete canary) | High |
Pipeline Best Practices
- Immutable images - Tag with commit SHA, not ‘latest’
- Run tests before deploy - Unit, integration, security
- Deploy to staging first - Catch issues before production
- Use canary deployments - Validate with real traffic
- Automate rollbacks - Don’t rely on manual intervention
Practice Question
Why should canary deployments validate metrics rather than just checking if pods are healthy?