DeployU
Interviews / Cloud & DevOps / Implement a CI/CD pipeline with Cloud Build that deploys to GKE with canary releases.

Implement a CI/CD pipeline with Cloud Build that deploys to GKE with canary releases.

practical CI/CD Interactive Quiz Code Examples

The Scenario

Your team deploys manually to GKE:

  • Deployments take 2 hours of engineer time
  • No consistent testing before deployment
  • Rollbacks are painful and slow
  • A bad deployment last month caused 4 hours of downtime

You need an automated pipeline with safety guardrails.

The Challenge

Design and implement a CI/CD pipeline using Cloud Build that builds, tests, and deploys to GKE with canary releases. Include automated rollback on failure.

Wrong Approach

A junior engineer might deploy directly to production without staging, skip tests to speed up deployment, use kubectl apply without health checks, or implement canary manually with percentage-based replica counts. These approaches risk production outages, miss bugs, and make rollbacks difficult.

Right Approach

A senior engineer implements a multi-stage pipeline: build and test, push to Artifact Registry, deploy to staging, run integration tests, canary deployment to production with automated metrics validation, then gradual rollout or automatic rollback.

Pipeline Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        Cloud Build Pipeline                      │
├─────────────────────────────────────────────────────────────────┤
│  PR Trigger                    Push to Main                      │
│      │                              │                            │
│      ▼                              ▼                            │
│  ┌────────┐                    ┌────────┐                       │
│  │  Lint  │                    │ Build  │                       │
│  │  Test  │                    │  Test  │                       │
│  └────────┘                    └───┬────┘                       │
│                                    │                            │
│                                    ▼                            │
│                              ┌─────────────┐                    │
│                              │Push to      │                    │
│                              │Artifact     │                    │
│                              │Registry     │                    │
│                              └──────┬──────┘                    │
│                                     │                            │
│                    ┌────────────────┴────────────────┐          │
│                    │                                 │          │
│                    ▼                                 ▼          │
│              ┌───────────┐                   ┌─────────────┐   │
│              │  Deploy   │                   │   Deploy    │   │
│              │  Staging  │                   │   Canary    │   │
│              └─────┬─────┘                   │   (10%)     │   │
│                    │                         └──────┬──────┘   │
│                    ▼                                │          │
│              ┌───────────┐                         │          │
│              │Integration│           ┌─────────────┤          │
│              │   Tests   │           │             │          │
│              └───────────┘           ▼             ▼          │
│                              ┌─────────────┐ ┌──────────┐    │
│                              │   Metrics   │ │ Rollback │    │
│                              │ Validation  │ │ (if bad) │    │
│                              └──────┬──────┘ └──────────┘    │
│                                     │                         │
│                                     ▼                         │
│                              ┌─────────────┐                  │
│                              │   Full      │                  │
│                              │  Rollout    │                  │
│                              └─────────────┘                  │
└─────────────────────────────────────────────────────────────────┘

Step 1: Cloud Build Configuration

# cloudbuild.yaml
steps:
  # Step 1: Run linting and unit tests
  - id: 'test'
    name: 'node:18'
    entrypoint: 'bash'
    args:
      - '-c'
      - |
        npm ci
        npm run lint
        npm run test:unit

  # Step 2: Build Docker image
  - id: 'build'
    name: 'gcr.io/cloud-builders/docker'
    args:
      - 'build'
      - '-t'
      - '${_REGION}-docker.pkg.dev/${PROJECT_ID}/${_REPO}/${_SERVICE}:${SHORT_SHA}'
      - '-t'
      - '${_REGION}-docker.pkg.dev/${PROJECT_ID}/${_REPO}/${_SERVICE}:latest'
      - '--cache-from'
      - '${_REGION}-docker.pkg.dev/${PROJECT_ID}/${_REPO}/${_SERVICE}:latest'
      - '.'

  # Step 3: Run security scan
  - id: 'scan'
    name: 'gcr.io/cloud-builders/gcloud'
    entrypoint: 'bash'
    args:
      - '-c'
      - |
        gcloud artifacts docker images scan \
          ${_REGION}-docker.pkg.dev/${PROJECT_ID}/${_REPO}/${_SERVICE}:${SHORT_SHA} \
          --format='json' > /workspace/scan-results.json

        # Fail if critical vulnerabilities found
        CRITICAL=$(cat /workspace/scan-results.json | jq '.vulnerabilities[] | select(.severity=="CRITICAL")' | wc -l)
        if [ "$CRITICAL" -gt "0" ]; then
          echo "Critical vulnerabilities found!"
          exit 1
        fi

  # Step 4: Push to Artifact Registry
  - id: 'push'
    name: 'gcr.io/cloud-builders/docker'
    args:
      - 'push'
      - '--all-tags'
      - '${_REGION}-docker.pkg.dev/${PROJECT_ID}/${_REPO}/${_SERVICE}'

  # Step 5: Deploy to staging
  - id: 'deploy-staging'
    name: 'gcr.io/cloud-builders/gke-deploy'
    args:
      - 'run'
      - '--filename=k8s/'
      - '--image=${_REGION}-docker.pkg.dev/${PROJECT_ID}/${_REPO}/${_SERVICE}:${SHORT_SHA}'
      - '--cluster=${_STAGING_CLUSTER}'
      - '--location=${_REGION}'
      - '--namespace=staging'

  # Step 6: Run integration tests against staging
  - id: 'integration-tests'
    name: 'gcr.io/cloud-builders/gcloud'
    entrypoint: 'bash'
    args:
      - '-c'
      - |
        # Wait for deployment to be ready
        kubectl rollout status deployment/${_SERVICE} -n staging --timeout=300s

        # Run integration tests
        npm run test:integration -- --baseUrl=https://staging.example.com

  # Step 7: Deploy canary to production (10%)
  - id: 'deploy-canary'
    name: 'gcr.io/cloud-builders/gke-deploy'
    args:
      - 'run'
      - '--filename=k8s/canary/'
      - '--image=${_REGION}-docker.pkg.dev/${PROJECT_ID}/${_REPO}/${_SERVICE}:${SHORT_SHA}'
      - '--cluster=${_PROD_CLUSTER}'
      - '--location=${_REGION}'
      - '--namespace=production'

  # Step 8: Validate canary metrics
  - id: 'validate-canary'
    name: 'gcr.io/cloud-builders/gcloud'
    entrypoint: 'bash'
    args:
      - '-c'
      - |
        # Wait for canary to receive traffic
        sleep 300

        # Check error rate for canary
        ERROR_RATE=$(gcloud monitoring metrics-scopes list \
          --filter="metric.type=custom.googleapis.com/http/error_rate AND resource.labels.version=canary" \
          --format="value(points[0].value.doubleValue)")

        if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
          echo "Canary error rate too high: $ERROR_RATE"
          exit 1
        fi

        echo "Canary validation passed"

  # Step 9: Full production rollout
  - id: 'deploy-production'
    name: 'gcr.io/cloud-builders/gke-deploy'
    args:
      - 'run'
      - '--filename=k8s/production/'
      - '--image=${_REGION}-docker.pkg.dev/${PROJECT_ID}/${_REPO}/${_SERVICE}:${SHORT_SHA}'
      - '--cluster=${_PROD_CLUSTER}'
      - '--location=${_REGION}'
      - '--namespace=production'

substitutions:
  _REGION: us-central1
  _REPO: app-images
  _SERVICE: api-server
  _STAGING_CLUSTER: staging-cluster
  _PROD_CLUSTER: prod-cluster

options:
  machineType: 'E2_HIGHCPU_8'
  logging: CLOUD_LOGGING_ONLY

timeout: '1800s'

Step 2: Canary Deployment Manifests

# k8s/canary/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server-canary
  labels:
    app: api-server
    version: canary
spec:
  replicas: 1  # Small canary
  selector:
    matchLabels:
      app: api-server
      version: canary
  template:
    metadata:
      labels:
        app: api-server
        version: canary
    spec:
      containers:
      - name: api
        image: IMAGE_PLACEHOLDER  # Replaced by Cloud Build
        ports:
        - containerPort: 8080
        resources:
          requests:
            cpu: "250m"
            memory: "512Mi"
          limits:
            cpu: "500m"
            memory: "1Gi"
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
---
# k8s/canary/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: api-server
spec:
  selector:
    app: api-server
    # No version selector - routes to both stable and canary
  ports:
  - port: 80
    targetPort: 8080

Step 3: Production Deployment with Rolling Update

# k8s/production/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
  labels:
    app: api-server
    version: stable
spec:
  replicas: 5
  selector:
    matchLabels:
      app: api-server
      version: stable
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 2
  template:
    metadata:
      labels:
        app: api-server
        version: stable
    spec:
      containers:
      - name: api
        image: IMAGE_PLACEHOLDER
        ports:
        - containerPort: 8080
        resources:
          requests:
            cpu: "250m"
            memory: "512Mi"
          limits:
            cpu: "500m"
            memory: "1Gi"

Step 4: Cloud Build Trigger Configuration

# Terraform configuration for triggers
resource "google_cloudbuild_trigger" "pr_trigger" {
  name        = "pr-validation"
  description = "Run tests on pull requests"

  github {
    owner = "myorg"
    name  = "myrepo"

    pull_request {
      branch = "^main$"
    }
  }

  filename = "cloudbuild-pr.yaml"
}

resource "google_cloudbuild_trigger" "deploy_trigger" {
  name        = "deploy-to-production"
  description = "Build and deploy on push to main"

  github {
    owner = "myorg"
    name  = "myrepo"

    push {
      branch = "^main$"
    }
  }

  filename = "cloudbuild.yaml"

  # Required approvals for production
  approval_config {
    approval_required = true
  }
}

Step 5: Automatic Rollback Script

# cloudbuild-rollback.yaml
steps:
  - id: 'rollback'
    name: 'gcr.io/cloud-builders/kubectl'
    entrypoint: 'bash'
    args:
      - '-c'
      - |
        gcloud container clusters get-credentials ${_CLUSTER} --region=${_REGION}

        # Get previous revision
        PREV_REVISION=$(kubectl rollout history deployment/${_SERVICE} -n production | tail -3 | head -1 | awk '{print $1}')

        # Rollback to previous
        kubectl rollout undo deployment/${_SERVICE} -n production --to-revision=$PREV_REVISION

        # Wait for rollback
        kubectl rollout status deployment/${_SERVICE} -n production --timeout=300s

        # Delete canary
        kubectl delete deployment ${_SERVICE}-canary -n production --ignore-not-found

substitutions:
  _SERVICE: api-server
  _CLUSTER: prod-cluster
  _REGION: us-central1

Step 6: Service Account Permissions

resource "google_service_account" "cloudbuild" {
  account_id   = "cloudbuild-deployer"
  display_name = "Cloud Build Deployer"
}

# GKE access
resource "google_project_iam_member" "cloudbuild_gke" {
  project = var.project
  role    = "roles/container.developer"
  member  = "serviceAccount:${google_service_account.cloudbuild.email}"
}

# Artifact Registry access
resource "google_project_iam_member" "cloudbuild_artifact" {
  project = var.project
  role    = "roles/artifactregistry.writer"
  member  = "serviceAccount:${google_service_account.cloudbuild.email}"
}

# Logging access
resource "google_project_iam_member" "cloudbuild_logs" {
  project = var.project
  role    = "roles/logging.logWriter"
  member  = "serviceAccount:${google_service_account.cloudbuild.email}"
}

Deployment Strategy Comparison

StrategyRiskRollback SpeedComplexity
Big BangHighSlow (redeploy)Low
Rolling UpdateMediumMedium (rollback)Low
Blue-GreenLowFast (switch)Medium
CanaryVery LowFast (delete canary)High

Pipeline Best Practices

  1. Immutable images - Tag with commit SHA, not ‘latest’
  2. Run tests before deploy - Unit, integration, security
  3. Deploy to staging first - Catch issues before production
  4. Use canary deployments - Validate with real traffic
  5. Automate rollbacks - Don’t rely on manual intervention

Practice Question

Why should canary deployments validate metrics rather than just checking if pods are healthy?