A critical deployment workflow is failing intermittently. Debug and fix the issue.

Q: A critical deployment workflow is failing intermittently. Debug and fix the issue.

Learn the answer to "A critical deployment workflow is failing intermittently. Debug and fix the issue." with detailed explanations, code examples, and best practices on DeployU.

The Scenario

Your production deployment workflow has been failing intermittently for the past week:

# Error from workflow run
Run kubectl apply -f k8s/
error: unable to recognize "k8s/deployment.yaml": Get "https://api.eks.us-east-1.amazonaws.com": dial tcp: lookup api.eks.us-east-1.amazonaws.com: no such host

Error: Process completed with exit code 1.

The workflow worked fine last month. Sometimes it passes, sometimes it fails. Developers are re-running workflows multiple times hoping they’ll succeed.

The Challenge

Debug why the workflow fails intermittently, identify the root cause, and implement robust fixes with proper error handling.

Wrong Approach

A junior engineer might just re-run the workflow hoping it passes, add arbitrary sleep commands, or blame GitHub's infrastructure. These approaches waste CI minutes, don't solve the underlying issue, and create unreliable deployments.

Addresses symptoms, not root cause

Right Approach

A senior engineer systematically debugs by examining workflow logs, checking runner environment, identifying transient vs persistent failures, and implementing proper retry logic with exponential backoff for network-dependent operations.

Step 1: Enable Debug Logging

# Re-run workflow with debug logging enabled
# Repository Settings > Secrets > Actions > Add secret
# ACTIONS_RUNNER_DEBUG = true
# ACTIONS_STEP_DEBUG = true

# Or trigger with debug enabled
name: Deploy
on:
  workflow_dispatch:
    inputs:
      debug_enabled:
        description: 'Enable debug logging'
        required: false
        default: 'false'

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - name: Debug info
        if: ${{ inputs.debug_enabled == 'true' }}
        run: |
          echo "Runner: ${{ runner.name }}"
          echo "OS: ${{ runner.os }}"
          cat /etc/resolv.conf
          nslookup api.eks.us-east-1.amazonaws.com || true
          curl -v https://api.eks.us-east-1.amazonaws.com || true

Step 2: Identify the Root Cause

# Add diagnostic steps to understand the failure
jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - name: Network diagnostics
        run: |
          echo "=== DNS Resolution ==="
          nslookup api.eks.us-east-1.amazonaws.com || echo "DNS failed"

          echo "=== DNS Servers ==="
          cat /etc/resolv.conf

          echo "=== Connectivity Test ==="
          curl -sS --connect-timeout 10 https://api.eks.us-east-1.amazonaws.com/healthz || echo "Connection failed"

          echo "=== AWS STS Test ==="
          aws sts get-caller-identity || echo "AWS auth failed"

Step 3: Implement Robust Error Handling

name: Deploy to Production

on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    timeout-minutes: 30

    # Retry the entire job on failure
    strategy:
      max-parallel: 1
      fail-fast: false

    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
          aws-region: us-east-1
          # Add retry for OIDC token fetch
          role-duration-seconds: 3600
          retry-max-attempts: 3

      - name: Setup kubectl with retry
        uses: azure/setup-kubectl@v3
        with:
          version: 'v1.28.0'

      - name: Update kubeconfig with retry
        run: |
          max_attempts=3
          attempt=1

          while [ $attempt -le $max_attempts ]; do
            echo "Attempt $attempt of $max_attempts"

            if aws eks update-kubeconfig --name production-cluster --region us-east-1; then
              echo "Successfully updated kubeconfig"
              break
            fi

            if [ $attempt -eq $max_attempts ]; then
              echo "Failed after $max_attempts attempts"
              exit 1
            fi

            sleep_time=$((attempt * 10))
            echo "Retrying in ${sleep_time}s..."
            sleep $sleep_time
            attempt=$((attempt + 1))
          done

      - name: Deploy with retry
        run: |
          deploy_with_retry() {
            local max_attempts=3
            local attempt=1

            while [ $attempt -le $max_attempts ]; do
              echo "Deploy attempt $attempt of $max_attempts"

              if kubectl apply -f k8s/ --timeout=60s; then
                echo "Deployment applied successfully"

                if kubectl rollout status deployment/app --timeout=300s; then
                  echo "Rollout completed successfully"
                  return 0
                fi
              fi

              if [ $attempt -eq $max_attempts ]; then
                echo "Deployment failed after $max_attempts attempts"
                return 1
              fi

              # Exponential backoff
              sleep_time=$((2 ** attempt * 5))
              echo "Retrying in ${sleep_time}s..."
              sleep $sleep_time
              attempt=$((attempt + 1))
            done
          }

          deploy_with_retry

      - name: Verify deployment
        if: success()
        run: |
          kubectl get pods -l app=myapp
          kubectl get deployment myapp -o jsonpath='{.status.availableReplicas}'

Step 4: Use the Retry Action for Cleaner Code

name: Deploy with Retry Action

on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
          aws-region: us-east-1

      - name: Deploy to Kubernetes
        uses: nick-fields/retry@v2
        with:
          timeout_minutes: 10
          max_attempts: 3
          retry_wait_seconds: 30
          command: |
            aws eks update-kubeconfig --name production-cluster
            kubectl apply -f k8s/
            kubectl rollout status deployment/app --timeout=300s

      - name: Notify on final failure
        if: failure()
        uses: slackapi/slack-github-action@v1
        with:
          channel-id: 'deployments'
          slack-message: |
            Deployment failed after retries
            Workflow: ${{ github.workflow }}
            Run: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
        env:
          SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }}

Step 5: Implement Health Checks Before Deploy

name: Deploy with Pre-flight Checks

on:
  push:
    branches: [main]

jobs:
  preflight:
    runs-on: ubuntu-latest
    outputs:
      cluster-healthy: ${{ steps.check.outputs.healthy }}
    steps:
      - name: Configure AWS
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
          aws-region: us-east-1

      - name: Check cluster health
        id: check
        run: |
          aws eks update-kubeconfig --name production-cluster

          # Check API server
          if ! kubectl cluster-info; then
            echo "healthy=false" >> $GITHUB_OUTPUT
            exit 0
          fi

          # Check nodes
          unhealthy_nodes=$(kubectl get nodes --no-headers | grep -v Ready | wc -l)
          if [ "$unhealthy_nodes" -gt 0 ]; then
            echo "Found $unhealthy_nodes unhealthy nodes"
            echo "healthy=false" >> $GITHUB_OUTPUT
            exit 0
          fi

          echo "healthy=true" >> $GITHUB_OUTPUT

  deploy:
    needs: preflight
    if: needs.preflight.outputs.cluster-healthy == 'true'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
          aws-region: us-east-1

      - name: Deploy
        run: |
          aws eks update-kubeconfig --name production-cluster
          kubectl apply -f k8s/

  alert-unhealthy:
    needs: preflight
    if: needs.preflight.outputs.cluster-healthy == 'false'
    runs-on: ubuntu-latest
    steps:
      - name: Alert team
        run: |
          echo "::error::Cluster health check failed - deployment skipped"
          # Send alert to team

Step 6: Add Comprehensive Logging

- name: Deploy with detailed logging
  id: deploy
  run: |
    set -euo pipefail

    echo "::group::Cluster Info"
    kubectl cluster-info
    kubectl get nodes
    echo "::endgroup::"

    echo "::group::Current State"
    kubectl get deployments -o wide || true
    kubectl get pods -o wide || true
    echo "::endgroup::"

    echo "::group::Applying Changes"
    kubectl apply -f k8s/ --dry-run=server
    kubectl apply -f k8s/
    echo "::endgroup::"

    echo "::group::Rollout Status"
    kubectl rollout status deployment/app --timeout=300s
    echo "::endgroup::"

    echo "::group::Final State"
    kubectl get pods -o wide
    kubectl get events --sort-by='.lastTimestamp' | tail -20
    echo "::endgroup::"

Systematic, production-ready debugging

Common Workflow Debugging Issues

Symptom	Root Cause	Fix
DNS resolution failures	Transient network issues	Add retry with backoff
Token expired	OIDC token timeout	Reduce job duration, refresh token
Rate limited	Too many API calls	Add delays, use caching
Random timeouts	Resource contention	Increase timeout, add health checks
Permission denied	Token scope insufficient	Check workflow permissions

Practice Question

A GitHub Actions workflow fails with 'Resource not accessible by integration'. What is the most likely cause?

Questions