Questions
A critical deployment workflow is failing intermittently. Debug and fix the issue.
The Scenario
Your production deployment workflow has been failing intermittently for the past week:
# Error from workflow run
Run kubectl apply -f k8s/
error: unable to recognize "k8s/deployment.yaml": Get "https://api.eks.us-east-1.amazonaws.com": dial tcp: lookup api.eks.us-east-1.amazonaws.com: no such host
Error: Process completed with exit code 1.
The workflow worked fine last month. Sometimes it passes, sometimes it fails. Developers are re-running workflows multiple times hoping they’ll succeed.
The Challenge
Debug why the workflow fails intermittently, identify the root cause, and implement robust fixes with proper error handling.
A junior engineer might just re-run the workflow hoping it passes, add arbitrary sleep commands, or blame GitHub's infrastructure. These approaches waste CI minutes, don't solve the underlying issue, and create unreliable deployments.
A senior engineer systematically debugs by examining workflow logs, checking runner environment, identifying transient vs persistent failures, and implementing proper retry logic with exponential backoff for network-dependent operations.
Step 1: Enable Debug Logging
# Re-run workflow with debug logging enabled
# Repository Settings > Secrets > Actions > Add secret
# ACTIONS_RUNNER_DEBUG = true
# ACTIONS_STEP_DEBUG = true
# Or trigger with debug enabled
name: Deploy
on:
workflow_dispatch:
inputs:
debug_enabled:
description: 'Enable debug logging'
required: false
default: 'false'
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- name: Debug info
if: ${{ inputs.debug_enabled == 'true' }}
run: |
echo "Runner: ${{ runner.name }}"
echo "OS: ${{ runner.os }}"
cat /etc/resolv.conf
nslookup api.eks.us-east-1.amazonaws.com || true
curl -v https://api.eks.us-east-1.amazonaws.com || trueStep 2: Identify the Root Cause
# Add diagnostic steps to understand the failure
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- name: Network diagnostics
run: |
echo "=== DNS Resolution ==="
nslookup api.eks.us-east-1.amazonaws.com || echo "DNS failed"
echo "=== DNS Servers ==="
cat /etc/resolv.conf
echo "=== Connectivity Test ==="
curl -sS --connect-timeout 10 https://api.eks.us-east-1.amazonaws.com/healthz || echo "Connection failed"
echo "=== AWS STS Test ==="
aws sts get-caller-identity || echo "AWS auth failed"Step 3: Implement Robust Error Handling
name: Deploy to Production
on:
push:
branches: [main]
jobs:
deploy:
runs-on: ubuntu-latest
timeout-minutes: 30
# Retry the entire job on failure
strategy:
max-parallel: 1
fail-fast: false
steps:
- uses: actions/checkout@v4
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
aws-region: us-east-1
# Add retry for OIDC token fetch
role-duration-seconds: 3600
retry-max-attempts: 3
- name: Setup kubectl with retry
uses: azure/setup-kubectl@v3
with:
version: 'v1.28.0'
- name: Update kubeconfig with retry
run: |
max_attempts=3
attempt=1
while [ $attempt -le $max_attempts ]; do
echo "Attempt $attempt of $max_attempts"
if aws eks update-kubeconfig --name production-cluster --region us-east-1; then
echo "Successfully updated kubeconfig"
break
fi
if [ $attempt -eq $max_attempts ]; then
echo "Failed after $max_attempts attempts"
exit 1
fi
sleep_time=$((attempt * 10))
echo "Retrying in ${sleep_time}s..."
sleep $sleep_time
attempt=$((attempt + 1))
done
- name: Deploy with retry
run: |
deploy_with_retry() {
local max_attempts=3
local attempt=1
while [ $attempt -le $max_attempts ]; do
echo "Deploy attempt $attempt of $max_attempts"
if kubectl apply -f k8s/ --timeout=60s; then
echo "Deployment applied successfully"
if kubectl rollout status deployment/app --timeout=300s; then
echo "Rollout completed successfully"
return 0
fi
fi
if [ $attempt -eq $max_attempts ]; then
echo "Deployment failed after $max_attempts attempts"
return 1
fi
# Exponential backoff
sleep_time=$((2 ** attempt * 5))
echo "Retrying in ${sleep_time}s..."
sleep $sleep_time
attempt=$((attempt + 1))
done
}
deploy_with_retry
- name: Verify deployment
if: success()
run: |
kubectl get pods -l app=myapp
kubectl get deployment myapp -o jsonpath='{.status.availableReplicas}'Step 4: Use the Retry Action for Cleaner Code
name: Deploy with Retry Action
on:
push:
branches: [main]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
aws-region: us-east-1
- name: Deploy to Kubernetes
uses: nick-fields/retry@v2
with:
timeout_minutes: 10
max_attempts: 3
retry_wait_seconds: 30
command: |
aws eks update-kubeconfig --name production-cluster
kubectl apply -f k8s/
kubectl rollout status deployment/app --timeout=300s
- name: Notify on final failure
if: failure()
uses: slackapi/slack-github-action@v1
with:
channel-id: 'deployments'
slack-message: |
Deployment failed after retries
Workflow: ${{ github.workflow }}
Run: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}
env:
SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }}Step 5: Implement Health Checks Before Deploy
name: Deploy with Pre-flight Checks
on:
push:
branches: [main]
jobs:
preflight:
runs-on: ubuntu-latest
outputs:
cluster-healthy: ${{ steps.check.outputs.healthy }}
steps:
- name: Configure AWS
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
aws-region: us-east-1
- name: Check cluster health
id: check
run: |
aws eks update-kubeconfig --name production-cluster
# Check API server
if ! kubectl cluster-info; then
echo "healthy=false" >> $GITHUB_OUTPUT
exit 0
fi
# Check nodes
unhealthy_nodes=$(kubectl get nodes --no-headers | grep -v Ready | wc -l)
if [ "$unhealthy_nodes" -gt 0 ]; then
echo "Found $unhealthy_nodes unhealthy nodes"
echo "healthy=false" >> $GITHUB_OUTPUT
exit 0
fi
echo "healthy=true" >> $GITHUB_OUTPUT
deploy:
needs: preflight
if: needs.preflight.outputs.cluster-healthy == 'true'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Configure AWS
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
aws-region: us-east-1
- name: Deploy
run: |
aws eks update-kubeconfig --name production-cluster
kubectl apply -f k8s/
alert-unhealthy:
needs: preflight
if: needs.preflight.outputs.cluster-healthy == 'false'
runs-on: ubuntu-latest
steps:
- name: Alert team
run: |
echo "::error::Cluster health check failed - deployment skipped"
# Send alert to teamStep 6: Add Comprehensive Logging
- name: Deploy with detailed logging
id: deploy
run: |
set -euo pipefail
echo "::group::Cluster Info"
kubectl cluster-info
kubectl get nodes
echo "::endgroup::"
echo "::group::Current State"
kubectl get deployments -o wide || true
kubectl get pods -o wide || true
echo "::endgroup::"
echo "::group::Applying Changes"
kubectl apply -f k8s/ --dry-run=server
kubectl apply -f k8s/
echo "::endgroup::"
echo "::group::Rollout Status"
kubectl rollout status deployment/app --timeout=300s
echo "::endgroup::"
echo "::group::Final State"
kubectl get pods -o wide
kubectl get events --sort-by='.lastTimestamp' | tail -20
echo "::endgroup::" Common Workflow Debugging Issues
| Symptom | Root Cause | Fix |
|---|---|---|
| DNS resolution failures | Transient network issues | Add retry with backoff |
| Token expired | OIDC token timeout | Reduce job duration, refresh token |
| Rate limited | Too many API calls | Add delays, use caching |
| Random timeouts | Resource contention | Increase timeout, add health checks |
| Permission denied | Token scope insufficient | Check workflow permissions |
Practice Question
A GitHub Actions workflow fails with 'Resource not accessible by integration'. What is the most likely cause?