A critical deployment pipeline is failing intermittently in production. Debug and fix the issue.

Q: A critical deployment pipeline is failing intermittently in production. Debug and fix the issue.

Learn the answer to "A critical deployment pipeline is failing intermittently in production. Debug and fix the issue." with detailed explanations, code examples, and best practices on DeployU.

The Scenario

It’s Friday afternoon and the release train is blocked. Your deployment pipeline has been failing intermittently for the past hour:

Started by user deploy-bot
Running in Durability level: MAX_SURVIVABILITY
[Pipeline] Start of Pipeline
[Pipeline] node
Running on agent-prod-01 in /var/jenkins/workspace/deploy-production
[Pipeline] {
[Pipeline] stage
[Pipeline] { (Deploy to Production)
[Pipeline] sh
+ kubectl apply -f k8s/deployment.yaml
error: unable to connect to the server: dial tcp 10.0.1.50:6443: i/o timeout
[Pipeline] }
[Pipeline] // stage
[Pipeline] }
[Pipeline] // node
[Pipeline] End of Pipeline
ERROR: script returned exit code 1
Finished: FAILURE

The same pipeline worked 3 hours ago. Nothing in the Jenkinsfile changed. The Kubernetes cluster is healthy when you check manually.

The Challenge

Debug why the pipeline fails intermittently, identify the root cause, and implement a robust fix with proper error handling and retry logic.

Wrong Approach

A junior engineer might just re-run the pipeline hoping it works, add a simple sleep before the kubectl command, blame the network team, or increase the timeout without understanding the root cause. These approaches waste time, don't fix the underlying issue, and leave the pipeline unreliable.

Addresses symptoms, not root cause

Right Approach

A senior engineer systematically debugs by checking agent connectivity, network configuration, credential expiration, and resource contention. They implement proper retry logic with exponential backoff, add health checks before deployment, and set up monitoring to catch issues early.

Step 1: Check Agent and Network Status

// Add debugging to your pipeline
pipeline {
    agent { label 'prod-agents' }

    environment {
        KUBECONFIG = credentials('kubeconfig-prod')
    }

    stages {
        stage('Debug Connectivity') {
            steps {
                script {
                    // Check agent details
                    sh '''
                        echo "=== Agent Information ==="
                        hostname
                        ip addr show
                        cat /etc/resolv.conf

                        echo "=== Kubernetes API Connectivity ==="
                        curl -v -k https://10.0.1.50:6443/healthz --max-time 10 || true

                        echo "=== DNS Resolution ==="
                        nslookup kubernetes.default.svc.cluster.local || true

                        echo "=== Route to API Server ==="
                        traceroute -m 5 10.0.1.50 || true
                    '''
                }
            }
        }
    }
}

Step 2: Identify Common Root Causes

// Check for credential expiration
stage('Validate Credentials') {
    steps {
        script {
            // Test kubeconfig validity
            def result = sh(
                script: '''
                    kubectl cluster-info --request-timeout=10s 2>&1
                ''',
                returnStatus: true
            )

            if (result != 0) {
                // Check if token expired
                sh '''
                    echo "Checking token expiration..."
                    kubectl config view --raw -o jsonpath='{.users[0].user.token}' | \
                        cut -d'.' -f2 | base64 -d 2>/dev/null | jq -r '.exp' | \
                        xargs -I {} date -d @{} || echo "Token check failed"
                '''
                error("Kubernetes credentials may be expired or invalid")
            }
        }
    }
}

Step 3: Implement Robust Retry Logic

pipeline {
    agent { label 'prod-agents' }

    options {
        timeout(time: 30, unit: 'MINUTES')
        retry(2)  // Retry entire pipeline on failure
    }

    environment {
        KUBECONFIG = credentials('kubeconfig-prod')
    }

    stages {
        stage('Pre-flight Checks') {
            steps {
                script {
                    // Verify connectivity before proceeding
                    retry(3) {
                        sh '''
                            kubectl cluster-info --request-timeout=15s
                        '''
                    }
                }
            }
        }

        stage('Deploy to Production') {
            steps {
                script {
                    def maxRetries = 3
                    def retryDelay = 10

                    for (int i = 0; i < maxRetries; i++) {
                        try {
                            sh '''
                                kubectl apply -f k8s/deployment.yaml --timeout=60s
                                kubectl rollout status deployment/app --timeout=300s
                            '''
                            echo "Deployment successful on attempt ${i + 1}"
                            break
                        } catch (Exception e) {
                            if (i == maxRetries - 1) {
                                error("Deployment failed after ${maxRetries} attempts: ${e.message}")
                            }
                            echo "Attempt ${i + 1} failed, retrying in ${retryDelay} seconds..."
                            sleep(retryDelay)
                            retryDelay *= 2  // Exponential backoff
                        }
                    }
                }
            }
        }
    }

    post {
        failure {
            script {
                // Collect diagnostic information on failure
                sh '''
                    echo "=== Collecting diagnostics ==="
                    kubectl get nodes -o wide || true
                    kubectl get pods -n kube-system || true
                    kubectl describe nodes | grep -A 5 "Conditions:" || true
                '''
            }
        }
    }
}

Step 4: Add Health Checks and Circuit Breaker

def checkKubernetesHealth() {
    def healthChecks = [
        'API Server': 'kubectl cluster-info --request-timeout=10s',
        'Node Status': 'kubectl get nodes --request-timeout=10s | grep -v NotReady',
        'Core DNS': 'kubectl get pods -n kube-system -l k8s-app=kube-dns --request-timeout=10s'
    ]

    def failures = []

    healthChecks.each { name, command ->
        def result = sh(script: command, returnStatus: true)
        if (result != 0) {
            failures.add(name)
        }
    }

    if (failures.size() > 0) {
        error("Health checks failed: ${failures.join(', ')}")
    }

    echo "All health checks passed"
}

// Use in pipeline
stage('Health Check') {
    steps {
        script {
            checkKubernetesHealth()
        }
    }
}

Step 5: Implement Proper Error Handling

pipeline {
    agent { label 'prod-agents' }

    stages {
        stage('Deploy') {
            steps {
                script {
                    try {
                        timeout(time: 5, unit: 'MINUTES') {
                            sh 'kubectl apply -f k8s/deployment.yaml'
                        }
                    } catch (org.jenkinsci.plugins.workflow.steps.FlowInterruptedException e) {
                        echo "Deployment timed out - checking cluster status"
                        sh 'kubectl get events --sort-by=.lastTimestamp | tail -20'
                        throw e
                    } catch (hudson.AbortException e) {
                        echo "kubectl command failed - checking connectivity"
                        sh 'curl -k https://kubernetes.default.svc/healthz || true'
                        throw e
                    }
                }
            }
        }
    }
}

Systematic, production-ready debugging

Common Pipeline Debugging Issues

Symptom	Root Cause	Fix
Intermittent timeout	Network instability or agent overload	Add retry with backoff, check agent resources
Connection refused	API server overloaded or firewall	Check server health, verify security groups
Certificate errors	Expired or mismatched certs	Refresh credentials, check cert validity
Permission denied	RBAC changes or token expiration	Verify service account, refresh tokens
Resource not found	Wrong namespace or context	Verify KUBECONFIG context and namespace

Debugging Commands Cheat Sheet

# Check Jenkins agent logs
tail -f /var/log/jenkins/jenkins.log

# Verify agent connectivity
curl -I http://jenkins-master:8080/computer/agent-name/

# Test Kubernetes from agent
kubectl auth can-i --list

# Check for network issues
netstat -tulpn | grep 6443
ss -tulpn | grep ESTAB

Practice Question

A Jenkins pipeline fails with 'connection refused' when connecting to a Kubernetes cluster, but works when run manually on the same agent. What is the most likely cause?

Questions