Questions
A critical deployment pipeline is failing intermittently in production. Debug and fix the issue.
The Scenario
It’s Friday afternoon and the release train is blocked. Your deployment pipeline has been failing intermittently for the past hour:
Started by user deploy-bot
Running in Durability level: MAX_SURVIVABILITY
[Pipeline] Start of Pipeline
[Pipeline] node
Running on agent-prod-01 in /var/jenkins/workspace/deploy-production
[Pipeline] {
[Pipeline] stage
[Pipeline] { (Deploy to Production)
[Pipeline] sh
+ kubectl apply -f k8s/deployment.yaml
error: unable to connect to the server: dial tcp 10.0.1.50:6443: i/o timeout
[Pipeline] }
[Pipeline] // stage
[Pipeline] }
[Pipeline] // node
[Pipeline] End of Pipeline
ERROR: script returned exit code 1
Finished: FAILURE
The same pipeline worked 3 hours ago. Nothing in the Jenkinsfile changed. The Kubernetes cluster is healthy when you check manually.
The Challenge
Debug why the pipeline fails intermittently, identify the root cause, and implement a robust fix with proper error handling and retry logic.
A junior engineer might just re-run the pipeline hoping it works, add a simple sleep before the kubectl command, blame the network team, or increase the timeout without understanding the root cause. These approaches waste time, don't fix the underlying issue, and leave the pipeline unreliable.
A senior engineer systematically debugs by checking agent connectivity, network configuration, credential expiration, and resource contention. They implement proper retry logic with exponential backoff, add health checks before deployment, and set up monitoring to catch issues early.
Step 1: Check Agent and Network Status
// Add debugging to your pipeline
pipeline {
agent { label 'prod-agents' }
environment {
KUBECONFIG = credentials('kubeconfig-prod')
}
stages {
stage('Debug Connectivity') {
steps {
script {
// Check agent details
sh '''
echo "=== Agent Information ==="
hostname
ip addr show
cat /etc/resolv.conf
echo "=== Kubernetes API Connectivity ==="
curl -v -k https://10.0.1.50:6443/healthz --max-time 10 || true
echo "=== DNS Resolution ==="
nslookup kubernetes.default.svc.cluster.local || true
echo "=== Route to API Server ==="
traceroute -m 5 10.0.1.50 || true
'''
}
}
}
}
}Step 2: Identify Common Root Causes
// Check for credential expiration
stage('Validate Credentials') {
steps {
script {
// Test kubeconfig validity
def result = sh(
script: '''
kubectl cluster-info --request-timeout=10s 2>&1
''',
returnStatus: true
)
if (result != 0) {
// Check if token expired
sh '''
echo "Checking token expiration..."
kubectl config view --raw -o jsonpath='{.users[0].user.token}' | \
cut -d'.' -f2 | base64 -d 2>/dev/null | jq -r '.exp' | \
xargs -I {} date -d @{} || echo "Token check failed"
'''
error("Kubernetes credentials may be expired or invalid")
}
}
}
}Step 3: Implement Robust Retry Logic
pipeline {
agent { label 'prod-agents' }
options {
timeout(time: 30, unit: 'MINUTES')
retry(2) // Retry entire pipeline on failure
}
environment {
KUBECONFIG = credentials('kubeconfig-prod')
}
stages {
stage('Pre-flight Checks') {
steps {
script {
// Verify connectivity before proceeding
retry(3) {
sh '''
kubectl cluster-info --request-timeout=15s
'''
}
}
}
}
stage('Deploy to Production') {
steps {
script {
def maxRetries = 3
def retryDelay = 10
for (int i = 0; i < maxRetries; i++) {
try {
sh '''
kubectl apply -f k8s/deployment.yaml --timeout=60s
kubectl rollout status deployment/app --timeout=300s
'''
echo "Deployment successful on attempt ${i + 1}"
break
} catch (Exception e) {
if (i == maxRetries - 1) {
error("Deployment failed after ${maxRetries} attempts: ${e.message}")
}
echo "Attempt ${i + 1} failed, retrying in ${retryDelay} seconds..."
sleep(retryDelay)
retryDelay *= 2 // Exponential backoff
}
}
}
}
}
}
post {
failure {
script {
// Collect diagnostic information on failure
sh '''
echo "=== Collecting diagnostics ==="
kubectl get nodes -o wide || true
kubectl get pods -n kube-system || true
kubectl describe nodes | grep -A 5 "Conditions:" || true
'''
}
}
}
}Step 4: Add Health Checks and Circuit Breaker
def checkKubernetesHealth() {
def healthChecks = [
'API Server': 'kubectl cluster-info --request-timeout=10s',
'Node Status': 'kubectl get nodes --request-timeout=10s | grep -v NotReady',
'Core DNS': 'kubectl get pods -n kube-system -l k8s-app=kube-dns --request-timeout=10s'
]
def failures = []
healthChecks.each { name, command ->
def result = sh(script: command, returnStatus: true)
if (result != 0) {
failures.add(name)
}
}
if (failures.size() > 0) {
error("Health checks failed: ${failures.join(', ')}")
}
echo "All health checks passed"
}
// Use in pipeline
stage('Health Check') {
steps {
script {
checkKubernetesHealth()
}
}
}Step 5: Implement Proper Error Handling
pipeline {
agent { label 'prod-agents' }
stages {
stage('Deploy') {
steps {
script {
try {
timeout(time: 5, unit: 'MINUTES') {
sh 'kubectl apply -f k8s/deployment.yaml'
}
} catch (org.jenkinsci.plugins.workflow.steps.FlowInterruptedException e) {
echo "Deployment timed out - checking cluster status"
sh 'kubectl get events --sort-by=.lastTimestamp | tail -20'
throw e
} catch (hudson.AbortException e) {
echo "kubectl command failed - checking connectivity"
sh 'curl -k https://kubernetes.default.svc/healthz || true'
throw e
}
}
}
}
}
} Common Pipeline Debugging Issues
| Symptom | Root Cause | Fix |
|---|---|---|
| Intermittent timeout | Network instability or agent overload | Add retry with backoff, check agent resources |
| Connection refused | API server overloaded or firewall | Check server health, verify security groups |
| Certificate errors | Expired or mismatched certs | Refresh credentials, check cert validity |
| Permission denied | RBAC changes or token expiration | Verify service account, refresh tokens |
| Resource not found | Wrong namespace or context | Verify KUBECONFIG context and namespace |
Debugging Commands Cheat Sheet
# Check Jenkins agent logs
tail -f /var/log/jenkins/jenkins.log
# Verify agent connectivity
curl -I http://jenkins-master:8080/computer/agent-name/
# Test Kubernetes from agent
kubectl auth can-i --list
# Check for network issues
netstat -tulpn | grep 6443
ss -tulpn | grep ESTAB
Practice Question
A Jenkins pipeline fails with 'connection refused' when connecting to a Kubernetes cluster, but works when run manually on the same agent. What is the most likely cause?