Questions
Jenkins server crashed and we lost all job configurations. Implement a disaster recovery strategy.
The Scenario
Monday morning: Jenkins is down. The disk failed overnight.
Current Impact:
- Jenkins master: OFFLINE (disk failure)
- Jobs configured: 500+ (all lost)
- Credentials: 150 (need to be recreated)
- Build history: 2 years (gone)
- Pipeline libraries: Custom shared library (gone)
- Estimated recovery: Unknown (no backup tested)
Developers can’t deploy. The CEO is asking when services will be restored. You need to recover AND ensure this never happens again.
The Challenge
Implement a comprehensive disaster recovery strategy that includes automated backups, quick restoration procedures, and infrastructure that can survive failures.
A junior engineer might just add a cron job to copy JENKINS_HOME to S3 without testing restores, skip credentials backup due to security concerns, or backup only on-demand when someone remembers. These approaches leave you unable to restore, miss critical data, and create false confidence.
A senior engineer implements automated, tested backups using multiple strategies (JCasC for config, encrypted credential backups, build history archival), creates runbooks for restoration, regularly tests recovery procedures, and optionally implements high availability to prevent the scenario entirely.
Step 1: Define Recovery Objectives
# recovery-objectives.yaml
disaster_recovery:
RTO: 30 minutes # Recovery Time Objective
RPO: 1 hour # Recovery Point Objective
what_to_backup:
critical: # Must restore within RTO
- job_configurations
- credentials
- plugin_list
- global_configuration
- shared_libraries
important: # Restore within 4 hours
- build_history
- workspace_caches
nice_to_have: # Restore within 24 hours
- archived_artifacts
- old_build_logsStep 2: Implement Configuration as Code Backup
# Store Jenkins configuration in Git (primary backup)
# jenkins-config-repo/jenkins.yaml
jenkins:
systemMessage: "Jenkins - Disaster Recovery Enabled"
numExecutors: 0
mode: EXCLUSIVE
securityRealm:
ldap:
configurations:
- server: "${LDAP_SERVER}"
rootDN: "dc=company,dc=com"
authorizationStrategy:
roleBased:
roles:
global:
- name: "admin"
permissions:
- "Overall/Administer"
entries:
- group: "jenkins-admins"
clouds:
- kubernetes:
name: "kubernetes"
serverUrl: "${K8S_SERVER_URL}"
# ... full cloud config
credentials:
system:
domainCredentials:
- credentials:
# Secrets from external vault - never in Git
- usernamePassword:
id: "github-credentials"
username: "${GITHUB_USERNAME}"
password: "${GITHUB_TOKEN}"
unclassified:
globalLibraries:
libraries:
- name: "company-shared-library"
retriever:
modernSCM:
scm:
git:
remote: "https://github.com/company/jenkins-shared-library.git"
defaultVersion: "main"Step 3: Automated Backup Pipeline
// backup-pipeline.groovy
pipeline {
agent { label 'backup-agent' }
triggers {
cron('0 */4 * * *') // Every 4 hours
}
environment {
BACKUP_BUCKET = 's3://jenkins-backups-company'
JENKINS_HOME = '/var/jenkins_home'
TIMESTAMP = sh(script: 'date +%Y%m%d-%H%M%S', returnStdout: true).trim()
}
stages {
stage('Prepare Backup') {
steps {
sh '''
# Create backup directory
mkdir -p /tmp/jenkins-backup-${TIMESTAMP}
cd /tmp/jenkins-backup-${TIMESTAMP}
# Export configuration via JCasC
curl -s -u ${JENKINS_USER}:${JENKINS_TOKEN} \
${JENKINS_URL}/configuration-as-code/export \
> jenkins-config.yaml
# List installed plugins with versions
curl -s -u ${JENKINS_USER}:${JENKINS_TOKEN} \
"${JENKINS_URL}/pluginManager/api/json?depth=1" | \
jq -r '.plugins[] | "\\(.shortName):\\(.version)"' | \
sort > plugins.txt
'''
}
}
stage('Backup Job Configurations') {
steps {
sh '''
cd /tmp/jenkins-backup-${TIMESTAMP}
# Backup all job configs (without build history)
mkdir -p jobs
find ${JENKINS_HOME}/jobs -name "config.xml" -type f | while read config; do
# Get relative path
relpath=$(echo "$config" | sed "s|${JENKINS_HOME}/jobs/||")
mkdir -p "jobs/$(dirname "$relpath")"
cp "$config" "jobs/$relpath"
done
# Backup nodes configuration
mkdir -p nodes
cp -r ${JENKINS_HOME}/nodes/* nodes/ 2>/dev/null || true
# Backup views
mkdir -p views
find ${JENKINS_HOME} -maxdepth 1 -name "*View*" -exec cp {} views/ \\;
'''
}
}
stage('Backup Credentials (Encrypted)') {
steps {
withCredentials([string(credentialsId: 'backup-encryption-key', variable: 'ENCRYPTION_KEY')]) {
sh '''
cd /tmp/jenkins-backup-${TIMESTAMP}
# Backup secrets directory (encrypted)
tar -czf - ${JENKINS_HOME}/secrets | \
openssl enc -aes-256-cbc -salt -pbkdf2 \
-pass pass:${ENCRYPTION_KEY} \
> secrets.tar.gz.enc
# Backup credentials.xml (encrypted)
openssl enc -aes-256-cbc -salt -pbkdf2 \
-pass pass:${ENCRYPTION_KEY} \
-in ${JENKINS_HOME}/credentials.xml \
-out credentials.xml.enc
'''
}
}
}
stage('Backup Build History') {
when {
// Only full backup on weekends
expression { return new Date().format('u') in ['6', '7'] }
}
steps {
sh '''
cd /tmp/jenkins-backup-${TIMESTAMP}
# Backup last 10 builds per job (excluding workspaces)
mkdir -p build-history
find ${JENKINS_HOME}/jobs -type d -name "builds" | while read builds_dir; do
job_name=$(echo "$builds_dir" | sed "s|${JENKINS_HOME}/jobs/||" | sed 's|/builds||')
mkdir -p "build-history/$job_name"
# Copy last 10 builds
ls -1t "$builds_dir" | head -10 | while read build; do
if [ -d "$builds_dir/$build" ]; then
cp -r "$builds_dir/$build" "build-history/$job_name/" 2>/dev/null || true
fi
done
done
'''
}
}
stage('Upload to S3') {
steps {
withCredentials([[$class: 'AmazonWebServicesCredentialsBinding',
credentialsId: 'aws-backup-credentials',
accessKeyVariable: 'AWS_ACCESS_KEY_ID',
secretKeyVariable: 'AWS_SECRET_ACCESS_KEY']]) {
sh '''
cd /tmp
# Create compressed archive
tar -czf jenkins-backup-${TIMESTAMP}.tar.gz jenkins-backup-${TIMESTAMP}
# Upload to S3
aws s3 cp jenkins-backup-${TIMESTAMP}.tar.gz \
${BACKUP_BUCKET}/daily/${TIMESTAMP}/
# Keep last 30 daily backups
aws s3 ls ${BACKUP_BUCKET}/daily/ | sort | head -n -30 | \
awk '{print $2}' | while read old_backup; do
aws s3 rm ${BACKUP_BUCKET}/daily/$old_backup --recursive
done
# Cleanup local
rm -rf jenkins-backup-${TIMESTAMP} jenkins-backup-${TIMESTAMP}.tar.gz
'''
}
}
}
stage('Verify Backup') {
steps {
script {
// Download and verify backup integrity
sh '''
cd /tmp
aws s3 cp ${BACKUP_BUCKET}/daily/${TIMESTAMP}/jenkins-backup-${TIMESTAMP}.tar.gz .
tar -tzf jenkins-backup-${TIMESTAMP}.tar.gz > /dev/null
echo "Backup verification: SUCCESS"
rm jenkins-backup-${TIMESTAMP}.tar.gz
'''
}
}
}
}
post {
success {
slackSend(
channel: '#jenkins-ops',
color: 'good',
message: "Jenkins backup completed: ${TIMESTAMP}"
)
}
failure {
slackSend(
channel: '#jenkins-alerts',
color: 'danger',
message: "Jenkins backup FAILED! Immediate attention required."
)
}
}
}Step 4: Disaster Recovery Runbook
// restore-jenkins.groovy - Run from recovery environment
pipeline {
agent any
parameters {
string(
name: 'BACKUP_TIMESTAMP',
description: 'Backup timestamp to restore (e.g., 20240115-120000)'
)
booleanParam(
name: 'RESTORE_CREDENTIALS',
defaultValue: true,
description: 'Restore encrypted credentials'
)
booleanParam(
name: 'RESTORE_BUILD_HISTORY',
defaultValue: false,
description: 'Restore build history (takes longer)'
)
}
environment {
BACKUP_BUCKET = 's3://jenkins-backups-company'
NEW_JENKINS_HOME = '/var/jenkins_home'
}
stages {
stage('Download Backup') {
steps {
sh '''
mkdir -p /tmp/jenkins-restore
cd /tmp/jenkins-restore
aws s3 cp ${BACKUP_BUCKET}/daily/${BACKUP_TIMESTAMP}/jenkins-backup-${BACKUP_TIMESTAMP}.tar.gz .
tar -xzf jenkins-backup-${BACKUP_TIMESTAMP}.tar.gz
'''
}
}
stage('Stop Jenkins') {
steps {
sh '''
# Stop Jenkins gracefully
curl -X POST -u ${JENKINS_USER}:${JENKINS_TOKEN} \
"${JENKINS_URL}/safeExit" || true
# Wait for shutdown
sleep 30
'''
}
}
stage('Restore Configuration') {
steps {
sh '''
cd /tmp/jenkins-restore/jenkins-backup-${BACKUP_TIMESTAMP}
# Restore JCasC configuration
cp jenkins-config.yaml ${NEW_JENKINS_HOME}/casc_configs/
# Restore job configurations
cp -r jobs/* ${NEW_JENKINS_HOME}/jobs/
# Restore nodes
cp -r nodes/* ${NEW_JENKINS_HOME}/nodes/ 2>/dev/null || true
'''
}
}
stage('Restore Credentials') {
when {
expression { params.RESTORE_CREDENTIALS }
}
steps {
withCredentials([string(credentialsId: 'backup-encryption-key', variable: 'ENCRYPTION_KEY')]) {
sh '''
cd /tmp/jenkins-restore/jenkins-backup-${BACKUP_TIMESTAMP}
# Restore secrets
openssl enc -aes-256-cbc -d -pbkdf2 \
-pass pass:${ENCRYPTION_KEY} \
-in secrets.tar.gz.enc | tar -xzf - -C ${NEW_JENKINS_HOME}/
# Restore credentials.xml
openssl enc -aes-256-cbc -d -pbkdf2 \
-pass pass:${ENCRYPTION_KEY} \
-in credentials.xml.enc \
-out ${NEW_JENKINS_HOME}/credentials.xml
'''
}
}
}
stage('Install Plugins') {
steps {
sh '''
cd /tmp/jenkins-restore/jenkins-backup-${BACKUP_TIMESTAMP}
# Install plugins from list
jenkins-plugin-cli --plugin-file plugins.txt
'''
}
}
stage('Start Jenkins') {
steps {
sh '''
# Start Jenkins
systemctl start jenkins
# Wait for startup
timeout 300 bash -c 'until curl -s ${JENKINS_URL}/login; do sleep 5; done'
echo "Jenkins is up and running"
'''
}
}
stage('Verify Restoration') {
steps {
script {
// Verify jobs are present
def jobCount = sh(
script: """
curl -s -u ${JENKINS_USER}:${JENKINS_TOKEN} \
"${JENKINS_URL}/api/json?tree=jobs[name]" | \
jq '.jobs | length'
""",
returnStdout: true
).trim().toInteger()
echo "Restored ${jobCount} jobs"
if (jobCount < 100) { // Expected minimum
error "Job count lower than expected!"
}
// Test a known credential
def credTest = sh(
script: """
curl -s -u ${JENKINS_USER}:${JENKINS_TOKEN} \
"${JENKINS_URL}/credentials/store/system/domain/_/credential/github-credentials/api/json"
""",
returnStatus: true
)
if (credTest != 0) {
error "Credential restoration verification failed!"
}
echo "Restoration verified successfully"
}
}
}
}
post {
success {
slackSend(
channel: '#jenkins-alerts',
color: 'good',
message: """
*Jenkins Restored Successfully*
Backup: ${params.BACKUP_TIMESTAMP}
Time: ${currentBuild.durationString}
"""
)
}
always {
sh 'rm -rf /tmp/jenkins-restore'
}
}
}Step 5: High Availability Setup (Prevention)
# kubernetes/jenkins-ha.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: jenkins
namespace: jenkins
spec:
serviceName: jenkins
replicas: 1 # Active-passive with persistent storage
selector:
matchLabels:
app: jenkins
template:
spec:
containers:
- name: jenkins
image: jenkins/jenkins:lts-jdk17
volumeMounts:
- name: jenkins-home
mountPath: /var/jenkins_home
resources:
requests:
memory: "4Gi"
cpu: "2"
limits:
memory: "8Gi"
cpu: "4"
volumeClaimTemplates:
- metadata:
name: jenkins-home
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: gp3 # High IOPS SSD
resources:
requests:
storage: 100Gi
---
# Use EBS snapshots for point-in-time recovery
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: jenkins-snapshot-class
driver: ebs.csi.aws.com
deletionPolicy: Retain
parameters:
tagSpecification_1: "Environment=production"
---
# Automated daily snapshots
apiVersion: batch/v1
kind: CronJob
metadata:
name: jenkins-volume-snapshot
namespace: jenkins
spec:
schedule: "0 2 * * *" # 2 AM daily
jobTemplate:
spec:
template:
spec:
containers:
- name: snapshot-creator
image: bitnami/kubectl
command:
- /bin/sh
- -c
- |
kubectl apply -f - <<EOF
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
name: jenkins-home-$(date +%Y%m%d)
namespace: jenkins
spec:
volumeSnapshotClassName: jenkins-snapshot-class
source:
persistentVolumeClaimName: jenkins-home-jenkins-0
EOF
restartPolicy: OnFailureStep 6: Regular Recovery Testing
// test-disaster-recovery.groovy - Run monthly
pipeline {
agent any
triggers {
cron('0 3 1 * *') // First of every month at 3 AM
}
stages {
stage('Create Test Environment') {
steps {
sh '''
# Spin up temporary Jenkins instance
docker run -d --name jenkins-dr-test \
-p 9090:8080 \
jenkins/jenkins:lts-jdk17
'''
}
}
stage('Restore to Test Environment') {
steps {
// Use the restore pipeline on test instance
build(
job: 'restore-jenkins',
parameters: [
string(name: 'BACKUP_TIMESTAMP', value: 'latest'),
booleanParam(name: 'TARGET_INSTANCE', value: 'test')
]
)
}
}
stage('Validate Restoration') {
steps {
script {
// Run comprehensive tests
def tests = [
'Jobs present': 'curl -s localhost:9090/api/json | jq ".jobs | length"',
'Plugins loaded': 'curl -s localhost:9090/pluginManager/api/json | jq ".plugins | length"',
'Credentials accessible': 'curl -s localhost:9090/credentials/api/json'
]
tests.each { name, command ->
def result = sh(script: command, returnStatus: true)
if (result != 0) {
error "DR Test Failed: ${name}"
}
}
}
}
}
}
post {
always {
sh 'docker rm -f jenkins-dr-test || true'
}
success {
emailext(
to: 'ops-team@company.com',
subject: 'Jenkins DR Test: PASSED',
body: 'Monthly disaster recovery test completed successfully.'
)
}
failure {
emailext(
to: 'ops-team@company.com',
subject: 'Jenkins DR Test: FAILED',
body: 'Monthly disaster recovery test FAILED. Immediate attention required.'
)
}
}
} Disaster Recovery Checklist
| Component | Backup Method | Frequency | Restore Time |
|---|---|---|---|
| Job configs | Git + S3 | Every 4 hours | 5 minutes |
| Credentials | Encrypted S3 | Every 4 hours | 2 minutes |
| Plugins list | Git | On change | 10 minutes |
| JCasC config | Git | On change | 2 minutes |
| Build history | S3 | Weekly | 30 minutes |
| Shared libraries | Git | On change | 1 minute |
Practice Question
Why should Jenkins credentials be backed up separately with encryption, rather than included in the standard JENKINS_HOME backup?