DeployU
Interviews / DevOps & Cloud Infrastructure / Jenkins server crashed and we lost all job configurations. Implement a disaster recovery strategy.

Questions

Jenkins server crashed and we lost all job configurations. Implement a disaster recovery strategy.

practical Backup & Recovery Interactive Quiz Code Examples

The Scenario

Monday morning: Jenkins is down. The disk failed overnight.

Current Impact:
- Jenkins master: OFFLINE (disk failure)
- Jobs configured: 500+ (all lost)
- Credentials: 150 (need to be recreated)
- Build history: 2 years (gone)
- Pipeline libraries: Custom shared library (gone)
- Estimated recovery: Unknown (no backup tested)

Developers can’t deploy. The CEO is asking when services will be restored. You need to recover AND ensure this never happens again.

The Challenge

Implement a comprehensive disaster recovery strategy that includes automated backups, quick restoration procedures, and infrastructure that can survive failures.

Wrong Approach

A junior engineer might just add a cron job to copy JENKINS_HOME to S3 without testing restores, skip credentials backup due to security concerns, or backup only on-demand when someone remembers. These approaches leave you unable to restore, miss critical data, and create false confidence.

Right Approach

A senior engineer implements automated, tested backups using multiple strategies (JCasC for config, encrypted credential backups, build history archival), creates runbooks for restoration, regularly tests recovery procedures, and optionally implements high availability to prevent the scenario entirely.

Step 1: Define Recovery Objectives

# recovery-objectives.yaml
disaster_recovery:
  RTO: 30 minutes      # Recovery Time Objective
  RPO: 1 hour          # Recovery Point Objective

  what_to_backup:
    critical:          # Must restore within RTO
      - job_configurations
      - credentials
      - plugin_list
      - global_configuration
      - shared_libraries
    important:         # Restore within 4 hours
      - build_history
      - workspace_caches
    nice_to_have:      # Restore within 24 hours
      - archived_artifacts
      - old_build_logs

Step 2: Implement Configuration as Code Backup

# Store Jenkins configuration in Git (primary backup)
# jenkins-config-repo/jenkins.yaml
jenkins:
  systemMessage: "Jenkins - Disaster Recovery Enabled"
  numExecutors: 0
  mode: EXCLUSIVE

  securityRealm:
    ldap:
      configurations:
        - server: "${LDAP_SERVER}"
          rootDN: "dc=company,dc=com"

  authorizationStrategy:
    roleBased:
      roles:
        global:
          - name: "admin"
            permissions:
              - "Overall/Administer"
            entries:
              - group: "jenkins-admins"

  clouds:
    - kubernetes:
        name: "kubernetes"
        serverUrl: "${K8S_SERVER_URL}"
        # ... full cloud config

credentials:
  system:
    domainCredentials:
      - credentials:
          # Secrets from external vault - never in Git
          - usernamePassword:
              id: "github-credentials"
              username: "${GITHUB_USERNAME}"
              password: "${GITHUB_TOKEN}"

unclassified:
  globalLibraries:
    libraries:
      - name: "company-shared-library"
        retriever:
          modernSCM:
            scm:
              git:
                remote: "https://github.com/company/jenkins-shared-library.git"
        defaultVersion: "main"

Step 3: Automated Backup Pipeline

// backup-pipeline.groovy
pipeline {
    agent { label 'backup-agent' }

    triggers {
        cron('0 */4 * * *')  // Every 4 hours
    }

    environment {
        BACKUP_BUCKET = 's3://jenkins-backups-company'
        JENKINS_HOME = '/var/jenkins_home'
        TIMESTAMP = sh(script: 'date +%Y%m%d-%H%M%S', returnStdout: true).trim()
    }

    stages {
        stage('Prepare Backup') {
            steps {
                sh '''
                    # Create backup directory
                    mkdir -p /tmp/jenkins-backup-${TIMESTAMP}
                    cd /tmp/jenkins-backup-${TIMESTAMP}

                    # Export configuration via JCasC
                    curl -s -u ${JENKINS_USER}:${JENKINS_TOKEN} \
                        ${JENKINS_URL}/configuration-as-code/export \
                        > jenkins-config.yaml

                    # List installed plugins with versions
                    curl -s -u ${JENKINS_USER}:${JENKINS_TOKEN} \
                        "${JENKINS_URL}/pluginManager/api/json?depth=1" | \
                        jq -r '.plugins[] | "\\(.shortName):\\(.version)"' | \
                        sort > plugins.txt
                '''
            }
        }

        stage('Backup Job Configurations') {
            steps {
                sh '''
                    cd /tmp/jenkins-backup-${TIMESTAMP}

                    # Backup all job configs (without build history)
                    mkdir -p jobs
                    find ${JENKINS_HOME}/jobs -name "config.xml" -type f | while read config; do
                        # Get relative path
                        relpath=$(echo "$config" | sed "s|${JENKINS_HOME}/jobs/||")
                        mkdir -p "jobs/$(dirname "$relpath")"
                        cp "$config" "jobs/$relpath"
                    done

                    # Backup nodes configuration
                    mkdir -p nodes
                    cp -r ${JENKINS_HOME}/nodes/* nodes/ 2>/dev/null || true

                    # Backup views
                    mkdir -p views
                    find ${JENKINS_HOME} -maxdepth 1 -name "*View*" -exec cp {} views/ \\;
                '''
            }
        }

        stage('Backup Credentials (Encrypted)') {
            steps {
                withCredentials([string(credentialsId: 'backup-encryption-key', variable: 'ENCRYPTION_KEY')]) {
                    sh '''
                        cd /tmp/jenkins-backup-${TIMESTAMP}

                        # Backup secrets directory (encrypted)
                        tar -czf - ${JENKINS_HOME}/secrets | \
                            openssl enc -aes-256-cbc -salt -pbkdf2 \
                            -pass pass:${ENCRYPTION_KEY} \
                            > secrets.tar.gz.enc

                        # Backup credentials.xml (encrypted)
                        openssl enc -aes-256-cbc -salt -pbkdf2 \
                            -pass pass:${ENCRYPTION_KEY} \
                            -in ${JENKINS_HOME}/credentials.xml \
                            -out credentials.xml.enc
                    '''
                }
            }
        }

        stage('Backup Build History') {
            when {
                // Only full backup on weekends
                expression { return new Date().format('u') in ['6', '7'] }
            }
            steps {
                sh '''
                    cd /tmp/jenkins-backup-${TIMESTAMP}

                    # Backup last 10 builds per job (excluding workspaces)
                    mkdir -p build-history
                    find ${JENKINS_HOME}/jobs -type d -name "builds" | while read builds_dir; do
                        job_name=$(echo "$builds_dir" | sed "s|${JENKINS_HOME}/jobs/||" | sed 's|/builds||')
                        mkdir -p "build-history/$job_name"

                        # Copy last 10 builds
                        ls -1t "$builds_dir" | head -10 | while read build; do
                            if [ -d "$builds_dir/$build" ]; then
                                cp -r "$builds_dir/$build" "build-history/$job_name/" 2>/dev/null || true
                            fi
                        done
                    done
                '''
            }
        }

        stage('Upload to S3') {
            steps {
                withCredentials([[$class: 'AmazonWebServicesCredentialsBinding',
                    credentialsId: 'aws-backup-credentials',
                    accessKeyVariable: 'AWS_ACCESS_KEY_ID',
                    secretKeyVariable: 'AWS_SECRET_ACCESS_KEY']]) {
                    sh '''
                        cd /tmp

                        # Create compressed archive
                        tar -czf jenkins-backup-${TIMESTAMP}.tar.gz jenkins-backup-${TIMESTAMP}

                        # Upload to S3
                        aws s3 cp jenkins-backup-${TIMESTAMP}.tar.gz \
                            ${BACKUP_BUCKET}/daily/${TIMESTAMP}/

                        # Keep last 30 daily backups
                        aws s3 ls ${BACKUP_BUCKET}/daily/ | sort | head -n -30 | \
                            awk '{print $2}' | while read old_backup; do
                                aws s3 rm ${BACKUP_BUCKET}/daily/$old_backup --recursive
                            done

                        # Cleanup local
                        rm -rf jenkins-backup-${TIMESTAMP} jenkins-backup-${TIMESTAMP}.tar.gz
                    '''
                }
            }
        }

        stage('Verify Backup') {
            steps {
                script {
                    // Download and verify backup integrity
                    sh '''
                        cd /tmp
                        aws s3 cp ${BACKUP_BUCKET}/daily/${TIMESTAMP}/jenkins-backup-${TIMESTAMP}.tar.gz .
                        tar -tzf jenkins-backup-${TIMESTAMP}.tar.gz > /dev/null
                        echo "Backup verification: SUCCESS"
                        rm jenkins-backup-${TIMESTAMP}.tar.gz
                    '''
                }
            }
        }
    }

    post {
        success {
            slackSend(
                channel: '#jenkins-ops',
                color: 'good',
                message: "Jenkins backup completed: ${TIMESTAMP}"
            )
        }
        failure {
            slackSend(
                channel: '#jenkins-alerts',
                color: 'danger',
                message: "Jenkins backup FAILED! Immediate attention required."
            )
        }
    }
}

Step 4: Disaster Recovery Runbook

// restore-jenkins.groovy - Run from recovery environment
pipeline {
    agent any

    parameters {
        string(
            name: 'BACKUP_TIMESTAMP',
            description: 'Backup timestamp to restore (e.g., 20240115-120000)'
        )
        booleanParam(
            name: 'RESTORE_CREDENTIALS',
            defaultValue: true,
            description: 'Restore encrypted credentials'
        )
        booleanParam(
            name: 'RESTORE_BUILD_HISTORY',
            defaultValue: false,
            description: 'Restore build history (takes longer)'
        )
    }

    environment {
        BACKUP_BUCKET = 's3://jenkins-backups-company'
        NEW_JENKINS_HOME = '/var/jenkins_home'
    }

    stages {
        stage('Download Backup') {
            steps {
                sh '''
                    mkdir -p /tmp/jenkins-restore
                    cd /tmp/jenkins-restore

                    aws s3 cp ${BACKUP_BUCKET}/daily/${BACKUP_TIMESTAMP}/jenkins-backup-${BACKUP_TIMESTAMP}.tar.gz .
                    tar -xzf jenkins-backup-${BACKUP_TIMESTAMP}.tar.gz
                '''
            }
        }

        stage('Stop Jenkins') {
            steps {
                sh '''
                    # Stop Jenkins gracefully
                    curl -X POST -u ${JENKINS_USER}:${JENKINS_TOKEN} \
                        "${JENKINS_URL}/safeExit" || true

                    # Wait for shutdown
                    sleep 30
                '''
            }
        }

        stage('Restore Configuration') {
            steps {
                sh '''
                    cd /tmp/jenkins-restore/jenkins-backup-${BACKUP_TIMESTAMP}

                    # Restore JCasC configuration
                    cp jenkins-config.yaml ${NEW_JENKINS_HOME}/casc_configs/

                    # Restore job configurations
                    cp -r jobs/* ${NEW_JENKINS_HOME}/jobs/

                    # Restore nodes
                    cp -r nodes/* ${NEW_JENKINS_HOME}/nodes/ 2>/dev/null || true
                '''
            }
        }

        stage('Restore Credentials') {
            when {
                expression { params.RESTORE_CREDENTIALS }
            }
            steps {
                withCredentials([string(credentialsId: 'backup-encryption-key', variable: 'ENCRYPTION_KEY')]) {
                    sh '''
                        cd /tmp/jenkins-restore/jenkins-backup-${BACKUP_TIMESTAMP}

                        # Restore secrets
                        openssl enc -aes-256-cbc -d -pbkdf2 \
                            -pass pass:${ENCRYPTION_KEY} \
                            -in secrets.tar.gz.enc | tar -xzf - -C ${NEW_JENKINS_HOME}/

                        # Restore credentials.xml
                        openssl enc -aes-256-cbc -d -pbkdf2 \
                            -pass pass:${ENCRYPTION_KEY} \
                            -in credentials.xml.enc \
                            -out ${NEW_JENKINS_HOME}/credentials.xml
                    '''
                }
            }
        }

        stage('Install Plugins') {
            steps {
                sh '''
                    cd /tmp/jenkins-restore/jenkins-backup-${BACKUP_TIMESTAMP}

                    # Install plugins from list
                    jenkins-plugin-cli --plugin-file plugins.txt
                '''
            }
        }

        stage('Start Jenkins') {
            steps {
                sh '''
                    # Start Jenkins
                    systemctl start jenkins

                    # Wait for startup
                    timeout 300 bash -c 'until curl -s ${JENKINS_URL}/login; do sleep 5; done'

                    echo "Jenkins is up and running"
                '''
            }
        }

        stage('Verify Restoration') {
            steps {
                script {
                    // Verify jobs are present
                    def jobCount = sh(
                        script: """
                            curl -s -u ${JENKINS_USER}:${JENKINS_TOKEN} \
                                "${JENKINS_URL}/api/json?tree=jobs[name]" | \
                                jq '.jobs | length'
                        """,
                        returnStdout: true
                    ).trim().toInteger()

                    echo "Restored ${jobCount} jobs"

                    if (jobCount < 100) {  // Expected minimum
                        error "Job count lower than expected!"
                    }

                    // Test a known credential
                    def credTest = sh(
                        script: """
                            curl -s -u ${JENKINS_USER}:${JENKINS_TOKEN} \
                                "${JENKINS_URL}/credentials/store/system/domain/_/credential/github-credentials/api/json"
                        """,
                        returnStatus: true
                    )

                    if (credTest != 0) {
                        error "Credential restoration verification failed!"
                    }

                    echo "Restoration verified successfully"
                }
            }
        }
    }

    post {
        success {
            slackSend(
                channel: '#jenkins-alerts',
                color: 'good',
                message: """
                    *Jenkins Restored Successfully*
                    Backup: ${params.BACKUP_TIMESTAMP}
                    Time: ${currentBuild.durationString}
                """
            )
        }
        always {
            sh 'rm -rf /tmp/jenkins-restore'
        }
    }
}

Step 5: High Availability Setup (Prevention)

# kubernetes/jenkins-ha.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: jenkins
  namespace: jenkins
spec:
  serviceName: jenkins
  replicas: 1  # Active-passive with persistent storage
  selector:
    matchLabels:
      app: jenkins
  template:
    spec:
      containers:
        - name: jenkins
          image: jenkins/jenkins:lts-jdk17
          volumeMounts:
            - name: jenkins-home
              mountPath: /var/jenkins_home
          resources:
            requests:
              memory: "4Gi"
              cpu: "2"
            limits:
              memory: "8Gi"
              cpu: "4"
  volumeClaimTemplates:
    - metadata:
        name: jenkins-home
      spec:
        accessModes: ["ReadWriteOnce"]
        storageClassName: gp3  # High IOPS SSD
        resources:
          requests:
            storage: 100Gi
---
# Use EBS snapshots for point-in-time recovery
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: jenkins-snapshot-class
driver: ebs.csi.aws.com
deletionPolicy: Retain
parameters:
  tagSpecification_1: "Environment=production"
---
# Automated daily snapshots
apiVersion: batch/v1
kind: CronJob
metadata:
  name: jenkins-volume-snapshot
  namespace: jenkins
spec:
  schedule: "0 2 * * *"  # 2 AM daily
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: snapshot-creator
              image: bitnami/kubectl
              command:
                - /bin/sh
                - -c
                - |
                  kubectl apply -f - <<EOF
                  apiVersion: snapshot.storage.k8s.io/v1
                  kind: VolumeSnapshot
                  metadata:
                    name: jenkins-home-$(date +%Y%m%d)
                    namespace: jenkins
                  spec:
                    volumeSnapshotClassName: jenkins-snapshot-class
                    source:
                      persistentVolumeClaimName: jenkins-home-jenkins-0
                  EOF
          restartPolicy: OnFailure

Step 6: Regular Recovery Testing

// test-disaster-recovery.groovy - Run monthly
pipeline {
    agent any

    triggers {
        cron('0 3 1 * *')  // First of every month at 3 AM
    }

    stages {
        stage('Create Test Environment') {
            steps {
                sh '''
                    # Spin up temporary Jenkins instance
                    docker run -d --name jenkins-dr-test \
                        -p 9090:8080 \
                        jenkins/jenkins:lts-jdk17
                '''
            }
        }

        stage('Restore to Test Environment') {
            steps {
                // Use the restore pipeline on test instance
                build(
                    job: 'restore-jenkins',
                    parameters: [
                        string(name: 'BACKUP_TIMESTAMP', value: 'latest'),
                        booleanParam(name: 'TARGET_INSTANCE', value: 'test')
                    ]
                )
            }
        }

        stage('Validate Restoration') {
            steps {
                script {
                    // Run comprehensive tests
                    def tests = [
                        'Jobs present': 'curl -s localhost:9090/api/json | jq ".jobs | length"',
                        'Plugins loaded': 'curl -s localhost:9090/pluginManager/api/json | jq ".plugins | length"',
                        'Credentials accessible': 'curl -s localhost:9090/credentials/api/json'
                    ]

                    tests.each { name, command ->
                        def result = sh(script: command, returnStatus: true)
                        if (result != 0) {
                            error "DR Test Failed: ${name}"
                        }
                    }
                }
            }
        }
    }

    post {
        always {
            sh 'docker rm -f jenkins-dr-test || true'
        }
        success {
            emailext(
                to: 'ops-team@company.com',
                subject: 'Jenkins DR Test: PASSED',
                body: 'Monthly disaster recovery test completed successfully.'
            )
        }
        failure {
            emailext(
                to: 'ops-team@company.com',
                subject: 'Jenkins DR Test: FAILED',
                body: 'Monthly disaster recovery test FAILED. Immediate attention required.'
            )
        }
    }
}

Disaster Recovery Checklist

ComponentBackup MethodFrequencyRestore Time
Job configsGit + S3Every 4 hours5 minutes
CredentialsEncrypted S3Every 4 hours2 minutes
Plugins listGitOn change10 minutes
JCasC configGitOn change2 minutes
Build historyS3Weekly30 minutes
Shared librariesGitOn change1 minute

Practice Question

Why should Jenkins credentials be backed up separately with encryption, rather than included in the standard JENKINS_HOME backup?