Your production cluster went down. Walk through your disaster recovery and backup/restore strategy.

Q: Your production cluster went down. Walk through your disaster recovery and backup/restore strategy.

Learn the answer to "Your production cluster went down. Walk through your disaster recovery and backup/restore strategy." with detailed explanations, code examples, and best practices on DeployU.

The Scenario

You’re the Infrastructure Architect at a healthcare SaaS company. Your production Kubernetes cluster hosts critical patient data and healthcare provider applications serving 500+ hospitals.

Business requirements from the CEO:

RTO (Recovery Time Objective): < 1 hour - System must be back online within 1 hour of disaster
RPO (Recovery Point Objective): < 15 minutes - Maximum acceptable data loss is 15 minutes
Compliance: HIPAA-compliant - All backups must be encrypted
Multi-region: Failover to secondary region if primary region fails
Testing: DR plan must be tested quarterly

What counts as a “disaster”:

Entire AWS region outage (rare but happened: us-east-1 in 2017, 2021)
Kubernetes cluster corruption (etcd data loss, control plane failure)
Ransomware attack (malicious deletion of resources, data encryption)
Accidental deletion (developer runs kubectl delete namespace production)
Data center fire/natural disaster (hurricane, earthquake, flood)

Last week, during a routine upgrade, someone accidentally ran:

kubectl delete namespace production --force

Everything was deleted:

50 microservices
200 GB of persistent volume data
ConfigMaps, Secrets, RBAC policies
Ingress rules, Network Policies

Your CTO asks: “How quickly can we recover?”

Currently, you don’t have a good answer. Your job is to design a complete disaster recovery plan.

The Challenge

Design a comprehensive disaster recovery strategy that includes:

Backup strategy: What to back up and how often
Storage location: Where to store backups (encryption, geo-redundancy)
Automated backup: CI/CD integration and scheduling
Recovery procedures: Step-by-step restoration process
Failover architecture: Multi-region active-passive setup
Testing plan: Quarterly DR drills

Show complete configurations, tools (Velero, etcd backup), and runbooks.

How Different Experience Levels Approach This

Junior Engineer

Surface Level

Basic backups without comprehensive planning - set up weekly backups using kubectl get all -o yaml and store them on S3. Backup frequency too low (weekly equals up to 7 days data loss), kubectl get all doesn't capture everything (secrets, PVs, RBAC), no automation (manual backups are unreliable), no testing plan (backups might not work when needed), no multi-region failover, and no etcd backups (cluster state could be lost). This approach violates both RTO (less than 1 hour) and RPO (less than 15 minutes) requirements.

Technically correct, but lacks depth

Senior Engineer

Production Ready

Enterprise disaster recovery architecture with three-layer strategy: Layer 1 - etcd backup every 15 minutes for Kubernetes state, Layer 2 - Velero backup with hourly incremental and daily full backups for resources and volumes, Layer 3 - Multi-region replication with active-passive setup. All backups stored in S3 with cross-region replication, KMS encryption, and versioning enabled. Includes automated CronJobs, comprehensive restore procedures, monitoring with Prometheus alerts, quarterly testing plan, and detailed runbooks for various disaster scenarios.

Junior Approach: Basic Backups Without Comprehensive Planning

The junior approach uses weekly kubectl backups:

kubectl get all -o yaml > backup.yaml

Problems with this approach:

Backup frequency too low (weekly = up to 7 days data loss)
kubectl get all doesn’t capture everything (secrets, PVs, RBAC)
No automation (manual backups are unreliable)
No testing plan (backups might not work when needed)
No multi-region failover
No etcd backups (cluster state could be lost)

This approach violates both RTO (< 1 hour) and RPO (< 15 minutes) requirements.

Senior Approach: Enterprise Disaster Recovery Architecture

This is exactly how financial institutions, healthcare companies, and Fortune 500 companies implement DR. Here’s the complete solution:

Three-Layer DR Strategy

Layer 1: etcd Backup (Kubernetes state)
   ↓ Every 15 minutes
   ↓
Layer 2: Velero Backup (Resources + Volumes)
   ↓ Hourly incremental, Daily full
   ↓
Layer 3: Multi-Region Replication
   ↓ Active-Passive setup
   ↓
Storage: S3 with cross-region replication

Layer 1: etcd Backup (Control Plane State)

etcd stores the entire Kubernetes cluster state. If etcd is lost, the cluster is gone.

Automated etcd backup script:

#!/bin/bash
set -e

ETCD_ENDPOINTS="https://127.0.0.1:2379"
ETCD_CERT="/etc/kubernetes/pki/etcd/server.crt"
ETCD_KEY="/etc/kubernetes/pki/etcd/server.key"
ETCD_CA="/etc/kubernetes/pki/etcd/ca.crt"

BACKUP_DIR="/var/backups/etcd"
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
BACKUP_FILE="${BACKUP_DIR}/etcd-backup-${TIMESTAMP}.db"

# Create backup directory
mkdir -p ${BACKUP_DIR}

# Create etcd snapshot
ETCDCTL_API=3 etcdctl snapshot save ${BACKUP_FILE} \
  --endpoints=${ETCD_ENDPOINTS} \
  --cacert=${ETCD_CA} \
  --cert=${ETCD_CERT} \
  --key=${ETCD_KEY}

# Verify backup
ETCDCTL_API=3 etcdctl snapshot status ${BACKUP_FILE} -w table

# Upload to S3 with encryption
aws s3 cp ${BACKUP_FILE} \
  s3://company-k8s-backups/etcd/${TIMESTAMP}/ \
  --sse aws:kms \
  --sse-kms-key-id arn:aws:kms:us-east-1:123456789:key/abc-123

# Keep only last 7 days locally
find ${BACKUP_DIR} -type f -name "*.db" -mtime +7 -delete

echo "✅ etcd backup completed: ${BACKUP_FILE}"

Automated etcd backup CronJob:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: etcd-backup
  namespace: kube-system
spec:
  # Run every 15 minutes (RPO requirement)
  schedule: "*/15 * * * *"
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        spec:
          hostNetwork: true
          nodeName: master-node-1  # Run on control plane node
          containers:
          - name: etcd-backup
            image: company/etcd-backup:v1.0
            command: ["/scripts/backup-etcd.sh"]
            volumeMounts:
            - name: etcd-certs
              mountPath: /etc/kubernetes/pki/etcd
              readOnly: true
            - name: backup-dir
              mountPath: /var/backups/etcd
            env:
            - name: AWS_ACCESS_KEY_ID
              valueFrom:
                secretKeyRef:
                  name: aws-credentials
                  key: access-key-id
            - name: AWS_SECRET_ACCESS_KEY
              valueFrom:
                secretKeyRef:
                  name: aws-credentials
                  key: secret-access-key
          volumes:
          - name: etcd-certs
            hostPath:
              path: /etc/kubernetes/pki/etcd
          - name: backup-dir
            hostPath:
              path: /var/backups/etcd
          restartPolicy: OnFailure

etcd Restore Procedure:

#!/bin/bash
# Restore etcd from backup

BACKUP_FILE="/var/backups/etcd/etcd-backup-20250115-100000.db"
RESTORE_DIR="/var/lib/etcd-restore"

# Stop etcd
systemctl stop etcd

# Restore snapshot
ETCDCTL_API=3 etcdctl snapshot restore ${BACKUP_FILE} \
  --data-dir=${RESTORE_DIR} \
  --name=etcd-restore \
  --initial-cluster=etcd-restore=https://10.0.1.10:2380 \
  --initial-advertise-peer-urls=https://10.0.1.10:2380

# Update etcd data directory
rm -rf /var/lib/etcd
mv ${RESTORE_DIR} /var/lib/etcd

# Start etcd
systemctl start etcd

echo "✅ etcd restored from ${BACKUP_FILE}"

Layer 2: Velero Backup (Complete Cluster Backup)

Velero backs up all Kubernetes resources (Deployments, Services, ConfigMaps, Secrets, etc.), Persistent Volumes (using volume snapshots), and Namespaces, RBAC, Network Policies.

Install Velero:

# 1. Create S3 bucket for backups
aws s3 mb s3://company-velero-backups --region us-east-1

# Enable versioning
aws s3api put-bucket-versioning \
  --bucket company-velero-backups \
  --versioning-configuration Status=Enabled

# Enable encryption
aws s3api put-bucket-encryption \
  --bucket company-velero-backups \
  --server-side-encryption-configuration '{
    "Rules": [{
      "ApplyServerSideEncryptionByDefault": {
        "SSEAlgorithm": "aws:kms",
        "KMSMasterKeyID": "arn:aws:kms:us-east-1:123456789:key/abc-123"
      }
    }]
  }'

# 2. Create IAM policy for Velero
cat > velero-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:DescribeVolumes",
        "ec2:DescribeSnapshots",
        "ec2:CreateTags",
        "ec2:CreateVolume",
        "ec2:CreateSnapshot",
        "ec2:DeleteSnapshot"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:DeleteObject",
        "s3:PutObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::company-velero-backups/*",
        "arn:aws:s3:::company-velero-backups"
      ]
    }
  ]
}
EOF

aws iam create-policy --policy-name VeleroPolicy --policy-document file://velero-policy.json

# 3. Install Velero
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.8.0 \
  --bucket company-velero-backups \
  --backup-location-config region=us-east-1 \
  --snapshot-location-config region=us-east-1 \
  --secret-file ./credentials-velero \
  --use-volume-snapshots=true \
  --use-node-agent

Velero Backup Schedules:

---
# Hourly incremental backup
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: hourly-backup
  namespace: velero
spec:
  schedule: "0 * * * *"  # Every hour
  template:
    # Include all namespaces except system ones
    includedNamespaces:
    - production
    - staging
    excludedNamespaces:
    - kube-system
    - kube-public

    # Backup volumes
    defaultVolumesToRestic: true

    # Retention
    ttl: 72h  # Keep hourly backups for 3 days

    # Hooks for app-consistent backups
    hooks:
      resources:
      - name: postgres-backup
        includedNamespaces:
        - production
        labelSelector:
          matchLabels:
            app: postgres
        pre:
        - exec:
            container: postgres
            command:
            - /bin/bash
            - -c
            - pg_dump -U postgres > /tmp/backup.sql
        post:
        - exec:
            container: postgres
            command:
            - /bin/bash
            - -c
            - rm /tmp/backup.sql

---
# Daily full backup
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: daily-backup
  namespace: velero
spec:
  schedule: "0 2 * * *"  # 2 AM daily
  template:
    # Backup everything including cluster resources
    includedResources:
    - '*'
    includeClusterResources: true

    defaultVolumesToRestic: true
    ttl: 720h  # Keep daily backups for 30 days

---
# Weekly compliance backup (long-term retention)
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: weekly-backup
  namespace: velero
spec:
  schedule: "0 3 * * 0"  # 3 AM every Sunday
  template:
    includedResources:
    - '*'
    includeClusterResources: true
    defaultVolumesToRestic: true
    ttl: 8760h  # Keep weekly backups for 1 year

    # Store in separate long-term retention bucket
    storageLocation: long-term-storage

Velero Restore Procedure:

# 1. List available backups
velero backup get

NAME                STATUS      CREATED                         EXPIRES
hourly-backup-001   Completed   2025-01-15 10:00:00 +0000 UTC   3d
daily-backup-001    Completed   2025-01-15 02:00:00 +0000 UTC   30d

# 2. Restore from specific backup
velero restore create restore-prod-20250115 \
  --from-backup daily-backup-001 \
  --wait

# 3. Check restore status
velero restore describe restore-prod-20250115

# 4. Verify resources are restored
kubectl get all -n production

Layer 3: Multi-Region Active-Passive Architecture

┌─────────────────────────────────────────────────────────┐
│ Primary Region (us-east-1)                              │
│  ├── Production Cluster (Active)                        │
│  ├── RDS Multi-AZ (Primary)                             │
│  └── S3 Bucket (Velero backups)                         │
│       ↓ Cross-region replication                        │
└──────────────────────┬──────────────────────────────────┘
                       │
                       ↓ Replication
┌─────────────────────────────────────────────────────────┐
│ Secondary Region (us-west-2)                            │
│  ├── Standby Cluster (Passive - Ready to activate)     │
│  ├── RDS Read Replica (Promoted to primary on failover)│
│  └── S3 Bucket (Replicated Velero backups)             │
└─────────────────────────────────────────────────────────┘

Cross-Region S3 Replication:

# Enable cross-region replication
aws s3api put-bucket-replication \
  --bucket company-velero-backups \
  --replication-configuration '{
    "Role": "arn:aws:iam::123456789:role/S3ReplicationRole",
    "Rules": [{
      "Status": "Enabled",
      "Priority": 1,
      "Filter": {},
      "Destination": {
        "Bucket": "arn:aws:s3:::company-velero-backups-dr",
        "ReplicationTime": {
          "Status": "Enabled",
          "Time": {
            "Minutes": 15
          }
        },
        "Metrics": {
          "Status": "Enabled",
          "EventThreshold": {
            "Minutes": 15
          }
        }
      }
    }]
  }'

Multi-Region Database Replication (using Terraform):

# AWS RDS with cross-region read replica
resource "aws_db_instance" "primary" {
  identifier     = "production-db"
  engine         = "postgres"
  instance_class = "db.r5.2xlarge"

  # Multi-AZ for high availability
  multi_az = true

  # Enable automated backups
  backup_retention_period = 30
  backup_window          = "03:00-04:00"

  # Enable point-in-time recovery
  enabled_cloudwatch_logs_exports = ["postgresql"]

  # Encryption
  storage_encrypted = true
  kms_key_id       = "arn:aws:kms:us-east-1:123456789:key/abc-123"
}

# Cross-region read replica for DR
resource "aws_db_instance" "replica" {
  identifier             = "production-db-replica"
  replicate_source_db    = aws_db_instance.primary.arn
  instance_class         = "db.r5.2xlarge"

  # Different region
  provider = aws.us-west-2

  # Can be promoted to standalone on failover
  backup_retention_period = 30
  storage_encrypted       = true
}

Disaster Recovery Runbooks

Scenario 1: Accidental Namespace Deletion

# INCIDENT: Someone ran kubectl delete namespace production

# Step 1: Identify the backup to restore from
velero backup get | grep production
hourly-backup-20250115-1400  Completed  15m ago

# Step 2: Restore the namespace
velero restore create prod-restore-ns \
  --from-backup hourly-backup-20250115-1400 \
  --include-namespaces production \
  --wait

# Step 3: Verify restoration
kubectl get all -n production
kubectl get pvc -n production

# Step 4: Verify application health
kubectl get pods -n production
curl https://api.company.com/health

# Recovery Time: ~10 minutes
# Data Loss: ~15 minutes (last backup)

Scenario 2: Complete Cluster Failure

# INCIDENT: Control plane nodes failed, etcd corrupted

# Step 1: Provision new cluster
eksctl create cluster -f cluster-config.yaml

# Step 2: Install Velero on new cluster
velero install --provider aws --bucket company-velero-backups ...

# Step 3: Restore from latest backup
velero restore create full-cluster-restore \
  --from-backup daily-backup-20250115 \
  --wait

# Step 4: Update DNS to point to new cluster
aws route53 change-resource-record-sets \
  --hosted-zone-id Z123456 \
  --change-batch '{
    "Changes": [{
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "api.company.com",
        "Type": "A",
        "AliasTarget": {
          "HostedZoneId": "Z2FDTNDATAQYW2",
          "DNSName": "new-cluster-lb.us-east-1.elb.amazonaws.com",
          "EvaluateTargetHealth": false
        }
      }
    }]
  }'

# Recovery Time: ~45 minutes
# Data Loss: ~1 hour (last daily backup)

Scenario 3: Region Failure (Failover to DR Region)

# INCIDENT: Entire us-east-1 region is down

# Step 1: Promote RDS read replica in us-west-2 to primary
aws rds promote-read-replica \
  --db-instance-identifier production-db-replica \
  --region us-west-2

# Step 2: Scale up standby cluster in us-west-2
kubectl scale deployment --all --replicas=10 -n production

# Step 3: Update DNS to point to DR region
aws route53 change-resource-record-sets \
  --hosted-zone-id Z123456 \
  --change-batch '{
    "Changes": [{
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "api.company.com",
        "Type": "A",
        "AliasTarget": {
          "DNSName": "dr-cluster-lb.us-west-2.elb.amazonaws.com"
        }
      }
    }]
  }'

# Step 4: Verify failover
curl https://api.company.com/health

# Recovery Time: ~30 minutes
# Data Loss: ~15 minutes (RDS replication lag)

Quarterly DR Testing Plan

# DR Test Checklist (Run every quarter)

Week 1: Test etcd Restore
  - Provision test cluster
  - Restore from etcd backup
  - Verify cluster state matches production
  - Document time taken

Week 2: Test Velero Namespace Restore
  - Delete test namespace
  - Restore from Velero backup
  - Verify all resources restored
  - Check PV data integrity

Week 3: Test Full Cluster Recovery
  - Provision new test cluster
  - Restore complete cluster from Velero
  - Run smoke tests
  - Measure RTO (should be &lt; 1 hour)

Week 4: Test Region Failover
  - Simulate region failure
  - Failover to DR region
  - Promote RDS replica
  - Update DNS
  - Verify application functionality
  - Measure RTO and RPO

Monitoring and Alerting

# Prometheus alerts for backup failures
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: backup-alerts
  namespace: monitoring
spec:
  groups:
  - name: disaster-recovery
    rules:
    - alert: VeleroBackupFailed
      expr: |
        velero_backup_failure_total > 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Velero backup failed"
        description: "Backup {{ $labels.backup }} failed. Check Velero logs."

    - alert: EtcdBackupMissing
      expr: |
        time() - etcd_backup_last_success_timestamp > 1800
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "etcd backup not taken in 30+ minutes"
        description: "Last successful backup was > 30 minutes ago"

    - alert: S3ReplicationLag
      expr: |
        aws_s3_replication_lag_seconds > 900
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "S3 cross-region replication lagging"
        description: "Replication lag is > 15 minutes (RPO violation)"

Cost Analysis

Monthly DR Costs:

etcd Backups:
- S3 storage (96 backups/day * 100 MB * 30 days): ~$7/month
- S3 cross-region replication: ~$3/month

Velero Backups:
- S3 storage (hourly 500 GB * 72 hours): ~$40/month
- S3 storage (daily 2 TB * 30 days): ~$120/month
- EBS snapshots (100 volumes * 100 GB * $0.05): ~$500/month

Multi-Region Setup:
- Standby cluster (10% capacity): ~$500/month
- RDS read replica: ~$800/month
- Cross-region data transfer: ~$200/month

Total DR Cost: ~$2,170/month

Cost of 1 hour outage (healthcare SaaS):
- Revenue loss: ~$50,000
- SLA penalties: ~$20,000
- Reputation damage: Priceless

ROI: DR costs $26K/year but prevents $70K+ losses per incident

Shows real-world understanding & trade-offs

What Makes the Difference?

Context over facts: Explains when and why, not just what
Real examples: Provides specific use cases from production experience
Trade-offs: Acknowledges pros, cons, and decision factors

Practice Question

Your Velero backup completed successfully at 2:00 PM. At 2:30 PM, a developer accidentally deletes the production namespace. You restore from the 2:00 PM backup. What is the RPO and RTO?

Questions