DeployU
Interviews / DevOps & Cloud Infrastructure / Your production cluster went down. Walk through your disaster recovery and backup/restore strategy.

Your production cluster went down. Walk through your disaster recovery and backup/restore strategy.

architecture Disaster Recovery Interactive Quiz

The Scenario

You’re the Infrastructure Architect at a healthcare SaaS company. Your production Kubernetes cluster hosts critical patient data and healthcare provider applications serving 500+ hospitals.

Business requirements from the CEO:

  • RTO (Recovery Time Objective): < 1 hour - System must be back online within 1 hour of disaster
  • RPO (Recovery Point Objective): < 15 minutes - Maximum acceptable data loss is 15 minutes
  • Compliance: HIPAA-compliant - All backups must be encrypted
  • Multi-region: Failover to secondary region if primary region fails
  • Testing: DR plan must be tested quarterly

What counts as a “disaster”:

  1. Entire AWS region outage (rare but happened: us-east-1 in 2017, 2021)
  2. Kubernetes cluster corruption (etcd data loss, control plane failure)
  3. Ransomware attack (malicious deletion of resources, data encryption)
  4. Accidental deletion (developer runs kubectl delete namespace production)
  5. Data center fire/natural disaster (hurricane, earthquake, flood)

Last week, during a routine upgrade, someone accidentally ran:

kubectl delete namespace production --force

Everything was deleted:

  • 50 microservices
  • 200 GB of persistent volume data
  • ConfigMaps, Secrets, RBAC policies
  • Ingress rules, Network Policies

Your CTO asks: “How quickly can we recover?”

Currently, you don’t have a good answer. Your job is to design a complete disaster recovery plan.

The Challenge

Design a comprehensive disaster recovery strategy that includes:

  1. Backup strategy: What to back up and how often
  2. Storage location: Where to store backups (encryption, geo-redundancy)
  3. Automated backup: CI/CD integration and scheduling
  4. Recovery procedures: Step-by-step restoration process
  5. Failover architecture: Multi-region active-passive setup
  6. Testing plan: Quarterly DR drills

Show complete configurations, tools (Velero, etcd backup), and runbooks.

How Different Experience Levels Approach This
Junior Engineer
Surface Level

Basic backups without comprehensive planning - set up weekly backups using kubectl get all -o yaml and store them on S3. Backup frequency too low (weekly equals up to 7 days data loss), kubectl get all doesn't capture everything (secrets, PVs, RBAC), no automation (manual backups are unreliable), no testing plan (backups might not work when needed), no multi-region failover, and no etcd backups (cluster state could be lost). This approach violates both RTO (less than 1 hour) and RPO (less than 15 minutes) requirements.

Senior Engineer
Production Ready

Enterprise disaster recovery architecture with three-layer strategy: Layer 1 - etcd backup every 15 minutes for Kubernetes state, Layer 2 - Velero backup with hourly incremental and daily full backups for resources and volumes, Layer 3 - Multi-region replication with active-passive setup. All backups stored in S3 with cross-region replication, KMS encryption, and versioning enabled. Includes automated CronJobs, comprehensive restore procedures, monitoring with Prometheus alerts, quarterly testing plan, and detailed runbooks for various disaster scenarios.

Junior Approach: Basic Backups Without Comprehensive Planning

The junior approach uses weekly kubectl backups:

kubectl get all -o yaml > backup.yaml

Problems with this approach:

  • Backup frequency too low (weekly = up to 7 days data loss)
  • kubectl get all doesn’t capture everything (secrets, PVs, RBAC)
  • No automation (manual backups are unreliable)
  • No testing plan (backups might not work when needed)
  • No multi-region failover
  • No etcd backups (cluster state could be lost)

This approach violates both RTO (< 1 hour) and RPO (< 15 minutes) requirements.

Senior Approach: Enterprise Disaster Recovery Architecture

This is exactly how financial institutions, healthcare companies, and Fortune 500 companies implement DR. Here’s the complete solution:

Three-Layer DR Strategy

Layer 1: etcd Backup (Kubernetes state)
   ↓ Every 15 minutes

Layer 2: Velero Backup (Resources + Volumes)
   ↓ Hourly incremental, Daily full

Layer 3: Multi-Region Replication
   ↓ Active-Passive setup

Storage: S3 with cross-region replication

Layer 1: etcd Backup (Control Plane State)

etcd stores the entire Kubernetes cluster state. If etcd is lost, the cluster is gone.

Automated etcd backup script:

#!/bin/bash
set -e

ETCD_ENDPOINTS="https://127.0.0.1:2379"
ETCD_CERT="/etc/kubernetes/pki/etcd/server.crt"
ETCD_KEY="/etc/kubernetes/pki/etcd/server.key"
ETCD_CA="/etc/kubernetes/pki/etcd/ca.crt"

BACKUP_DIR="/var/backups/etcd"
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
BACKUP_FILE="${BACKUP_DIR}/etcd-backup-${TIMESTAMP}.db"

# Create backup directory
mkdir -p ${BACKUP_DIR}

# Create etcd snapshot
ETCDCTL_API=3 etcdctl snapshot save ${BACKUP_FILE} \
  --endpoints=${ETCD_ENDPOINTS} \
  --cacert=${ETCD_CA} \
  --cert=${ETCD_CERT} \
  --key=${ETCD_KEY}

# Verify backup
ETCDCTL_API=3 etcdctl snapshot status ${BACKUP_FILE} -w table

# Upload to S3 with encryption
aws s3 cp ${BACKUP_FILE} \
  s3://company-k8s-backups/etcd/${TIMESTAMP}/ \
  --sse aws:kms \
  --sse-kms-key-id arn:aws:kms:us-east-1:123456789:key/abc-123

# Keep only last 7 days locally
find ${BACKUP_DIR} -type f -name "*.db" -mtime +7 -delete

echo "✅ etcd backup completed: ${BACKUP_FILE}"

Automated etcd backup CronJob:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: etcd-backup
  namespace: kube-system
spec:
  # Run every 15 minutes (RPO requirement)
  schedule: "*/15 * * * *"
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        spec:
          hostNetwork: true
          nodeName: master-node-1  # Run on control plane node
          containers:
          - name: etcd-backup
            image: company/etcd-backup:v1.0
            command: ["/scripts/backup-etcd.sh"]
            volumeMounts:
            - name: etcd-certs
              mountPath: /etc/kubernetes/pki/etcd
              readOnly: true
            - name: backup-dir
              mountPath: /var/backups/etcd
            env:
            - name: AWS_ACCESS_KEY_ID
              valueFrom:
                secretKeyRef:
                  name: aws-credentials
                  key: access-key-id
            - name: AWS_SECRET_ACCESS_KEY
              valueFrom:
                secretKeyRef:
                  name: aws-credentials
                  key: secret-access-key
          volumes:
          - name: etcd-certs
            hostPath:
              path: /etc/kubernetes/pki/etcd
          - name: backup-dir
            hostPath:
              path: /var/backups/etcd
          restartPolicy: OnFailure

etcd Restore Procedure:

#!/bin/bash
# Restore etcd from backup

BACKUP_FILE="/var/backups/etcd/etcd-backup-20250115-100000.db"
RESTORE_DIR="/var/lib/etcd-restore"

# Stop etcd
systemctl stop etcd

# Restore snapshot
ETCDCTL_API=3 etcdctl snapshot restore ${BACKUP_FILE} \
  --data-dir=${RESTORE_DIR} \
  --name=etcd-restore \
  --initial-cluster=etcd-restore=https://10.0.1.10:2380 \
  --initial-advertise-peer-urls=https://10.0.1.10:2380

# Update etcd data directory
rm -rf /var/lib/etcd
mv ${RESTORE_DIR} /var/lib/etcd

# Start etcd
systemctl start etcd

echo "✅ etcd restored from ${BACKUP_FILE}"

Layer 2: Velero Backup (Complete Cluster Backup)

Velero backs up all Kubernetes resources (Deployments, Services, ConfigMaps, Secrets, etc.), Persistent Volumes (using volume snapshots), and Namespaces, RBAC, Network Policies.

Install Velero:

# 1. Create S3 bucket for backups
aws s3 mb s3://company-velero-backups --region us-east-1

# Enable versioning
aws s3api put-bucket-versioning \
  --bucket company-velero-backups \
  --versioning-configuration Status=Enabled

# Enable encryption
aws s3api put-bucket-encryption \
  --bucket company-velero-backups \
  --server-side-encryption-configuration '{
    "Rules": [{
      "ApplyServerSideEncryptionByDefault": {
        "SSEAlgorithm": "aws:kms",
        "KMSMasterKeyID": "arn:aws:kms:us-east-1:123456789:key/abc-123"
      }
    }]
  }'

# 2. Create IAM policy for Velero
cat > velero-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ec2:DescribeVolumes",
        "ec2:DescribeSnapshots",
        "ec2:CreateTags",
        "ec2:CreateVolume",
        "ec2:CreateSnapshot",
        "ec2:DeleteSnapshot"
      ],
      "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:DeleteObject",
        "s3:PutObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::company-velero-backups/*",
        "arn:aws:s3:::company-velero-backups"
      ]
    }
  ]
}
EOF

aws iam create-policy --policy-name VeleroPolicy --policy-document file://velero-policy.json

# 3. Install Velero
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.8.0 \
  --bucket company-velero-backups \
  --backup-location-config region=us-east-1 \
  --snapshot-location-config region=us-east-1 \
  --secret-file ./credentials-velero \
  --use-volume-snapshots=true \
  --use-node-agent

Velero Backup Schedules:

---
# Hourly incremental backup
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: hourly-backup
  namespace: velero
spec:
  schedule: "0 * * * *"  # Every hour
  template:
    # Include all namespaces except system ones
    includedNamespaces:
    - production
    - staging
    excludedNamespaces:
    - kube-system
    - kube-public

    # Backup volumes
    defaultVolumesToRestic: true

    # Retention
    ttl: 72h  # Keep hourly backups for 3 days

    # Hooks for app-consistent backups
    hooks:
      resources:
      - name: postgres-backup
        includedNamespaces:
        - production
        labelSelector:
          matchLabels:
            app: postgres
        pre:
        - exec:
            container: postgres
            command:
            - /bin/bash
            - -c
            - pg_dump -U postgres > /tmp/backup.sql
        post:
        - exec:
            container: postgres
            command:
            - /bin/bash
            - -c
            - rm /tmp/backup.sql

---
# Daily full backup
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: daily-backup
  namespace: velero
spec:
  schedule: "0 2 * * *"  # 2 AM daily
  template:
    # Backup everything including cluster resources
    includedResources:
    - '*'
    includeClusterResources: true

    defaultVolumesToRestic: true
    ttl: 720h  # Keep daily backups for 30 days

---
# Weekly compliance backup (long-term retention)
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: weekly-backup
  namespace: velero
spec:
  schedule: "0 3 * * 0"  # 3 AM every Sunday
  template:
    includedResources:
    - '*'
    includeClusterResources: true
    defaultVolumesToRestic: true
    ttl: 8760h  # Keep weekly backups for 1 year

    # Store in separate long-term retention bucket
    storageLocation: long-term-storage

Velero Restore Procedure:

# 1. List available backups
velero backup get

NAME                STATUS      CREATED                         EXPIRES
hourly-backup-001   Completed   2025-01-15 10:00:00 +0000 UTC   3d
daily-backup-001    Completed   2025-01-15 02:00:00 +0000 UTC   30d

# 2. Restore from specific backup
velero restore create restore-prod-20250115 \
  --from-backup daily-backup-001 \
  --wait

# 3. Check restore status
velero restore describe restore-prod-20250115

# 4. Verify resources are restored
kubectl get all -n production

Layer 3: Multi-Region Active-Passive Architecture

┌─────────────────────────────────────────────────────────┐
│ Primary Region (us-east-1)                              │
│  ├── Production Cluster (Active)                        │
│  ├── RDS Multi-AZ (Primary)                             │
│  └── S3 Bucket (Velero backups)                         │
│       ↓ Cross-region replication                        │
└──────────────────────┬──────────────────────────────────┘

                       ↓ Replication
┌─────────────────────────────────────────────────────────┐
│ Secondary Region (us-west-2)                            │
│  ├── Standby Cluster (Passive - Ready to activate)     │
│  ├── RDS Read Replica (Promoted to primary on failover)│
│  └── S3 Bucket (Replicated Velero backups)             │
└─────────────────────────────────────────────────────────┘

Cross-Region S3 Replication:

# Enable cross-region replication
aws s3api put-bucket-replication \
  --bucket company-velero-backups \
  --replication-configuration '{
    "Role": "arn:aws:iam::123456789:role/S3ReplicationRole",
    "Rules": [{
      "Status": "Enabled",
      "Priority": 1,
      "Filter": {},
      "Destination": {
        "Bucket": "arn:aws:s3:::company-velero-backups-dr",
        "ReplicationTime": {
          "Status": "Enabled",
          "Time": {
            "Minutes": 15
          }
        },
        "Metrics": {
          "Status": "Enabled",
          "EventThreshold": {
            "Minutes": 15
          }
        }
      }
    }]
  }'

Multi-Region Database Replication (using Terraform):

# AWS RDS with cross-region read replica
resource "aws_db_instance" "primary" {
  identifier     = "production-db"
  engine         = "postgres"
  instance_class = "db.r5.2xlarge"

  # Multi-AZ for high availability
  multi_az = true

  # Enable automated backups
  backup_retention_period = 30
  backup_window          = "03:00-04:00"

  # Enable point-in-time recovery
  enabled_cloudwatch_logs_exports = ["postgresql"]

  # Encryption
  storage_encrypted = true
  kms_key_id       = "arn:aws:kms:us-east-1:123456789:key/abc-123"
}

# Cross-region read replica for DR
resource "aws_db_instance" "replica" {
  identifier             = "production-db-replica"
  replicate_source_db    = aws_db_instance.primary.arn
  instance_class         = "db.r5.2xlarge"

  # Different region
  provider = aws.us-west-2

  # Can be promoted to standalone on failover
  backup_retention_period = 30
  storage_encrypted       = true
}

Disaster Recovery Runbooks

Scenario 1: Accidental Namespace Deletion

# INCIDENT: Someone ran kubectl delete namespace production

# Step 1: Identify the backup to restore from
velero backup get | grep production
hourly-backup-20250115-1400  Completed  15m ago

# Step 2: Restore the namespace
velero restore create prod-restore-ns \
  --from-backup hourly-backup-20250115-1400 \
  --include-namespaces production \
  --wait

# Step 3: Verify restoration
kubectl get all -n production
kubectl get pvc -n production

# Step 4: Verify application health
kubectl get pods -n production
curl https://api.company.com/health

# Recovery Time: ~10 minutes
# Data Loss: ~15 minutes (last backup)

Scenario 2: Complete Cluster Failure

# INCIDENT: Control plane nodes failed, etcd corrupted

# Step 1: Provision new cluster
eksctl create cluster -f cluster-config.yaml

# Step 2: Install Velero on new cluster
velero install --provider aws --bucket company-velero-backups ...

# Step 3: Restore from latest backup
velero restore create full-cluster-restore \
  --from-backup daily-backup-20250115 \
  --wait

# Step 4: Update DNS to point to new cluster
aws route53 change-resource-record-sets \
  --hosted-zone-id Z123456 \
  --change-batch '{
    "Changes": [{
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "api.company.com",
        "Type": "A",
        "AliasTarget": {
          "HostedZoneId": "Z2FDTNDATAQYW2",
          "DNSName": "new-cluster-lb.us-east-1.elb.amazonaws.com",
          "EvaluateTargetHealth": false
        }
      }
    }]
  }'

# Recovery Time: ~45 minutes
# Data Loss: ~1 hour (last daily backup)

Scenario 3: Region Failure (Failover to DR Region)

# INCIDENT: Entire us-east-1 region is down

# Step 1: Promote RDS read replica in us-west-2 to primary
aws rds promote-read-replica \
  --db-instance-identifier production-db-replica \
  --region us-west-2

# Step 2: Scale up standby cluster in us-west-2
kubectl scale deployment --all --replicas=10 -n production

# Step 3: Update DNS to point to DR region
aws route53 change-resource-record-sets \
  --hosted-zone-id Z123456 \
  --change-batch '{
    "Changes": [{
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "api.company.com",
        "Type": "A",
        "AliasTarget": {
          "DNSName": "dr-cluster-lb.us-west-2.elb.amazonaws.com"
        }
      }
    }]
  }'

# Step 4: Verify failover
curl https://api.company.com/health

# Recovery Time: ~30 minutes
# Data Loss: ~15 minutes (RDS replication lag)

Quarterly DR Testing Plan

# DR Test Checklist (Run every quarter)

Week 1: Test etcd Restore
  - Provision test cluster
  - Restore from etcd backup
  - Verify cluster state matches production
  - Document time taken

Week 2: Test Velero Namespace Restore
  - Delete test namespace
  - Restore from Velero backup
  - Verify all resources restored
  - Check PV data integrity

Week 3: Test Full Cluster Recovery
  - Provision new test cluster
  - Restore complete cluster from Velero
  - Run smoke tests
  - Measure RTO (should be &lt; 1 hour)

Week 4: Test Region Failover
  - Simulate region failure
  - Failover to DR region
  - Promote RDS replica
  - Update DNS
  - Verify application functionality
  - Measure RTO and RPO

Monitoring and Alerting

# Prometheus alerts for backup failures
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: backup-alerts
  namespace: monitoring
spec:
  groups:
  - name: disaster-recovery
    rules:
    - alert: VeleroBackupFailed
      expr: |
        velero_backup_failure_total > 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Velero backup failed"
        description: "Backup {{ $labels.backup }} failed. Check Velero logs."

    - alert: EtcdBackupMissing
      expr: |
        time() - etcd_backup_last_success_timestamp > 1800
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "etcd backup not taken in 30+ minutes"
        description: "Last successful backup was > 30 minutes ago"

    - alert: S3ReplicationLag
      expr: |
        aws_s3_replication_lag_seconds > 900
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "S3 cross-region replication lagging"
        description: "Replication lag is > 15 minutes (RPO violation)"

Cost Analysis

Monthly DR Costs:

etcd Backups:
- S3 storage (96 backups/day * 100 MB * 30 days): ~$7/month
- S3 cross-region replication: ~$3/month

Velero Backups:
- S3 storage (hourly 500 GB * 72 hours): ~$40/month
- S3 storage (daily 2 TB * 30 days): ~$120/month
- EBS snapshots (100 volumes * 100 GB * $0.05): ~$500/month

Multi-Region Setup:
- Standby cluster (10% capacity): ~$500/month
- RDS read replica: ~$800/month
- Cross-region data transfer: ~$200/month

Total DR Cost: ~$2,170/month

Cost of 1 hour outage (healthcare SaaS):
- Revenue loss: ~$50,000
- SLA penalties: ~$20,000
- Reputation damage: Priceless

ROI: DR costs $26K/year but prevents $70K+ losses per incident
What Makes the Difference?
  • Context over facts: Explains when and why, not just what
  • Real examples: Provides specific use cases from production experience
  • Trade-offs: Acknowledges pros, cons, and decision factors

Practice Question

Your Velero backup completed successfully at 2:00 PM. At 2:30 PM, a developer accidentally deletes the production namespace. You restore from the 2:00 PM backup. What is the RPO and RTO?