Questions
Your production cluster went down. Walk through your disaster recovery and backup/restore strategy.
The Scenario
You’re the Infrastructure Architect at a healthcare SaaS company. Your production Kubernetes cluster hosts critical patient data and healthcare provider applications serving 500+ hospitals.
Business requirements from the CEO:
- RTO (Recovery Time Objective): < 1 hour - System must be back online within 1 hour of disaster
- RPO (Recovery Point Objective): < 15 minutes - Maximum acceptable data loss is 15 minutes
- Compliance: HIPAA-compliant - All backups must be encrypted
- Multi-region: Failover to secondary region if primary region fails
- Testing: DR plan must be tested quarterly
What counts as a “disaster”:
- Entire AWS region outage (rare but happened: us-east-1 in 2017, 2021)
- Kubernetes cluster corruption (etcd data loss, control plane failure)
- Ransomware attack (malicious deletion of resources, data encryption)
- Accidental deletion (developer runs
kubectl delete namespace production) - Data center fire/natural disaster (hurricane, earthquake, flood)
Last week, during a routine upgrade, someone accidentally ran:
kubectl delete namespace production --force
Everything was deleted:
- 50 microservices
- 200 GB of persistent volume data
- ConfigMaps, Secrets, RBAC policies
- Ingress rules, Network Policies
Your CTO asks: “How quickly can we recover?”
Currently, you don’t have a good answer. Your job is to design a complete disaster recovery plan.
The Challenge
Design a comprehensive disaster recovery strategy that includes:
- Backup strategy: What to back up and how often
- Storage location: Where to store backups (encryption, geo-redundancy)
- Automated backup: CI/CD integration and scheduling
- Recovery procedures: Step-by-step restoration process
- Failover architecture: Multi-region active-passive setup
- Testing plan: Quarterly DR drills
Show complete configurations, tools (Velero, etcd backup), and runbooks.
How Different Experience Levels Approach This
Basic backups without comprehensive planning - set up weekly backups using kubectl get all -o yaml and store them on S3. Backup frequency too low (weekly equals up to 7 days data loss), kubectl get all doesn't capture everything (secrets, PVs, RBAC), no automation (manual backups are unreliable), no testing plan (backups might not work when needed), no multi-region failover, and no etcd backups (cluster state could be lost). This approach violates both RTO (less than 1 hour) and RPO (less than 15 minutes) requirements.
Enterprise disaster recovery architecture with three-layer strategy: Layer 1 - etcd backup every 15 minutes for Kubernetes state, Layer 2 - Velero backup with hourly incremental and daily full backups for resources and volumes, Layer 3 - Multi-region replication with active-passive setup. All backups stored in S3 with cross-region replication, KMS encryption, and versioning enabled. Includes automated CronJobs, comprehensive restore procedures, monitoring with Prometheus alerts, quarterly testing plan, and detailed runbooks for various disaster scenarios.
Junior Approach: Basic Backups Without Comprehensive Planning
The junior approach uses weekly kubectl backups:
kubectl get all -o yaml > backup.yamlProblems with this approach:
- Backup frequency too low (weekly = up to 7 days data loss)
- kubectl get all doesn’t capture everything (secrets, PVs, RBAC)
- No automation (manual backups are unreliable)
- No testing plan (backups might not work when needed)
- No multi-region failover
- No etcd backups (cluster state could be lost)
This approach violates both RTO (< 1 hour) and RPO (< 15 minutes) requirements.
Senior Approach: Enterprise Disaster Recovery Architecture
This is exactly how financial institutions, healthcare companies, and Fortune 500 companies implement DR. Here’s the complete solution:
Three-Layer DR Strategy
Layer 1: etcd Backup (Kubernetes state)
↓ Every 15 minutes
↓
Layer 2: Velero Backup (Resources + Volumes)
↓ Hourly incremental, Daily full
↓
Layer 3: Multi-Region Replication
↓ Active-Passive setup
↓
Storage: S3 with cross-region replicationLayer 1: etcd Backup (Control Plane State)
etcd stores the entire Kubernetes cluster state. If etcd is lost, the cluster is gone.
Automated etcd backup script:
#!/bin/bash
set -e
ETCD_ENDPOINTS="https://127.0.0.1:2379"
ETCD_CERT="/etc/kubernetes/pki/etcd/server.crt"
ETCD_KEY="/etc/kubernetes/pki/etcd/server.key"
ETCD_CA="/etc/kubernetes/pki/etcd/ca.crt"
BACKUP_DIR="/var/backups/etcd"
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
BACKUP_FILE="${BACKUP_DIR}/etcd-backup-${TIMESTAMP}.db"
# Create backup directory
mkdir -p ${BACKUP_DIR}
# Create etcd snapshot
ETCDCTL_API=3 etcdctl snapshot save ${BACKUP_FILE} \
--endpoints=${ETCD_ENDPOINTS} \
--cacert=${ETCD_CA} \
--cert=${ETCD_CERT} \
--key=${ETCD_KEY}
# Verify backup
ETCDCTL_API=3 etcdctl snapshot status ${BACKUP_FILE} -w table
# Upload to S3 with encryption
aws s3 cp ${BACKUP_FILE} \
s3://company-k8s-backups/etcd/${TIMESTAMP}/ \
--sse aws:kms \
--sse-kms-key-id arn:aws:kms:us-east-1:123456789:key/abc-123
# Keep only last 7 days locally
find ${BACKUP_DIR} -type f -name "*.db" -mtime +7 -delete
echo "✅ etcd backup completed: ${BACKUP_FILE}"Automated etcd backup CronJob:
apiVersion: batch/v1
kind: CronJob
metadata:
name: etcd-backup
namespace: kube-system
spec:
# Run every 15 minutes (RPO requirement)
schedule: "*/15 * * * *"
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 3
jobTemplate:
spec:
template:
spec:
hostNetwork: true
nodeName: master-node-1 # Run on control plane node
containers:
- name: etcd-backup
image: company/etcd-backup:v1.0
command: ["/scripts/backup-etcd.sh"]
volumeMounts:
- name: etcd-certs
mountPath: /etc/kubernetes/pki/etcd
readOnly: true
- name: backup-dir
mountPath: /var/backups/etcd
env:
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: aws-credentials
key: access-key-id
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: aws-credentials
key: secret-access-key
volumes:
- name: etcd-certs
hostPath:
path: /etc/kubernetes/pki/etcd
- name: backup-dir
hostPath:
path: /var/backups/etcd
restartPolicy: OnFailureetcd Restore Procedure:
#!/bin/bash
# Restore etcd from backup
BACKUP_FILE="/var/backups/etcd/etcd-backup-20250115-100000.db"
RESTORE_DIR="/var/lib/etcd-restore"
# Stop etcd
systemctl stop etcd
# Restore snapshot
ETCDCTL_API=3 etcdctl snapshot restore ${BACKUP_FILE} \
--data-dir=${RESTORE_DIR} \
--name=etcd-restore \
--initial-cluster=etcd-restore=https://10.0.1.10:2380 \
--initial-advertise-peer-urls=https://10.0.1.10:2380
# Update etcd data directory
rm -rf /var/lib/etcd
mv ${RESTORE_DIR} /var/lib/etcd
# Start etcd
systemctl start etcd
echo "✅ etcd restored from ${BACKUP_FILE}"Layer 2: Velero Backup (Complete Cluster Backup)
Velero backs up all Kubernetes resources (Deployments, Services, ConfigMaps, Secrets, etc.), Persistent Volumes (using volume snapshots), and Namespaces, RBAC, Network Policies.
Install Velero:
# 1. Create S3 bucket for backups
aws s3 mb s3://company-velero-backups --region us-east-1
# Enable versioning
aws s3api put-bucket-versioning \
--bucket company-velero-backups \
--versioning-configuration Status=Enabled
# Enable encryption
aws s3api put-bucket-encryption \
--bucket company-velero-backups \
--server-side-encryption-configuration '{
"Rules": [{
"ApplyServerSideEncryptionByDefault": {
"SSEAlgorithm": "aws:kms",
"KMSMasterKeyID": "arn:aws:kms:us-east-1:123456789:key/abc-123"
}
}]
}'
# 2. Create IAM policy for Velero
cat > velero-policy.json <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ec2:DescribeVolumes",
"ec2:DescribeSnapshots",
"ec2:CreateTags",
"ec2:CreateVolume",
"ec2:CreateSnapshot",
"ec2:DeleteSnapshot"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:DeleteObject",
"s3:PutObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::company-velero-backups/*",
"arn:aws:s3:::company-velero-backups"
]
}
]
}
EOF
aws iam create-policy --policy-name VeleroPolicy --policy-document file://velero-policy.json
# 3. Install Velero
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.8.0 \
--bucket company-velero-backups \
--backup-location-config region=us-east-1 \
--snapshot-location-config region=us-east-1 \
--secret-file ./credentials-velero \
--use-volume-snapshots=true \
--use-node-agentVelero Backup Schedules:
---
# Hourly incremental backup
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: hourly-backup
namespace: velero
spec:
schedule: "0 * * * *" # Every hour
template:
# Include all namespaces except system ones
includedNamespaces:
- production
- staging
excludedNamespaces:
- kube-system
- kube-public
# Backup volumes
defaultVolumesToRestic: true
# Retention
ttl: 72h # Keep hourly backups for 3 days
# Hooks for app-consistent backups
hooks:
resources:
- name: postgres-backup
includedNamespaces:
- production
labelSelector:
matchLabels:
app: postgres
pre:
- exec:
container: postgres
command:
- /bin/bash
- -c
- pg_dump -U postgres > /tmp/backup.sql
post:
- exec:
container: postgres
command:
- /bin/bash
- -c
- rm /tmp/backup.sql
---
# Daily full backup
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: daily-backup
namespace: velero
spec:
schedule: "0 2 * * *" # 2 AM daily
template:
# Backup everything including cluster resources
includedResources:
- '*'
includeClusterResources: true
defaultVolumesToRestic: true
ttl: 720h # Keep daily backups for 30 days
---
# Weekly compliance backup (long-term retention)
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: weekly-backup
namespace: velero
spec:
schedule: "0 3 * * 0" # 3 AM every Sunday
template:
includedResources:
- '*'
includeClusterResources: true
defaultVolumesToRestic: true
ttl: 8760h # Keep weekly backups for 1 year
# Store in separate long-term retention bucket
storageLocation: long-term-storageVelero Restore Procedure:
# 1. List available backups
velero backup get
NAME STATUS CREATED EXPIRES
hourly-backup-001 Completed 2025-01-15 10:00:00 +0000 UTC 3d
daily-backup-001 Completed 2025-01-15 02:00:00 +0000 UTC 30d
# 2. Restore from specific backup
velero restore create restore-prod-20250115 \
--from-backup daily-backup-001 \
--wait
# 3. Check restore status
velero restore describe restore-prod-20250115
# 4. Verify resources are restored
kubectl get all -n productionLayer 3: Multi-Region Active-Passive Architecture
┌─────────────────────────────────────────────────────────┐
│ Primary Region (us-east-1) │
│ ├── Production Cluster (Active) │
│ ├── RDS Multi-AZ (Primary) │
│ └── S3 Bucket (Velero backups) │
│ ↓ Cross-region replication │
└──────────────────────┬──────────────────────────────────┘
│
↓ Replication
┌─────────────────────────────────────────────────────────┐
│ Secondary Region (us-west-2) │
│ ├── Standby Cluster (Passive - Ready to activate) │
│ ├── RDS Read Replica (Promoted to primary on failover)│
│ └── S3 Bucket (Replicated Velero backups) │
└─────────────────────────────────────────────────────────┘Cross-Region S3 Replication:
# Enable cross-region replication
aws s3api put-bucket-replication \
--bucket company-velero-backups \
--replication-configuration '{
"Role": "arn:aws:iam::123456789:role/S3ReplicationRole",
"Rules": [{
"Status": "Enabled",
"Priority": 1,
"Filter": {},
"Destination": {
"Bucket": "arn:aws:s3:::company-velero-backups-dr",
"ReplicationTime": {
"Status": "Enabled",
"Time": {
"Minutes": 15
}
},
"Metrics": {
"Status": "Enabled",
"EventThreshold": {
"Minutes": 15
}
}
}
}]
}'Multi-Region Database Replication (using Terraform):
# AWS RDS with cross-region read replica
resource "aws_db_instance" "primary" {
identifier = "production-db"
engine = "postgres"
instance_class = "db.r5.2xlarge"
# Multi-AZ for high availability
multi_az = true
# Enable automated backups
backup_retention_period = 30
backup_window = "03:00-04:00"
# Enable point-in-time recovery
enabled_cloudwatch_logs_exports = ["postgresql"]
# Encryption
storage_encrypted = true
kms_key_id = "arn:aws:kms:us-east-1:123456789:key/abc-123"
}
# Cross-region read replica for DR
resource "aws_db_instance" "replica" {
identifier = "production-db-replica"
replicate_source_db = aws_db_instance.primary.arn
instance_class = "db.r5.2xlarge"
# Different region
provider = aws.us-west-2
# Can be promoted to standalone on failover
backup_retention_period = 30
storage_encrypted = true
}Disaster Recovery Runbooks
Scenario 1: Accidental Namespace Deletion
# INCIDENT: Someone ran kubectl delete namespace production
# Step 1: Identify the backup to restore from
velero backup get | grep production
hourly-backup-20250115-1400 Completed 15m ago
# Step 2: Restore the namespace
velero restore create prod-restore-ns \
--from-backup hourly-backup-20250115-1400 \
--include-namespaces production \
--wait
# Step 3: Verify restoration
kubectl get all -n production
kubectl get pvc -n production
# Step 4: Verify application health
kubectl get pods -n production
curl https://api.company.com/health
# Recovery Time: ~10 minutes
# Data Loss: ~15 minutes (last backup)Scenario 2: Complete Cluster Failure
# INCIDENT: Control plane nodes failed, etcd corrupted
# Step 1: Provision new cluster
eksctl create cluster -f cluster-config.yaml
# Step 2: Install Velero on new cluster
velero install --provider aws --bucket company-velero-backups ...
# Step 3: Restore from latest backup
velero restore create full-cluster-restore \
--from-backup daily-backup-20250115 \
--wait
# Step 4: Update DNS to point to new cluster
aws route53 change-resource-record-sets \
--hosted-zone-id Z123456 \
--change-batch '{
"Changes": [{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "api.company.com",
"Type": "A",
"AliasTarget": {
"HostedZoneId": "Z2FDTNDATAQYW2",
"DNSName": "new-cluster-lb.us-east-1.elb.amazonaws.com",
"EvaluateTargetHealth": false
}
}
}]
}'
# Recovery Time: ~45 minutes
# Data Loss: ~1 hour (last daily backup)Scenario 3: Region Failure (Failover to DR Region)
# INCIDENT: Entire us-east-1 region is down
# Step 1: Promote RDS read replica in us-west-2 to primary
aws rds promote-read-replica \
--db-instance-identifier production-db-replica \
--region us-west-2
# Step 2: Scale up standby cluster in us-west-2
kubectl scale deployment --all --replicas=10 -n production
# Step 3: Update DNS to point to DR region
aws route53 change-resource-record-sets \
--hosted-zone-id Z123456 \
--change-batch '{
"Changes": [{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "api.company.com",
"Type": "A",
"AliasTarget": {
"DNSName": "dr-cluster-lb.us-west-2.elb.amazonaws.com"
}
}
}]
}'
# Step 4: Verify failover
curl https://api.company.com/health
# Recovery Time: ~30 minutes
# Data Loss: ~15 minutes (RDS replication lag)Quarterly DR Testing Plan
# DR Test Checklist (Run every quarter)
Week 1: Test etcd Restore
- Provision test cluster
- Restore from etcd backup
- Verify cluster state matches production
- Document time taken
Week 2: Test Velero Namespace Restore
- Delete test namespace
- Restore from Velero backup
- Verify all resources restored
- Check PV data integrity
Week 3: Test Full Cluster Recovery
- Provision new test cluster
- Restore complete cluster from Velero
- Run smoke tests
- Measure RTO (should be < 1 hour)
Week 4: Test Region Failover
- Simulate region failure
- Failover to DR region
- Promote RDS replica
- Update DNS
- Verify application functionality
- Measure RTO and RPOMonitoring and Alerting
# Prometheus alerts for backup failures
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: backup-alerts
namespace: monitoring
spec:
groups:
- name: disaster-recovery
rules:
- alert: VeleroBackupFailed
expr: |
velero_backup_failure_total > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Velero backup failed"
description: "Backup {{ $labels.backup }} failed. Check Velero logs."
- alert: EtcdBackupMissing
expr: |
time() - etcd_backup_last_success_timestamp > 1800
for: 5m
labels:
severity: critical
annotations:
summary: "etcd backup not taken in 30+ minutes"
description: "Last successful backup was > 30 minutes ago"
- alert: S3ReplicationLag
expr: |
aws_s3_replication_lag_seconds > 900
for: 10m
labels:
severity: warning
annotations:
summary: "S3 cross-region replication lagging"
description: "Replication lag is > 15 minutes (RPO violation)"Cost Analysis
Monthly DR Costs:
etcd Backups:
- S3 storage (96 backups/day * 100 MB * 30 days): ~$7/month
- S3 cross-region replication: ~$3/month
Velero Backups:
- S3 storage (hourly 500 GB * 72 hours): ~$40/month
- S3 storage (daily 2 TB * 30 days): ~$120/month
- EBS snapshots (100 volumes * 100 GB * $0.05): ~$500/month
Multi-Region Setup:
- Standby cluster (10% capacity): ~$500/month
- RDS read replica: ~$800/month
- Cross-region data transfer: ~$200/month
Total DR Cost: ~$2,170/month
Cost of 1 hour outage (healthcare SaaS):
- Revenue loss: ~$50,000
- SLA penalties: ~$20,000
- Reputation damage: Priceless
ROI: DR costs $26K/year but prevents $70K+ losses per incident - Context over facts: Explains when and why, not just what
- Real examples: Provides specific use cases from production experience
- Trade-offs: Acknowledges pros, cons, and decision factors
Practice Question
Your Velero backup completed successfully at 2:00 PM. At 2:30 PM, a developer accidentally deletes the production namespace. You restore from the 2:00 PM backup. What is the RPO and RTO?