Questions
Implement a zero-downtime deployment strategy for a critical microservice handling 10K requests/second.
The Scenario
You’re the DevOps Lead at a major e-commerce company. Your checkout microservice handles 10,000 requests per second during peak hours (Black Friday, holiday season). Each request represents real revenue—even 1 second of downtime costs thousands of dollars.
Your team needs to deploy a critical security patch today. The deployment includes:
- Updated container image with security fixes
- New environment variables for enhanced monitoring
- Updated database schema (backward-compatible migration)
The business requirement is clear: Zero downtime. Not even 1 failed request.
Your current deployment manifest:
apiVersion: apps/v1
kind: Deployment
metadata:
name: checkout-service
namespace: production
spec:
replicas: 20
selector:
matchLabels:
app: checkout-service
template:
metadata:
labels:
app: checkout-service
spec:
containers:
- name: checkout
image: company/checkout:v1.2.3
ports:
- containerPort: 8080
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: db-credentials
key: url
The Challenge
Design and implement a zero-downtime deployment strategy. Your solution must:
- Guarantee zero failed requests during the deployment
- Handle the database migration safely
- Provide instant rollback capability if something goes wrong
- Be automated (no manual kubectl commands during deployment)
Show the complete deployment configuration and explain each decision.
How Different Experience Levels Approach This
Basic rolling update without proper configuration - just update the image tag and let Kubernetes handle it. No readiness probes, no preStop hooks, default maxUnavailable allows downtime, no PodDisruptionBudget, no database migration strategy, no graceful shutdown period. Result: Failed requests during deployment.
Production-grade zero-downtime deployment using Rolling Update strategy with Readiness Probes, PreStop Hooks, PodDisruptionBudget, and proper resource management. This approach ensures traffic only goes to ready pods, gracefully finishes in-flight requests, prevents too many pods from being down simultaneously, and handles database migrations safely.
Junior Approach: Basic Rolling Update
The junior developer just updates the image tag without proper configuration:
spec:
replicas: 20
template:
spec:
containers:
- name: checkout
image: company/checkout:v1.2.4 # UpdatedProblems with this approach:
- No readiness probes (traffic sent to pods before they’re ready)
- No preStop hooks (abrupt connection closures)
- Default maxUnavailable allows downtime
- No PodDisruptionBudget (too many pods can be down)
- No database migration strategy
- No graceful shutdown period
- Result: Failed requests during deployment
Senior Approach: Production-Ready Zero-Downtime Deployment
This is exactly what senior engineers at Netflix, Amazon, and Uber implement daily. Here’s the complete solution:
Strategy: Rolling Update with Readiness Probes + PreStop Hooks + PodDisruptionBudget
Complete Production-Ready Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: checkout-service
namespace: production
annotations:
# Enable Prometheus monitoring for deployment metrics
prometheus.io/scrape: "true"
spec:
replicas: 20
# Rolling update strategy - this is KEY for zero downtime
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 5 # Create up to 5 extra pods during update
maxUnavailable: 0 # NEVER allow any pods to be unavailable
selector:
matchLabels:
app: checkout-service
template:
metadata:
labels:
app: checkout-service
version: v1.2.4 # New version label for monitoring
annotations:
# Force pod restart on configmap/secret change
checksum/config: {{ include (print $.Template.BasePath "/configmap.yaml") . | sha256sum }}
spec:
# Graceful termination - critical for zero downtime
terminationGracePeriodSeconds: 60
containers:
- name: checkout
image: company/checkout:v1.2.4 # New version
imagePullPolicy: Always
ports:
- name: http
containerPort: 8080
protocol: TCP
# Environment variables
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: db-credentials
key: url
- name: NEW_MONITORING_ENDPOINT
value: "https://metrics.company.com"
# Resource limits - prevent OOMKilled during deployment
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "1Gi"
cpu: "1000m"
# CRITICAL: Readiness probe ensures traffic only goes to ready pods
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
successThreshold: 1
failureThreshold: 3
# Liveness probe restarts crashed pods
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
# Startup probe for slow-starting applications
startupProbe:
httpGet:
path: /health/startup
port: 8080
initialDelaySeconds: 0
periodSeconds: 5
timeoutSeconds: 3
successThreshold: 1
failureThreshold: 30 # 30 * 5s = 150s max startup time
# PreStop hook - gracefully finish in-flight requests
lifecycle:
preStop:
exec:
command:
- sh
- -c
- |
# Stop accepting new connections
kill -TERM 1
# Wait for existing requests to complete (up to 45s)
sleep 45
---
# PodDisruptionBudget - prevents too many pods from being down
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: checkout-service-pdb
namespace: production
spec:
minAvailable: 15 # Always keep at least 15 pods running (out of 20)
selector:
matchLabels:
app: checkout-service
---
# Service - ensures traffic routing
apiVersion: v1
kind: Service
metadata:
name: checkout-service
namespace: production
spec:
type: ClusterIP
selector:
app: checkout-service
ports:
- name: http
port: 80
targetPort: 8080
protocol: TCP
# Session affinity for stateful connections (optional)
sessionAffinity: ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 3600
---
# HorizontalPodAutoscaler - handle traffic spikes during deployment
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: checkout-service-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: checkout-service
minReplicas: 20
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # Don't scale down too quicklyHow This Achieves Zero Downtime
1. maxUnavailable: 0
This is the most critical setting. Kubernetes will create new pods FIRST (up to maxSurge: 5), wait for them to become ready, and only THEN terminate old pods.
Timeline:
- t=0s: 20 old pods running
- t=10s: 25 pods running (20 old + 5 new starting)
- t=30s: 25 pods running (20 old + 5 new READY)
- t=31s: 20 pods running (15 old + 5 new) - first 5 old pods terminated
- t=60s: 20 pods running (all new) - deployment complete
2. Readiness Probe - The Traffic Controller
Kubernetes Service only sends traffic to pods that pass readiness checks. During deployment, new pods start but DON’T receive traffic yet. They perform database migrations and warm up caches. Once /health/ready returns 200 OK, traffic flows to them. Old pods continue handling requests until removed.
Application code for /health/ready:
app.get('/health/ready', async (req, res) => {
try {
// Check database connectivity
await db.query('SELECT 1');
// Check cache is warmed up
if (!cache.isReady()) {
return res.status(503).send('Cache not ready');
}
// Check critical dependencies
const redisOk = await redis.ping();
if (!redisOk) {
return res.status(503).send('Redis unavailable');
}
res.status(200).send('OK');
} catch (error) {
res.status(503).send('Not ready');
}
});3. PreStop Hook - Graceful Shutdown
When Kubernetes terminates a pod:
- PreStop hook runs - tells app to stop accepting new requests
- 45-second grace period - app finishes in-flight requests
- Pod removed from Service - no new requests routed to it
- SIGTERM sent - app shuts down cleanly
- After terminationGracePeriodSeconds (60s), SIGKILL if still running
Why 45 seconds? Most requests complete in less than 30s, database transactions need time to commit, and this prevents abrupt connection closures.
4. PodDisruptionBudget - Prevent Mass Termination
PDB prevents Kubernetes from draining too many nodes at once, cluster autoscaler from removing too many nodes, and admin from accidentally deleting too many pods. If someone runs kubectl drain node-1 (which has 10 pods), PDB will allow draining only if 15 pods remain healthy on other nodes. Otherwise, it blocks the drain operation.
Database Migration Strategy
Problem: New version requires database schema changes.
Solution: Backward-compatible migrations with separate job
# Run BEFORE deploying new version
apiVersion: batch/v1
kind: Job
metadata:
name: checkout-db-migration-v124
namespace: production
spec:
template:
spec:
restartPolicy: OnFailure
containers:
- name: migrate
image: company/checkout:v1.2.4
command:
- sh
- -c
- |
# Run backward-compatible migration
# Old version can still work with new schema
npm run db:migrate
initContainers:
- name: wait-for-db
image: postgres:15
command:
- sh
- -c
- until pg_isready -h $DB_HOST; do sleep 2; doneMigration best practices:
- Additive only: Add new columns/tables, don’t drop anything
- Default values: New columns have sensible defaults
- Two-phase deploy:
- Phase 1: Deploy schema changes (old app still works)
- Phase 2: Deploy new app version (uses new columns)
Deployment Commands (Automated in CI/CD)
#!/bin/bash
set -e
echo "1. Running database migration..."
kubectl apply -f db-migration-job.yaml
kubectl wait --for=condition=complete --timeout=300s job/checkout-db-migration-v124 -n production
echo "2. Deploying new version..."
kubectl apply -f deployment.yaml
echo "3. Watching rollout..."
kubectl rollout status deployment/checkout-service -n production --timeout=600s
echo "4. Verifying health..."
kubectl get pods -n production -l app=checkout-service
kubectl get deployment checkout-service -n production
echo "5. Checking metrics..."
curl -s http://checkout-service.production.svc.cluster.local/metrics | grep http_requests_total
echo "✅ Deployment complete!"Instant Rollback Strategy
If anything goes wrong during deployment:
# Immediate rollback to previous version
kubectl rollout undo deployment/checkout-service -n production
# Or rollback to specific revision
kubectl rollout history deployment/checkout-service -n production
kubectl rollout undo deployment/checkout-service --to-revision=42 -n productionWhy rollback is instant: Old ReplicaSet is still there (not deleted), Kubernetes just scales old ReplicaSet back up, and it takes approximately 30 seconds to restore service.
Monitoring During Deployment
# Watch pod status in real-time
watch kubectl get pods -n production -l app=checkout-service
# Monitor metrics during rollout
kubectl top pods -n production -l app=checkout-service
# Check for errors in logs
kubectl logs -n production -l app=checkout-service --tail=100 -fKey metrics to watch:
- Error rate: Should stay at less than 0.01% during rollout
- Response time: Should not increase
- Pod count: Should maintain minAvailable from PDB
- Request success rate: Should stay at 100%
- Context over facts: Explains when and why, not just what
- Real examples: Provides specific use cases from production experience
- Trade-offs: Acknowledges pros, cons, and decision factors
Practice Question
During a rolling update with maxUnavailable: 0 and maxSurge: 5, what is the maximum number of pods that will run simultaneously if the deployment has replicas: 20?