Implement a zero-downtime deployment strategy for a critical microservice handling 10K requests/second.

Q: Implement a zero-downtime deployment strategy for a critical microservice handling 10K requests/second.

Learn the answer to "Implement a zero-downtime deployment strategy for a critical microservice handling 10K requests/second." with detailed explanations, code examples, and best practices on DeployU.

The Scenario

You’re the DevOps Lead at a major e-commerce company. Your checkout microservice handles 10,000 requests per second during peak hours (Black Friday, holiday season). Each request represents real revenue—even 1 second of downtime costs thousands of dollars.

Your team needs to deploy a critical security patch today. The deployment includes:

Updated container image with security fixes
New environment variables for enhanced monitoring
Updated database schema (backward-compatible migration)

The business requirement is clear: Zero downtime. Not even 1 failed request.

Your current deployment manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: checkout-service
  namespace: production
spec:
  replicas: 20
  selector:
    matchLabels:
      app: checkout-service
  template:
    metadata:
      labels:
        app: checkout-service
    spec:
      containers:
      - name: checkout
        image: company/checkout:v1.2.3
        ports:
        - containerPort: 8080
        env:
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: db-credentials
              key: url

The Challenge

Design and implement a zero-downtime deployment strategy. Your solution must:

Guarantee zero failed requests during the deployment
Handle the database migration safely
Provide instant rollback capability if something goes wrong
Be automated (no manual kubectl commands during deployment)

Show the complete deployment configuration and explain each decision.

How Different Experience Levels Approach This

Junior Engineer

Surface Level

Basic rolling update without proper configuration - just update the image tag and let Kubernetes handle it. No readiness probes, no preStop hooks, default maxUnavailable allows downtime, no PodDisruptionBudget, no database migration strategy, no graceful shutdown period. Result: Failed requests during deployment.

Technically correct, but lacks depth

Senior Engineer

Production Ready

Production-grade zero-downtime deployment using Rolling Update strategy with Readiness Probes, PreStop Hooks, PodDisruptionBudget, and proper resource management. This approach ensures traffic only goes to ready pods, gracefully finishes in-flight requests, prevents too many pods from being down simultaneously, and handles database migrations safely.

Junior Approach: Basic Rolling Update

The junior developer just updates the image tag without proper configuration:

spec:
  replicas: 20
  template:
    spec:
      containers:
      - name: checkout
        image: company/checkout:v1.2.4  # Updated

Problems with this approach:

No readiness probes (traffic sent to pods before they’re ready)
No preStop hooks (abrupt connection closures)
Default maxUnavailable allows downtime
No PodDisruptionBudget (too many pods can be down)
No database migration strategy
No graceful shutdown period
Result: Failed requests during deployment

Senior Approach: Production-Ready Zero-Downtime Deployment

This is exactly what senior engineers at Netflix, Amazon, and Uber implement daily. Here’s the complete solution:

Strategy: Rolling Update with Readiness Probes + PreStop Hooks + PodDisruptionBudget

Complete Production-Ready Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: checkout-service
  namespace: production
  annotations:
    # Enable Prometheus monitoring for deployment metrics
    prometheus.io/scrape: "true"
spec:
  replicas: 20

  # Rolling update strategy - this is KEY for zero downtime
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 5           # Create up to 5 extra pods during update
      maxUnavailable: 0     # NEVER allow any pods to be unavailable

  selector:
    matchLabels:
      app: checkout-service

  template:
    metadata:
      labels:
        app: checkout-service
        version: v1.2.4    # New version label for monitoring
      annotations:
        # Force pod restart on configmap/secret change
        checksum/config: {{ include (print $.Template.BasePath "/configmap.yaml") . | sha256sum }}

    spec:
      # Graceful termination - critical for zero downtime
      terminationGracePeriodSeconds: 60

      containers:
      - name: checkout
        image: company/checkout:v1.2.4  # New version
        imagePullPolicy: Always

        ports:
        - name: http
          containerPort: 8080
          protocol: TCP

        # Environment variables
        env:
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: db-credentials
              key: url
        - name: NEW_MONITORING_ENDPOINT
          value: "https://metrics.company.com"

        # Resource limits - prevent OOMKilled during deployment
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "1Gi"
            cpu: "1000m"

        # CRITICAL: Readiness probe ensures traffic only goes to ready pods
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 3
          successThreshold: 1
          failureThreshold: 3

        # Liveness probe restarts crashed pods
        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3

        # Startup probe for slow-starting applications
        startupProbe:
          httpGet:
            path: /health/startup
            port: 8080
          initialDelaySeconds: 0
          periodSeconds: 5
          timeoutSeconds: 3
          successThreshold: 1
          failureThreshold: 30  # 30 * 5s = 150s max startup time

        # PreStop hook - gracefully finish in-flight requests
        lifecycle:
          preStop:
            exec:
              command:
              - sh
              - -c
              - |
                # Stop accepting new connections
                kill -TERM 1
                # Wait for existing requests to complete (up to 45s)
                sleep 45

---
# PodDisruptionBudget - prevents too many pods from being down
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: checkout-service-pdb
  namespace: production
spec:
  minAvailable: 15  # Always keep at least 15 pods running (out of 20)
  selector:
    matchLabels:
      app: checkout-service

---
# Service - ensures traffic routing
apiVersion: v1
kind: Service
metadata:
  name: checkout-service
  namespace: production
spec:
  type: ClusterIP
  selector:
    app: checkout-service
  ports:
  - name: http
    port: 80
    targetPort: 8080
    protocol: TCP
  # Session affinity for stateful connections (optional)
  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 3600

---
# HorizontalPodAutoscaler - handle traffic spikes during deployment
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: checkout-service-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: checkout-service
  minReplicas: 20
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # Don't scale down too quickly

How This Achieves Zero Downtime

1. maxUnavailable: 0

This is the most critical setting. Kubernetes will create new pods FIRST (up to maxSurge: 5), wait for them to become ready, and only THEN terminate old pods.

Timeline:

t=0s: 20 old pods running
t=10s: 25 pods running (20 old + 5 new starting)
t=30s: 25 pods running (20 old + 5 new READY)
t=31s: 20 pods running (15 old + 5 new) - first 5 old pods terminated
t=60s: 20 pods running (all new) - deployment complete

2. Readiness Probe - The Traffic Controller

Kubernetes Service only sends traffic to pods that pass readiness checks. During deployment, new pods start but DON’T receive traffic yet. They perform database migrations and warm up caches. Once /health/ready returns 200 OK, traffic flows to them. Old pods continue handling requests until removed.

Application code for /health/ready:

app.get('/health/ready', async (req, res) => {
  try {
    // Check database connectivity
    await db.query('SELECT 1');

    // Check cache is warmed up
    if (!cache.isReady()) {
      return res.status(503).send('Cache not ready');
    }

    // Check critical dependencies
    const redisOk = await redis.ping();
    if (!redisOk) {
      return res.status(503).send('Redis unavailable');
    }

    res.status(200).send('OK');
  } catch (error) {
    res.status(503).send('Not ready');
  }
});

3. PreStop Hook - Graceful Shutdown

When Kubernetes terminates a pod:

PreStop hook runs - tells app to stop accepting new requests
45-second grace period - app finishes in-flight requests
Pod removed from Service - no new requests routed to it
SIGTERM sent - app shuts down cleanly
After terminationGracePeriodSeconds (60s), SIGKILL if still running

Why 45 seconds? Most requests complete in less than 30s, database transactions need time to commit, and this prevents abrupt connection closures.

4. PodDisruptionBudget - Prevent Mass Termination

PDB prevents Kubernetes from draining too many nodes at once, cluster autoscaler from removing too many nodes, and admin from accidentally deleting too many pods. If someone runs kubectl drain node-1 (which has 10 pods), PDB will allow draining only if 15 pods remain healthy on other nodes. Otherwise, it blocks the drain operation.

Database Migration Strategy

Problem: New version requires database schema changes.

Solution: Backward-compatible migrations with separate job

# Run BEFORE deploying new version
apiVersion: batch/v1
kind: Job
metadata:
  name: checkout-db-migration-v124
  namespace: production
spec:
  template:
    spec:
      restartPolicy: OnFailure
      containers:
      - name: migrate
        image: company/checkout:v1.2.4
        command:
        - sh
        - -c
        - |
          # Run backward-compatible migration
          # Old version can still work with new schema
          npm run db:migrate
      initContainers:
      - name: wait-for-db
        image: postgres:15
        command:
        - sh
        - -c
        - until pg_isready -h $DB_HOST; do sleep 2; done

Migration best practices:

Additive only: Add new columns/tables, don’t drop anything
Default values: New columns have sensible defaults
Two-phase deploy:
- Phase 1: Deploy schema changes (old app still works)
- Phase 2: Deploy new app version (uses new columns)

Deployment Commands (Automated in CI/CD)

#!/bin/bash
set -e

echo "1. Running database migration..."
kubectl apply -f db-migration-job.yaml
kubectl wait --for=condition=complete --timeout=300s job/checkout-db-migration-v124 -n production

echo "2. Deploying new version..."
kubectl apply -f deployment.yaml

echo "3. Watching rollout..."
kubectl rollout status deployment/checkout-service -n production --timeout=600s

echo "4. Verifying health..."
kubectl get pods -n production -l app=checkout-service
kubectl get deployment checkout-service -n production

echo "5. Checking metrics..."
curl -s http://checkout-service.production.svc.cluster.local/metrics | grep http_requests_total

echo "✅ Deployment complete!"

Instant Rollback Strategy

If anything goes wrong during deployment:

# Immediate rollback to previous version
kubectl rollout undo deployment/checkout-service -n production

# Or rollback to specific revision
kubectl rollout history deployment/checkout-service -n production
kubectl rollout undo deployment/checkout-service --to-revision=42 -n production

Why rollback is instant: Old ReplicaSet is still there (not deleted), Kubernetes just scales old ReplicaSet back up, and it takes approximately 30 seconds to restore service.

Monitoring During Deployment

# Watch pod status in real-time
watch kubectl get pods -n production -l app=checkout-service

# Monitor metrics during rollout
kubectl top pods -n production -l app=checkout-service

# Check for errors in logs
kubectl logs -n production -l app=checkout-service --tail=100 -f

Key metrics to watch:

Error rate: Should stay at less than 0.01% during rollout
Response time: Should not increase
Pod count: Should maintain minAvailable from PDB
Request success rate: Should stay at 100%

Shows real-world understanding & trade-offs

What Makes the Difference?

Context over facts: Explains when and why, not just what
Real examples: Provides specific use cases from production experience
Trade-offs: Acknowledges pros, cons, and decision factors

Practice Question

During a rolling update with maxUnavailable: 0 and maxSurge: 5, what is the maximum number of pods that will run simultaneously if the deployment has replicas: 20?

Questions