Your application experiences unpredictable traffic spikes. Design a comprehensive autoscaling strategy.

Q: Your application experiences unpredictable traffic spikes. Design a comprehensive autoscaling strategy.

Learn the answer to "Your application experiences unpredictable traffic spikes. Design a comprehensive autoscaling strategy." with detailed explanations, code examples, and best practices on DeployU.

The Scenario

You’re the Cloud Infrastructure Architect at a news media company. Your application has highly unpredictable traffic:

Normal traffic: 1,000 requests/second (10 pods sufficient)
Breaking news events: 50,000+ requests/second (need 200+ pods)
Daily pattern: Traffic spikes at 8 AM, 12 PM, 6 PM
Unpredictable spikes: Major news events can happen anytime

Current problems:

Manual scaling is too slow—by the time engineers add pods, the spike is over
Over-provisioning wastes money—paying for 200 pods 24/7 costs $50K/month
Under-provisioning causes crashes—site went down during last major event

Your CEO’s requirements:

Handle traffic spikes within 60 seconds
Scale down to save costs during low traffic
Maintain 99.9% uptime
Keep infrastructure costs under $15K/month

The Challenge

Design a comprehensive autoscaling strategy using:

Horizontal Pod Autoscaler (HPA) - Scale pods based on metrics
Vertical Pod Autoscaler (VPA) - Right-size pod resources
Cluster Autoscaler - Add/remove nodes as needed

Explain when to use each, how they work together, and provide complete configurations.

How Different Experience Levels Approach This

Junior Engineer

Surface Level

A junior engineer might use basic CPU-based HPA with defaults, set aggressive scaling without understanding behavior, ignore cluster capacity leading to pending pods, and not configure PodDisruptionBudgets causing downtime. This fails because CPU alone doesn't reflect application load, aggressive scaling causes pod thrashing, pending pods mean requests fail during spikes, and there's no protection during node maintenance.

Technically correct, but lacks depth

Senior Engineer

Production Ready

A senior architect implements a comprehensive three-layer autoscaling strategy: Layer 1 is HPA scaling pods based on CPU, memory, and custom metrics like requests per second; Layer 2 is Cluster Autoscaler adding nodes when pods can't be scheduled; Layer 3 is VPA right-sizing pod resource requests. The HPA uses multiple metrics with behavior policies controlling scale-up (immediate, 100% increase) and scale-down (gradual, 5-minute stabilization). This achieves 89% cost reduction while maintaining 99.9% uptime.

Complete HPA with Multiple Metrics

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: news-app-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: news-app
  minReplicas: 10        # Always keep at least 10 pods (handle baseline)
  maxReplicas: 200       # Never exceed 200 pods (cost control)

  # Multiple metrics - scale based on whichever hits threshold first
  metrics:
  # Scale based on CPU utilization
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70  # Scale when avg CPU > 70%

  # Scale based on memory utilization
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80  # Scale when avg memory > 80%

  # Scale based on custom metric (requests per second)
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "100"  # Scale when pod handles > 100 RPS

  # Scaling behavior - control how fast to scale up/down
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0  # Scale up immediately (no delay)
      policies:
      - type: Percent
        value: 100  # Double the pods at once if needed (10 → 20 → 40 → 80)
        periodSeconds: 15
      - type: Pods
        value: 20   # Or add 20 pods at once, whichever is higher
        periodSeconds: 15
      selectPolicy: Max  # Use the policy that adds more pods

    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 min before scaling down
      policies:
      - type: Percent
        value: 50  # Remove max 50% of pods at once (slow scale-down)
        periodSeconds: 60
      - type: Pods
        value: 5   # Or remove 5 pods, whichever is lower
        periodSeconds: 60
      selectPolicy: Min  # Use the policy that removes fewer pods

How HPA Works

The HPA calculation works as follows: it checks current state (10 pods at 90% CPU), calculates target replicas as current multiplied by current utilization divided by desired utilization (10 times 90 divided by 70 equals 13 pods rounded), scales to 13 pods, then recalculates every 15 seconds based on latest metrics.

Cluster Autoscaler Configuration

# AWS EKS Cluster Autoscaler
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
spec:
  replicas: 1
  template:
    spec:
      serviceAccountName: cluster-autoscaler
      containers:
      - name: cluster-autoscaler
        image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.27.0
        command:
        - ./cluster-autoscaler
        - --cloud-provider=aws
        - --namespace=kube-system
        - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled
        - --balance-similar-node-groups
        - --skip-nodes-with-system-pods=false
        - --scale-down-enabled=true
        - --scale-down-delay-after-add=10m
        - --scale-down-unneeded-time=10m
        - --scale-down-utilization-threshold=0.5

VPA for Right-Sizing

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: news-app-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: news-app
  updatePolicy:
    updateMode: "Recommender"  # Only recommend, don't auto-apply
  resourcePolicy:
    containerPolicies:
    - containerName: news-app
      minAllowed:
        cpu: "100m"
        memory: "128Mi"
      maxAllowed:
        cpu: "4000m"
        memory: "8Gi"

Shows real-world understanding & trade-offs

What Makes the Difference?

Context over facts: Explains when and why, not just what
Real examples: Provides specific use cases from production experience
Trade-offs: Acknowledges pros, cons, and decision factors

Cost Optimization Strategy

Current costs vs optimized:

Without autoscaling (static 200 pods):
- Nodes: 40 m5.2xlarge on-demand = $0.38 * 40 * 730 hours = $11,096/month
- Over-provisioned 23 hours/day = $10,000 wasted

With autoscaling:
- Baseline: 5 nodes * $0.38 * 730 = $1,387/month
- Spike hours: +45 nodes * 1 hour/day * 30 days = +$513/month
- Spot savings (80% spot): Additional savings of 70% = Total ~$1,200/month

Savings: $11,096 - $1,200 = $9,896/month (89% cost reduction!)

Monitoring and Alerting

# Prometheus alert for autoscaling issues
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: autoscaling-alerts
  namespace: monitoring
spec:
  groups:
  - name: autoscaling
    rules:
    - alert: HPAMaxedOut
      expr: |
        kube_horizontalpodautoscaler_status_current_replicas >=
        kube_horizontalpodautoscaler_spec_max_replicas
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "HPA {{ $labels.horizontalpodautoscaler }} reached max replicas"
        description: "Consider increasing maxReplicas or adding more node capacity"

    - alert: ClusterAutoscalerFailing
      expr: cluster_autoscaler_failed_scale_ups_total > 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Cluster Autoscaler cannot add nodes"
        description: "Check ASG limits and AWS quotas"

Practice Question

Your HPA is configured with minReplicas: 10 and maxReplicas: 50. During a traffic spike, CPU usage hits 200% and HPA wants to scale to 80 pods. What actually happens?

Questions