Questions
Your application experiences unpredictable traffic spikes. Design a comprehensive autoscaling strategy.
The Scenario
You’re the Cloud Infrastructure Architect at a news media company. Your application has highly unpredictable traffic:
- Normal traffic: 1,000 requests/second (10 pods sufficient)
- Breaking news events: 50,000+ requests/second (need 200+ pods)
- Daily pattern: Traffic spikes at 8 AM, 12 PM, 6 PM
- Unpredictable spikes: Major news events can happen anytime
Current problems:
- Manual scaling is too slow—by the time engineers add pods, the spike is over
- Over-provisioning wastes money—paying for 200 pods 24/7 costs $50K/month
- Under-provisioning causes crashes—site went down during last major event
Your CEO’s requirements:
- Handle traffic spikes within 60 seconds
- Scale down to save costs during low traffic
- Maintain 99.9% uptime
- Keep infrastructure costs under $15K/month
The Challenge
Design a comprehensive autoscaling strategy using:
- Horizontal Pod Autoscaler (HPA) - Scale pods based on metrics
- Vertical Pod Autoscaler (VPA) - Right-size pod resources
- Cluster Autoscaler - Add/remove nodes as needed
Explain when to use each, how they work together, and provide complete configurations.
How Different Experience Levels Approach This
A junior engineer might use basic CPU-based HPA with defaults, set aggressive scaling without understanding behavior, ignore cluster capacity leading to pending pods, and not configure PodDisruptionBudgets causing downtime. This fails because CPU alone doesn't reflect application load, aggressive scaling causes pod thrashing, pending pods mean requests fail during spikes, and there's no protection during node maintenance.
A senior architect implements a comprehensive three-layer autoscaling strategy: Layer 1 is HPA scaling pods based on CPU, memory, and custom metrics like requests per second; Layer 2 is Cluster Autoscaler adding nodes when pods can't be scheduled; Layer 3 is VPA right-sizing pod resource requests. The HPA uses multiple metrics with behavior policies controlling scale-up (immediate, 100% increase) and scale-down (gradual, 5-minute stabilization). This achieves 89% cost reduction while maintaining 99.9% uptime.
Complete HPA with Multiple Metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: news-app-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: news-app
minReplicas: 10 # Always keep at least 10 pods (handle baseline)
maxReplicas: 200 # Never exceed 200 pods (cost control)
# Multiple metrics - scale based on whichever hits threshold first
metrics:
# Scale based on CPU utilization
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70 # Scale when avg CPU > 70%
# Scale based on memory utilization
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80 # Scale when avg memory > 80%
# Scale based on custom metric (requests per second)
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "100" # Scale when pod handles > 100 RPS
# Scaling behavior - control how fast to scale up/down
behavior:
scaleUp:
stabilizationWindowSeconds: 0 # Scale up immediately (no delay)
policies:
- type: Percent
value: 100 # Double the pods at once if needed (10 → 20 → 40 → 80)
periodSeconds: 15
- type: Pods
value: 20 # Or add 20 pods at once, whichever is higher
periodSeconds: 15
selectPolicy: Max # Use the policy that adds more pods
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min before scaling down
policies:
- type: Percent
value: 50 # Remove max 50% of pods at once (slow scale-down)
periodSeconds: 60
- type: Pods
value: 5 # Or remove 5 pods, whichever is lower
periodSeconds: 60
selectPolicy: Min # Use the policy that removes fewer podsHow HPA Works
The HPA calculation works as follows: it checks current state (10 pods at 90% CPU), calculates target replicas as current multiplied by current utilization divided by desired utilization (10 times 90 divided by 70 equals 13 pods rounded), scales to 13 pods, then recalculates every 15 seconds based on latest metrics.
Cluster Autoscaler Configuration
# AWS EKS Cluster Autoscaler
apiVersion: apps/v1
kind: Deployment
metadata:
name: cluster-autoscaler
namespace: kube-system
spec:
replicas: 1
template:
spec:
serviceAccountName: cluster-autoscaler
containers:
- name: cluster-autoscaler
image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.27.0
command:
- ./cluster-autoscaler
- --cloud-provider=aws
- --namespace=kube-system
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled
- --balance-similar-node-groups
- --skip-nodes-with-system-pods=false
- --scale-down-enabled=true
- --scale-down-delay-after-add=10m
- --scale-down-unneeded-time=10m
- --scale-down-utilization-threshold=0.5VPA for Right-Sizing
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: news-app-vpa
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: news-app
updatePolicy:
updateMode: "Recommender" # Only recommend, don't auto-apply
resourcePolicy:
containerPolicies:
- containerName: news-app
minAllowed:
cpu: "100m"
memory: "128Mi"
maxAllowed:
cpu: "4000m"
memory: "8Gi" - Context over facts: Explains when and why, not just what
- Real examples: Provides specific use cases from production experience
- Trade-offs: Acknowledges pros, cons, and decision factors
Cost Optimization Strategy
Current costs vs optimized:
Without autoscaling (static 200 pods):
- Nodes: 40 m5.2xlarge on-demand = $0.38 * 40 * 730 hours = $11,096/month
- Over-provisioned 23 hours/day = $10,000 wasted
With autoscaling:
- Baseline: 5 nodes * $0.38 * 730 = $1,387/month
- Spike hours: +45 nodes * 1 hour/day * 30 days = +$513/month
- Spot savings (80% spot): Additional savings of 70% = Total ~$1,200/month
Savings: $11,096 - $1,200 = $9,896/month (89% cost reduction!)
Monitoring and Alerting
# Prometheus alert for autoscaling issues
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: autoscaling-alerts
namespace: monitoring
spec:
groups:
- name: autoscaling
rules:
- alert: HPAMaxedOut
expr: |
kube_horizontalpodautoscaler_status_current_replicas >=
kube_horizontalpodautoscaler_spec_max_replicas
for: 5m
labels:
severity: warning
annotations:
summary: "HPA {{ $labels.horizontalpodautoscaler }} reached max replicas"
description: "Consider increasing maxReplicas or adding more node capacity"
- alert: ClusterAutoscalerFailing
expr: cluster_autoscaler_failed_scale_ups_total > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Cluster Autoscaler cannot add nodes"
description: "Check ASG limits and AWS quotas"
Practice Question
Your HPA is configured with minReplicas: 10 and maxReplicas: 50. During a traffic spike, CPU usage hits 200% and HPA wants to scale to 80 pods. What actually happens?