Questions
Pods are stuck in Pending state and nodes keep getting OOMKilled. Fix this GKE cluster.
The Scenario
Your production GKE cluster is experiencing issues. Some pods are stuck in Pending:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
api-server-7d9f8c-abc 0/1 Pending 0 45m
api-server-7d9f8c-def 1/1 Running 0 2h
worker-5c6b7d-ghi 0/1 Pending 0 30m
$ kubectl describe pod api-server-7d9f8c-abc
Events:
Warning FailedScheduling 2m default-scheduler
0/5 nodes are available: 2 Insufficient cpu, 3 Insufficient memory,
5 node(s) had taint {node.kubernetes.io/memory-pressure: NoSchedule}
Meanwhile, nodes are being evicted due to memory pressure:
$ kubectl get nodes
NAME STATUS ROLES AGE
gke-cluster-default-pool-abc123-def Ready,SchedulingDisabled <none> 5h
gke-cluster-default-pool-abc123-ghi NotReady <none> 3h
The Challenge
Diagnose the root cause of pod scheduling failures and node instability. Implement fixes for resource management, autoscaling, and node pool configuration.
A junior engineer might manually delete pending pods, increase node size without understanding the cause, disable resource limits entirely, or restart nodes. These approaches mask symptoms without fixing root causes and often make things worse.
A senior engineer systematically investigates: resource requests vs actual usage, node allocatable resources, memory pressure causes, autoscaler configuration, and implements proper resource management with requests/limits, PodDisruptionBudgets, and priority classes.
Step 1: Understand Why Pods Are Pending
# Check why pods can't be scheduled
kubectl describe pod api-server-7d9f8c-abc | grep -A 20 Events
# Check node resources
kubectl describe nodes | grep -A 10 "Allocated resources"
# Example output:
# Allocated resources:
# (Total limits may be over 100 percent, i.e., overcommitted.)
# Resource Requests Limits
# -------- -------- ------
# cpu 3800m (95%) 8000m (200%)
# memory 14Gi (95%) 20Gi (133%)
# See what's consuming resources
kubectl top nodes
kubectl top pods --all-namespaces --sort-by=memoryStep 2: Check Node Memory Pressure
# Check node conditions
kubectl get nodes -o custom-columns=\
NAME:.metadata.name,\
MEMORY_PRESSURE:.status.conditions[?(@.type=="MemoryPressure")].status,\
DISK_PRESSURE:.status.conditions[?(@.type=="DiskPressure")].status
# Check kubelet logs for evictions
gcloud logging read \
'resource.type="k8s_node" AND
textPayload:"eviction"' \
--limit=50 \
--format="table(timestamp, textPayload)"
# Check what's being evicted
kubectl get events --sort-by='.lastTimestamp' | grep -i evictStep 3: Analyze Pod Resource Configuration
# Find pods without resource limits (dangerous!)
kubectl get pods --all-namespaces -o json | \
jq -r '.items[] | select(.spec.containers[].resources.limits == null) |
"\(.metadata.namespace)/\(.metadata.name)"'
# Check current resource configuration
kubectl get pod api-server-7d9f8c-def -o yaml | \
yq '.spec.containers[].resources'
# Typical problematic config:
# resources:
# requests:
# memory: "256Mi" # Too low request
# cpu: "100m"
# limits:
# memory: "8Gi" # 32x the request - causes overcommit!
# cpu: "2000m"Step 4: Fix Resource Requests and Limits
# Proper resource configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
spec:
replicas: 3
template:
spec:
containers:
- name: api
resources:
requests:
memory: "512Mi" # Based on actual P95 usage
cpu: "250m"
limits:
memory: "1Gi" # 2x request is reasonable
cpu: "500m" # Limit CPU bursting
# Add liveness/readiness for proper scheduling
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5Step 5: Configure Cluster Autoscaler
# Check current autoscaler status
gcloud container clusters describe my-cluster \
--zone=us-central1-a \
--format="yaml(autoscaling)"
# Enable cluster autoscaler with proper limits
gcloud container clusters update my-cluster \
--zone=us-central1-a \
--enable-autoscaling \
--min-nodes=2 \
--max-nodes=20 \
--node-pool=default-pool
# Check autoscaler events
kubectl get events -n kube-system | grep cluster-autoscalerStep 6: Create Properly Sized Node Pools
# Terraform: Create node pool with appropriate sizing
resource "google_container_node_pool" "primary" {
name = "primary-pool"
cluster = google_container_cluster.main.name
location = "us-central1"
# Autoscaling configuration
autoscaling {
min_node_count = 2
max_node_count = 20
}
node_config {
machine_type = "e2-standard-4" # 4 vCPU, 16GB RAM
# Reserve resources for system daemons
# Allocatable = Total - Reserved
# e2-standard-4: ~3.5 CPU, ~14GB allocatable
labels = {
workload = "general"
}
# Prevent scheduling on nodes during maintenance
taint {
key = "dedicated"
value = "general"
effect = "NO_SCHEDULE"
}
oauth_scopes = [
"https://www.googleapis.com/auth/cloud-platform"
]
}
management {
auto_repair = true
auto_upgrade = true
}
}
# Separate pool for memory-intensive workloads
resource "google_container_node_pool" "memory_optimized" {
name = "memory-pool"
cluster = google_container_cluster.main.name
location = "us-central1"
autoscaling {
min_node_count = 0
max_node_count = 10
}
node_config {
machine_type = "n2-highmem-4" # 4 vCPU, 32GB RAM
labels = {
workload = "memory-intensive"
}
taint {
key = "workload"
value = "memory-intensive"
effect = "NO_SCHEDULE"
}
}
}Step 7: Implement Resource Quotas and Limit Ranges
# Namespace resource quota
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-quota
namespace: production
spec:
hard:
requests.cpu: "20"
requests.memory: "40Gi"
limits.cpu: "40"
limits.memory: "80Gi"
pods: "50"
---
# Default limits for pods without explicit resources
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: production
spec:
limits:
- default:
memory: "512Mi"
cpu: "500m"
defaultRequest:
memory: "256Mi"
cpu: "100m"
max:
memory: "4Gi"
cpu: "2"
min:
memory: "64Mi"
cpu: "50m"
type: ContainerStep 8: Set Up Priority Classes
# High priority for critical workloads
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
description: "Critical production workloads"
---
# Default priority
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: default-priority
value: 0
globalDefault: true
description: "Default priority for all pods"
---
# Low priority for batch jobs (can be preempted)
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: low-priority
value: -1000
globalDefault: false
preemptionPolicy: Never
description: "Batch jobs that can be preempted"# Use priority class in deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
spec:
template:
spec:
priorityClassName: high-priority
containers:
- name: api
# ... GKE Resource Debugging Cheatsheet
| Symptom | Likely Cause | Fix |
|---|---|---|
| Pods Pending | Insufficient resources | Right-size requests or add nodes |
| Node MemoryPressure | Overcommitted memory | Reduce limits:requests ratio |
| OOMKilled pods | Memory limit too low | Increase limit based on actual usage |
| Slow scaling | Autoscaler configuration | Reduce scale-down delay |
| Uneven distribution | No PodAntiAffinity | Add topology spread constraints |
Useful Debugging Commands
# Real-time resource monitoring
kubectl top pods --containers
kubectl top nodes
# Check why autoscaler isn't scaling
kubectl -n kube-system logs -l app=cluster-autoscaler --tail=100
# Find resource hogs
kubectl get pods -A -o json | jq -r '
.items[] |
"\(.metadata.namespace)/\(.metadata.name):
CPU: \(.spec.containers[0].resources.requests.cpu // "none")
MEM: \(.spec.containers[0].resources.requests.memory // "none")"'
Practice Question
Why does having memory limits much higher than requests (e.g., 256Mi request, 8Gi limit) cause node memory pressure issues?