Questions
AKS pods can't pull images and nodes are NotReady. Diagnose and fix the cluster.
The Scenario
Your AKS cluster is experiencing multiple issues:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
aks-nodepool1-12345678-vmss0000 NotReady agent 2d v1.27.3
aks-nodepool1-12345678-vmss0001 Ready agent 2d v1.27.3
aks-nodepool1-12345678-vmss0002 NotReady agent 2d v1.27.3
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
api-server-abc123 0/1 ImagePullBackOff 0 30m
worker-def456 0/1 Pending 0 25m
The cluster was working yesterday. You need to restore service quickly.
The Challenge
Systematically diagnose node failures and pod issues. Understand AKS-specific debugging techniques and implement fixes.
A junior engineer might immediately delete and recreate nodes, restart all pods hoping things fix themselves, or skip systematic debugging. These approaches cause data loss, don't address root causes, and waste time.
A senior engineer follows a systematic approach: check node conditions and events, verify network connectivity (especially for private clusters), validate ACR authentication, check resource quotas, and use Azure Monitor and kubectl logs for diagnosis.
Step 1: Diagnose Node Issues
# Get detailed node status
kubectl describe node aks-nodepool1-12345678-vmss0000
# Look for conditions:
# - MemoryPressure
# - DiskPressure
# - PIDPressure
# - NetworkUnavailable
# - Ready
# Check node events
kubectl get events --field-selector involvedObject.kind=Node --sort-by='.lastTimestamp'
# Common NotReady causes:
# 1. kubelet not running
# 2. Network plugin issues (Azure CNI)
# 3. Disk pressure
# 4. Memory exhaustionStep 2: Check Node Health via Azure
# List VMSS instances
az vmss list-instances \
--resource-group MC_myResourceGroup_myAKSCluster_eastus \
--name aks-nodepool1-12345678-vmss \
--output table
# Check instance health
az vmss get-instance-view \
--resource-group MC_myResourceGroup_myAKSCluster_eastus \
--name aks-nodepool1-12345678-vmss \
--instance-id 0
# Reimage unhealthy node (last resort)
az vmss reimage \
--resource-group MC_myResourceGroup_myAKSCluster_eastus \
--name aks-nodepool1-12345678-vmss \
--instance-id 0Step 3: Diagnose Image Pull Issues
# Check pod events
kubectl describe pod api-server-abc123
# Common ImagePullBackOff causes:
# 1. ACR authentication failure
# 2. Image doesn't exist
# 3. Network connectivity to registry
# 4. Private endpoint DNS resolution
# Check if AKS can authenticate to ACR
az aks check-acr \
--resource-group myResourceGroup \
--name myAKSCluster \
--acr myacr.azurecr.io
# Attach ACR to AKS (if not done)
az aks update \
--resource-group myResourceGroup \
--name myAKSCluster \
--attach-acr myacrStep 4: Fix ACR Authentication
// Bicep: Proper ACR integration with AKS
resource aks 'Microsoft.ContainerService/managedClusters@2023-05-01' = {
name: aksName
location: location
identity: {
type: 'SystemAssigned'
}
properties: {
// ... other properties
}
}
// Grant AKS kubelet identity AcrPull role
resource acrPullRole 'Microsoft.Authorization/roleAssignments@2022-04-01' = {
name: guid(acr.id, aks.id, 'acrpull')
scope: acr
properties: {
principalId: aks.properties.identityProfile.kubeletidentity.objectId
principalType: 'ServicePrincipal'
roleDefinitionId: subscriptionResourceId('Microsoft.Authorization/roleDefinitions',
'7f951dda-4ed3-4680-a7ca-43fe172d538d') // AcrPull
}
}Step 5: Debug Network Issues (Private Cluster)
# For private clusters, check DNS resolution
kubectl run debug-pod --image=busybox --rm -it --restart=Never -- nslookup myacr.azurecr.io
# Should resolve to private IP, not public
# If resolving to public IP, private endpoint DNS is misconfigured
# Check egress connectivity
kubectl run debug-pod --image=curlimages/curl --rm -it --restart=Never -- \
curl -v https://myacr.azurecr.io/v2/
# Verify Private DNS Zone links
az network private-dns zone show \
--resource-group myResourceGroup \
--name "privatelink.azurecr.io"Step 6: Check Resource Constraints
# Check node resources
kubectl top nodes
# Check pod resource usage
kubectl top pods
# Check resource quotas
kubectl describe resourcequotas
# Check LimitRanges
kubectl describe limitranges
# Find pods causing resource pressure
kubectl get pods --all-namespaces -o json | \
jq -r '.items[] | select(.spec.containers[].resources.requests != null) |
"\(.metadata.namespace)/\(.metadata.name): \(.spec.containers[].resources.requests)"'Step 7: Fix Pending Pods
# Check why pod is pending
kubectl describe pod worker-def456
# Common Pending causes:
# 1. Insufficient CPU/memory
# 2. Node affinity/selectors don't match
# 3. Taints not tolerated
# 4. PVC not bound
# Check available resources per node
kubectl describe nodes | grep -A 5 "Allocated resources"
# Scale node pool if needed
az aks nodepool scale \
--resource-group myResourceGroup \
--cluster-name myAKSCluster \
--name nodepool1 \
--node-count 5Step 8: Enable AKS Diagnostics
// Enable diagnostic settings for AKS
resource aksDiagnostics 'Microsoft.Insights/diagnosticSettings@2021-05-01-preview' = {
name: 'aks-diagnostics'
scope: aks
properties: {
workspaceId: logAnalyticsWorkspace.id
logs: [
{
category: 'kube-apiserver'
enabled: true
}
{
category: 'kube-controller-manager'
enabled: true
}
{
category: 'kube-scheduler'
enabled: true
}
{
category: 'kube-audit'
enabled: true
}
{
category: 'cluster-autoscaler'
enabled: true
}
]
metrics: [
{
category: 'AllMetrics'
enabled: true
}
]
}
}// KQL query to find node issues
KubeNodeInventory
| where TimeGenerated > ago(1h)
| where Status != "Ready"
| project TimeGenerated, Computer, Status, Labels
| order by TimeGenerated desc
// KQL query to find pod failures
KubePodInventory
| where TimeGenerated > ago(1h)
| where PodStatus in ("Failed", "Unknown", "Pending")
| project TimeGenerated, Name, Namespace, PodStatus, ContainerStatusReason
| order by TimeGenerated descStep 9: Implement Proper Health Monitoring
# Deployment with proper probes
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
spec:
replicas: 3
selector:
matchLabels:
app: api-server
template:
spec:
containers:
- name: api
image: myacr.azurecr.io/api:v1.2.3
resources:
requests:
cpu: "250m"
memory: "512Mi"
limits:
cpu: "500m"
memory: "1Gi"
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
failureThreshold: 3
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 15
periodSeconds: 20
failureThreshold: 3
startupProbe:
httpGet:
path: /health
port: 8080
failureThreshold: 30
periodSeconds: 10 AKS Debugging Cheatsheet
| Issue | Command | Fix |
|---|---|---|
| Node NotReady | kubectl describe node | Check conditions, reimage if needed |
| ImagePullBackOff | az aks check-acr | Attach ACR, check DNS |
| Pending pods | kubectl describe pod | Scale nodepool, fix selectors |
| CrashLoopBackOff | kubectl logs --previous | Fix application code |
| OOMKilled | kubectl describe pod | Increase memory limits |
Common AKS Issues
| Symptom | Likely Cause | Solution |
|---|---|---|
| All pods pending | Cluster autoscaler disabled | Enable autoscaler |
| Can’t pull from ACR | Missing AcrPull role | az aks update --attach-acr |
| DNS resolution fails | CoreDNS issues | Check CoreDNS pods |
| Private cluster no access | Jump box not configured | Deploy jump box in VNet |
Practice Question
What is the correct way to grant AKS access to pull images from Azure Container Registry?