AKS pods can't pull images and nodes are NotReady. Diagnose and fix the cluster.

Q: AKS pods can't pull images and nodes are NotReady. Diagnose and fix the cluster.

Learn the answer to "AKS pods can't pull images and nodes are NotReady. Diagnose and fix the cluster." with detailed explanations, code examples, and best practices on DeployU.

The Scenario

Your AKS cluster is experiencing multiple issues:

$ kubectl get nodes
NAME                              STATUS     ROLES   AGE   VERSION
aks-nodepool1-12345678-vmss0000   NotReady   agent   2d    v1.27.3
aks-nodepool1-12345678-vmss0001   Ready      agent   2d    v1.27.3
aks-nodepool1-12345678-vmss0002   NotReady   agent   2d    v1.27.3

$ kubectl get pods
NAME                    READY   STATUS             RESTARTS   AGE
api-server-abc123       0/1     ImagePullBackOff   0          30m
worker-def456           0/1     Pending            0          25m

The cluster was working yesterday. You need to restore service quickly.

The Challenge

Systematically diagnose node failures and pod issues. Understand AKS-specific debugging techniques and implement fixes.

Wrong Approach

A junior engineer might immediately delete and recreate nodes, restart all pods hoping things fix themselves, or skip systematic debugging. These approaches cause data loss, don't address root causes, and waste time.

Addresses symptoms, not root cause

Right Approach

A senior engineer follows a systematic approach: check node conditions and events, verify network connectivity (especially for private clusters), validate ACR authentication, check resource quotas, and use Azure Monitor and kubectl logs for diagnosis.

Step 1: Diagnose Node Issues

# Get detailed node status
kubectl describe node aks-nodepool1-12345678-vmss0000

# Look for conditions:
# - MemoryPressure
# - DiskPressure
# - PIDPressure
# - NetworkUnavailable
# - Ready

# Check node events
kubectl get events --field-selector involvedObject.kind=Node --sort-by='.lastTimestamp'

# Common NotReady causes:
# 1. kubelet not running
# 2. Network plugin issues (Azure CNI)
# 3. Disk pressure
# 4. Memory exhaustion

Step 2: Check Node Health via Azure

# List VMSS instances
az vmss list-instances \
  --resource-group MC_myResourceGroup_myAKSCluster_eastus \
  --name aks-nodepool1-12345678-vmss \
  --output table

# Check instance health
az vmss get-instance-view \
  --resource-group MC_myResourceGroup_myAKSCluster_eastus \
  --name aks-nodepool1-12345678-vmss \
  --instance-id 0

# Reimage unhealthy node (last resort)
az vmss reimage \
  --resource-group MC_myResourceGroup_myAKSCluster_eastus \
  --name aks-nodepool1-12345678-vmss \
  --instance-id 0

Step 3: Diagnose Image Pull Issues

# Check pod events
kubectl describe pod api-server-abc123

# Common ImagePullBackOff causes:
# 1. ACR authentication failure
# 2. Image doesn't exist
# 3. Network connectivity to registry
# 4. Private endpoint DNS resolution

# Check if AKS can authenticate to ACR
az aks check-acr \
  --resource-group myResourceGroup \
  --name myAKSCluster \
  --acr myacr.azurecr.io

# Attach ACR to AKS (if not done)
az aks update \
  --resource-group myResourceGroup \
  --name myAKSCluster \
  --attach-acr myacr

Step 4: Fix ACR Authentication

// Bicep: Proper ACR integration with AKS
resource aks 'Microsoft.ContainerService/managedClusters@2023-05-01' = {
  name: aksName
  location: location
  identity: {
    type: 'SystemAssigned'
  }
  properties: {
    // ... other properties
  }
}

// Grant AKS kubelet identity AcrPull role
resource acrPullRole 'Microsoft.Authorization/roleAssignments@2022-04-01' = {
  name: guid(acr.id, aks.id, 'acrpull')
  scope: acr
  properties: {
    principalId: aks.properties.identityProfile.kubeletidentity.objectId
    principalType: 'ServicePrincipal'
    roleDefinitionId: subscriptionResourceId('Microsoft.Authorization/roleDefinitions',
      '7f951dda-4ed3-4680-a7ca-43fe172d538d')  // AcrPull
  }
}

Step 5: Debug Network Issues (Private Cluster)

# For private clusters, check DNS resolution
kubectl run debug-pod --image=busybox --rm -it --restart=Never -- nslookup myacr.azurecr.io

# Should resolve to private IP, not public
# If resolving to public IP, private endpoint DNS is misconfigured

# Check egress connectivity
kubectl run debug-pod --image=curlimages/curl --rm -it --restart=Never -- \
  curl -v https://myacr.azurecr.io/v2/

# Verify Private DNS Zone links
az network private-dns zone show \
  --resource-group myResourceGroup \
  --name "privatelink.azurecr.io"

Step 6: Check Resource Constraints

# Check node resources
kubectl top nodes

# Check pod resource usage
kubectl top pods

# Check resource quotas
kubectl describe resourcequotas

# Check LimitRanges
kubectl describe limitranges

# Find pods causing resource pressure
kubectl get pods --all-namespaces -o json | \
  jq -r '.items[] | select(.spec.containers[].resources.requests != null) |
  "\(.metadata.namespace)/\(.metadata.name): \(.spec.containers[].resources.requests)"'

Step 7: Fix Pending Pods

# Check why pod is pending
kubectl describe pod worker-def456

# Common Pending causes:
# 1. Insufficient CPU/memory
# 2. Node affinity/selectors don't match
# 3. Taints not tolerated
# 4. PVC not bound

# Check available resources per node
kubectl describe nodes | grep -A 5 "Allocated resources"

# Scale node pool if needed
az aks nodepool scale \
  --resource-group myResourceGroup \
  --cluster-name myAKSCluster \
  --name nodepool1 \
  --node-count 5

Step 8: Enable AKS Diagnostics

// Enable diagnostic settings for AKS
resource aksDiagnostics 'Microsoft.Insights/diagnosticSettings@2021-05-01-preview' = {
  name: 'aks-diagnostics'
  scope: aks
  properties: {
    workspaceId: logAnalyticsWorkspace.id
    logs: [
      {
        category: 'kube-apiserver'
        enabled: true
      }
      {
        category: 'kube-controller-manager'
        enabled: true
      }
      {
        category: 'kube-scheduler'
        enabled: true
      }
      {
        category: 'kube-audit'
        enabled: true
      }
      {
        category: 'cluster-autoscaler'
        enabled: true
      }
    ]
    metrics: [
      {
        category: 'AllMetrics'
        enabled: true
      }
    ]
  }
}

// KQL query to find node issues
KubeNodeInventory
| where TimeGenerated > ago(1h)
| where Status != "Ready"
| project TimeGenerated, Computer, Status, Labels
| order by TimeGenerated desc

// KQL query to find pod failures
KubePodInventory
| where TimeGenerated > ago(1h)
| where PodStatus in ("Failed", "Unknown", "Pending")
| project TimeGenerated, Name, Namespace, PodStatus, ContainerStatusReason
| order by TimeGenerated desc

Step 9: Implement Proper Health Monitoring

# Deployment with proper probes
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api-server
  template:
    spec:
      containers:
      - name: api
        image: myacr.azurecr.io/api:v1.2.3
        resources:
          requests:
            cpu: "250m"
            memory: "512Mi"
          limits:
            cpu: "500m"
            memory: "1Gi"
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 10
          failureThreshold: 3
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 15
          periodSeconds: 20
          failureThreshold: 3
        startupProbe:
          httpGet:
            path: /health
            port: 8080
          failureThreshold: 30
          periodSeconds: 10

Systematic, production-ready debugging

AKS Debugging Cheatsheet

Issue	Command	Fix
Node NotReady	`kubectl describe node`	Check conditions, reimage if needed
ImagePullBackOff	`az aks check-acr`	Attach ACR, check DNS
Pending pods	`kubectl describe pod`	Scale nodepool, fix selectors
CrashLoopBackOff	`kubectl logs --previous`	Fix application code
OOMKilled	`kubectl describe pod`	Increase memory limits

Common AKS Issues

Symptom	Likely Cause	Solution
All pods pending	Cluster autoscaler disabled	Enable autoscaler
Can’t pull from ACR	Missing AcrPull role	`az aks update --attach-acr`
DNS resolution fails	CoreDNS issues	Check CoreDNS pods
Private cluster no access	Jump box not configured	Deploy jump box in VNet

Practice Question

What is the correct way to grant AKS access to pull images from Azure Container Registry?

Questions