DeployU
Interviews / Cloud & DevOps / AKS pods can't pull images and nodes are NotReady. Diagnose and fix the cluster.

AKS pods can't pull images and nodes are NotReady. Diagnose and fix the cluster.

debugging AKS Interactive Quiz Code Examples

The Scenario

Your AKS cluster is experiencing multiple issues:

$ kubectl get nodes
NAME                              STATUS     ROLES   AGE   VERSION
aks-nodepool1-12345678-vmss0000   NotReady   agent   2d    v1.27.3
aks-nodepool1-12345678-vmss0001   Ready      agent   2d    v1.27.3
aks-nodepool1-12345678-vmss0002   NotReady   agent   2d    v1.27.3

$ kubectl get pods
NAME                    READY   STATUS             RESTARTS   AGE
api-server-abc123       0/1     ImagePullBackOff   0          30m
worker-def456           0/1     Pending            0          25m

The cluster was working yesterday. You need to restore service quickly.

The Challenge

Systematically diagnose node failures and pod issues. Understand AKS-specific debugging techniques and implement fixes.

Wrong Approach

A junior engineer might immediately delete and recreate nodes, restart all pods hoping things fix themselves, or skip systematic debugging. These approaches cause data loss, don't address root causes, and waste time.

Right Approach

A senior engineer follows a systematic approach: check node conditions and events, verify network connectivity (especially for private clusters), validate ACR authentication, check resource quotas, and use Azure Monitor and kubectl logs for diagnosis.

Step 1: Diagnose Node Issues

# Get detailed node status
kubectl describe node aks-nodepool1-12345678-vmss0000

# Look for conditions:
# - MemoryPressure
# - DiskPressure
# - PIDPressure
# - NetworkUnavailable
# - Ready

# Check node events
kubectl get events --field-selector involvedObject.kind=Node --sort-by='.lastTimestamp'

# Common NotReady causes:
# 1. kubelet not running
# 2. Network plugin issues (Azure CNI)
# 3. Disk pressure
# 4. Memory exhaustion

Step 2: Check Node Health via Azure

# List VMSS instances
az vmss list-instances \
  --resource-group MC_myResourceGroup_myAKSCluster_eastus \
  --name aks-nodepool1-12345678-vmss \
  --output table

# Check instance health
az vmss get-instance-view \
  --resource-group MC_myResourceGroup_myAKSCluster_eastus \
  --name aks-nodepool1-12345678-vmss \
  --instance-id 0

# Reimage unhealthy node (last resort)
az vmss reimage \
  --resource-group MC_myResourceGroup_myAKSCluster_eastus \
  --name aks-nodepool1-12345678-vmss \
  --instance-id 0

Step 3: Diagnose Image Pull Issues

# Check pod events
kubectl describe pod api-server-abc123

# Common ImagePullBackOff causes:
# 1. ACR authentication failure
# 2. Image doesn't exist
# 3. Network connectivity to registry
# 4. Private endpoint DNS resolution

# Check if AKS can authenticate to ACR
az aks check-acr \
  --resource-group myResourceGroup \
  --name myAKSCluster \
  --acr myacr.azurecr.io

# Attach ACR to AKS (if not done)
az aks update \
  --resource-group myResourceGroup \
  --name myAKSCluster \
  --attach-acr myacr

Step 4: Fix ACR Authentication

// Bicep: Proper ACR integration with AKS
resource aks 'Microsoft.ContainerService/managedClusters@2023-05-01' = {
  name: aksName
  location: location
  identity: {
    type: 'SystemAssigned'
  }
  properties: {
    // ... other properties
  }
}

// Grant AKS kubelet identity AcrPull role
resource acrPullRole 'Microsoft.Authorization/roleAssignments@2022-04-01' = {
  name: guid(acr.id, aks.id, 'acrpull')
  scope: acr
  properties: {
    principalId: aks.properties.identityProfile.kubeletidentity.objectId
    principalType: 'ServicePrincipal'
    roleDefinitionId: subscriptionResourceId('Microsoft.Authorization/roleDefinitions',
      '7f951dda-4ed3-4680-a7ca-43fe172d538d')  // AcrPull
  }
}

Step 5: Debug Network Issues (Private Cluster)

# For private clusters, check DNS resolution
kubectl run debug-pod --image=busybox --rm -it --restart=Never -- nslookup myacr.azurecr.io

# Should resolve to private IP, not public
# If resolving to public IP, private endpoint DNS is misconfigured

# Check egress connectivity
kubectl run debug-pod --image=curlimages/curl --rm -it --restart=Never -- \
  curl -v https://myacr.azurecr.io/v2/

# Verify Private DNS Zone links
az network private-dns zone show \
  --resource-group myResourceGroup \
  --name "privatelink.azurecr.io"

Step 6: Check Resource Constraints

# Check node resources
kubectl top nodes

# Check pod resource usage
kubectl top pods

# Check resource quotas
kubectl describe resourcequotas

# Check LimitRanges
kubectl describe limitranges

# Find pods causing resource pressure
kubectl get pods --all-namespaces -o json | \
  jq -r '.items[] | select(.spec.containers[].resources.requests != null) |
  "\(.metadata.namespace)/\(.metadata.name): \(.spec.containers[].resources.requests)"'

Step 7: Fix Pending Pods

# Check why pod is pending
kubectl describe pod worker-def456

# Common Pending causes:
# 1. Insufficient CPU/memory
# 2. Node affinity/selectors don't match
# 3. Taints not tolerated
# 4. PVC not bound

# Check available resources per node
kubectl describe nodes | grep -A 5 "Allocated resources"

# Scale node pool if needed
az aks nodepool scale \
  --resource-group myResourceGroup \
  --cluster-name myAKSCluster \
  --name nodepool1 \
  --node-count 5

Step 8: Enable AKS Diagnostics

// Enable diagnostic settings for AKS
resource aksDiagnostics 'Microsoft.Insights/diagnosticSettings@2021-05-01-preview' = {
  name: 'aks-diagnostics'
  scope: aks
  properties: {
    workspaceId: logAnalyticsWorkspace.id
    logs: [
      {
        category: 'kube-apiserver'
        enabled: true
      }
      {
        category: 'kube-controller-manager'
        enabled: true
      }
      {
        category: 'kube-scheduler'
        enabled: true
      }
      {
        category: 'kube-audit'
        enabled: true
      }
      {
        category: 'cluster-autoscaler'
        enabled: true
      }
    ]
    metrics: [
      {
        category: 'AllMetrics'
        enabled: true
      }
    ]
  }
}
// KQL query to find node issues
KubeNodeInventory
| where TimeGenerated > ago(1h)
| where Status != "Ready"
| project TimeGenerated, Computer, Status, Labels
| order by TimeGenerated desc

// KQL query to find pod failures
KubePodInventory
| where TimeGenerated > ago(1h)
| where PodStatus in ("Failed", "Unknown", "Pending")
| project TimeGenerated, Name, Namespace, PodStatus, ContainerStatusReason
| order by TimeGenerated desc

Step 9: Implement Proper Health Monitoring

# Deployment with proper probes
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  replicas: 3
  selector:
    matchLabels:
      app: api-server
  template:
    spec:
      containers:
      - name: api
        image: myacr.azurecr.io/api:v1.2.3
        resources:
          requests:
            cpu: "250m"
            memory: "512Mi"
          limits:
            cpu: "500m"
            memory: "1Gi"
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 10
          failureThreshold: 3
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 15
          periodSeconds: 20
          failureThreshold: 3
        startupProbe:
          httpGet:
            path: /health
            port: 8080
          failureThreshold: 30
          periodSeconds: 10

AKS Debugging Cheatsheet

IssueCommandFix
Node NotReadykubectl describe nodeCheck conditions, reimage if needed
ImagePullBackOffaz aks check-acrAttach ACR, check DNS
Pending podskubectl describe podScale nodepool, fix selectors
CrashLoopBackOffkubectl logs --previousFix application code
OOMKilledkubectl describe podIncrease memory limits

Common AKS Issues

SymptomLikely CauseSolution
All pods pendingCluster autoscaler disabledEnable autoscaler
Can’t pull from ACRMissing AcrPull roleaz aks update --attach-acr
DNS resolution failsCoreDNS issuesCheck CoreDNS pods
Private cluster no accessJump box not configuredDeploy jump box in VNet

Practice Question

What is the correct way to grant AKS access to pull images from Azure Container Registry?