DeployU
Interviews / DevOps & Cloud Infrastructure / A StatefulSet's pods can't mount their persistent volumes. Troubleshoot and fix the issue.

A StatefulSet's pods can't mount their persistent volumes. Troubleshoot and fix the issue.

debugging Storage & StatefulSets Interactive Quiz Code Examples

The Scenario

You’re the Site Reliability Engineer at a fintech company running a PostgreSQL database cluster using StatefulSets. It’s 9 AM Monday and you receive alerts:

CRITICAL: Database pods failing to start
3 pods stuck in "Pending" state
Users cannot access account data

When you check the cluster:

$ kubectl get pods -n database
NAME          READY   STATUS    RESTARTS   AGE
postgres-0    1/1     Running   0          5d
postgres-1    1/1     Running   0          5d
postgres-2    0/1     Pending   0          15m
postgres-3    0/1     Pending   0          15m
postgres-4    0/1     Pending   0          15m

You were trying to scale from 2 to 5 replicas for increased capacity. Pods 2-4 won’t start.

$ kubectl describe pod postgres-2 -n database
Events:
  Warning  FailedScheduling  5m  persistentvolumeclaim "data-postgres-2" not found
  Warning  FailedMount       3m  MountVolume.SetUp failed for volume "pvc-xyz" :
           rpc error: code = DeadlineExceeded desc = context deadline exceeded

Your VP of Engineering needs the database scaled up today for a product launch tomorrow.

The Challenge

Debug and fix this persistent volume issue. Walk through:

  1. How do you diagnose PVC/PV binding problems?
  2. What are the common causes of volume mount failures?
  3. How do you verify the storage provisioner is working?
  4. What’s your complete fix and validation process?
Wrong Approach

A junior engineer might delete and recreate the pods repeatedly, manually create PVCs without understanding the root cause, increase storage quota randomly hoping it helps, or delete the StatefulSet and lose all data. This fails because deleting pods doesn't fix PVC provisioning issues, manual PVC creation breaks StatefulSet automation, random quota changes don't address the actual problem, and you might lose production data.

Right Approach

A senior SRE follows a systematic debugging process starting with checking PVC status to see if PVCs are Pending or Bound. List all PVCs and describe pending ones to see events revealing issues like StorageQuota exceeded or provisioning failures. Check StorageClass configuration and list all PVs to see if any are in Released state from previous deletions. Check storage provisioner logs for errors like VolumeQuotaExceeded or RequestLimitExceeded. This reveals issues like PVs stuck in Released state not being reclaimed or AWS volume quota exceeded.

Phase 1: Check PVC Status

# List all PVCs in the namespace
kubectl get pvc -n database

NAME              STATUS    VOLUME    CAPACITY   STORAGECLASS   AGE
data-postgres-0   Bound     pv-001    100Gi      fast-ssd       5d
data-postgres-1   Bound     pv-002    100Gi      fast-ssd       5d
data-postgres-2   Pending   -         -          fast-ssd       15m
data-postgres-3   Pending   -         -          fast-ssd       15m
data-postgres-4   Pending   -         -          fast-ssd       15m

# Describe the pending PVC to see events
kubectl describe pvc data-postgres-2 -n database

Events:
  Warning  ProvisioningFailed  1m  failed to provision volume:
           StorageQuota exceeded for StorageClass fast-ssd

This reveals the problem: Storage quota exceeded!

Phase 2: Check StorageClass Configuration

# Get StorageClass details
kubectl get storageclass fast-ssd -o yaml

Phase 3: Check Persistent Volumes

# List all PVs
kubectl get pv

NAME     CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM
pv-001   100Gi      RWO            Retain           Bound       database/data-postgres-0
pv-002   100Gi      RWO            Retain           Bound       database/data-postgres-1
pv-003   100Gi      RWO            Retain           Released    database/old-postgres-2
pv-004   100Gi      RWO            Retain           Released    database/old-postgres-3

Key observation: PVs exist but are in “Released” state from previous deletions!

Phase 4: Check Storage Provisioner Logs

# For AWS EBS CSI driver
kubectl logs -n kube-system -l app=ebs-csi-controller

ERROR: VolumeQuotaExceeded: Maximum number of volumes (20) reached for instance type m5.xlarge
ERROR: Failed to create volume: RequestLimitExceeded

# Check AWS quotas
aws service-quotas get-service-quota \
  --service-code ec2 \
  --quota-code L-D18FCD1D  # EBS volume quota

Quota: 20 volumes per instance
Current usage: 20 volumes

Root Causes and Solutions

Root Cause #1: Released PVs Not Reclaimed

Problem: Old PVs are stuck in “Released” state after StatefulSet pods were deleted.

# PVs in Released state still hold data and can't be reused
kubectl get pv pv-003 -o yaml

status:
  phase: Released  # Not Available!
spec:
  claimRef:
    name: data-postgres-2  # Still references old claim
    namespace: database

Solution: Manually reclaim the volumes

# Option 1: Patch the PV to remove claimRef (if data not needed)
kubectl patch pv pv-003 -p '{"spec":{"claimRef": null}}'
kubectl patch pv pv-004 -p '{"spec":{"claimRef": null}}'

# Verify PVs are now Available
kubectl get pv
NAME     STATUS      CLAIM
pv-003   Available   -
pv-004   Available   -

# Option 2: Delete and recreate PV (if using dynamic provisioning)
kubectl delete pv pv-003 pv-004
# Dynamic provisioner will create new volumes

Root Cause #2: AWS Volume Quota Exceeded

Problem: Instance has reached max EBS volumes (20 for m5.xlarge).

Solution: Use larger instance types or increase quota

# Increase AWS Service Quota
aws service-quotas request-service-quota-increase \
  --service-code ec2 \
  --quota-code L-D18FCD1D \
  --desired-value 50

# OR: Use nodes with higher volume limits
# m5.xlarge: 20 volumes
# m5.2xlarge: 27 volumes
# m5.4xlarge: 27 volumes
# c5.9xlarge: 50 volumes

Complete Working StatefulSet with Persistent Storage

---
# StorageClass for database workloads
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: database-storage
provisioner: ebs.csi.aws.com  # AWS EBS CSI driver
parameters:
  type: gp3
  iops: "3000"
  throughput: "125"
  encrypted: "true"
  kmsKeyId: "arn:aws:kms:us-east-1:123456789:key/abc-123"
volumeBindingMode: Immediate
allowVolumeExpansion: true
reclaimPolicy: Retain  # Don't delete PV when PVC is deleted

---
# StatefulSet for PostgreSQL
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
  namespace: database
spec:
  serviceName: postgres-headless
  replicas: 5
  selector:
    matchLabels:
      app: postgres

  # Volume claim templates - creates PVC for each pod
  volumeClaimTemplates:
  - metadata:
      name: data
      labels:
        app: postgres
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: database-storage
      resources:
        requests:
          storage: 100Gi

  template:
    metadata:
      labels:
        app: postgres
    spec:
      # Ensure pods spread across availability zones
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchLabels:
                app: postgres
            topologyKey: topology.kubernetes.io/zone

      containers:
      - name: postgres
        image: postgres:15-alpine
        ports:
        - containerPort: 5432
          name: postgres

        # Mount the persistent volume
        volumeMounts:
        - name: data
          mountPath: /var/lib/postgresql/data
          subPath: pgdata  # Use subdirectory to avoid permission issues

        env:
        - name: POSTGRES_PASSWORD
          valueFrom:
            secretKeyRef:
              name: postgres-secret
              key: password
        - name: PGDATA
          value: /var/lib/postgresql/data/pgdata

        resources:
          requests:
            memory: "4Gi"
            cpu: "2000m"
          limits:
            memory: "8Gi"
            cpu: "4000m"

        # Probes for health checking
        livenessProbe:
          exec:
            command:
            - pg_isready
            - -U
            - postgres
          initialDelaySeconds: 30
          periodSeconds: 10

        readinessProbe:
          exec:
            command:
            - pg_isready
            - -U
            - postgres
          initialDelaySeconds: 5
          periodSeconds: 5

Storage Troubleshooting Checklist

✓ Check PVC status: kubectl get pvc -n <namespace>
✓ Describe PVC for events: kubectl describe pvc <name>
✓ Check PV availability: kubectl get pv
✓ Verify StorageClass exists: kubectl get storageclass
✓ Check storage provisioner pods running
✓ Review provisioner logs for errors
✓ Verify cloud provider quotas (EBS volume limits, IOPS, etc.)
✓ Check IAM permissions for CSI driver
✓ Ensure nodes have capacity for new volumes
✓ Test with simple PVC/Pod before StatefulSet
✓ Check for Released PVs that need reclaiming

Practice Question

You delete a StatefulSet but keep the PVCs. Later, you recreate the StatefulSet with the same name. What happens to the data?