Questions
A StatefulSet's pods can't mount their persistent volumes. Troubleshoot and fix the issue.
The Scenario
You’re the Site Reliability Engineer at a fintech company running a PostgreSQL database cluster using StatefulSets. It’s 9 AM Monday and you receive alerts:
CRITICAL: Database pods failing to start
3 pods stuck in "Pending" state
Users cannot access account data
When you check the cluster:
$ kubectl get pods -n database
NAME READY STATUS RESTARTS AGE
postgres-0 1/1 Running 0 5d
postgres-1 1/1 Running 0 5d
postgres-2 0/1 Pending 0 15m
postgres-3 0/1 Pending 0 15m
postgres-4 0/1 Pending 0 15m
You were trying to scale from 2 to 5 replicas for increased capacity. Pods 2-4 won’t start.
$ kubectl describe pod postgres-2 -n database
Events:
Warning FailedScheduling 5m persistentvolumeclaim "data-postgres-2" not found
Warning FailedMount 3m MountVolume.SetUp failed for volume "pvc-xyz" :
rpc error: code = DeadlineExceeded desc = context deadline exceeded
Your VP of Engineering needs the database scaled up today for a product launch tomorrow.
The Challenge
Debug and fix this persistent volume issue. Walk through:
- How do you diagnose PVC/PV binding problems?
- What are the common causes of volume mount failures?
- How do you verify the storage provisioner is working?
- What’s your complete fix and validation process?
A junior engineer might delete and recreate the pods repeatedly, manually create PVCs without understanding the root cause, increase storage quota randomly hoping it helps, or delete the StatefulSet and lose all data. This fails because deleting pods doesn't fix PVC provisioning issues, manual PVC creation breaks StatefulSet automation, random quota changes don't address the actual problem, and you might lose production data.
A senior SRE follows a systematic debugging process starting with checking PVC status to see if PVCs are Pending or Bound. List all PVCs and describe pending ones to see events revealing issues like StorageQuota exceeded or provisioning failures. Check StorageClass configuration and list all PVs to see if any are in Released state from previous deletions. Check storage provisioner logs for errors like VolumeQuotaExceeded or RequestLimitExceeded. This reveals issues like PVs stuck in Released state not being reclaimed or AWS volume quota exceeded.
Phase 1: Check PVC Status
# List all PVCs in the namespace
kubectl get pvc -n database
NAME STATUS VOLUME CAPACITY STORAGECLASS AGE
data-postgres-0 Bound pv-001 100Gi fast-ssd 5d
data-postgres-1 Bound pv-002 100Gi fast-ssd 5d
data-postgres-2 Pending - - fast-ssd 15m
data-postgres-3 Pending - - fast-ssd 15m
data-postgres-4 Pending - - fast-ssd 15m
# Describe the pending PVC to see events
kubectl describe pvc data-postgres-2 -n database
Events:
Warning ProvisioningFailed 1m failed to provision volume:
StorageQuota exceeded for StorageClass fast-ssdThis reveals the problem: Storage quota exceeded!
Phase 2: Check StorageClass Configuration
# Get StorageClass details
kubectl get storageclass fast-ssd -o yamlPhase 3: Check Persistent Volumes
# List all PVs
kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM
pv-001 100Gi RWO Retain Bound database/data-postgres-0
pv-002 100Gi RWO Retain Bound database/data-postgres-1
pv-003 100Gi RWO Retain Released database/old-postgres-2
pv-004 100Gi RWO Retain Released database/old-postgres-3Key observation: PVs exist but are in “Released” state from previous deletions!
Phase 4: Check Storage Provisioner Logs
# For AWS EBS CSI driver
kubectl logs -n kube-system -l app=ebs-csi-controller
ERROR: VolumeQuotaExceeded: Maximum number of volumes (20) reached for instance type m5.xlarge
ERROR: Failed to create volume: RequestLimitExceeded
# Check AWS quotas
aws service-quotas get-service-quota \
--service-code ec2 \
--quota-code L-D18FCD1D # EBS volume quota
Quota: 20 volumes per instance
Current usage: 20 volumes Root Causes and Solutions
Root Cause #1: Released PVs Not Reclaimed
Problem: Old PVs are stuck in “Released” state after StatefulSet pods were deleted.
# PVs in Released state still hold data and can't be reused
kubectl get pv pv-003 -o yaml
status:
phase: Released # Not Available!
spec:
claimRef:
name: data-postgres-2 # Still references old claim
namespace: database
Solution: Manually reclaim the volumes
# Option 1: Patch the PV to remove claimRef (if data not needed)
kubectl patch pv pv-003 -p '{"spec":{"claimRef": null}}'
kubectl patch pv pv-004 -p '{"spec":{"claimRef": null}}'
# Verify PVs are now Available
kubectl get pv
NAME STATUS CLAIM
pv-003 Available -
pv-004 Available -
# Option 2: Delete and recreate PV (if using dynamic provisioning)
kubectl delete pv pv-003 pv-004
# Dynamic provisioner will create new volumes
Root Cause #2: AWS Volume Quota Exceeded
Problem: Instance has reached max EBS volumes (20 for m5.xlarge).
Solution: Use larger instance types or increase quota
# Increase AWS Service Quota
aws service-quotas request-service-quota-increase \
--service-code ec2 \
--quota-code L-D18FCD1D \
--desired-value 50
# OR: Use nodes with higher volume limits
# m5.xlarge: 20 volumes
# m5.2xlarge: 27 volumes
# m5.4xlarge: 27 volumes
# c5.9xlarge: 50 volumes
Complete Working StatefulSet with Persistent Storage
---
# StorageClass for database workloads
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: database-storage
provisioner: ebs.csi.aws.com # AWS EBS CSI driver
parameters:
type: gp3
iops: "3000"
throughput: "125"
encrypted: "true"
kmsKeyId: "arn:aws:kms:us-east-1:123456789:key/abc-123"
volumeBindingMode: Immediate
allowVolumeExpansion: true
reclaimPolicy: Retain # Don't delete PV when PVC is deleted
---
# StatefulSet for PostgreSQL
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
namespace: database
spec:
serviceName: postgres-headless
replicas: 5
selector:
matchLabels:
app: postgres
# Volume claim templates - creates PVC for each pod
volumeClaimTemplates:
- metadata:
name: data
labels:
app: postgres
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: database-storage
resources:
requests:
storage: 100Gi
template:
metadata:
labels:
app: postgres
spec:
# Ensure pods spread across availability zones
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: postgres
topologyKey: topology.kubernetes.io/zone
containers:
- name: postgres
image: postgres:15-alpine
ports:
- containerPort: 5432
name: postgres
# Mount the persistent volume
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
subPath: pgdata # Use subdirectory to avoid permission issues
env:
- name: POSTGRES_PASSWORD
valueFrom:
secretKeyRef:
name: postgres-secret
key: password
- name: PGDATA
value: /var/lib/postgresql/data/pgdata
resources:
requests:
memory: "4Gi"
cpu: "2000m"
limits:
memory: "8Gi"
cpu: "4000m"
# Probes for health checking
livenessProbe:
exec:
command:
- pg_isready
- -U
- postgres
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
exec:
command:
- pg_isready
- -U
- postgres
initialDelaySeconds: 5
periodSeconds: 5
Storage Troubleshooting Checklist
✓ Check PVC status: kubectl get pvc -n <namespace>
✓ Describe PVC for events: kubectl describe pvc <name>
✓ Check PV availability: kubectl get pv
✓ Verify StorageClass exists: kubectl get storageclass
✓ Check storage provisioner pods running
✓ Review provisioner logs for errors
✓ Verify cloud provider quotas (EBS volume limits, IOPS, etc.)
✓ Check IAM permissions for CSI driver
✓ Ensure nodes have capacity for new volumes
✓ Test with simple PVC/Pod before StatefulSet
✓ Check for Released PVs that need reclaiming
Practice Question
You delete a StatefulSet but keep the PVCs. Later, you recreate the StatefulSet with the same name. What happens to the data?