Questions
Your Azure bill increased 50% last month. Identify waste and implement cost controls.
The Scenario
Your Azure spending is out of control:
Monthly Bill Breakdown:
├── Virtual Machines: $45,000 (30%)
│ └── Many D-series running 24/7 including dev/test
├── Azure SQL: $30,000 (20%)
│ └── Premium tier for all databases
├── Storage: $22,500 (15%)
│ └── All data in Hot tier
├── AKS: $25,500 (17%)
│ └── Overprovisioned node pools
├── App Services: $15,000 (10%)
│ └── Premium V3 for all environments
└── Other: $12,000 (8%)
Total: $150,000/month
YoY Growth: 40%
Reserved Instance Coverage: 0%
Finance is asking for a 30% cost reduction without impacting performance.
The Challenge
Implement a comprehensive cost optimization strategy using Azure Cost Management, reservations, right-sizing, and architectural improvements.
A junior engineer might delete resources randomly, downgrade everything to the smallest size, skip reserved instances because of commitment fear, or ignore the problem hoping it goes away. These approaches break applications, cause performance issues, or don't address the root causes.
A senior engineer analyzes usage patterns with Azure Advisor and Cost Management, implements reserved instances for stable workloads, right-sizes resources based on metrics, uses auto-scaling, implements proper resource lifecycle management, and sets up budgets with alerts.
Step 1: Analyze Current Spending
# Get cost breakdown by resource group
az consumption usage list \
--start-date 2024-01-01 \
--end-date 2024-01-31 \
--query "[].{ResourceGroup:resourceGroup,Cost:pretaxCost}" \
--output table
# Get Azure Advisor recommendations
az advisor recommendation list \
--category Cost \
--output table
# Export cost data for analysis
az costmanagement query \
--type Usage \
--scope "/subscriptions/{subscription-id}" \
--timeframe MonthToDate \
--dataset-grouping name=ResourceGroup type=Dimension \
--dataset-aggregation '{"totalCost":{"name":"Cost","function":"Sum"}}'Step 2: Implement Reserved Instances
// Reserved Instance savings calculator
// Standard D4s_v5 (4 vCPU, 16GB) in East US:
// - Pay-as-you-go: $140.16/month
// - 1-year reserved: $89.79/month (36% savings)
// - 3-year reserved: $57.67/month (59% savings)
// For 10 production VMs running 24/7:
// - Current: 10 × $140.16 = $1,401.60/month
// - With 3-year RI: 10 × $57.67 = $576.70/month
// - Annual savings: $9,899/year
// Purchase recommendations based on usage patterns
resource reservationOrder 'Microsoft.Capacity/reservationOrders@2022-11-01' = {
name: 'ro-production-vms'
location: 'global'
properties: {
reservedResourceType: 'VirtualMachines'
billingScopeId: subscription().id
term: 'P3Y' // 3-year term
billingPlan: 'Monthly'
quantity: 10
displayName: 'Production VM Reservations'
appliedScopes: [subscription().id]
appliedScopeType: 'Shared' // Apply across subscriptions
renew: true
}
}
// Azure SQL Database reservations
// vCore reservations apply across all SQL products
resource sqlReservation 'Microsoft.Capacity/reservationOrders@2022-11-01' = {
name: 'ro-sql-vcores'
location: 'global'
properties: {
reservedResourceType: 'SqlDatabases'
term: 'P1Y'
quantity: 24 // Total vCores
displayName: 'SQL vCore Reservations'
appliedScopeType: 'Shared'
}
}Step 3: Right-Size Virtual Machines
// Enable Azure Monitor for VM metrics analysis
resource vmInsights 'Microsoft.Insights/dataCollectionRules@2022-06-01' = {
name: 'dcr-vm-performance'
location: location
properties: {
dataSources: {
performanceCounters: [
{
name: 'VMPerformance'
streams: ['Microsoft-Perf']
samplingFrequencyInSeconds: 60
counterSpecifiers: [
'\\Processor Information(_Total)\\% Processor Time'
'\\Memory\\% Committed Bytes In Use'
'\\LogicalDisk(_Total)\\% Disk Read Time'
'\\LogicalDisk(_Total)\\% Disk Write Time'
]
}
]
}
destinations: {
logAnalytics: [
{
workspaceResourceId: logAnalytics.id
name: 'vmLogs'
}
]
}
}
}// KQL query to find oversized VMs
Perf
| where ObjectName == "Processor" and CounterName == "% Processor Time"
| where TimeGenerated > ago(30d)
| summarize AvgCPU = avg(CounterValue),
MaxCPU = max(CounterValue),
P95CPU = percentile(CounterValue, 95)
by Computer
| where P95CPU < 20 // VMs with P95 CPU < 20% are oversized
| order by AvgCPU asc
// Memory utilization
Perf
| where ObjectName == "Memory" and CounterName == "% Committed Bytes In Use"
| where TimeGenerated > ago(30d)
| summarize AvgMemory = avg(CounterValue),
MaxMemory = max(CounterValue),
P95Memory = percentile(CounterValue, 95)
by Computer
| where P95Memory < 40 // VMs with P95 Memory < 40% can be downsized# Resize VM based on analysis
az vm resize \
--resource-group rg-production \
--name vm-web-01 \
--size Standard_D2s_v5 # Downsize from D4s_v5
# Savings: D4s_v5 ($140/mo) → D2s_v5 ($70/mo) = 50% per VMStep 4: Auto-Shutdown for Non-Production
// Auto-shutdown for dev/test VMs
resource autoShutdown 'Microsoft.DevTestLab/schedules@2018-09-15' = {
name: 'shutdown-computevm-${vmName}'
location: location
properties: {
status: 'Enabled'
taskType: 'ComputeVmShutdownTask'
dailyRecurrence: {
time: '1900' // 7 PM
}
timeZoneId: 'Eastern Standard Time'
notificationSettings: {
status: 'Enabled'
timeInMinutes: 30
emailRecipient: 'team@contoso.com'
}
targetResourceId: vm.id
}
}
// Start VMs on schedule using Automation
resource automationRunbook 'Microsoft.Automation/automationAccounts/runbooks@2022-08-08' = {
parent: automationAccount
name: 'Start-DevVMs'
location: location
properties: {
runbookType: 'PowerShell'
logProgress: true
logVerbose: false
publishContentLink: {
uri: 'https://raw.githubusercontent.com/contoso/runbooks/main/Start-DevVMs.ps1'
}
}
}
resource startSchedule 'Microsoft.Automation/automationAccounts/schedules@2022-08-08' = {
parent: automationAccount
name: 'StartDevVMsWeekday'
properties: {
startTime: '2024-01-01T08:00:00+00:00'
frequency: 'Week'
interval: 1
timeZone: 'Eastern Standard Time'
advancedSchedule: {
weekDays: ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']
}
}
}
// Savings: Dev VMs running 10hrs/day × 5 days = 50hrs vs 720hrs
// = 93% cost reduction for dev VMsStep 5: Optimize Azure SQL
-- Identify unused indexes
SELECT
OBJECT_NAME(i.object_id) AS TableName,
i.name AS IndexName,
ius.user_seeks,
ius.user_scans,
ius.user_lookups,
ius.user_updates
FROM sys.indexes i
JOIN sys.dm_db_index_usage_stats ius
ON i.object_id = ius.object_id AND i.index_id = ius.index_id
WHERE OBJECTPROPERTY(i.object_id, 'IsUserTable') = 1
AND ius.user_seeks = 0
AND ius.user_scans = 0
AND ius.user_lookups = 0
ORDER BY ius.user_updates DESC;
-- Check DTU/vCore utilization
SELECT
AVG(avg_cpu_percent) as AvgCPU,
MAX(avg_cpu_percent) as MaxCPU,
AVG(avg_data_io_percent) as AvgIO,
AVG(avg_memory_usage_percent) as AvgMemory
FROM sys.dm_db_resource_stats
WHERE end_time > DATEADD(day, -14, GETUTCDATE());// Right-size based on analysis
resource sqlDatabase 'Microsoft.Sql/servers/databases@2023-05-01-preview' = {
parent: sqlServer
name: 'appdb'
location: location
sku: {
// Before: Premium P4 (500 DTU) - $1,860/month
// After: Standard S3 (100 DTU) - $150/month
// Or: General Purpose 2 vCore - $370/month
name: 'GP_S_Gen5' // Serverless for variable workloads
tier: 'GeneralPurpose'
family: 'Gen5'
capacity: 2
}
properties: {
autoPauseDelay: 60 // Pause after 1 hour of inactivity
minCapacity: 0.5 // Minimum 0.5 vCores when active
zoneRedundant: false // Disable for non-prod
}
}
// Use Elastic Pools for multiple databases
resource elasticPool 'Microsoft.Sql/servers/elasticPools@2023-05-01-preview' = {
parent: sqlServer
name: 'pool-shared'
location: location
sku: {
name: 'GP_Gen5'
tier: 'GeneralPurpose'
family: 'Gen5'
capacity: 4 // 4 vCores shared across databases
}
properties: {
perDatabaseSettings: {
minCapacity: 0
maxCapacity: 2
}
}
}
// 10 databases × $370/month = $3,700
// vs Elastic Pool 4 vCore: $740/month = 80% savingsStep 6: Optimize AKS
// Right-size AKS node pools
resource aksCluster 'Microsoft.ContainerService/managedClusters@2023-05-01' = {
name: aksName
location: location
properties: {
agentPoolProfiles: [
{
name: 'system'
count: 2 // Reduced from 3
vmSize: 'Standard_D2s_v5' // Reduced from D4s_v5
mode: 'System'
enableAutoScaling: true
minCount: 2
maxCount: 3
}
{
name: 'workload'
count: 3
vmSize: 'Standard_D4s_v5'
mode: 'User'
enableAutoScaling: true
minCount: 2
maxCount: 10 // Scale up only when needed
// Use spot instances for non-critical workloads
scaleSetPriority: 'Spot'
spotMaxPrice: -1 // Pay up to on-demand price
scaleSetEvictionPolicy: 'Delete'
nodeLabels: {
'workload-type': 'batch'
}
nodeTaints: [
'kubernetes.azure.com/scalesetpriority=spot:NoSchedule'
]
}
]
}
}
// Spot instance savings: ~60-90% vs regular VMs# Kubernetes resource optimization
apiVersion: v1
kind: LimitRange
metadata:
name: resource-limits
namespace: production
spec:
limits:
- default:
cpu: "500m"
memory: "512Mi"
defaultRequest:
cpu: "100m"
memory: "128Mi"
type: Container
---
# Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70Step 7: Set Up Cost Alerts and Budgets
// Budget with alerts
resource budget 'Microsoft.Consumption/budgets@2023-05-01' = {
name: 'monthly-budget'
properties: {
category: 'Cost'
amount: 120000 // Target: $120K (20% reduction)
timeGrain: 'Monthly'
timePeriod: {
startDate: '2024-01-01'
endDate: '2025-12-31'
}
filter: {
dimensions: {
name: 'ResourceGroup'
operator: 'In'
values: ['rg-production', 'rg-staging', 'rg-development']
}
}
notifications: {
Actual_GreaterThan_80_Percent: {
enabled: true
operator: 'GreaterThan'
threshold: 80
contactEmails: ['finance@contoso.com', 'platform@contoso.com']
contactRoles: ['Owner', 'Contributor']
thresholdType: 'Actual'
}
Forecasted_GreaterThan_100_Percent: {
enabled: true
operator: 'GreaterThan'
threshold: 100
contactEmails: ['finance@contoso.com', 'cto@contoso.com']
thresholdType: 'Forecasted'
}
}
}
}
// Resource group level budget
resource rgBudget 'Microsoft.Consumption/budgets@2023-05-01' = {
name: 'dev-budget'
scope: resourceGroup('rg-development')
properties: {
category: 'Cost'
amount: 5000
timeGrain: 'Monthly'
notifications: {
Actual_GreaterThan_90_Percent: {
enabled: true
operator: 'GreaterThan'
threshold: 90
contactEmails: ['dev-lead@contoso.com']
}
}
}
}Step 8: Implement Cost Tagging
// Enforce tagging policy
resource taggingPolicy 'Microsoft.Authorization/policyDefinitions@2021-06-01' = {
name: 'require-cost-tags'
properties: {
policyType: 'Custom'
mode: 'Indexed'
displayName: 'Require cost center and environment tags'
policyRule: {
if: {
anyOf: [
{
field: 'tags[CostCenter]'
exists: 'false'
}
{
field: 'tags[Environment]'
exists: 'false'
}
{
field: 'tags[Owner]'
exists: 'false'
}
]
}
then: {
effect: 'deny'
}
}
}
}
// Apply tags to all resources
resource tagPolicy 'Microsoft.Resources/tags@2021-04-01' = {
name: 'default'
properties: {
tags: {
Environment: environment
CostCenter: costCenter
Owner: ownerEmail
Project: projectName
CreatedBy: 'Bicep'
CreatedDate: utcNow('yyyy-MM-dd')
}
}
}Cost Optimization Summary
Optimization Results:
BEFORE ($150,000/month):
├── VMs: $45,000
├── SQL: $30,000
├── Storage: $22,500
├── AKS: $25,500
├── App Services: $15,000
└── Other: $12,000
AFTER ($100,000/month):
├── VMs: $25,000 (-44%)
│ ├── Reserved instances: -$10,000
│ ├── Right-sizing: -$5,000
│ └── Auto-shutdown dev: -$5,000
├── SQL: $18,000 (-40%)
│ ├── Elastic pools: -$8,000
│ └── Serverless: -$4,000
├── Storage: $15,000 (-33%)
│ └── Lifecycle policies: -$7,500
├── AKS: $18,000 (-29%)
│ ├── Spot instances: -$5,000
│ └── Autoscaling: -$2,500
├── App Services: $12,000 (-20%)
│ └── Right-size non-prod: -$3,000
└── Other: $12,000
TOTAL SAVINGS: $50,000/month (33% reduction)
ANNUAL SAVINGS: $600,000 Cost Optimization Strategies
| Strategy | Savings | Effort | Risk |
|---|---|---|---|
| Reserved Instances | 30-60% | Low | Commitment |
| Spot Instances | 60-90% | Medium | Interruption |
| Right-sizing | 20-50% | Medium | Performance |
| Auto-shutdown | 50-90% | Low | Availability |
| Serverless | Variable | High | Architecture |
Quick Wins Checklist
| Action | Expected Savings |
|---|---|
| Delete unattached disks | $5-50/disk/month |
| Stop idle VMs | 100% of compute |
| Resize oversized VMs | 30-50% per VM |
| Enable auto-shutdown | 60% for dev/test |
| Use reserved instances | 30-60% for prod |
| Implement lifecycle policies | 30-90% on storage |
Practice Question
Why should you analyze at least 14-30 days of metrics before right-sizing a VM?