Last quarter our EC2 bill crossed a number that made our finance lead schedule a meeting. I owned the audit. We didn't migrate anything to Graviton, we didn't rearchitect a single service, we just measured what each instance actually did and matched it to the right size. The bill dropped 31% in six weeks.

Here's the exact process I used, including the data I pulled and the mistakes that cost us a week.

Start with utilization, not the instance list

The temptation is to open the EC2 console, sort by hourly price, and start shrinking the expensive ones. That's backwards. A pricey r5.4xlarge running at 80% CPU is fine; a "cheap" m5.xlarge sitting at 4% is the real waste. I pulled 14 days of CloudWatch metrics, CPU, network, and (critically) memory via the CloudWatch agent, because EC2 does not report memory by default.

aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-0abc123def456 \
  --start-time 2026-06-01T00:00:00Z \
  --end-time 2026-06-15T00:00:00Z \
  --period 3600 \
  --statistics Average Maximum

The pairing of Average and Maximum matters. An instance at 9% average but 70% peak handles a nightly batch job, downsizing it would tank that window. The ones worth cutting are flat-low on both.

Let Compute Optimizer do the first pass

AWS Compute Optimizer ingests this data for free and emits "Over-provisioned / Optimized / Under-provisioned" findings per instance, with a recommended type and projected savings. I treated it as a worklist, not gospel. It flagged 38 of our 112 instances as over-provisioned.

Compute Optimizer is only as good as its memory data. Without the CloudWatch agent reporting mem_used_percent, it assumes memory is fine and may recommend a downsize that triggers OOM kills in production.

I installed the agent fleet-wide first, then waited a full week for fresh memory metrics before trusting any recommendation that reduced RAM.

Where the savings actually came from

The 31% wasn't evenly spread. Breaking it down clarified where to spend future effort.

ChangeInstancesShare of savings
Downsize one family step (e.g. m5.2xlargem5.xlarge)27~52%
Terminate idle / orphaned instances9~28%
Move to newer generation (m5m6i)14~12%
Schedule non-prod shutdown nights/weekends18~8%

The orphaned instances stung. Three were dev boxes from people who'd left, two were a load test someone forgot to tear down. Tagging discipline would have caught all five.

Make downsizing safe and reversible

I batched changes by blast radius. Non-prod first, then prod stateless services behind auto scaling, then stateful last. Each change was a stop/modify/start, so I scripted it and verified the new type before restart.

import boto3

ec2 = boto3.client("ec2")
INSTANCE_ID = "i-0abc123def456"
NEW_TYPE = "m5.xlarge"

ec2.stop_instances(InstanceIds=[INSTANCE_ID])
ec2.get_waiter("instance_stopped").wait(InstanceIds=[INSTANCE_ID])

ec2.modify_instance_attribute(
    InstanceId=INSTANCE_ID,
    InstanceType={"Value": NEW_TYPE},
)

ec2.start_instances(InstanceIds=[INSTANCE_ID])
ec2.get_waiter("instance_running").wait(InstanceIds=[INSTANCE_ID])
print(f"{INSTANCE_ID} is now {NEW_TYPE}")

After each batch I watched CPU, memory, and p99 latency for 48 hours before moving on. Two instances had to bounce back up a size, both were JVM services where heap pressure showed up as GC pauses, not CPU. That's exactly why memory metrics are non-negotiable.

Keep it from creeping back

Right-sizing is not a one-time event; provisioning drifts back up as teams round up "to be safe." I set a monthly Compute Optimizer review, a budget alert in AWS Budgets, and a tagging policy enforced via SCP so every instance carries an owner and environment tag. The audit found the savings; the guardrails keep them.

Takeaways

  • Sort by utilization, not price, and always pull both average and peak over at least 14 days.
  • Install the CloudWatch agent for memory metrics before trusting any downsize recommendation.
  • Idle and orphaned instances are pure waste; enforce owner/environment tags via SCP to find them automatically.
  • Bank the win with recurring reviews and budget alerts, or provisioning will creep right back up.