Killing idle resources automatically with EventBridge, The Cloud Ledger

Every cost review I run starts the same way: I open Cost Explorer, sort by service, and find a handful of resources that nobody remembers spinning up. A ml.g4dn.xlarge notebook left running over a long weekend. Three NAT gateways feeding a VPC with no traffic. A dev RDS instance that has served zero connections since March. None of it is dramatic on its own, but it compounds into four-figure monthly waste.

Manual cleanup never sticks. The fix that actually held for my team was making idleness a thing the platform reacts to automatically, with EventBridge as the nervous system.

Define "idle" before you automate it

"Idle" is not a single metric. What it means depends on the resource:

EC2 / SageMaker notebooks: CPU below ~5% and network below a few KB/s for a sustained window.
RDS / Aurora: DatabaseConnections at zero for N hours.
Load balancers: RequestCount (ALB) or ProcessedBytes flat over a day.
EBS volumes: status=available (detached) for more than a week.

Get this wrong and you terminate something that was merely quiet, like a batch worker that wakes hourly. I always pair a metric threshold with a duration, and I scope automation by tags so production opts in explicitly.

The EventBridge pattern that works

There are two triggers worth combining. A scheduled rule sweeps for resources matching an idle definition on a cadence, and event-driven rules react to CloudWatch alarm state changes in near real time. The schedule catches the long tail; the alarm path catches sharp drop-offs fast.

Here is a scheduled rule that fires a Lambda janitor every hour, scoped to off-hours behavior:

aws events put-rule \
  --name idle-resource-sweep \
  --schedule-expression "rate(1 hour)" \
  --state ENABLED

aws events put-targets \
  --rule idle-resource-sweep \
  --targets "Id"="1","Arn"="arn:aws:lambda:us-east-1:111122223333:function:idle-janitor"

The janitor: act, but reversibly

The Lambda evaluates candidates and takes the cheapest reversible action first. Stopping an EC2 instance keeps the EBS volume and IP wiring intact, so I stop before I ever terminate. SageMaker notebooks get stopped, never deleted.

import boto3, datetime

cw = boto3.client("cloudwatch")
ec2 = boto3.client("ec2")

def is_idle(instance_id, hours=4, cpu_threshold=5.0):
    end = datetime.datetime.utcnow()
    start = end - datetime.timedelta(hours=hours)
    stats = cw.get_metric_statistics(
        Namespace="AWS/EC2",
        MetricName="CPUUtilization",
        Dimensions=[{"Name": "InstanceId", "Value": instance_id}],
        StartTime=start, EndTime=end,
        Period=3600, Statistics=["Maximum"],
    )
    points = [d["Maximum"] for d in stats["Datapoints"]]
    return len(points) >= hours and max(points) < cpu_threshold

def handler(event, context):
    reservations = ec2.describe_instances(
        Filters=[
            {"Name": "instance-state-name", "Values": ["running"]},
            {"Name": "tag:auto-stop", "Values": ["true"]},
        ]
    )["Reservations"]
    for r in reservations:
        for inst in r["Instances"]:
            iid = inst["InstanceId"]
            if is_idle(iid):
                ec2.stop_instances(InstanceIds=[iid])
                print(f"stopped {iid}")

Note the tag:auto-stop filter. Nothing is touched unless its owner opted in. That single design choice is what made the rollout politically survivable.

The goal isn't to delete things. It's to make the steady state of your account reflect what's actually being used, so the bill stops paying for hypothetical needs.

Guardrails that keep this safe

Automation that can stop infrastructure needs brakes:

Notify before terminate. Stop on day one, post to an SNS topic, and only terminate after a 7-day grace tag.
Exclusion tags win. A do-not-stop=true tag always overrides idle logic.
Dry-run mode. Ship the janitor logging actions for a week before it acts.
Audit trail. Every action lands in CloudWatch Logs and a DynamoDB table, so you can answer "who stopped my box" in seconds.

After this ran for a quarter, idle compute on my dev and staging accounts dropped by roughly 30%, with no production incidents traced to the sweep.

Takeaways

Define idleness per resource type with both a metric threshold and a sustained duration, never a single number.
Use EventBridge scheduled rules for the long tail and alarm-driven rules for fast drop-offs.
Always take the cheapest reversible action first: stop before terminate, and gate everything on opt-in tags.
Notify, log, and add a grace period; automation that can't be audited will get switched off.

Killing idle resources automatically with EventBridge

Define "idle" before you automate it

The EventBridge pattern that works

The janitor: act, but reversibly

Guardrails that keep this safe

Takeaways

More on Cost Optimization