FinOps for engineers: making cost a first-class metric, The Cloud Ledger

For a long time, cost at my company was a thing the finance team looked at once a month, gasped, and emailed about. By the time anyone saw the number, the spending had already happened. Engineers had no feedback loop, we shipped features and the bill was someone else's problem.

FinOps is the practice of fixing that feedback loop: making cost a metric engineers see, own, and act on while they're still building, the same way we treat latency or error rate. Here's how I made that real on a team that had never thought about a dollar.

Cost is just another SLI

The mindset shift that worked was framing cost as a service-level indicator. You wouldn't ship a service with no p99 latency dashboard. So why ship one with no "$ per 1,000 requests" number? Once cost is a unit-economics metric tied to business value, it stops being scary accounting and becomes an engineering signal.

The goal of FinOps isn't to spend less. It's to spend deliberately, to know what each dollar buys and to make that visible to the people who can change it.

Step one: you cannot fix what you cannot allocate

None of this works without tagging. If 40% of your bill lands in "untagged," every conversation stalls. I enforce a small mandatory tag set and use AWS Organizations tag policies plus a Config rule to catch drift:

team, who owns it
service, what it is
environment, prod / staging / dev
cost-center, for chargeback

Then activate those as cost allocation tags in the Billing console (user-defined tags take up to 24 hours to appear and only apply going forward, they don't backfill). After that, Cost Explorer and the Cost and Usage Report can group spend by team and service.

Step two: put the number in front of engineers

A monthly finance email is too slow. I pull cost daily and post per-team deltas into Slack so anomalies surface within a day, not a quarter. A small boto3 job over Cost Explorer covers it:

import boto3
from datetime import date, timedelta

ce = boto3.client("ce")
today = date.today()

resp = ce.get_cost_and_usage(
    TimePeriod={
        "Start": str(today - timedelta(days=1)),
        "End": str(today),
    },
    Granularity="DAILY",
    Metrics=["UnblendedCost"],
    GroupBy=[{"Type": "TAG", "Key": "team"}],
)

for group in resp["ResultsByTime"][0]["Groups"]:
    team = group["Keys"][0]
    amount = float(group["Metrics"]["UnblendedCost"]["Amount"])
    if amount > 50:  # only ping on meaningful spend
        print(f"{team}: ${amount:,.2f} yesterday")

For genuine surprises, AWS Cost Anomaly Detection uses an ML model on your historical spend and alerts on statistically unusual jumps, far better than a static threshold that fires every time traffic doubles for a good reason.

Step three: connect cost to a unit of value

Absolute dollars lie. A bill that grows 20% while traffic grows 60% is a win. So I divide cost by a business unit, requests, active users, GB processed, to get unit economics:

Metric	Q1	Q2	Read
Total spend	$42k	$48k	Up 14%, looks bad
Requests (M)	180	260	Up 44%
$ / 1k req	$0.233	$0.185	Down 21%, actually efficient

This table is what turns a defensive budget meeting into a productive one. The total went up; the efficiency improved. Both facts are true and only the unit metric tells the real story.

Step four: make optimization part of the workflow

The classic FinOps loop is Inform → Optimize → Operate, and "Operate" is where most teams fail, they optimize once and let it rot. I bake it in:

Right-sizing reviews land as backlog tickets from Compute Optimizer recommendations, not heroics.
Commitment purchases (Savings Plans, Reserved capacity) are a quarterly ritual with an owner.
New services ship with a cost estimate in the design doc, the same way they ship a capacity plan.
Dev/staging environments auto-stop overnight, a Lambda on an EventBridge schedule routinely cuts non-prod spend 60-70%.

Takeaways

Treat cost as an SLI engineers see continuously, not a monthly finance report they react to.
Tagging discipline is the prerequisite, without allocation, no FinOps conversation goes anywhere.
Track unit economics ($/request, $/user), not just absolute dollars; growth and efficiency are different stories.
Bake the Inform-Optimize-Operate loop into normal engineering workflow so savings don't decay.