FinOps for engineers: making cost a first-class metric
Bringing cost into the engineering loop without turning everyone into an accountant.
For a long time, cost at my company was a thing the finance team looked at once a month, gasped, and emailed about. By the time anyone saw the number, the spending had already happened. Engineers had no feedback loop, we shipped features and the bill was someone else's problem.
FinOps is the practice of fixing that feedback loop: making cost a metric engineers see, own, and act on while they're still building, the same way we treat latency or error rate. Here's how I made that real on a team that had never thought about a dollar.
Cost is just another SLI
The mindset shift that worked was framing cost as a service-level indicator. You wouldn't ship a service with no p99 latency dashboard. So why ship one with no "$ per 1,000 requests" number? Once cost is a unit-economics metric tied to business value, it stops being scary accounting and becomes an engineering signal.
The goal of FinOps isn't to spend less. It's to spend deliberately, to know what each dollar buys and to make that visible to the people who can change it.
Step one: you cannot fix what you cannot allocate
None of this works without tagging. If 40% of your bill lands in "untagged," every conversation stalls. I enforce a small mandatory tag set and use AWS Organizations tag policies plus a Config rule to catch drift:
team, who owns itservice, what it isenvironment, prod / staging / devcost-center, for chargeback
Then activate those as cost allocation tags in the Billing console (user-defined tags take up to 24 hours to appear and only apply going forward, they don't backfill). After that, Cost Explorer and the Cost and Usage Report can group spend by team and service.
Step two: put the number in front of engineers
A monthly finance email is too slow. I pull cost daily and post per-team deltas into Slack so anomalies surface within a day, not a quarter. A small boto3 job over Cost Explorer covers it:
import boto3
from datetime import date, timedelta
ce = boto3.client("ce")
today = date.today()
resp = ce.get_cost_and_usage(
TimePeriod={
"Start": str(today - timedelta(days=1)),
"End": str(today),
},
Granularity="DAILY",
Metrics=["UnblendedCost"],
GroupBy=[{"Type": "TAG", "Key": "team"}],
)
for group in resp["ResultsByTime"][0]["Groups"]:
team = group["Keys"][0]
amount = float(group["Metrics"]["UnblendedCost"]["Amount"])
if amount > 50: # only ping on meaningful spend
print(f"{team}: ${amount:,.2f} yesterday")
For genuine surprises, AWS Cost Anomaly Detection uses an ML model on your historical spend and alerts on statistically unusual jumps, far better than a static threshold that fires every time traffic doubles for a good reason.
Step three: connect cost to a unit of value
Absolute dollars lie. A bill that grows 20% while traffic grows 60% is a win. So I divide cost by a business unit, requests, active users, GB processed, to get unit economics:
| Metric | Q1 | Q2 | Read |
|---|---|---|---|
| Total spend | $42k | $48k | Up 14%, looks bad |
| Requests (M) | 180 | 260 | Up 44% |
| $ / 1k req | $0.233 | $0.185 | Down 21%, actually efficient |
This table is what turns a defensive budget meeting into a productive one. The total went up; the efficiency improved. Both facts are true and only the unit metric tells the real story.
Step four: make optimization part of the workflow
The classic FinOps loop is Inform → Optimize → Operate, and "Operate" is where most teams fail, they optimize once and let it rot. I bake it in:
- Right-sizing reviews land as backlog tickets from Compute Optimizer recommendations, not heroics.
- Commitment purchases (Savings Plans, Reserved capacity) are a quarterly ritual with an owner.
- New services ship with a cost estimate in the design doc, the same way they ship a capacity plan.
- Dev/staging environments auto-stop overnight, a Lambda on an EventBridge schedule routinely cuts non-prod spend 60-70%.
Takeaways
- Treat cost as an SLI engineers see continuously, not a monthly finance report they react to.
- Tagging discipline is the prerequisite, without allocation, no FinOps conversation goes anywhere.
- Track unit economics ($/request, $/user), not just absolute dollars; growth and efficiency are different stories.
- Bake the Inform-Optimize-Operate loop into normal engineering workflow so savings don't decay.