Cutting CloudWatch costs: logs, metrics, and retention, The Cloud Ledger

Our CloudWatch bill quietly grew to be the third-largest line item in the account, behind only EC2 and RDS. Nobody had provisioned a giant resource, it was death by a thousand log streams, custom metrics emitted per request, and a default retention setting of "never expire." Observability is worth paying for, but I was paying for data I would never look at again.

Here's what actually moved the number, in rough order of impact.

Understand what you're billed for

CloudWatch charges across four buckets that people conflate:

Logs ingestion, roughly $0.50/GB ingested. This is usually the biggest surprise.
Logs storage, ~$0.03/GB-month, forever, if retention is unset.
Custom metrics, ~$0.30 per metric per month. A "metric" is a unique combination of name + dimension values, so high-cardinality dimensions explode this.
API/dashboards/alarms, usually small, but GetMetricData calls from third-party tools add up.

You can't optimize what you can't attribute. Start by splitting your bill across these four before touching anything.

Retention is the free win

The single highest-ROI change: set retention on every log group. Default is "never expire," which means you pay storage indefinitely for logs no human will read. I set 30 days for application logs, 90 for audit-adjacent ones, and archive anything I'm legally required to keep to S3, where it's an order of magnitude cheaper.

# Find every log group with no retention set
aws logs describe-log-groups \
  --query 'logGroups[?!retentionInDays].logGroupName' \
  --output text |
while read -r lg; do
  aws logs put-retention-policy \
    --log-group-name "$lg" \
    --retention-in-days 30
done

Retention only affects storage, not the ingestion charge you already paid. To cut ingestion you have to stop sending the bytes in the first place.

Cut ingestion at the source

Ingestion is billed on volume in, so the levers are upstream of CloudWatch:

Drop log levels in prod. Shipping DEBUG from every Lambda invocation is the most common offender. Set the level via env var and emit INFO and above.
Sample high-volume, low-value logs. You rarely need 100% of access logs; 10% sampling keeps the shape of traffic at a tenth of the cost.
Stop logging giant payloads. One team was logging entire API request bodies, multi-KB JSON per request.

Tame custom metrics cardinality

Because a metric is name + dimension combination, a dimension like customer_id or request_id can turn one metric into millions. The fix is to keep metrics low-cardinality and push the high-cardinality detail into logs, then query it with Logs Insights only when needed. If you use Embedded Metric Format, be deliberate about which fields become dimensions:

{
  "_aws": {
    "CloudWatchMetrics": [{
      "Namespace": "Orders",
      "Dimensions": [["Service", "Region"]],
      "Metrics": [{"Name": "Latency", "Unit": "Milliseconds"}]
    }]
  },
  "Service": "checkout",
  "Region": "us-east-1",
  "customer_id": "a1b2",
  "Latency": 84
}

Note customer_id is present in the log but not in Dimensions, so it's queryable but doesn't multiply your metric count.

Route around CloudWatch where it makes sense

For very high log volumes, ingestion to CloudWatch Logs can cost more than the analysis is worth. I now route bulk logs through a subscription filter to Kinesis Data Firehose into S3, then query with Athena. You give up the live tail and Logs Insights convenience, but storage drops from ~$0.03/GB-month to ~$0.023/GB-month in S3, and you skip a lot of the ingestion premium for archival data. Keep hot, operational logs in CloudWatch; send cold, compliance-driven logs to S3.

Takeaways

Split the bill into ingestion, storage, custom metrics, and API before optimizing, they have different levers.
Setting log retention is the fastest win; default "never expire" silently bills storage forever.
Ingestion is cut upstream: lower prod log levels, sample, and stop logging large payloads, retention won't help here.
Keep metric dimensions low-cardinality and route bulk/archival logs to S3 + Athena instead of CloudWatch.