Real-time inference endpoints that don’t break the bank
Autoscaling, multi-model endpoints, and serverless inference, paying for what you use.
I once watched a recommendation model burn through $9,000 a month on a fleet of always-on ml.g4dn.xlarge SageMaker endpoints that averaged 4% GPU utilization. The model was good. The serving setup was the problem, we'd provisioned for peak traffic that arrived for about ninety minutes a day and paid for idle GPUs the other twenty-two and a half hours.
Real-time inference doesn't have to be expensive. The trick is matching the serving pattern to the actual traffic shape, and being honest about whether you even need a GPU. Here's the decision tree I use now.
First question: do you need real-time at all?
The cheapest inference is the kind you don't run synchronously. Before optimizing an endpoint, I check whether the workload tolerates latency:
- Batch Transform, periodic scoring of a dataset. Spins up, processes, tears down. Near-zero idle cost.
- Asynchronous Inference, queues requests, scales to zero when idle, returns results to S3. Perfect for large payloads or seconds-tolerant latency.
- Serverless Inference, true scale-to-zero for spiky, low-volume traffic. You pay per millisecond of compute, nothing when idle.
- Real-time endpoint, always-on, lowest and most consistent latency. The most expensive default.
The most common cost mistake isn't choosing the wrong instance, it's choosing real-time when async or serverless would have met the latency SLA at a fraction of the price.
Serverless inference for spiky traffic
For the recommendation model above, traffic was bursty and CPU-servable once we quantized it. SageMaker Serverless Inference scales to zero between bursts, so we stopped paying for idle entirely. You configure memory (1-6 GB) and max concurrency:
import boto3
sm = boto3.client("sagemaker")
sm.create_endpoint_config(
EndpointConfigName="reco-serverless",
ProductionVariants=[{
"ModelName": "reco-v3",
"VariantName": "AllTraffic",
"ServerlessConfig": {
"MemorySizeInMB": 4096,
"MaxConcurrency": 20,
},
}],
)
sm.create_endpoint(
EndpointName="reco",
EndpointConfigName="reco-serverless",
)
The trade-off is cold starts, the first request after idle can add a few seconds while a worker spins up. If your p99 SLA can't absorb that, keep a small real-time variant warm or use provisioned concurrency.
If you do need a real-time endpoint, right-size it
Two levers cut real-time cost dramatically without touching the model:
| Lever | Mechanism | Typical saving |
|---|---|---|
| Autoscaling | Scale instance count on InvocationsPerInstance | 40-70% off peak-provisioned cost |
| Multi-model endpoints | Host many models behind one endpoint, load on demand | Huge for long-tail models |
| Graviton / Inferentia | Move off x86 GPU to inf2 or ARM CPU | 30-50% on price-performance |
| Quantization (INT8) | Shrink model so CPU or smaller GPU serves it | Often eliminates GPU entirely |
Autoscaling alone is the biggest win for most teams. Provision a sane minimum and let SageMaker add capacity on the target-tracking metric:
aws application-autoscaling register-scalable-target \
--service-namespace sagemaker \
--resource-id endpoint/reco/variant/AllTraffic \
--scalable-dimension sagemaker:variant:DesiredInstanceCount \
--min-capacity 1 --max-capacity 8
aws application-autoscaling put-scaling-policy \
--policy-name reco-tt \
--service-namespace sagemaker \
--resource-id endpoint/reco/variant/AllTraffic \
--scalable-dimension sagemaker:variant:DesiredInstanceCount \
--policy-type TargetTrackingScaling \
--target-tracking-scaling-policy-configuration \
'{"TargetValue":750.0,"PredefinedMetricSpecification":{"PredefinedMetricType":"SageMakerVariantInvocationsPerInstance"}}'
Don't forget the silicon and the savings plan
Two more things I now check by default. First, AWS Inferentia (inf2) instances are purpose-built for inference and frequently beat GPU instances on cost-per-inference once you compile the model with the Neuron SDK. Second, steady inference fleets are exactly the kind of predictable workload that SageMaker Savings Plans cover, committing to a baseline of compute can knock another 30%+ off the on-demand rate for the floor you'll always run.
The outcome
For that $9k recommendation model: we quantized to INT8, moved to serverless for the long tail, kept one autoscaled CPU real-time variant for the peak window, and ended at about $1,400/month, roughly an 85% cut, with no measurable change in recommendation quality and a p99 still inside our 200ms SLA.
Takeaways
- Ask whether you need real-time at all, batch, async, and serverless inference scale to zero and cover most workloads cheaper.
- Serverless inference eliminates idle cost for spiky traffic; the price is cold-start latency.
- For real-time endpoints, autoscaling is the single biggest lever, followed by Inferentia/Graviton and INT8 quantization.
- Cover your always-on baseline with a SageMaker Savings Plan and right-size the rest dynamically.