Real-time inference endpoints that don’t break the bank, The Cloud Ledger

I once watched a recommendation model burn through $9,000 a month on a fleet of always-on ml.g4dn.xlarge SageMaker endpoints that averaged 4% GPU utilization. The model was good. The serving setup was the problem, we'd provisioned for peak traffic that arrived for about ninety minutes a day and paid for idle GPUs the other twenty-two and a half hours.

Real-time inference doesn't have to be expensive. The trick is matching the serving pattern to the actual traffic shape, and being honest about whether you even need a GPU. Here's the decision tree I use now.

First question: do you need real-time at all?

The cheapest inference is the kind you don't run synchronously. Before optimizing an endpoint, I check whether the workload tolerates latency:

Batch Transform, periodic scoring of a dataset. Spins up, processes, tears down. Near-zero idle cost.
Asynchronous Inference, queues requests, scales to zero when idle, returns results to S3. Perfect for large payloads or seconds-tolerant latency.
Serverless Inference, true scale-to-zero for spiky, low-volume traffic. You pay per millisecond of compute, nothing when idle.
Real-time endpoint, always-on, lowest and most consistent latency. The most expensive default.

The most common cost mistake isn't choosing the wrong instance, it's choosing real-time when async or serverless would have met the latency SLA at a fraction of the price.

Serverless inference for spiky traffic

For the recommendation model above, traffic was bursty and CPU-servable once we quantized it. SageMaker Serverless Inference scales to zero between bursts, so we stopped paying for idle entirely. You configure memory (1-6 GB) and max concurrency:

import boto3
sm = boto3.client("sagemaker")

sm.create_endpoint_config(
    EndpointConfigName="reco-serverless",
    ProductionVariants=[{
        "ModelName": "reco-v3",
        "VariantName": "AllTraffic",
        "ServerlessConfig": {
            "MemorySizeInMB": 4096,
            "MaxConcurrency": 20,
        },
    }],
)
sm.create_endpoint(
    EndpointName="reco",
    EndpointConfigName="reco-serverless",
)

The trade-off is cold starts, the first request after idle can add a few seconds while a worker spins up. If your p99 SLA can't absorb that, keep a small real-time variant warm or use provisioned concurrency.

If you do need a real-time endpoint, right-size it

Two levers cut real-time cost dramatically without touching the model:

Lever	Mechanism	Typical saving
Autoscaling	Scale instance count on `InvocationsPerInstance`	40-70% off peak-provisioned cost
Multi-model endpoints	Host many models behind one endpoint, load on demand	Huge for long-tail models
Graviton / Inferentia	Move off x86 GPU to `inf2` or ARM CPU	30-50% on price-performance
Quantization (INT8)	Shrink model so CPU or smaller GPU serves it	Often eliminates GPU entirely

Autoscaling alone is the biggest win for most teams. Provision a sane minimum and let SageMaker add capacity on the target-tracking metric:

aws application-autoscaling register-scalable-target \
  --service-namespace sagemaker \
  --resource-id endpoint/reco/variant/AllTraffic \
  --scalable-dimension sagemaker:variant:DesiredInstanceCount \
  --min-capacity 1 --max-capacity 8

aws application-autoscaling put-scaling-policy \
  --policy-name reco-tt \
  --service-namespace sagemaker \
  --resource-id endpoint/reco/variant/AllTraffic \
  --scalable-dimension sagemaker:variant:DesiredInstanceCount \
  --policy-type TargetTrackingScaling \
  --target-tracking-scaling-policy-configuration \
  '{"TargetValue":750.0,"PredefinedMetricSpecification":{"PredefinedMetricType":"SageMakerVariantInvocationsPerInstance"}}'

Don't forget the silicon and the savings plan

Two more things I now check by default. First, AWS Inferentia (inf2) instances are purpose-built for inference and frequently beat GPU instances on cost-per-inference once you compile the model with the Neuron SDK. Second, steady inference fleets are exactly the kind of predictable workload that SageMaker Savings Plans cover, committing to a baseline of compute can knock another 30%+ off the on-demand rate for the floor you'll always run.

The outcome

For that $9k recommendation model: we quantized to INT8, moved to serverless for the long tail, kept one autoscaled CPU real-time variant for the peak window, and ended at about $1,400/month, roughly an 85% cut, with no measurable change in recommendation quality and a p99 still inside our 200ms SLA.

Takeaways

Ask whether you need real-time at all, batch, async, and serverless inference scale to zero and cover most workloads cheaper.
Serverless inference eliminates idle cost for spiky traffic; the price is cold-start latency.
For real-time endpoints, autoscaling is the single biggest lever, followed by Inferentia/Graviton and INT8 quantization.
Cover your always-on baseline with a SageMaker Savings Plan and right-size the rest dynamically.