Scaling inference with SageMaker async endpoints
Queue-backed inference for large payloads and bursty traffic, scaling to zero between bursts.
I had a document-understanding model that took 12 to 40 seconds per request depending on page count. Behind a real-time SageMaker endpoint it was a disaster: clients timed out at 60 seconds, traffic was bursty, and to absorb spikes I was paying for a fleet of ml.g5.xlarge instances that sat idle most of the day. Synchronous inference was the wrong tool for a slow, spiky, latency-tolerant workload.
SageMaker Asynchronous Inference fixed all three problems at once. Here's how it works and where it pays off.
How async endpoints differ from real-time
With a real-time endpoint, the client holds an HTTP connection open until the prediction returns. With an async endpoint, you upload the payload to S3, call InvokeEndpointAsync, and get back an output location immediately. SageMaker queues the request, processes it when capacity is free, and drops the result in S3, optionally notifying you via SNS.
| Real-time | Async | |
|---|---|---|
| Request pattern | Sync HTTP | Queued, S3 in/out |
| Max payload | ~6 MB | up to 1 GB |
| Max processing | ~60 s | up to 1 hour |
| Scale to zero | No | Yes |
That last row is the headline: async endpoints can scale instances to zero when the queue is empty, so a workload that runs in bursts stops costing you anything between bursts.
Deploying an async endpoint
The config differs from real-time in the AsyncInferenceConfig block, where you set the S3 output path and an SNS topic for completion notifications:
import boto3
sm = boto3.client("sagemaker")
sm.create_endpoint_config(
EndpointConfigName="doc-async-config",
ProductionVariants=[{
"VariantName": "AllTraffic",
"ModelName": "doc-understanding-v3",
"InstanceType": "ml.g5.xlarge",
"InitialInstanceCount": 1,
}],
AsyncInferenceConfig={
"OutputConfig": {
"S3OutputPath": "s3://my-bucket/async-out/",
"NotificationConfig": {
"SuccessTopic": "arn:aws:sns:us-east-1:111122223333:infer-done",
"ErrorTopic": "arn:aws:sns:us-east-1:111122223333:infer-err",
},
},
"ClientConfig": {"MaxConcurrentInvocationsPerInstance": 4},
},
)
sm.create_endpoint(EndpointName="doc-async", EndpointConfigName="doc-async-config")
Scaling to zero, and back up
The autoscaling target for async endpoints is the queue backlog per instance, the ApproximateBacklogSizePerInstance metric. You register a scaling target with a minimum capacity of zero, so when the queue drains, SageMaker spins instances down.
aws application-autoscaling register-scalable-target \
--service-namespace sagemaker \
--resource-id endpoint/doc-async/variant/AllTraffic \
--scalable-dimension sagemaker:variant:DesiredInstanceCount \
--min-capacity 0 --max-capacity 8
aws application-autoscaling put-scaling-policy \
--policy-name backlog-scaling \
--service-namespace sagemaker \
--resource-id endpoint/doc-async/variant/AllTraffic \
--scalable-dimension sagemaker:variant:DesiredInstanceCount \
--policy-type TargetTrackingScaling \
--target-tracking-scaling-policy-configuration \
'{"TargetValue":5.0,"CustomizedMetricSpecification":{"MetricName":"ApproximateBacklogSizePerInstance","Namespace":"AWS/SageMaker","Statistic":"Average","Dimensions":[{"Name":"EndpointName","Value":"doc-async"}]},"ScaleInCooldown":300,"ScaleOutCooldown":60}'
The trade-off with scale-to-zero is cold start. When the first request arrives at an idle endpoint, SageMaker provisions an instance and pulls the container, which took roughly 3-5 minutes in my case for a large GPU image. Acceptable for batch-style work, not for anything user-facing in real time.
Async inference turns inference cost from a function of how much capacity you provision into a function of how much work you actually submit. For bursty workloads that's a step change.
Where async fits, and where it doesn't
- Good fit: large payloads (video, multi-page PDFs, genomics), processing times over a few seconds, latency-tolerant pipelines, spiky or scheduled batch traffic.
- Poor fit: interactive chat, sub-second SLAs, anything where a 3-5 minute cold start is unacceptable and traffic is too sparse to keep one instance warm.
For my document pipeline, moving to async cut the monthly inference bill by about 60% because the endpoint sat at zero instances overnight and on weekends, while throughput during business hours was unchanged. The client experience also improved: instead of timing out, callers poll the S3 output or react to the SNS notification.
Takeaways
- Async endpoints suit large payloads and long, latency-tolerant inference where holding an HTTP connection open is impractical.
- Scale to zero by autoscaling on
ApproximateBacklogSizePerInstancewith a minimum capacity of zero. - Budget for a multi-minute cold start when the endpoint wakes from zero; keep one instance warm if that's unacceptable.
- Wire SNS success/error topics so clients react to completion instead of polling tight loops.