Batch inference pipelines with SageMaker and S3, The Cloud Ledger

We had a real-time SageMaker endpoint scoring our churn model, and it was costing us a fortune to sit idle 23 hours a day while the only consumer was a nightly job that scored the entire customer base at once. That's the textbook case for batch inference: when you don't need an answer in milliseconds, you don't need a server running around the clock. SageMaker Batch Transform spins up, scores a dataset straight from S3, writes results back to S3, and tears down.

Here's the pipeline I built and the things about the input format that tripped me up.

Batch Transform vs a real-time endpoint

The economics are stark. A real-time endpoint on an ml.m5.xlarge bills 24/7 whether or not it's serving traffic. A Batch Transform job on the same instance type bills only for the minutes it runs. For our nightly scoring of a few million rows, that's roughly 45 minutes of compute per day instead of 24 hours, over a 95% reduction for that workload.

The deciding question is latency tolerance, not data size. If a few hours of turnaround is acceptable, batch beats a provisioned endpoint on cost almost every time, because batch scales to zero between runs.

Lay out the data in S3 correctly

Batch Transform reads input objects from an S3 prefix and writes one output object per input. Two settings control how records are fed to the model and they're the most common source of confusion:

SplitType, how SageMaker splits each input file into records. Line for JSONL/CSV one-record-per-line; None to send whole files.
BatchStrategy, MultiRecord packs many records per request (efficient), SingleRecord sends one at a time.
MaxPayloadInMB and MaxConcurrentTransforms, tune throughput vs memory.

Getting SplitType wrong is why a job either sends the entire file as one giant request (and times out) or fails to parse records. For CSV with a record per line, Line + MultiRecord is the right combination.

Run the job

from sagemaker.transformer import Transformer

transformer = Transformer(
    model_name="churn-model-2026-06",
    instance_count=4,                 # parallelism across data shards
    instance_type="ml.m5.xlarge",
    strategy="MultiRecord",
    max_payload=6,                    # MB per request
    accept="text/csv",
    assemble_with="Line",
    output_path="s3://ml-batch/output/churn/2026-06-24/",
)

transformer.transform(
    data="s3://ml-batch/input/churn/2026-06-24/",
    content_type="text/csv",
    split_type="Line",
    join_source="Input",              # join predictions back to input rows
)
transformer.wait()

Two things earn their keep here. instance_count=4 shards the input across four workers, Batch Transform parallelizes automatically by distributing S3 objects, so more, smaller input files parallelize better than one huge file. And join_source="Input" stitches each prediction back onto its source row, so the output isn't just bare scores you have to re-key.

Orchestrate it on a schedule

For a nightly run, I trigger the job from EventBridge Scheduler into a small Lambda (or a Step Functions state machine for anything multi-stage). Step Functions is worth it once you have preprocessing → transform → postprocessing, because it gives you retries and visibility per step.

aws scheduler create-schedule \
  --name nightly-churn-scoring \
  --schedule-expression "cron(0 2 * * ? *)" \
  --flexible-time-window '{"Mode":"OFF"}' \
  --target '{
    "Arn":"arn:aws:lambda:us-east-1:123456789012:function:start-churn-batch",
    "RoleArn":"arn:aws:iam::123456789012:role/scheduler-invoke-lambda"
  }'

Watch cost and failure modes

A few operational notes from running this in production:

Size instance_count to your shard count, not higher, extra workers with no data to process just bill for nothing.
Set a job timeout and CloudWatch alarms; a malformed input file can stall a worker silently.
Land outputs under a date-partitioned prefix so reruns don't clobber prior results and downstream jobs can find them.
If inputs and the model are large but reads are infrequent, keep input data in S3 Standard during the run window and let lifecycle rules cool it afterward.

Takeaways

Use Batch Transform whenever latency tolerance allows, it scales to zero and routinely beats an idle real-time endpoint by 90%+.
Get SplitType and BatchStrategy right; Line + MultiRecord is the norm for per-line records.
Use join_source="Input" to keep predictions tied to source rows, and many small input files to parallelize across workers.
Orchestrate with EventBridge Scheduler + Lambda (or Step Functions) and write to date-partitioned S3 prefixes.