Putting an ML model into production: a checklist
Everything between “it works in the notebook” and a model serving real traffic.
The gap between "the notebook gets 0.94 AUC" and "the model serves traffic reliably" is where most ML projects quietly die. I have shipped models that were technically excellent and operationally useless, no monitoring, no rollback, no idea what the training data even was by the time someone asked six months later.
This is the checklist I now run before any model gets real traffic. It is deliberately boring. Boring is what survives a Friday-afternoon incident.
Reproducibility before anything else
If you cannot rebuild the exact artifact that is in production, you cannot debug it. Pin everything:
- Training data snapshotted and versioned (an S3 path with a fixed version, or a feature-store snapshot), not "the latest table."
- Code at a git SHA, dependencies locked, and the random seed recorded.
- The trained artifact registered with metadata linking it back to that data and code. The SageMaker Model Registry or a plain manifest both work.
If you can't answer "what data and code produced this exact model?" in under a minute, you are not ready to deploy it.
Serving shape: real-time, batch, or async
Match the serving pattern to the latency and volume the use case actually needs, not the fanciest option:
| Pattern | Latency | Use when |
|---|---|---|
| Real-time endpoint | ms | user-facing, synchronous |
| Serverless inference | ms, cold starts | spiky/low traffic, cost-sensitive |
| Async inference | seconds-minutes | large payloads, no live user waiting |
| Batch transform | offline | scoring a whole dataset on a schedule |
I have watched teams pay for an always-on GPU endpoint to serve a few hundred predictions a day. Serverless or batch would have cost a fraction.
Ship behind a gate, not a switch
Never cut 100% of traffic to a new model at once. Deploy it as a shadow first (it scores live traffic but its output is logged, not served), then canary a small percentage, then ramp. SageMaker endpoints support multiple production variants with weighted traffic:
import boto3
sm = boto3.client("sagemaker")
sm.update_endpoint_weights_and_capacities(
EndpointName="fraud-scorer",
DesiredWeightsAndCapacities=[
{"VariantName": "model-v3", "DesiredWeight": 90},
{"VariantName": "model-v4-canary", "DesiredWeight": 10},
],
)
# Watch error rate + business metric on v4-canary, then ramp or roll back.
Keep the previous variant warm so rollback is a weight change, not a redeploy.
Monitor the inputs, not just the outputs
Models do not throw exceptions when they go wrong; they quietly get less accurate as the world drifts away from the training distribution. Capture inference inputs and outputs and watch for:
- Data drift, feature distributions shifting from the training baseline.
- Prediction drift, the output distribution moving (e.g., fraud rate suddenly triples).
- Operational health, p99 latency, error rate, throttles.
- Ground-truth quality, once labels arrive, the real metric, on a delay.
SageMaker Model Monitor can baseline your training data and alert on drift, but a CloudWatch dashboard with input statistics gets you most of the value cheaply.
Plan the unglamorous parts
Decide the retraining trigger (schedule vs. drift-based) before launch. Write down who gets paged and what the rollback command is. Confirm the endpoint's IAM role has least privilege and that PII in logged inputs is handled. These are the items that turn a 3 a.m. page into a five-minute fix instead of an outage.
Takeaways
- Lock data, code, and seed so any production model is reproducible in minutes.
- Match serving pattern (real-time, serverless, async, batch) to actual latency and volume needs.
- Roll out via shadow then canary with weighted variants; keep the prior version warm for instant rollback.
- Monitor input/prediction drift and operational health, and define retraining and on-call before launch.