Monitoring ML models in production with SageMaker, The Cloud Ledger

The model that passed every offline metric in my notebook started quietly rotting in production about six weeks after launch. No errors, no alarms, no 500s. The endpoint was healthy by every infrastructure measure I had. What was actually happening was that the input distribution had drifted, and the model's predictions had degraded into something close to noise while everything looked green.

That experience taught me that monitoring an ML model means watching two completely different things: the system serving it, and the statistics of what it's being asked and what it answers. SageMaker gives you tools for both, but they sit in different places.

The two layers you have to watch

Operational health, latency, error rate, invocations, instance saturation. These are CloudWatch metrics emitted by the endpoint, and they tell you whether the service is up. This is the easy layer and the one teams usually have covered.
Model quality, data drift, prediction drift, feature attribution shift, and (when ground truth arrives) actual accuracy. This is what SageMaker Model Monitor exists for, and it's the layer that catches the silent failures.

Operational metrics first

SageMaker real-time endpoints publish to the AWS/SageMaker namespace. The ones I always alarm on are ModelLatency, OverheadLatency, Invocation5XXErrors, and the per-variant CPUUtilization / GPUUtilization. A simple p99 latency alarm:

aws cloudwatch put-metric-alarm \
  --alarm-name fraud-endpoint-p99-latency \
  --namespace AWS/SageMaker \
  --metric-name ModelLatency \
  --dimensions Name=EndpointName,Value=fraud-scoring Name=VariantName,Value=AllTraffic \
  --statistic p99 \
  --period 60 --evaluation-periods 5 \
  --threshold 200000 \
  --comparison-operator GreaterThanThreshold \
  --alarm-actions arn:aws:sns:us-east-1:111122223333:ml-oncall

Note ModelLatency is reported in microseconds, so 200000 is 200 ms, getting that unit wrong is a classic way to build an alarm that never fires.

Catching drift with Model Monitor

Model Monitor works by first baselining: you point it at your training (or a known-good) dataset and it computes statistics and constraints, column means, ranges, distribution summaries, null rates. Then you schedule monitoring jobs that compare captured production traffic against that baseline and flag violations.

The prerequisite is data capture, which you enable on the endpoint so requests and responses land in S3:

from sagemaker.model_monitor import DataCaptureConfig

capture = DataCaptureConfig(
    enable_capture=True,
    sampling_percentage=100,
    destination_s3_uri="s3://ml-monitoring/fraud-scoring/capture",
)
predictor.update_data_capture_config(data_capture_config=capture)

Sample at 100% while you're establishing a baseline, then dial it down (10-20% is usually plenty) to control storage and processing cost once traffic is high. From there a scheduled DefaultModelMonitor job runs hourly or daily, writes a violations report to S3, and emits a CloudWatch metric you can alarm on.

The four monitor types, and what each catches

Monitor	Catches	Needs ground truth?
Data Quality	Input drift, missing/out-of-range features, schema breaks	No
Model Quality	Accuracy/precision/recall degradation	Yes (labels)
Bias Drift	Fairness metrics shifting across groups	Sometimes
Feature Attribution	Which features drive predictions changing over time	No

The crucial honesty here: Model Quality monitoring requires ground-truth labels, and labels almost always arrive late. For fraud you might not know the truth for 30-90 days. So in practice Data Quality and Feature Attribution drift are your early-warning system, and Model Quality confirms the damage after labels catch up.

Drift detection tells you the world changed; it does not tell you the model got worse. Treat a drift alarm as "investigate," not "roll back", sometimes the new distribution is fine and the model handles it.

Closing the loop

A monitor that files reports nobody reads is theater. I wire the Model Monitor CloudWatch metric to the same SNS topic as the latency alarms, and a drift violation opens a ticket with the specific features that breached. The retraining trigger is deliberately a human decision gate, not fully automatic, automatic retraining on a drift signal is how you train on a bad week of data and make things worse.

Takeaways

Monitor two layers: operational health via CloudWatch and model quality via SageMaker Model Monitor, green infrastructure hides silent model rot.
Enable data capture and baseline against known-good data before you can detect any drift.
Model Quality monitoring needs labels that usually arrive late, so lean on Data Quality and feature-attribution drift for early warning.
Route drift to investigation, not automatic rollback or retraining; keep a human gate in the loop.