"We'll save a fortune by hosting the model ourselves" is a sentence I've heard in more than one architecture meeting, and it's true exactly often enough to be dangerous. The intuition makes sense: pay-per-token APIs feel expensive at scale, GPUs are a fixed cost, so surely owning the hardware wins eventually. Sometimes it does. But the breakeven point is much further out than most people guess, and the math is dominated by one number almost everyone ignores: utilization.

I've run both. Here's the actual cost breakdown for serving an LLM on Amazon Bedrock versus self-hosting an open-weight model on EC2 GPU instances.

How the two pricing models differ

Bedrock (and managed APIs generally) bill per token, split into input and output, with output typically priced higher. For Anthropic's Claude models the published rates are, per million tokens: Claude Sonnet 4.6 at $3 input / $15 output, Claude Haiku 4.5 at $1 / $5, and Claude Opus 4.8 at $5 / $25. There is no idle cost. Send zero requests, pay zero.

Self-hosting flips that. You rent a GPU instance by the hour whether it's serving traffic or sitting idle, and you pay for it 24/7 if you want the endpoint always available. The per-token marginal cost is effectively zero; the entire cost is the instance.

The number that decides everything: utilization

A self-hosted GPU only beats a managed API when you keep it busy. An idle GPU is the most expensive way to serve a token that exists.

Take a single g5.12xlarge (4x A10G GPUs), enough to serve a mid-size open model with reasonable throughput. On-demand it runs roughly $5.67/hour, about $4,080/month if left running continuously. To make that pay off against an API, you have to push enough tokens through it to beat what those tokens would have cost on Bedrock.

ScenarioSelf-host cost/moEffective behavior
GPU at 5% utilization~$4,080Paying full price to serve almost nothing
GPU at 70%+ utilization, steady traffic~$4,080 flatCheap per token; this is where self-hosting wins
Spiky / low total volume~$4,080 + opsAPI would have cost a fraction

The key insight: the API's cost scales with usage, the GPU's cost scales with wall-clock time. If your traffic is low or bursty, the API is almost always cheaper because you're not paying for idle silicon. Self-hosting only wins at high, sustained utilization where you'd otherwise be sending a very large, steady stream of tokens to the API.

The costs that don't show up on the EC2 bill

The instance price is the visible cost. The hidden ones sink most self-hosting business cases:

  • Engineering time. Someone has to set up the inference server (vLLM, TGI), tune batching, handle model loading, build autoscaling, and stay on call. That's a meaningful slice of an MLOps salary, recurring forever.
  • Redundancy. One instance is a single point of failure. Real availability means at least two, doubling the floor.
  • Underutilization from autoscaling lag. GPUs take minutes to spin up and load weights, so you over-provision to absorb spikes, which lowers average utilization, which is the exact number that has to stay high.
  • Model quality. The open-weight model you can afford to self-host may simply be less capable than the frontier model behind the API, and worse answers have their own cost.

Estimating the crossover in code

Before committing either way, I estimate the monthly cost both ways from real traffic numbers. A rough model:

def monthly_cost(reqs_per_day, in_tok, out_tok):
    days = 30

    # Bedrock-style per-token (Claude Sonnet 4.6 rates, $/token)
    in_rate, out_rate = 3.0 / 1e6, 15.0 / 1e6
    api = reqs_per_day * days * (in_tok * in_rate + out_tok * out_rate)

    # Self-host: one g5.12xlarge on-demand, always on, x2 for HA
    gpu_hourly = 5.67
    self_host = gpu_hourly * 24 * days * 2

    return round(api, 2), round(self_host, 2)

# 50k requests/day, 1k in / 500 out tokens each
print(monthly_cost(50_000, 1_000, 500))
# (~$11,250 API, ~$8,164 self-host x2)  -> self-host edges ahead, IF you keep it busy

Run that with your real numbers. At low volume the API number is tiny and self-hosting is absurd. As request volume climbs into the millions per day with steady load, the flat GPU cost wins, exactly as the utilization argument predicts. The crossover is a business-specific point, not a universal rule.

A pragmatic default

Start on Bedrock. You get frontier model quality, zero idle cost, no ops burden, and a per-token bill that tells you precisely how much inference you're actually consuming. Only when that bill is large, your traffic is steady, and an open model meets your quality bar should you model the self-hosted alternative, including the hidden costs. Most teams never cross that line.

Takeaways

  • Managed APIs bill per token with no idle cost; self-hosting bills per wall-clock hour with near-zero marginal token cost.
  • Self-hosting only wins at high, sustained GPU utilization; an idle or bursty GPU is the most expensive option.
  • Factor in the invisible costs: engineering time, redundancy, autoscaling lag, and potential model-quality gaps.
  • Default to Bedrock; model the self-hosted case only when volume is large and steady, and validate the crossover with your own traffic numbers.