The first time I shipped a Bedrock-backed feature, I picked the biggest model on the menu because the demo looked great. The bill three weeks later did not look great. Roughly 80% of my spend was coming from a summarization step that a much cheaper model handled just as well.

Model selection on Amazon Bedrock is mostly a cost-engineering problem dressed up as a quality problem. Once you measure both, the right answer is usually "use the smallest model that passes your eval."

Price is per-token, and the tiers are wide

Bedrock charges separately for input and output tokens, and the spread between model families is enormous. Output tokens are typically 4-5x the price of input tokens, so chatty responses hurt twice. A rough on-demand picture for Anthropic models on Bedrock:

ModelInput ($/1M tok)Output ($/1M tok)Good for
Claude Haiku~$0.25~$1.25Classification, extraction, routing
Claude Sonnet~$3.00~$15.00RAG, drafting, most workloads
Claude Opus~$15.00~$75.00Hard reasoning, agentic planning

That is a 60x range on output tokens between the cheapest and most expensive tier. If you default to the top tier for a task a Haiku-class model can do, you are paying that multiple for nothing.

Route by task, not by reflex

I split every feature into discrete steps and assign each step the cheapest model that clears its eval. A common pattern: a small model classifies or routes the request, and only the genuinely hard cases escalate to a larger model.

import boto3, json

brt = boto3.client("bedrock-runtime", region_name="us-east-1")

def invoke(model_id, prompt, max_tokens=512):
    resp = brt.invoke_model(
        modelId=model_id,
        body=json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": max_tokens,
            "messages": [{"role": "user", "content": prompt}],
        }),
    )
    return json.loads(resp["body"].read())

# Cheap model triages; only escalate when it flags low confidence
triage = invoke("anthropic.claude-3-5-haiku-20241022-v1:0", classify_prompt)
if needs_deep_reasoning(triage):
    answer = invoke("anthropic.claude-3-5-sonnet-20241022-v2:0", full_prompt)

Squeeze tokens before you switch models

Before downgrading a model, I check whether the prompt itself is bloated. The cheapest token is the one you never send. Tactics that have paid off for me:

  • Trim system prompts and few-shot examples once the model is reliably formatting output. I have cut 1,500-token system prompts to 300 with no quality loss.
  • Cap max_tokens aggressively. A summarizer that can run away to 2,000 tokens but only needs 200 is a silent cost leak.
  • Use prompt caching for stable context (long instructions, retrieved documents reused across turns). Cached reads are billed at a steep discount versus fresh input tokens.
  • Stop stuffing entire documents into context when retrieval can hand you the three relevant chunks.

Measure dollars-per-task, not dollars-per-token

Per-token price is misleading on its own. A cheaper model that needs two retries, a verification pass, and a longer prompt can cost more per completed task than a pricier model that nails it in one shot. I track cost per successful request and tag every Bedrock invocation so I can attribute spend by feature.

The right model is the cheapest one that passes your eval at your required quality bar. Everything above that bar is money you are setting on fire.

Build a small eval set of 50-100 real examples with graded outputs, then run each candidate model through it and chart quality against cost. The decision usually makes itself: most steps land on a small model, and you reserve the expensive tier for the genuinely hard 10%.

Takeaways

  • Output tokens dominate cost and span a 60x range across tiers; default to the smallest model that passes eval, not the largest.
  • Route by task: cheap models triage and handle the common case, larger models only handle escalations.
  • Trim prompts, cap max_tokens, and use prompt caching before you reach for a bigger model.
  • Optimize for cost-per-successful-task, and tag invocations so you can attribute spend per feature.