My first RAG demo took an afternoon: embed some docs, stuff the top matches into a prompt, ship it. My first production RAG system took two months, and most of that time went to the things the demo glossed over: keeping the index fresh, controlling cost per query, stopping the model from confidently inventing answers, and making retrieval quality measurable instead of vibes.

This is the end-to-end shape of a production retrieval-augmented generation system on AWS, with the decisions that actually mattered.

The pipeline at a glance

A production RAG system is really two pipelines that meet at a vector store:

  1. Ingestion (offline): source documents → chunk → embed → upsert into a vector index, re-run on changes.
  2. Query (online): user question → embed → retrieve top-k → rerank → assemble prompt → generate → cite.

On AWS the building blocks I settled on were S3 for raw docs, a Lambda/Step Functions ingestion flow, Amazon Bedrock for both embeddings and generation, and a vector store. You can shortcut much of this with a Bedrock Knowledge Base, but building it explicitly gives you control over chunking and reranking that the managed path hides.

Choosing a vector store

This decision drives cost and latency more than the model choice does:

OptionBest whenWatch out for
OpenSearch Serverless (vector)Hybrid lexical+vector, large corporaMinimum OCU cost floor
Aurora PostgreSQL + pgvectorYou already run Postgres, want SQL filtersIndex tuning at scale
Bedrock Knowledge BaseFastest path, less controlLess control over chunking/rerank

I went with pgvector on Aurora because I needed to filter retrieval by tenant and document ACL in the same query, and SQL made that trivial.

Ingestion and embedding

Chunking is where retrieval quality is won or lost. Fixed 512-token chunks with ~15% overlap was my reliable default; I switched to structure-aware splitting (by heading) for technical docs. Each chunk gets embedded via Bedrock Titan Embeddings and upserted with its metadata.

import boto3, json, psycopg

bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")

def embed(text):
    body = json.dumps({"inputText": text})
    resp = bedrock.invoke_model(
        modelId="amazon.titan-embed-text-v2:0",
        body=body,
    )
    return json.loads(resp["body"].read())["embedding"]

def upsert_chunk(conn, doc_id, tenant, chunk):
    vec = embed(chunk)
    conn.execute(
        "INSERT INTO chunks (doc_id, tenant, content, embedding) "
        "VALUES (%s, %s, %s, %s)",
        (doc_id, tenant, chunk, vec),
    )

Retrieve, rerank, then generate

Naive top-k cosine search returns plausible-but-wrong chunks often enough to matter. Two fixes earned their keep: hybrid search (combine vector similarity with keyword match) and a reranking step that scores candidates against the query before they reach the prompt. I retrieve top-20, rerank to top-5, and only those go to the generator.

def retrieve(conn, query, tenant, k=20):
    qvec = embed(query)
    rows = conn.execute(
        "SELECT content FROM chunks "
        "WHERE tenant = %s "
        "ORDER BY embedding <=> %s::vector "  # cosine distance
        "LIMIT %s",
        (tenant, qvec, k),
    ).fetchall()
    return [r[0] for r in rows]

def answer(conn, query, tenant):
    chunks = retrieve(conn, query, tenant)
    context = "\n---\n".join(chunks[:5])
    prompt = (
        "Answer using ONLY the context. If the answer is not in the "
        "context, say you don't know.\n\n"
        f"Context:\n{context}\n\nQuestion: {query}"
    )
    resp = bedrock.invoke_model(
        modelId="anthropic.claude-sonnet-4-5-20250929-v1:0",
        body=json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 1024,
            "messages": [{"role": "user", "content": prompt}],
        }),
    )
    return json.loads(resp["body"].read())["content"][0]["text"]

The instruction to answer only from context and to admit ignorance is the single biggest lever against hallucination, paired with returning citations so a human can verify.

In a demo, RAG is a retrieval problem. In production, it's a freshness, cost, and trust problem. The vector search is the easy 20%.

Making it production-grade

  • Evaluation: build a golden set of question/answer pairs and score retrieval (recall@k) and answer faithfulness on every change. Without this you can't tell if a "improvement" regressed quality.
  • Cost control: cache embeddings for unchanged chunks, cap retrieved context tokens, and use a smaller model for simple queries with a router.
  • Freshness: trigger re-ingestion from S3 events so the index reflects document changes within minutes, and track per-chunk source version for citations.
  • Guardrails: apply Bedrock Guardrails for PII and toxicity, and enforce tenant filtering in the retrieval query so one customer never sees another's data.

Takeaways

  • RAG is two pipelines (ingestion and query) meeting at a vector store; design both, not just the query path.
  • Chunking strategy and a rerank step (retrieve top-20, keep top-5) move quality more than swapping models.
  • Constrain the generator to answer only from context and return citations to control hallucination.
  • Production hinges on evaluation harnesses, freshness via S3-triggered re-ingestion, tenant-scoped retrieval, and cost caps on context tokens.