I built a semantic search feature for an internal docs portal last quarter, and the part that surprised me wasn't the embedding model, it was choosing where to store the vectors. We already ran an Amazon OpenSearch domain for logs, so reusing it for k-NN vector search meant one fewer system to operate. Here's how that turned out and the knobs that actually mattered.

The shape of the problem

Semantic search works by turning text into a high-dimensional vector with an embedding model, then finding the stored vectors closest to your query vector. The hard part at scale is that exact nearest-neighbor search is O(n) per query. OpenSearch's k-NN plugin solves this with approximate nearest neighbor (ANN) indexes, you trade a small amount of recall for sub-50ms latency over millions of vectors.

Defining the index

You enable k-NN at index creation and declare a knn_vector field with a fixed dimension that must match your embedding model's output. Amazon Titan Text Embeddings V2 emits 1024 dimensions; all-MiniLM-L6-v2 emits 384. Get this wrong and ingestion fails.

PUT /docs-index
{
  "settings": {
    "index.knn": true,
    "index.knn.algo_param.ef_search": 100
  },
  "mappings": {
    "properties": {
      "embedding": {
        "type": "knn_vector",
        "dimension": 1024,
        "method": {
          "name": "hnsw",
          "space_type": "cosinesimil",
          "engine": "faiss",
          "parameters": { "ef_construction": 256, "m": 16 }
        }
      },
      "text": { "type": "text" },
      "source_url": { "type": "keyword" }
    }
  }
}

HNSW (Hierarchical Navigable Small World) is the graph index I default to. The three parameters that control the recall/cost trade-off:

  • m, edges per node. Higher means better recall but more memory. 16 is a sensible start.
  • ef_construction, search breadth while building the graph. Higher builds slower but yields a better graph.
  • ef_search, search breadth at query time. The lever you tune live to balance latency vs recall.

Ingesting and querying from Python

You generate the embedding (here via Bedrock), then index it alongside the original text. The query embeds the user's question the same way and runs a knn query.

import boto3, json
from opensearchpy import OpenSearch, RequestsHttpConnection

bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")

def embed(text):
    resp = bedrock.invoke_model(
        modelId="amazon.titan-embed-text-v2:0",
        body=json.dumps({"inputText": text}),
    )
    return json.loads(resp["body"].read())["embedding"]

client = OpenSearch(
    hosts=[{"host": "search-docs.us-east-1.es.amazonaws.com", "port": 443}],
    use_ssl=True, connection_class=RequestsHttpConnection,
)

query_vec = embed("how do I rotate database credentials?")
results = client.search(index="docs-index", body={
    "size": 5,
    "query": {"knn": {"embedding": {"vector": query_vec, "k": 5}}},
})

Memory is the real cost driver

HNSW graphs live in off-heap memory, and that dominates your sizing. A rough estimate for FAISS HNSW is 1.1 * (4 * dimension + 8 * m) * num_vectors bytes. For 5 million 1024-dim vectors at m=16, that's roughly 24 GB of graph memory alone. OpenSearch reserves about half of an instance's RAM for the k-NN circuit breaker, so you'd want around three r6g.2xlarge.search data nodes (64 GB each) to hold it comfortably.

If your vector count outgrows RAM, switch the engine to product quantization or use OpenSearch Serverless's vector collection, which decouples storage from the search graph and bills on OCUs instead.

Hybrid search beats pure vectors

Pure semantic search misses exact-match queries, product codes, error numbers, acronyms. I combine BM25 keyword scoring with k-NN using a hybrid query and a normalization search pipeline. In our A/B test, hybrid lifted top-3 relevance by about 18% over k-NN alone, mostly by catching those literal lookups that embeddings smear together.

Takeaways

  • Match the knn_vector dimension to your embedding model exactly, 1024 for Titan V2, 384 for MiniLM.
  • Tune ef_search at query time to dial recall vs latency without rebuilding the index.
  • Size instances for off-heap graph memory, not document count, it's the binding constraint.
  • Use hybrid BM25 + k-NN search; pure vectors lose on exact-match queries like codes and acronyms.