Name	OpenSearchEmbeddingRetriever
Path	https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/opensearch
Position in a Pipeline	1. After a Text Embedder and before a `PromptBuilder` in a RAG Pipeline 2. The last component in the semantic search Pipeline 3. After a Text Embedder and before an `ExtractiveReader` in an ExtractiveQA Pipeline
Inputs	“query_embedding”: a list of floats
Outputs	“documents”: a list of Documents

Overview

The OpenSearchEmbeddingRetriever is an embedding-based Retriever compatible with the OpenSearchDocumentStore. It compares the query and Document embeddings and fetches the Documents most relevant to the query from the OpenSearchDocumentStore based on the outcome.

When using the OpenSearchEmbeddingRetriever in your NLP system, make sure it has the query and Document embeddings available. You can do so by adding a Document Embedder to your indexing pipeline and a Text Embedder to your query pipeline.

In addition to the query_embedding, the OpenSearchEmbeddingRetriever accepts other optional parameters, including top_k (the maximum number of Documents to retrieve) and filters to narrow down the search space.

The embedding_dim for storing and retrieving embeddings must be defined when the corresponding OpenSearchDocumentStore is initialized.

Setup and installation

Install and run an OpenSearch instance.

If you have Docker set up, we recommend pulling the Docker image and running it.

docker pull opensearchproject/opensearch:2.11.0
docker run -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" -e "OPENSEARCH_JAVA_OPTS=-Xms1024m -Xmx1024m" opensearchproject/opensearch:2.11.0

As an alternative, you can go to OpenSearch integration GitHub and start a Docker container running OpenSearch using the provided docker-compose.yml:

docker compose up

Once you have a running OpenSearch instance, install the opensearch-haystack integration:

pip install opensearch-haystack

Usage

In a pipeline

Use this Retriever in a query Pipeline like this:

from haystack_integrations.components.retrievers.opensearch  import OpenSearchEmbeddingRetriever
from haystack_integrations.document_stores.opensearch import OpenSearchDocumentStore

from haystack.document_stores.types import DuplicatePolicy
from haystack import Document
from haystack import Pipeline
from haystack.components.embedders import SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder

document_store = OpenSearchDocumentStore(hosts="http://localhost:9200", use_ssl=True,
verify_certs=False, http_auth=("admin", "admin"))

model = "sentence-transformers/all-mpnet-base-v2"

documents = [Document(content="There are over 7,000 languages spoken around the world today."),
						Document(content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors."),
						Document(content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.")]

document_embedder = SentenceTransformersDocumentEmbedder(model=model)  
document_embedder.warm_up()
documents_with_embeddings = document_embedder.run(documents)

document_store.write_documents(documents_with_embeddings.get("documents"), policy=DuplicatePolicy.SKIP)

query_pipeline = Pipeline()
query_pipeline.add_component("text_embedder", SentenceTransformersTextEmbedder(model=model))
query_pipeline.add_component("retriever", OpenSearchEmbeddingRetriever(document_store=document_store))
query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")

query = "How many languages are there?"

result = query_pipeline.run({"text_embedder": {"text": query}})

print(result['retriever']['documents'][0])

The example output would be:

Document(id=cfe93bc1c274908801e6670440bf2bbba54fad792770d57421f85ffa2a4fcc94, content: 'There are over 7,000 languages spoken around the world today.', score: 0.70026743, embedding: vector of size 768)