DocumentationAPI ReferenceπŸ““ TutorialsπŸ§‘β€πŸ³ Cookbook🀝 IntegrationsπŸ’œ Discord

DocumentJoiner

Use this component in hybrid retrieval pipelines or indexing pipelines with multiple file converters to join lists of Documents.

NameDocumentJoiner
Folder Path/joiners/
Position in a PipelineIn indexing and query Pipelines, after components that return a list of Documents such as multiple Retrievers or multiple File Converters.
Inputsβ€œdocuments”: List of Document objects. This input is variadic, meaning you can connect a variable number of components to it.
Outputsβ€œdocuments”: List of Document objects

Overview

DocumentJoiner joins input lists of Documents from multiple connections and outputs them as one list. You can choose how you want the lists to be joined by specifying the join_mode. There are three options available:

  • concatenate - Combines Document from multiple components, discarding any duplicates. Documents get their scores from the last component in the pipeline that assigns scores. This mode doesn’t influence Document scores.
  • merge - Merges the scores of duplicate Documents coming from multiple components. You can also assign a weight to the scores to influence how they’re merged and set the top_k limit to specify how many documents you want DocumentJoiner to return.
  • reciprocal_rank_fusion- Combines Documents into a single list based on their ranking received from multiple components. It then calculates a new score based on the ranks of Documents in the input lists. If the same Document appears in more than one list (was returned by multiple components), it gets a higher score.

Usage

On its own

Below is an example where we are using the DocumentJoiner to merge two lists of Documents. We run the DocumentJoiner and provide the documents. It returns a list of Documents ranked by combined scores. By default, equal weight is given to each Retriever score. You could also use custom weights by setting the weights parameter to a list of floats with one weight per input component.

from haystack import Document
from haystack.components.joiners.document_joiner import DocumentJoiner

docs_1 = [Document(content="Paris is the capital of France.", score=0.5), Document(content="Berlin is the capital of Germany.", score=0.4)]
docs_2 = [Document(content="Paris is the capital of France.", score=0.6), Document(content="Rome is the capital of Italy.", score=0.5)]

joiner = DocumentJoiner(join_mode="merge")

joiner.run(documents=[docs_1, docs_2])

# {'documents': [Document(id=0f5beda04153dbfc462c8b31f8536749e43654709ecf0cfe22c6d009c9912214, content: 'Paris is the capital of France.', score: 0.55), Document(id=424beed8b549a359239ab000f33ca3b1ddb0f30a988bbef2a46597b9c27e42f2, content: 'Rome is the capital of Italy.', score: 0.25), Document(id=312b465e77e25c11512ee76ae699ce2eb201f34c8c51384003bb367e24fb6cf8, content: 'Berlin is the capital of Germany.', score: 0.2)]}

In a Pipeline

Below is an example of a hybrid retrieval pipeline that retrieves Documents from an InMemoryDocumentStore based on keyword search (using InMemoryBM25Retriever) and embedding search (using InMemoryEmbeddingRetriever). It then uses the DocumentJoiner with its default join mode to concatenate the retrieved Documents into one list. The DocumentStore must contain Documents with embeddings, otherwise the InMemoryEmbeddingRetriever will not return any Documents.

from haystack.components.joiners.document_joiner import DocumentJoiner
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever, InMemoryEmbeddingRetriever
from haystack.components.embedders import SentenceTransformersTextEmbedder

document_store = InMemoryDocumentStore()
p = Pipeline()
p.add_component(instance=InMemoryBM25Retriever(document_store=document_store), name="bm25_retriever")
p.add_component(
        instance=SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2"),
        name="text_embedder",
    )
p.add_component(instance=InMemoryEmbeddingRetriever(document_store=document_store), name="embedding_retriever")
p.add_component(instance=DocumentJoiner(), name="joiner")
p.connect("bm25_retriever", "joiner")
p.connect("embedding_retriever", "joiner")
p.connect("text_embedder", "embedding_retriever")
query = "What is the capital of France?"
p.run(data={"bm25_retriever": {"query": query}, 
            "text_embedder": {"text": query}})

Related Links

See the parameters details in our API reference: