๐ŸŒธ Join the Spring into Haystack challenge and create your Agent with MCP and Haystack!

Tutorial: Creating Custom SuperComponents


Overview

In this tutorial, you’ll learn how to create custom SuperComponents using the @super_component decorator. SuperComponents are a powerful way to encapsulate complex pipelines into reusable components with simplified interfaces.

We’ll explore several examples:

  1. Creating a simple HybridRetriever SuperComponent
  2. Extending our HybridRetriever with a ranker component
  3. Creating a SuperComponent with custom input and output mappings
  4. Creating a SuperComponent that exposes outputs from non-leaf components

The @super_component decorator makes it easy to convert a class that defines a pipeline into a fully functional Haystack component that can be used in other pipelines or applications without losing pipeline functionalities like content tracing and debugging. All it requires is that the class has an attribute called pipeline.

Preparing the Environment

First, let’s install Haystack and the dependencies we’ll need:

%%bash

pip install haystack-ai
pip install "sentence-transformers>=3.0.0" datasets transformers[torch,sentencepiece]

Understanding the @super_component Decorator

The @super_component decorator is a powerful tool that allows you to create custom components by wrapping a Pipeline. It handles all the complexity of mapping inputs and outputs between the component interface and the underlying pipeline.

When you use the @super_component decorator, you need to define a class with:

  1. An __init__ method that creates a Pipeline and assigns it to self.pipeline
  2. Optionally, input_mapping and output_mapping attributes to customize how inputs and outputs are mapped

The decorator then:

  1. Creates a new class that inherits from SuperComponent
  2. Copies all methods and attributes from your original class
  3. Adds initialization logic to properly set up the SuperComponent

Let’s see how this works with some practical examples.

1. Creating a HybridRetriever SuperComponent

Let’s start with a simple example: creating a HybridRetriever that combines BM25 and embedding-based retrieval. This SuperComponent will take a query and return relevant documents.

from haystack import Document, Pipeline, super_component
from haystack.components.joiners import DocumentJoiner
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.retrievers import InMemoryBM25Retriever, InMemoryEmbeddingRetriever
from haystack.document_stores.in_memory import InMemoryDocumentStore

from datasets import load_dataset


@super_component
class HybridRetriever:
    def __init__(self, document_store: InMemoryDocumentStore, embedder_model: str = "BAAI/bge-small-en-v1.5"):
        # Create the components
        embedding_retriever = InMemoryEmbeddingRetriever(document_store)
        bm25_retriever = InMemoryBM25Retriever(document_store)
        text_embedder = SentenceTransformersTextEmbedder(embedder_model)
        document_joiner = DocumentJoiner(join_mode="reciprocal_rank_fusion")

        # Create the pipeline
        self.pipeline = Pipeline()
        self.pipeline.add_component("text_embedder", text_embedder)
        self.pipeline.add_component("embedding_retriever", embedding_retriever)
        self.pipeline.add_component("bm25_retriever", bm25_retriever)
        self.pipeline.add_component("document_joiner", document_joiner)

        # Connect the components
        self.pipeline.connect("text_embedder", "embedding_retriever")
        self.pipeline.connect("bm25_retriever", "document_joiner")
        self.pipeline.connect("embedding_retriever", "document_joiner")

Now, let’s load a dataset and test our HybridRetriever:

# Load a dataset
dataset = load_dataset("HaystackBot/medrag-pubmed-chunk-with-embeddings", split="train")
docs = [Document(content=doc["contents"], embedding=doc["embedding"]) for doc in dataset]
document_store = InMemoryDocumentStore()
document_store.write_documents(docs)

# Create and run the HybridRetriever
query = "What treatments are available for chronic bronchitis?"
retriever = HybridRetriever(document_store)
result = retriever.run(
    text=query, query=query
)  # `query` variable will match with `text` and `query` inputs of components in the pipeline.
# Print the results
print(f"Found {len(result['documents'])} documents")
for i, doc in enumerate(result["documents"][:3]):  # Show first 3 documents
    print(f"\nDocument {i+1} (Score: {doc.score:.4f}):")
    print(doc.content[:200] + "...")

How the HybridRetriever Works

Let’s break down what’s happening in our HybridRetriever SuperComponent:

  1. We define a class decorated with @super_component
  2. In the __init__ method, we:
    • Create all the components we need (embedding retriever, BM25 retriever, etc.)
    • Create a Pipeline and add all components to it
    • Connect the components to define the flow of data
  3. The @super_component decorator handles all the complexity of making our class work as a component

If we define an input mapping like {"query": ["text_embedder.text", "bm25_retriever.query"]}, we can call retriever.run(query=query), and the query will automatically be routed to both the text embedder’s text input and the BM25 retriever’s query input.

You can also specify how the outputs should be exposed through output_mapping. For example, output mapping {"document_joiner.documents": "documents"} means that the documents produced by the document_joiner will be returned under the name documents when you call retriever.run(...).

2. A HybridRetriever with Re-Ranking and Custom ‘input_mapping’

Now, let’s enhance our HybridRetriever by adding a ranker component. This will re-rank the documents based on their semantic similarity to the query, potentially improving the quality of the results. We also define a custom input_mapping.

from haystack import Document, Pipeline, super_component
from haystack.components.joiners import DocumentJoiner
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.rankers import TransformersSimilarityRanker
from haystack.components.retrievers import InMemoryBM25Retriever, InMemoryEmbeddingRetriever
from haystack.document_stores.in_memory import InMemoryDocumentStore

from datasets import load_dataset


@super_component
class HybridRetrieverWithRanker:
    def __init__(
        self,
        document_store: InMemoryDocumentStore,
        embedder_model: str = "BAAI/bge-small-en-v1.5",
        ranker_model: str = "BAAI/bge-reranker-base",
    ):
        # Create the components
        embedding_retriever = InMemoryEmbeddingRetriever(document_store)
        bm25_retriever = InMemoryBM25Retriever(document_store)
        text_embedder = SentenceTransformersTextEmbedder(embedder_model)
        document_joiner = DocumentJoiner()
        ranker = TransformersSimilarityRanker(ranker_model)

        # Create the pipeline
        self.pipeline = Pipeline()
        self.pipeline.add_component("text_embedder", text_embedder)
        self.pipeline.add_component("embedding_retriever", embedding_retriever)
        self.pipeline.add_component("bm25_retriever", bm25_retriever)
        self.pipeline.add_component("document_joiner", document_joiner)
        self.pipeline.add_component("ranker", ranker)

        # Connect the components
        self.pipeline.connect("text_embedder", "embedding_retriever")
        self.pipeline.connect("bm25_retriever", "document_joiner")
        self.pipeline.connect("embedding_retriever", "document_joiner")
        self.pipeline.connect("document_joiner", "ranker")

        # Define input mapping
        self.input_mapping = {"query": ["text_embedder.text", "bm25_retriever.query", "ranker.query"]}
# Create and run the HybridRetrieverWithRanker
retriever = HybridRetrieverWithRanker(document_store)
result = retriever.run(query=query)  # instead of retriever.run(text=query, query=query) thanks to input_mapping

# Print the results
print(f"Found {len(result['documents'])} documents")
for i, doc in enumerate(result["documents"][:3]):  # Show first 3 documents
    print(f"\nDocument {i+1} (Score: {doc.score:.4f}):")
    print(doc.content[:200] + "...")

Comparing the Two Retrievers

The main differences between the two retrievers are:

  1. Added Ranker Component: The second version includes a TransformersSimilarityRanker that re-ranks the documents based on their semantic similarity to the query.
  2. Updated Input Mapping: We added "text_embedder.text", "bm25_retriever.query" and "ranker.query" to the input mapping to ensure the input query is sent to all three components while simplifying the retriever.run method.

The ranker can significantly improve the quality of the results by re-ranking the documents based on their semantic similarity to the query, even if they were not ranked highly by the initial retrievers.

3. Serialization and Deserialization of SuperComponents

One of the key benefits of using the @super_component decorator is that it automatically adds serialization and deserialization capabilities to your component. This means you can easily save and load your SuperComponents using the standard Haystack serialization functions.

Let’s see how this works with our DocumentPreprocessor component:

from haystack.core.serialization import component_to_dict, component_from_dict
from haystack.components.preprocessors import DocumentPreprocessor

# Create an instance of our SuperComponent
preprocessor = DocumentPreprocessor()

# Serialize the component to a dictionary
serialized = component_to_dict(preprocessor, "document_preprocessor")
print("Serialized component:")
print(serialized)

# Deserialize the component from the dictionary
deserialized = component_from_dict(DocumentPreprocessor, serialized, "document_preprocessor")
print("\nDeserialized component:")
print(deserialized)

# Verify that the deserialized component works
doc = Document(content="I love pizza!")
result = deserialized.run(documents=[doc])
print(f"\nDeserialized component produced {len(result['documents'])} documents")

The serialization and deserialization process works seamlessly with SuperComponents because the @super_component decorator automatically adds the necessary functionality. This is particularly useful when you want to:

  1. Save and load pipelines: You can save your entire pipeline, including SuperComponents, to a file and load it later.
  2. Deploy components: You can deploy your SuperComponents to a server or cloud environment.
  3. Share components: You can share your SuperComponents with others, who can then load and use them in their own pipelines.

The serialization process captures all the initialization parameters of your SuperComponent, ensuring that when it’s deserialized, it’s recreated with the same configuration.

4. Creating a SuperComponent with Outputs from Non-Leaf Components

One of the powerful features of SuperComponents is the ability to expose outputs from any component in the pipeline, not just the leaf components. With leaf components, we here refer to components that do not send any outputs to other components in a pipeline. Let’s create a SuperComponent that demonstrates this capability.

from haystack import Document, Pipeline, super_component
from haystack.components.joiners import DocumentJoiner
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.retrievers import InMemoryBM25Retriever, InMemoryEmbeddingRetriever
from haystack.components.rankers import TransformersSimilarityRanker
from haystack.document_stores.in_memory import InMemoryDocumentStore


@super_component
class AdvancedHybridRetriever:
    def __init__(
        self,
        document_store: InMemoryDocumentStore,
        embedder_model: str = "BAAI/bge-small-en-v1.5",
        ranker_model: str = "BAAI/bge-reranker-base",
    ):
        # Create the components
        embedding_retriever = InMemoryEmbeddingRetriever(document_store)
        bm25_retriever = InMemoryBM25Retriever(document_store)
        text_embedder = SentenceTransformersTextEmbedder(embedder_model)
        document_joiner = DocumentJoiner()
        ranker = TransformersSimilarityRanker(ranker_model)

        # Create the pipeline
        self.pipeline = Pipeline()
        self.pipeline.add_component("text_embedder", text_embedder)
        self.pipeline.add_component("embedding_retriever", embedding_retriever)
        self.pipeline.add_component("bm25_retriever", bm25_retriever)
        self.pipeline.add_component("document_joiner", document_joiner)
        self.pipeline.add_component("ranker", ranker)

        # Connect the components
        self.pipeline.connect("text_embedder", "embedding_retriever")
        self.pipeline.connect("bm25_retriever", "document_joiner")
        self.pipeline.connect("embedding_retriever", "document_joiner")
        self.pipeline.connect("document_joiner", "ranker")

        # Define input and output mappings
        self.input_mapping = {"query": ["text_embedder.text", "bm25_retriever.query", "ranker.query"]}

        # Expose outputs from multiple components, including non-leaf components
        self.output_mapping = {
            "bm25_retriever.documents": "bm25_documents",
            "embedding_retriever.documents": "embedding_documents",
            "document_joiner.documents": "joined_documents",
            "ranker.documents": "ranked_documents",
            "text_embedder.embedding": "query_embedding",
        }
# Create and run the AdvancedHybridRetriever
retriever = AdvancedHybridRetriever(document_store)
result = retriever.run(query=query)

# Print the results
print(f"BM25 documents: {len(result['bm25_documents'])}")
print(f"Embedding documents: {len(result['embedding_documents'])}")
print(f"Joined documents: {len(result['joined_documents'])}")
print(f"Ranked documents: {len(result['ranked_documents'])}")
print(f"Query embedding shape: {len(result['query_embedding'])}")

# Compare the top document from each stage
print("\nTop BM25 document:")
print(result["bm25_documents"][0].content[:200] + "...")
print(f"Score: {result['bm25_documents'][0].score:.4f}")

print("\nTop embedding document:")
print(result["embedding_documents"][0].content[:200] + "...")
print(f"Score: {result['embedding_documents'][0].score:.4f}")

print("\nTop ranked document:")
print(result["ranked_documents"][0].content[:200] + "...")
print(f"Score: {result['ranked_documents'][0].score:.4f}")

Understanding Outputs from Non-Leaf Components

In this example, we’ve created a SuperComponent that exposes outputs from multiple components in the pipeline, including non-leaf components:

  1. BM25 Documents: Documents retrieved by the BM25 retriever
  2. Embedding Documents: Documents retrieved by the embedding retriever
  3. Joined Documents: Documents after joining the results from both retrievers
  4. Ranked Documents: Documents after re-ranking
  5. Query Embedding: The embedding of the query

This demonstrates how the @super_component decorator allows you to expose outputs from any component in the pipeline, not just the leaf components. This is particularly useful for debugging, analysis, or when you want to provide more detailed information to the user.

Ready-Made SuperComponents in Haystack

Haystack provides several ready-made SuperComponents that you can use in your applications, for example

  1. MultiFileConverter: A SuperComponent that can convert various file types to documents.
  2. DocumentPreprocessor: A SuperComponent that combines document cleaning and splitting.

These SuperComponents provide a convenient way to use common pipelines without having to build them from scratch.

Conclusion

In this tutorial, you’ve learned how to create custom SuperComponents using the @super_component decorator. You’ve seen how to:

  1. Create a simple HybridRetriever SuperComponent
  2. Enhance it with a ranker and custom input mapping
  3. Serialize and deserialize the component with out-of-the-box functionalities
  4. Create a SuperComponent that exposes outputs from non-leaf components

SuperComponents are a powerful way to encapsulate complex pipelines into reusable components with simplified interfaces. They make it easy to create higher-level components that abstract away the details of the underlying pipeline.

If you liked this tutorial, there’s more to learn about Haystack:

To stay up to date on the latest Haystack developments, you can sign up for our newsletter or join Haystack discord community.