Tutorial: Creating Custom SuperComponents
Last Updated: April 22, 2025
- Level: Intermediate
- Time to complete: 20 minutes
- Concepts and Components Used:
@super_component
,Pipeline
,DocumentJoiner
,SentenceTransformersTextEmbedder
,InMemoryBM25Retriever
,InMemoryEmbeddingRetriever
,TransformersSimilarityRanker
- Goal: After completing this tutorial, you’ll have learned how to create custom SuperComponents using the
@super_component
decorator to simplify complex pipelines and make them reusable as components.
Overview
In this tutorial, you’ll learn how to create custom SuperComponents using the @super_component
decorator. SuperComponents are a powerful way to encapsulate complex pipelines into reusable components with simplified interfaces.
We’ll explore several examples:
- Creating a simple HybridRetriever SuperComponent
- Extending our HybridRetriever with a ranker component
- Creating a SuperComponent with custom input and output mappings
- Creating a SuperComponent that exposes outputs from non-leaf components
The @super_component
decorator makes it easy to convert a class that defines a pipeline into a fully functional Haystack component that can be used in other pipelines or applications without losing pipeline functionalities like content tracing and debugging. All it requires is that the class has an attribute called pipeline
.
Preparing the Environment
First, let’s install Haystack and the dependencies we’ll need:
%%bash
pip install haystack-ai
pip install "sentence-transformers>=3.0.0" datasets transformers[torch,sentencepiece]
Understanding the @super_component Decorator
The @super_component
decorator is a powerful tool that allows you to create custom components by wrapping a Pipeline. It handles all the complexity of mapping inputs and outputs between the component interface and the underlying pipeline.
When you use the @super_component
decorator, you need to define a class with:
- An
__init__
method that creates a Pipeline and assigns it toself.pipeline
- Optionally,
input_mapping
andoutput_mapping
attributes to customize how inputs and outputs are mapped
The decorator then:
- Creates a new class that inherits from
SuperComponent
- Copies all methods and attributes from your original class
- Adds initialization logic to properly set up the SuperComponent
Let’s see how this works with some practical examples.
1. Creating a HybridRetriever SuperComponent
Let’s start with a simple example: creating a HybridRetriever that combines BM25 and embedding-based retrieval. This SuperComponent will take a query and return relevant documents.
from haystack import Document, Pipeline, super_component
from haystack.components.joiners import DocumentJoiner
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.retrievers import InMemoryBM25Retriever, InMemoryEmbeddingRetriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
from datasets import load_dataset
@super_component
class HybridRetriever:
def __init__(self, document_store: InMemoryDocumentStore, embedder_model: str = "BAAI/bge-small-en-v1.5"):
# Create the components
embedding_retriever = InMemoryEmbeddingRetriever(document_store)
bm25_retriever = InMemoryBM25Retriever(document_store)
text_embedder = SentenceTransformersTextEmbedder(embedder_model)
document_joiner = DocumentJoiner(join_mode="reciprocal_rank_fusion")
# Create the pipeline
self.pipeline = Pipeline()
self.pipeline.add_component("text_embedder", text_embedder)
self.pipeline.add_component("embedding_retriever", embedding_retriever)
self.pipeline.add_component("bm25_retriever", bm25_retriever)
self.pipeline.add_component("document_joiner", document_joiner)
# Connect the components
self.pipeline.connect("text_embedder", "embedding_retriever")
self.pipeline.connect("bm25_retriever", "document_joiner")
self.pipeline.connect("embedding_retriever", "document_joiner")
Now, let’s load a dataset and test our HybridRetriever:
# Load a dataset
dataset = load_dataset("HaystackBot/medrag-pubmed-chunk-with-embeddings", split="train")
docs = [Document(content=doc["contents"], embedding=doc["embedding"]) for doc in dataset]
document_store = InMemoryDocumentStore()
document_store.write_documents(docs)
# Create and run the HybridRetriever
query = "What treatments are available for chronic bronchitis?"
retriever = HybridRetriever(document_store)
result = retriever.run(
text=query, query=query
) # `query` variable will match with `text` and `query` inputs of components in the pipeline.
# Print the results
print(f"Found {len(result['documents'])} documents")
for i, doc in enumerate(result["documents"][:3]): # Show first 3 documents
print(f"\nDocument {i+1} (Score: {doc.score:.4f}):")
print(doc.content[:200] + "...")
How the HybridRetriever Works
Let’s break down what’s happening in our HybridRetriever SuperComponent:
- We define a class decorated with
@super_component
- In the
__init__
method, we:- Create all the components we need (embedding retriever, BM25 retriever, etc.)
- Create a Pipeline and add all components to it
- Connect the components to define the flow of data
- The
@super_component
decorator handles all the complexity of making our class work as a component
If we define an input mapping like {"query": ["text_embedder.text", "bm25_retriever.query"]}
, we can call retriever.run(query=query)
, and the query will automatically be routed to both the text embedder’s text
input and the BM25 retriever’s query
input.
You can also specify how the outputs should be exposed through output_mapping
. For example, output mapping {"document_joiner.documents": "documents"}
means that the documents produced by the document_joiner
will be returned under the name documents
when you call retriever.run(...)
.
2. A HybridRetriever with Re-Ranking and Custom ‘input_mapping’
Now, let’s enhance our HybridRetriever by adding a ranker component. This will re-rank the documents based on their semantic similarity to the query, potentially improving the quality of the results. We also define a custom input_mapping.
from haystack import Document, Pipeline, super_component
from haystack.components.joiners import DocumentJoiner
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.rankers import TransformersSimilarityRanker
from haystack.components.retrievers import InMemoryBM25Retriever, InMemoryEmbeddingRetriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
from datasets import load_dataset
@super_component
class HybridRetrieverWithRanker:
def __init__(
self,
document_store: InMemoryDocumentStore,
embedder_model: str = "BAAI/bge-small-en-v1.5",
ranker_model: str = "BAAI/bge-reranker-base",
):
# Create the components
embedding_retriever = InMemoryEmbeddingRetriever(document_store)
bm25_retriever = InMemoryBM25Retriever(document_store)
text_embedder = SentenceTransformersTextEmbedder(embedder_model)
document_joiner = DocumentJoiner()
ranker = TransformersSimilarityRanker(ranker_model)
# Create the pipeline
self.pipeline = Pipeline()
self.pipeline.add_component("text_embedder", text_embedder)
self.pipeline.add_component("embedding_retriever", embedding_retriever)
self.pipeline.add_component("bm25_retriever", bm25_retriever)
self.pipeline.add_component("document_joiner", document_joiner)
self.pipeline.add_component("ranker", ranker)
# Connect the components
self.pipeline.connect("text_embedder", "embedding_retriever")
self.pipeline.connect("bm25_retriever", "document_joiner")
self.pipeline.connect("embedding_retriever", "document_joiner")
self.pipeline.connect("document_joiner", "ranker")
# Define input mapping
self.input_mapping = {"query": ["text_embedder.text", "bm25_retriever.query", "ranker.query"]}
# Create and run the HybridRetrieverWithRanker
retriever = HybridRetrieverWithRanker(document_store)
result = retriever.run(query=query) # instead of retriever.run(text=query, query=query) thanks to input_mapping
# Print the results
print(f"Found {len(result['documents'])} documents")
for i, doc in enumerate(result["documents"][:3]): # Show first 3 documents
print(f"\nDocument {i+1} (Score: {doc.score:.4f}):")
print(doc.content[:200] + "...")
Comparing the Two Retrievers
The main differences between the two retrievers are:
- Added Ranker Component: The second version includes a
TransformersSimilarityRanker
that re-ranks the documents based on their semantic similarity to the query. - Updated Input Mapping: We added
"text_embedder.text"
,"bm25_retriever.query"
and"ranker.query"
to the input mapping to ensure the input query is sent to all three components while simplifying theretriever.run
method.
The ranker can significantly improve the quality of the results by re-ranking the documents based on their semantic similarity to the query, even if they were not ranked highly by the initial retrievers.
3. Serialization and Deserialization of SuperComponents
One of the key benefits of using the @super_component
decorator is that it automatically adds serialization and deserialization capabilities to your component. This means you can easily save and load your SuperComponents using the standard Haystack serialization functions.
Let’s see how this works with our DocumentPreprocessor
component:
from haystack.core.serialization import component_to_dict, component_from_dict
from haystack.components.preprocessors import DocumentPreprocessor
# Create an instance of our SuperComponent
preprocessor = DocumentPreprocessor()
# Serialize the component to a dictionary
serialized = component_to_dict(preprocessor, "document_preprocessor")
print("Serialized component:")
print(serialized)
# Deserialize the component from the dictionary
deserialized = component_from_dict(DocumentPreprocessor, serialized, "document_preprocessor")
print("\nDeserialized component:")
print(deserialized)
# Verify that the deserialized component works
doc = Document(content="I love pizza!")
result = deserialized.run(documents=[doc])
print(f"\nDeserialized component produced {len(result['documents'])} documents")
The serialization and deserialization process works seamlessly with SuperComponents because the @super_component
decorator automatically adds the necessary functionality. This is particularly useful when you want to:
- Save and load pipelines: You can save your entire pipeline, including SuperComponents, to a file and load it later.
- Deploy components: You can deploy your SuperComponents to a server or cloud environment.
- Share components: You can share your SuperComponents with others, who can then load and use them in their own pipelines.
The serialization process captures all the initialization parameters of your SuperComponent, ensuring that when it’s deserialized, it’s recreated with the same configuration.
4. Creating a SuperComponent with Outputs from Non-Leaf Components
One of the powerful features of SuperComponents is the ability to expose outputs from any component in the pipeline, not just the leaf components. With leaf components, we here refer to components that do not send any outputs to other components in a pipeline. Let’s create a SuperComponent that demonstrates this capability.
from haystack import Document, Pipeline, super_component
from haystack.components.joiners import DocumentJoiner
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.retrievers import InMemoryBM25Retriever, InMemoryEmbeddingRetriever
from haystack.components.rankers import TransformersSimilarityRanker
from haystack.document_stores.in_memory import InMemoryDocumentStore
@super_component
class AdvancedHybridRetriever:
def __init__(
self,
document_store: InMemoryDocumentStore,
embedder_model: str = "BAAI/bge-small-en-v1.5",
ranker_model: str = "BAAI/bge-reranker-base",
):
# Create the components
embedding_retriever = InMemoryEmbeddingRetriever(document_store)
bm25_retriever = InMemoryBM25Retriever(document_store)
text_embedder = SentenceTransformersTextEmbedder(embedder_model)
document_joiner = DocumentJoiner()
ranker = TransformersSimilarityRanker(ranker_model)
# Create the pipeline
self.pipeline = Pipeline()
self.pipeline.add_component("text_embedder", text_embedder)
self.pipeline.add_component("embedding_retriever", embedding_retriever)
self.pipeline.add_component("bm25_retriever", bm25_retriever)
self.pipeline.add_component("document_joiner", document_joiner)
self.pipeline.add_component("ranker", ranker)
# Connect the components
self.pipeline.connect("text_embedder", "embedding_retriever")
self.pipeline.connect("bm25_retriever", "document_joiner")
self.pipeline.connect("embedding_retriever", "document_joiner")
self.pipeline.connect("document_joiner", "ranker")
# Define input and output mappings
self.input_mapping = {"query": ["text_embedder.text", "bm25_retriever.query", "ranker.query"]}
# Expose outputs from multiple components, including non-leaf components
self.output_mapping = {
"bm25_retriever.documents": "bm25_documents",
"embedding_retriever.documents": "embedding_documents",
"document_joiner.documents": "joined_documents",
"ranker.documents": "ranked_documents",
"text_embedder.embedding": "query_embedding",
}
# Create and run the AdvancedHybridRetriever
retriever = AdvancedHybridRetriever(document_store)
result = retriever.run(query=query)
# Print the results
print(f"BM25 documents: {len(result['bm25_documents'])}")
print(f"Embedding documents: {len(result['embedding_documents'])}")
print(f"Joined documents: {len(result['joined_documents'])}")
print(f"Ranked documents: {len(result['ranked_documents'])}")
print(f"Query embedding shape: {len(result['query_embedding'])}")
# Compare the top document from each stage
print("\nTop BM25 document:")
print(result["bm25_documents"][0].content[:200] + "...")
print(f"Score: {result['bm25_documents'][0].score:.4f}")
print("\nTop embedding document:")
print(result["embedding_documents"][0].content[:200] + "...")
print(f"Score: {result['embedding_documents'][0].score:.4f}")
print("\nTop ranked document:")
print(result["ranked_documents"][0].content[:200] + "...")
print(f"Score: {result['ranked_documents'][0].score:.4f}")
Understanding Outputs from Non-Leaf Components
In this example, we’ve created a SuperComponent that exposes outputs from multiple components in the pipeline, including non-leaf components:
- BM25 Documents: Documents retrieved by the BM25 retriever
- Embedding Documents: Documents retrieved by the embedding retriever
- Joined Documents: Documents after joining the results from both retrievers
- Ranked Documents: Documents after re-ranking
- Query Embedding: The embedding of the query
This demonstrates how the @super_component
decorator allows you to expose outputs from any component in the pipeline, not just the leaf components. This is particularly useful for debugging, analysis, or when you want to provide more detailed information to the user.
Ready-Made SuperComponents in Haystack
Haystack provides several ready-made SuperComponents that you can use in your applications, for example
- MultiFileConverter: A SuperComponent that can convert various file types to documents.
- DocumentPreprocessor: A SuperComponent that combines document cleaning and splitting.
These SuperComponents provide a convenient way to use common pipelines without having to build them from scratch.
Conclusion
In this tutorial, you’ve learned how to create custom SuperComponents using the @super_component
decorator. You’ve seen how to:
- Create a simple HybridRetriever SuperComponent
- Enhance it with a ranker and custom input mapping
- Serialize and deserialize the component with out-of-the-box functionalities
- Create a SuperComponent that exposes outputs from non-leaf components
SuperComponents are a powerful way to encapsulate complex pipelines into reusable components with simplified interfaces. They make it easy to create higher-level components that abstract away the details of the underlying pipeline.
If you liked this tutorial, there’s more to learn about Haystack:
- Building an Agentic RAG with Fallback to Websearch
- Build a Tool-Calling Agent
- Evaluating RAG Pipelines
To stay up to date on the latest Haystack developments, you can sign up for our newsletter or join Haystack discord community.