Tutorial: Creating Vision+Text RAG Pipelines
Last Updated: August 4, 2025
- Level: Intermediate
- Time to complete: 20 minutes
- Components Used:
SentenceTransformersDocumentImageEmbedder
,ImageFileToDocument
,DocumentToImageContent
,DocumentTypeRouter
,LLMDocumentContentExtractor
- Prerequisites: You need an OpenAI API Key
- Goal: After completing this tutorial, you’ll have learned how to index and retrieve images using Haystack and build a RAG pipeline that can answer questions grounded in both images and text.
Overview
In this notebook, you’ll learn how to index and retrieve images using Haystack. By the end, you’ll be able to build a Retrieval-Augmented Generation (RAG) pipeline that can answer questions grounded in both images and text. This is useful when working with datasets like scientific papers, diagrams, or screenshots where meaning is spread across modalities.
This tutorial uses the following new components that enable image indexing:
SentenceTransformersDocumentImageEmbedder
: Embed image documents with CLIP-based modelsImageFileToDocument
: Convert image files into HaystackDocument
sDocumentTypeRouter
: Route retrieved documents by mime type (e.g., image vs text)DocumentToImageContent
: Convert image documents intoImageContent
to be processed by our ChatGeneratorLLMDocumentContentExtractor
: Extracts textual content from image-based documents using a vision-enabled LLM.
In this notebook, we’ll introduce all these features, show an application using image + text retrieval + multimodal generation.
Setup Development Environment
First, let’s install required packages:
%%bash
pip install -q "haystack-ai>=2.16.0" pillow pypdf pypdfium2 "sentence-transformers>=4.1.0"
Enter your API key for OpenAI:
import os
from getpass import getpass
if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass("Enter OpenAI API key:")
Introduction to Embedding Images
Let’s compare the similarity between a text and two images.
First, let’s download two sample images, one of an apple and one of a Capybara.
from urllib.request import urlretrieve
urlretrieve("https://upload.wikimedia.org/wikipedia/commons/2/26/Pink_Lady_Apple_%284107712628%29.jpg", "apple.jpg")
urlretrieve("https://upload.wikimedia.org/wikipedia/commons/thumb/e/e1/Cattle_tyrant_%28Machetornis_rixosa%29_on_Capybara.jpg/960px-Cattle_tyrant_%28Machetornis_rixosa%29_on_Capybara.jpg?download", "capybara.jpg")
from PIL import Image
img = Image.open("apple.jpg")
# We resize the image here just to avoid it taking up too much space in the notebook
img_resized = img.resize((img.width // 6, img.height // 6))
img_resized
Next, we convert our Image Files into Haystack Documents
so they can be used downstream in our SentenceTransformersDocumentImageEmbedder
component.
from haystack.components.converters.image import ImageFileToDocument
image_file_converter = ImageFileToDocument()
image_docs = image_file_converter.run(sources=["apple.jpg", "capybara.jpg"])["documents"]
print(image_docs)
Next, we load our embedders with the sentence-transformers/clip-ViT-L-14 model that maps text and images to a shared vector space. It’s important that we use the same CLIP model for both text and images to calculate the similarity between.
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.embedders.image import SentenceTransformersDocumentImageEmbedder
text_embedder = SentenceTransformersTextEmbedder(model="sentence-transformers/clip-ViT-L-14", progress_bar=False)
image_embedder = SentenceTransformersDocumentImageEmbedder(
model="sentence-transformers/clip-ViT-L-14", progress_bar=False
)
# Warm up the models to load them
text_embedder.warm_up()
image_embedder.warm_up()
Let’s run the embedders and create vector embeddings for images to see how semantically similar our query is to the two images.
import torch
from sentence_transformers import util
query = "A red apple on a white background"
text_embedding = text_embedder.run(text=query)["embedding"]
image_docs_with_embeddings = image_embedder.run(image_docs)["documents"]
# Compare the similarities between the query and two image documents
for doc in image_docs_with_embeddings:
similarity = util.cos_sim(torch.tensor(text_embedding), torch.tensor(doc.embedding))
print(f"Similarity with {doc.meta['file_path'].split('/')[-1]}: {similarity.item():.2f}")
As we can see, the text is most similar to our Apple image, as expected! So, the CLIP model can create correct representations for images and text.
Multimodal Retrieval with Image and Text Embeddings
In this approach, we’ll use the sentence-transformers/clip-ViT-L-14
model to create embeddings for both image and text, and perform retrieval using these embeddings.
First, let’s download a sample PDF file to see how we can retrieve over both text and image based documents
from urllib.request import urlretrieve
urlretrieve("https://arxiv.org/pdf/1706.03762.pdf", "attention_is_all_you_need.pdf")
Building an Image + Text Indexing Pipeline
Let’s create an indexing pipeline to process our image and PDF files at once and write them to our Document Store.
So in the following Pipeline
, we are:
- computing embeddings based on images for image files
- converting PDF files to textual Documents and then computing embeddings based on the text
from haystack import Pipeline
from haystack.components.converters import PyPDFToDocument
from haystack.components.converters.image import ImageFileToDocument
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.embedders.image import SentenceTransformersDocumentImageEmbedder
from haystack.components.joiners import DocumentJoiner
from haystack.components.preprocessors.document_splitter import DocumentSplitter
from haystack.components.routers.file_type_router import FileTypeRouter
from haystack.components.writers.document_writer import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
# Create our document store
doc_store = InMemoryDocumentStore(embedding_similarity_function="cosine")
# Define our components
file_type_router = FileTypeRouter(mime_types=["application/pdf", "image/jpeg"])
final_doc_joiner = DocumentJoiner(sort_by_score=False)
image_converter = ImageFileToDocument()
pdf_converter = PyPDFToDocument()
pdf_splitter = DocumentSplitter(split_by="page", split_length=1)
text_doc_embedder = SentenceTransformersDocumentEmbedder(
model="sentence-transformers/clip-ViT-L-14", progress_bar=False
)
image_embedder = SentenceTransformersDocumentImageEmbedder(
model="sentence-transformers/clip-ViT-L-14", progress_bar=False
)
document_writer = DocumentWriter(doc_store)
# Create the Indexing Pipeline
indexing_pipe = Pipeline()
indexing_pipe.add_component("file_type_router", file_type_router)
indexing_pipe.add_component("pdf_converter", pdf_converter)
indexing_pipe.add_component("pdf_splitter", pdf_splitter)
indexing_pipe.add_component("image_converter", image_converter)
indexing_pipe.add_component("text_doc_embedder", text_doc_embedder)
indexing_pipe.add_component("image_doc_embedder", image_embedder)
indexing_pipe.add_component("final_doc_joiner", final_doc_joiner)
indexing_pipe.add_component("document_writer", document_writer)
indexing_pipe.connect("file_type_router.application/pdf", "pdf_converter.sources")
indexing_pipe.connect("pdf_converter.documents", "pdf_splitter.documents")
indexing_pipe.connect("pdf_splitter.documents", "text_doc_embedder.documents")
indexing_pipe.connect("file_type_router.image/jpeg", "image_converter.sources")
indexing_pipe.connect("image_converter.documents", "image_doc_embedder.documents")
indexing_pipe.connect("text_doc_embedder.documents", "final_doc_joiner.documents")
indexing_pipe.connect("image_doc_embedder.documents", "final_doc_joiner.documents")
indexing_pipe.connect("final_doc_joiner.documents", "document_writer.documents")
Visualize the Indexing pipeline
# indexing_pipe.show()
Run the indexing pipeline with a pdf and an image file.
indexing_result = indexing_pipe.run(
data={"file_type_router": {"sources": ["attention_is_all_you_need.pdf", "apple.jpg"]}}
)
Inspect the documents
indexed_documents = doc_store.filter_documents()
print(f"Indexed {len(indexed_documents)} documents")
Searching Image + Text
Let’s now set up our search and retrieve relevant data from our document store by passing a query.
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
retriever = InMemoryEmbeddingRetriever(document_store=doc_store)
text_embedding = text_embedder.run(text="An image of an apple")["embedding"]
results = retriever.run(text_embedding)["documents"]
for idx, doc in enumerate(results[:5]):
print(f"Document {idx+1}:")
print(f"Score: {doc.score}")
print(f"File Path: {doc.meta['file_path']}")
print("")
Huh how odd! It doesn’t seem like any of the top results are relevant. In fact it seems like the top retrieved documents are text-based no matter how irrelevant they are.
This is actually a common scenario when trying to run multimodal retrieval with both images and text at the same time. Often times the underlying embedding model (CLIP in this case) is not trained to handle both text and image documents at the same time and can be biased towards one type. In this case the model we have chosen appears to be biased towards text to text similarities which we can observe by the scores attached to each Document.
Side Note: It’s possible more recent models like jina-embeddings-v4 or Cohere Embed 4 might work better for this type of scenario.
To combat this, let’s use a slightly different approach below.
Multimodal Retrieval with (Only) Text Embeddings
In this approach, we will use the LLMDocumentContentExtractor
to first extract a textual representation of all images before writing them to our DocumentStore.
- This will allow us to use text-only retrieval methods when searching through our DocumentStore.
- We will still send the actual image to the Vision LLM. This is helpful since it’s possible that the image contains more information and nuance than the extracted version.
Building an Image + Text Indexing Pipeline using the LLMDocumentContentExtractor
This time, in the following Pipeline
, we are:
- extracting the textual representation of images with
LLMDocumentContentExtractor
- converting PDF files to textual Documents
- creating text embeddings for both PDF and image files’ text content using
mixedbread-ai/mxbai-embed-large-v1
from haystack import Pipeline
from haystack.components.converters import PyPDFToDocument
from haystack.components.converters.image import ImageFileToDocument
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.embedders.image import SentenceTransformersDocumentImageEmbedder
from haystack.components.joiners import DocumentJoiner
from haystack.components.preprocessors.document_splitter import DocumentSplitter
from haystack.components.routers.file_type_router import FileTypeRouter
from haystack.components.writers.document_writer import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.components.extractors.image import LLMDocumentContentExtractor
# Create our document store
doc_store = InMemoryDocumentStore(embedding_similarity_function="cosine")
# Define our components
file_type_router = FileTypeRouter(mime_types=["application/pdf", "image/jpeg"])
image_converter = ImageFileToDocument()
pdf_converter = PyPDFToDocument()
pdf_splitter = DocumentSplitter(split_by="page", split_length=1)
final_doc_joiner = DocumentJoiner(sort_by_score=False)
document_writer = DocumentWriter(doc_store)
# Now we use high-performing text-only embedders
doc_embedder = SentenceTransformersDocumentEmbedder(model="mixedbread-ai/mxbai-embed-large-v1", progress_bar=False)
# New LLMDocumentContentExtractor
llm_content_extractor = LLMDocumentContentExtractor(
chat_generator=OpenAIChatGenerator(model="gpt-4o-mini"), # you can replace this with other chat generators that support vision
max_workers=1, # This can be used to parallelize the content extraction
)
# Create the Indexing Pipeline
indexing_pipe = Pipeline()
indexing_pipe.add_component("file_type_router", file_type_router)
indexing_pipe.add_component("pdf_converter", pdf_converter)
indexing_pipe.add_component("pdf_splitter", pdf_splitter)
indexing_pipe.add_component("image_converter", image_converter)
indexing_pipe.add_component("llm_content_extractor", llm_content_extractor)
indexing_pipe.add_component("doc_embedder", doc_embedder)
indexing_pipe.add_component("final_doc_joiner", final_doc_joiner)
indexing_pipe.add_component("document_writer", document_writer)
indexing_pipe.connect("file_type_router.application/pdf", "pdf_converter.sources")
indexing_pipe.connect("pdf_converter.documents", "pdf_splitter.documents")
indexing_pipe.connect("pdf_splitter.documents", "final_doc_joiner.documents")
indexing_pipe.connect("file_type_router.image/jpeg", "image_converter.sources")
indexing_pipe.connect("image_converter.documents", "llm_content_extractor.documents")
indexing_pipe.connect("llm_content_extractor.documents", "final_doc_joiner.documents")
indexing_pipe.connect("final_doc_joiner.documents", "doc_embedder.documents")
indexing_pipe.connect("doc_embedder.documents", "document_writer.documents")
# indexing_pipe.show()
indexing_result = indexing_pipe.run(
data={"file_type_router": {"sources": ["attention_is_all_you_need.pdf", "apple.jpg"]}}
)
indexed_documents = doc_store.filter_documents()
print(f"Indexed {len(indexed_documents)} documents")
Let’s inspect our image document to see what content was extracted
image_doc = [d for d in indexed_documents if d.meta.get("file_path") == "apple.jpg"]
image_doc
Nice, we have a an image caption of our image of an apple!
Now, let’s run the retrieval with the same query and see what we get.
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.embedders import SentenceTransformersTextEmbedder
retriever = InMemoryEmbeddingRetriever(document_store=doc_store)
text_embedder = SentenceTransformersTextEmbedder(model="mixedbread-ai/mxbai-embed-large-v1", progress_bar=False)
text_embedder.warm_up()
text_embedding = text_embedder.run(text="An image of an apple")["embedding"]
results = retriever.run(text_embedding)["documents"]
for idx, doc in enumerate(results[:5]):
print(f"Document {idx+1}:")
print(f"Score: {doc.score}")
print(f"File Path: {doc.meta['file_path']}")
print("")
And now we can see that the document representing the apple.jpg file was retrieved first! We can now use this approach retrieve the image document at query time and still use the image to answer a user’s question.
Multimodal RAG on Image + Text
In this section, we demonstrate a multimodal RAG pipeline that retrieves based on textual image captions, but uses the original images during generation. This allows us to combine the strengths of both modalities: fast and effective retrieval using a text-based index, and rich, grounded generation using visual inputs.
Specifically:
- During indexing, we use
LLMDocumentContentExtractor
to extract a caption from each image, which serves as the searchable text representation. - At query time, these captions are embedded and used to retrieve relevant documents.
- Retrieved image documents are then passed through
DocumentToImageContent
, which converts them to base64 strings and packages them asImageContent
. - These image objects are rendered directly in the prompt and sent to a vision-capable language model like
gpt-4o-mini
.
One thing to notice here, this time, instead of passing a list of
ChatMessage
objects toChatPromptBuilder
, we define the roles directly within the prompt using{%- message role="system" -%}
and{%- message role="user" -%}
. We then render the base64 strings by using thetemplatize_part
utility.
This approach makes it possible to retrieve both images and text with text, but still leverage the full detail of the image in the final answer.
from haystack import Pipeline
from haystack.components.embedders.sentence_transformers_text_embedder import SentenceTransformersTextEmbedder
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.components.retrievers.in_memory.embedding_retriever import InMemoryEmbeddingRetriever
from haystack.components.builders import ChatPromptBuilder
from haystack.components.converters.image import DocumentToImageContent
from haystack.components.routers import DocumentTypeRouter
text_embedder = SentenceTransformersTextEmbedder(model="mixedbread-ai/mxbai-embed-large-v1", progress_bar=False)
retriever = InMemoryEmbeddingRetriever(document_store=doc_store, top_k=3)
doc_type_router = DocumentTypeRouter(file_path_meta_field="file_path", mime_types=["image/jpeg", "application/pdf"])
doc_to_image = DocumentToImageContent(detail="auto")
chat_prompt_builder = ChatPromptBuilder(
required_variables=["question"],
template="""{% message role="system" %}
You are a friendly assistant that answers questions based on provided documents and images.
{% endmessage %}
{%- message role="user" -%}
Only provide an answer to the question using the images and text passages provided.
These are the text-only documents:
{%- if documents|length > 0 %}
{%- for doc in documents %}
Text Document [{{ loop.index }}] :
{{ doc.content }}
{% endfor -%}
{%- else %}
No relevant text documents were found.
{% endif %}
End of text documents.
Question: {{ question }}
Answer:
Images:
{%- if image_contents|length > 0 %}
{%- for img in image_contents -%}
{{ img | templatize_part }}
{%- endfor -%}
{% endif %}
{%- endmessage -%}
""")
llm = OpenAIChatGenerator(model="gpt-4o-mini")
# Create the Query Pipeline
pipe = Pipeline()
pipe.add_component("text_embedder", text_embedder)
pipe.add_component("retriever", retriever)
pipe.add_component("doc_type_router", doc_type_router)
pipe.add_component("doc_to_image", doc_to_image)
pipe.add_component("chat_prompt_builder", chat_prompt_builder)
pipe.add_component("llm", llm)
pipe.connect("text_embedder.embedding", "retriever.query_embedding")
pipe.connect("retriever.documents", "doc_type_router.documents")
pipe.connect("doc_type_router.image/jpeg", "doc_to_image.documents")
pipe.connect("doc_to_image.image_contents", "chat_prompt_builder.image_contents")
pipe.connect("doc_type_router.application/pdf", "chat_prompt_builder.documents")
pipe.connect("chat_prompt_builder.prompt", "llm.messages")
# pipe.show()
When we send a query to our pipeline now, we’ll receive a response based on the apple image. The retriever
fetches the relevant data, and a vision-capable language model like gpt-4o-mini
generates the response using the base64-encoded image.
# Run the pipeline with a query about the apple
query = "What is the color of the background of the image with an apple in it?"
result = pipe.run(
data={"text_embedder": {"text": query}, "chat_prompt_builder": {"question": query}}
)
print(result["llm"]["replies"][0].text)
# Run the pipeline with a query about the pdf document
query = "What is attention in the transformers architecture?"
result = pipe.run(data={"text_embedder": {"text": query}, "chat_prompt_builder": {"question": query}})
print(result["llm"]["replies"][0].text)
What’s next
๐ Congrats! You just built a multimodal RAG system with Haystack and tried out different ways to retrieve both image and text data.
You can follow the progress of the multimodal features in this GitHub issue.
Curious to keep exploring? Here are a few great next steps:
- Creating a Multi-Agent System with Haystack
- Building an Agentic RAG with Fallback to Websearch
- AI Guardrails: Content Moderation and Safety with Open Language Models
To stay up to date on the latest Haystack developments, you can sign up for our newsletter or join Haystack discord community.
(Notebook by Sebastian Husch Lee)