Chroma Indexing and RAG Examples
Last Updated: July 8, 2025
Install dependencies
# Install the Chroma integration, Haystack will come as a dependency
!pip install -U chroma-haystack "huggingface_hub>=0.22.0"
Indexing Pipeline: preprocess, split and index documents
In this section, we will index documents into a Chroma DB collection by building a Haystack indexing pipeline. Here, we are indexing documents from the
VIM User Manuel into the Haystack
ChromaDocumentStore.
We have the .txt files for these pages in the examples folder for the ChromaDocumentStore, so we are using the
TextFileToDocument and
DocumentWriter components to build this indexing pipeline.
# Fetch data files from the Github repo
!curl -sL https://github.com/deepset-ai/haystack-core-integrations/tarball/main -o main.tar
!mkdir main
!tar xf main.tar -C main --strip-components 1
!mv main/integrations/chroma/example/data .
mkdir: main: File exists
mv: rename main/integrations/chroma/example/data to ./data: Directory not empty
import os
from pathlib import Path
from haystack import Pipeline
from haystack.components.converters import TextFileToDocument
from haystack.components.writers import DocumentWriter
from haystack_integrations.document_stores.chroma import ChromaDocumentStore
file_paths = ["data" / Path(name) for name in os.listdir("data")]
# Chroma is used in-memory so we use the same instances in the two pipelines below
document_store = ChromaDocumentStore()
indexing = Pipeline()
indexing.add_component("converter", TextFileToDocument())
indexing.add_component("writer", DocumentWriter(document_store))
indexing.connect("converter", "writer")
indexing.run({"converter": {"sources": file_paths}})
{'writer': {'documents_written': 36}}
Query Pipeline: build retrieval-augmented generation (RAG) pipelines
Once we have documents in the ChromaDocumentStore, we can use the accompanying Chroma retrievers to build a query pipeline. The query pipeline below is a simple retrieval-augmented generation (RAG) pipeline that uses Chroma’s
query API.
You can change the indexing pipeline and query pipelines here for embedding search by using one of the
Haystack Embedders accompanied by the ChromaEmbeddingRetriever.
In this example we are using:
- The
OpenAIChatGeneratorwithgpt-4o-mini. (You will need a OpenAI API key to use this model). You can replace this with any of the otherGenerators - The
ChatPromptBuilderwhich holds the prompt template. You can adjust this to a prompt of your choice - The
ChromaQueryTextRetriverwhich expects a list of queries and retieves thetop_kmost relevant documents from your Chroma collection.
import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass("Enter OpenAI API key:")
Enter OpenAI API key: ········
from haystack_integrations.components.retrievers.chroma import ChromaQueryTextRetriever
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.components.builders import ChatPromptBuilder
from haystack.dataclasses.chat_message import ChatMessage
prompt = """
Answer the query based on the provided context.
If the context does not contain the answer, say 'Answer not found'.
Context:
{% for doc in documents %}
{{ doc.content }}
{% endfor %}
query: {{query}}
Answer:
"""
template = [ChatMessage.from_user(prompt)]
prompt_builder = ChatPromptBuilder(template=template)
llm = OpenAIChatGenerator()
retriever = ChromaQueryTextRetriever(document_store)
querying = Pipeline()
querying.add_component("retriever", retriever)
querying.add_component("prompt_builder", prompt_builder)
querying.add_component("llm", llm)
querying.connect("retriever.documents", "prompt_builder.documents")
querying.connect("prompt_builder", "llm")
ChatPromptBuilder has 2 prompt variables, but `required_variables` is not set. By default, all prompt variables are treated as optional, which may lead to unintended behavior in multi-branch pipelines. To avoid unexpected execution, ensure that variables intended to be required are explicitly set in `required_variables`.
<haystack.core.pipeline.pipeline.Pipeline object at 0x308f29880>
🚅 Components
- retriever: ChromaQueryTextRetriever
- prompt_builder: ChatPromptBuilder
- llm: OpenAIChatGenerator
🛤️ Connections
- retriever.documents -> prompt_builder.documents (List[Document])
- prompt_builder.prompt -> llm.messages (List[ChatMessage])
query = "Should I write documentation for my plugin?"
results = querying.run({"retriever": {"query": query, "top_k": 3}, "prompt_builder": {"query": query}})
print(results["llm"]["replies"][0].text)
Yes, it is a good idea to write documentation for your plugin. This helps users understand how to use it, especially when its behavior can be changed by the user.
