Chroma Indexing and RAG Examples
Last Updated: July 8, 2025
Install dependencies
# Install the Chroma integration, Haystack will come as a dependency
!pip install -U chroma-haystack "huggingface_hub>=0.22.0"
Indexing Pipeline: preprocess, split and index documents
In this section, we will index documents into a Chroma DB collection by building a Haystack indexing pipeline. Here, we are indexing documents from the
VIM User Manuel into the Haystack
ChromaDocumentStore
.
We have the .txt
files for these pages in the examples folder for the ChromaDocumentStore
, so we are using the
TextFileToDocument
and
DocumentWriter
components to build this indexing pipeline.
# Fetch data files from the Github repo
!curl -sL https://github.com/deepset-ai/haystack-core-integrations/tarball/main -o main.tar
!mkdir main
!tar xf main.tar -C main --strip-components 1
!mv main/integrations/chroma/example/data .
mkdir: main: File exists
mv: rename main/integrations/chroma/example/data to ./data: Directory not empty
import os
from pathlib import Path
from haystack import Pipeline
from haystack.components.converters import TextFileToDocument
from haystack.components.writers import DocumentWriter
from haystack_integrations.document_stores.chroma import ChromaDocumentStore
file_paths = ["data" / Path(name) for name in os.listdir("data")]
# Chroma is used in-memory so we use the same instances in the two pipelines below
document_store = ChromaDocumentStore()
indexing = Pipeline()
indexing.add_component("converter", TextFileToDocument())
indexing.add_component("writer", DocumentWriter(document_store))
indexing.connect("converter", "writer")
indexing.run({"converter": {"sources": file_paths}})
{'writer': {'documents_written': 36}}
Query Pipeline: build retrieval-augmented generation (RAG) pipelines
Once we have documents in the ChromaDocumentStore
, we can use the accompanying Chroma retrievers to build a query pipeline. The query pipeline below is a simple retrieval-augmented generation (RAG) pipeline that uses Chroma’s
query API.
You can change the indexing pipeline and query pipelines here for embedding search by using one of the
Haystack Embedders
accompanied by the ChromaEmbeddingRetriever
.
In this example we are using:
- The
OpenAIChatGenerator
withgpt-4o-mini
. (You will need a OpenAI API key to use this model). You can replace this with any of the otherGenerators
- The
ChatPromptBuilder
which holds the prompt template. You can adjust this to a prompt of your choice - The
ChromaQueryTextRetriver
which expects a list of queries and retieves thetop_k
most relevant documents from your Chroma collection.
import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass("Enter OpenAI API key:")
Enter OpenAI API key: ········
from haystack_integrations.components.retrievers.chroma import ChromaQueryTextRetriever
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.components.builders import ChatPromptBuilder
from haystack.dataclasses.chat_message import ChatMessage
prompt = """
Answer the query based on the provided context.
If the context does not contain the answer, say 'Answer not found'.
Context:
{% for doc in documents %}
{{ doc.content }}
{% endfor %}
query: {{query}}
Answer:
"""
template = [ChatMessage.from_user(prompt)]
prompt_builder = ChatPromptBuilder(template=template)
llm = OpenAIChatGenerator()
retriever = ChromaQueryTextRetriever(document_store)
querying = Pipeline()
querying.add_component("retriever", retriever)
querying.add_component("prompt_builder", prompt_builder)
querying.add_component("llm", llm)
querying.connect("retriever.documents", "prompt_builder.documents")
querying.connect("prompt_builder", "llm")
ChatPromptBuilder has 2 prompt variables, but `required_variables` is not set. By default, all prompt variables are treated as optional, which may lead to unintended behavior in multi-branch pipelines. To avoid unexpected execution, ensure that variables intended to be required are explicitly set in `required_variables`.
<haystack.core.pipeline.pipeline.Pipeline object at 0x308f29880>
🚅 Components
- retriever: ChromaQueryTextRetriever
- prompt_builder: ChatPromptBuilder
- llm: OpenAIChatGenerator
🛤️ Connections
- retriever.documents -> prompt_builder.documents (List[Document])
- prompt_builder.prompt -> llm.messages (List[ChatMessage])
query = "Should I write documentation for my plugin?"
results = querying.run({"retriever": {"query": query, "top_k": 3}, "prompt_builder": {"query": query}})
print(results["llm"]["replies"][0].text)
Yes, it is a good idea to write documentation for your plugin. This helps users understand how to use it, especially when its behavior can be changed by the user.