🆕 Faster agents with parallel tool execution and guardrails & moderation for safer apps. See what's new in Haystack 2.15 🌟

Chroma Indexing and RAG Examples


Install dependencies

# Install the Chroma integration, Haystack will come as a dependency
!pip install -U chroma-haystack "huggingface_hub>=0.22.0"

Indexing Pipeline: preprocess, split and index documents

In this section, we will index documents into a Chroma DB collection by building a Haystack indexing pipeline. Here, we are indexing documents from the VIM User Manuel into the Haystack ChromaDocumentStore.

We have the .txt files for these pages in the examples folder for the ChromaDocumentStore, so we are using the TextFileToDocument and DocumentWriter components to build this indexing pipeline.

# Fetch data files from the Github repo
!curl -sL https://github.com/deepset-ai/haystack-core-integrations/tarball/main -o main.tar
!mkdir main
!tar xf main.tar -C main --strip-components 1
!mv main/integrations/chroma/example/data .
mkdir: main: File exists
mv: rename main/integrations/chroma/example/data to ./data: Directory not empty
import os
from pathlib import Path

from haystack import Pipeline
from haystack.components.converters import TextFileToDocument
from haystack.components.writers import DocumentWriter

from haystack_integrations.document_stores.chroma import ChromaDocumentStore

file_paths = ["data" / Path(name) for name in os.listdir("data")]

# Chroma is used in-memory so we use the same instances in the two pipelines below
document_store = ChromaDocumentStore()

indexing = Pipeline()
indexing.add_component("converter", TextFileToDocument())
indexing.add_component("writer", DocumentWriter(document_store))
indexing.connect("converter", "writer")
indexing.run({"converter": {"sources": file_paths}})
{'writer': {'documents_written': 36}}

Query Pipeline: build retrieval-augmented generation (RAG) pipelines

Once we have documents in the ChromaDocumentStore, we can use the accompanying Chroma retrievers to build a query pipeline. The query pipeline below is a simple retrieval-augmented generation (RAG) pipeline that uses Chroma’s query API.

You can change the indexing pipeline and query pipelines here for embedding search by using one of the Haystack Embedders accompanied by the ChromaEmbeddingRetriever.

In this example we are using:

  • The OpenAIChatGenerator with gpt-4o-mini. (You will need a OpenAI API key to use this model). You can replace this with any of the other Generators
  • The ChatPromptBuilder which holds the prompt template. You can adjust this to a prompt of your choice
  • The ChromaQueryTextRetriver which expects a list of queries and retieves the top_k most relevant documents from your Chroma collection.
import os
from getpass import getpass

os.environ["OPENAI_API_KEY"] = getpass("Enter OpenAI API key:")
Enter OpenAI API key: ········
from haystack_integrations.components.retrievers.chroma import ChromaQueryTextRetriever
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.components.builders import ChatPromptBuilder
from haystack.dataclasses.chat_message import ChatMessage

prompt = """
Answer the query based on the provided context.
If the context does not contain the answer, say 'Answer not found'.
Context:
{% for doc in documents %}
  {{ doc.content }}
{% endfor %}
query: {{query}}
Answer:
"""

template = [ChatMessage.from_user(prompt)]
prompt_builder = ChatPromptBuilder(template=template)

llm = OpenAIChatGenerator()
retriever = ChromaQueryTextRetriever(document_store)

querying = Pipeline()
querying.add_component("retriever", retriever)
querying.add_component("prompt_builder", prompt_builder)
querying.add_component("llm", llm)

querying.connect("retriever.documents", "prompt_builder.documents")
querying.connect("prompt_builder", "llm")
ChatPromptBuilder has 2 prompt variables, but `required_variables` is not set. By default, all prompt variables are treated as optional, which may lead to unintended behavior in multi-branch pipelines. To avoid unexpected execution, ensure that variables intended to be required are explicitly set in `required_variables`.





<haystack.core.pipeline.pipeline.Pipeline object at 0x308f29880>
🚅 Components
  - retriever: ChromaQueryTextRetriever
  - prompt_builder: ChatPromptBuilder
  - llm: OpenAIChatGenerator
🛤️ Connections
  - retriever.documents -> prompt_builder.documents (List[Document])
  - prompt_builder.prompt -> llm.messages (List[ChatMessage])
query = "Should I write documentation for my plugin?"
results = querying.run({"retriever": {"query": query, "top_k": 3}, "prompt_builder": {"query": query}})
print(results["llm"]["replies"][0].text)
Yes, it is a good idea to write documentation for your plugin. This helps users understand how to use it, especially when its behavior can be changed by the user.