🪁 RAG pipelines with Haystack + Zephyr 7B Beta

_{Last Updated:
March 10, 2025}

Notebook by Stefano Fiorucci and Tuana Celik

We are going to build a nice Retrieval Augmented Generation pipeline for Rock music, using the 🏗️ Haystack LLM orchestration framework and a good LLM: 💬 Zephyr 7B Beta (fine-tuned version of Mistral 7B V.01 that focuses on helpfulness and outperforms many larger models on the MT-Bench and AlpacaEval benchmarks)

Install dependencies

wikipedia is needed to download data from Wikipedia
haystack-ai is the Haystack package
sentence_transformers is needed for embeddings
transformers is needed to use open-source LLMs
accelerate and bitsandbytes are required to use quantized versions of these models (with smaller memory footprint)

%%capture
! pip install wikipedia haystack-ai transformers accelerate bitsandbytes sentence_transformers

from IPython.display import Image
from pprint import pprint
import torch
import rich
import random

Load data from Wikipedia

We are going to download the Wikipedia pages related to some Rock bands, using the python library wikipedia.

These pages are converted into Haystack Documents

favourite_bands="""Audioslave
Blink-182
Dire Straits
Evanescence
Green Day
Muse (band)
Nirvana (band)
Sum 41
The Cure
The Smiths""".split("\n")

import wikipedia
from haystack.dataclasses import Document

raw_docs=[]

for title in favourite_bands:
    page = wikipedia.page(title=title, auto_suggest=False)
    doc = Document(content=page.content, meta={"title": page.title, "url":page.url})
    raw_docs.append(doc)

The Indexing Pipeline

from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.embedders import SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder
from haystack.components.writers import DocumentWriter
from haystack.document_stores.types import DuplicatePolicy
from haystack.utils import ComponentDevice

We will save our final Documents in an InMemoryDocumentStore, a simple database which lives in memory.

document_store = InMemoryDocumentStore(embedding_similarity_function="cosine")

Our indexing Pipeline transform the original Documents and save them in the Document Store.

It consists of several components:

DocumentCleaner: performs a basic cleaning of the Documents
DocumentSplitter: chunks each Document into smaller pieces (more appropriate for semantic search and RAG)
SentenceTransformersDocumentEmbedder:
- represent each Document as a vector (capturing its meaning).
- we choose a good but not too big model from MTEB leaderboard.
- Also the metadata title is embedded, because it contains relevant information (metadata_fields_to_embed parameter).
- We use the GPU for this expensive operation (device parameter).
DocumentWriter just saves the Documents in the Document Store

indexing = Pipeline()
indexing.add_component("cleaner", DocumentCleaner())
indexing.add_component("splitter", DocumentSplitter(split_by='sentence', split_length=2))
indexing.add_component("doc_embedder", SentenceTransformersDocumentEmbedder(model="thenlper/gte-large",
                                                                            device=ComponentDevice.from_str("cuda:0"), 
                                                                            meta_fields_to_embed=["title"]))
indexing.add_component("writer", DocumentWriter(document_store=document_store, policy=DuplicatePolicy.OVERWRITE))

indexing.connect("cleaner", "splitter")
indexing.connect("splitter", "doc_embedder")
indexing.connect("doc_embedder", "writer")

Let’s draw the indexing pipeline

indexing.draw("indexing.png")
Image(filename='indexing.png')

We finally run the indexing pipeline

indexing.run({"cleaner":{"documents":raw_docs}})

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]



README.md:   0%|          | 0.00/67.9k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/619 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/670M [00:00<?, ?B/s]



onnx/config.json:   0%|          | 0.00/632 [00:00<?, ?B/s]



model.onnx:   0%|          | 0.00/1.34G [00:00<?, ?B/s]



onnx/special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]



onnx/tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]



onnx/tokenizer_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]



onnx/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/670M [00:00<?, ?B/s]



sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]



special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]



tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]



vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]



modules.json:   0%|          | 0.00/385 [00:00<?, ?B/s]



Batches:   0%|          | 0/49 [00:00<?, ?it/s]





{'writer': {'documents_written': 1554}}

Let’s inspect the total number of chunked Documents and examine a Document

len(document_store.filter_documents())

document_store.filter_documents()[0].meta

{'title': 'Audioslave',
 'url': 'https://en.wikipedia.org/wiki/Audioslave',
 'source_id': 'e3deff3d39ef107e8b0d69415ea61644b73175086cfbeee03d5f5d6946619fcf'}

pprint(document_store.filter_documents()[0])
print(len(document_store.filter_documents()[0].embedding)) # embedding size

Document(id=3ca9785f81fb9fb0700f794b1fd2355626824599ecbce435e6f5e3babb05facc, content: 'Audioslave was an American rock supergroup formed in Glendale, California, in 2001. The four-piece b...', meta: {'title': 'Audioslave', 'url': 'https://en.wikipedia.org/wiki/Audioslave', 'source_id': 'e3deff3d39ef107e8b0d69415ea61644b73175086cfbeee03d5f5d6946619fcf'}, embedding: vector of size 1024)
1024

The RAG Pipeline

`HuggingFaceLocalGenerator` with `zephyr-7b-beta`

To load and manage Open Source LLMs in Haystack, we can use the HuggingFaceLocalGenerator.
The LLM we choose is Zephyr 7B Beta, a fine-tuned version of Mistral 7B V.01 that focuses on helpfulness and outperforms many larger models on the MT-Bench and AlpacaEval benchmarks; the model was fine-tuned by the Hugging Face team.
Since we are using a free Colab instance (with limited resources), we load the model using 4-bit quantization (passing the appropriate huggingface_pipeline_kwargs to our Generator). For an introduction to Quantization in Hugging Face Transformers, you can read this simple blog post.

from haystack.components.generators import HuggingFaceLocalGenerator

generator = HuggingFaceLocalGenerator("HuggingFaceH4/zephyr-7b-beta",
                                 huggingface_pipeline_kwargs={"device_map":"auto",
                                               "model_kwargs":{"load_in_4bit":True,
                                                "bnb_4bit_use_double_quant":True,
                                                "bnb_4bit_quant_type":"nf4",
                                                "bnb_4bit_compute_dtype":torch.bfloat16}},
                                 generation_kwargs={"max_new_tokens": 350})

Let’s warm up the component and try the model…

generator.warm_up()

config.json:   0%|          | 0.00/638 [00:00<?, ?B/s]



model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]



Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]



model-00001-of-00008.safetensors:   0%|          | 0.00/1.89G [00:00<?, ?B/s]



model-00002-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]



model-00003-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]



model-00004-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]



model-00005-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]



model-00006-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]



model-00007-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]



model-00008-of-00008.safetensors:   0%|          | 0.00/816M [00:00<?, ?B/s]



Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]



generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]



tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]



tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]



added_tokens.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]



special_tokens_map.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

# quick check
rich.print(generator.run("Please write a rhyme about Italy."))

/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:1473: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use and modify the model generation configuration (see https://huggingface.co/docs/transformers/generation_strategies#default-text-generation-configuration )
  warnings.warn(

{
    'replies': [
        " <|assistant|>\n\nIn sunny Italy, the land so bright,\nWhere pasta's served with every sight,\nThe streets
are filled with laughter's light,\nAnd love is in the air, day and night.\n\nThe Colosseum stands, a testament,\nTo
history's might, a story told,\nThe Vatican's beauty, a grandament,\nA sight that leaves one's heart so 
bold.\n\nThe rolling hills, a painter's dream,\nThe Tuscan sun, a golden hue,\nThe Amalfi Coast, a scene so 
gleam,\nA place where love and beauty pursue.\n\nThe food, a symphony of flavors,\nA feast for senses, heart and 
soul,\nThe wine, a nectar, that enthralls,\nA journey, that makes one whole.\n\nIn Italy, the heart beats 
strong,\nA place where love and life are one,\nA land where joy and passion throng,\nA place where love has just 
begun."
    ]
}

Ok, nice!

`PromptBuilder`

It’s a component that renders a prompt from a template string using Jinja2 engine.

Let’s setup our prompt builder, with a format like the following (appropriate for Zephyr):

from haystack.components.builders import PromptBuilder

prompt_template = """<|system|>Using the information contained in the context, give a comprehensive answer to the question.
If the answer is contained in the context, also report the source URL.
If the answer cannot be deduced from the context, do not give an answer.</s>
<|user|>
Context:
  {% for doc in documents %}
  {{ doc.content }} URL:{{ doc.meta['url'] }}
  {% endfor %};
  Question: {{query}}
  </s>
<|assistant|>
"""
prompt_builder = PromptBuilder(template=prompt_template)

Let’s create the RAG pipeline

from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever

Our RAG Pipeline finds Documents relevant to the user query and pass them to the LLM to generate a grounded answer.

It consists of several components:

SentenceTransformersTextEmbedder: represent the query as a vector (capturing its meaning).
InMemoryEmbeddingRetriever: finds the (top 5) Documents that are most similar to the query vector
PromptBuilder
HuggingFaceLocalGenerator

rag = Pipeline()
rag.add_component("text_embedder", SentenceTransformersTextEmbedder(model="thenlper/gte-large", 
                                                                    device=ComponentDevice.from_str("cuda:0"))
rag.add_component("retriever", InMemoryEmbeddingRetriever(document_store=document_store, top_k=5))
rag.add_component("prompt_builder", prompt_builder)
rag.add_component("llm", generator)

rag.connect("text_embedder", "retriever")
rag.connect("retriever.documents", "prompt_builder.documents")
rag.connect("prompt_builder.prompt", "llm.prompt")

Visualize our pipeline!

rag.draw("rag.png")
Image(filename='rag.png')

We create an utility function that runs the RAG pipeline and nicely prints the answer.

def get_generative_answer(query):

  results = rag.run({
      "text_embedder": {"text": query},
      "prompt_builder": {"query": query}
    }
  )

  answer = results["llm"]["replies"][0]
  rich.print(answer)

Let’s try our RAG pipeline…

get_generative_answer("What is the style of the Cure?")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:1473: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use and modify the model generation configuration (see https://huggingface.co/docs/transformers/generation_strategies#default-text-generation-configuration )
  warnings.warn(

get_generative_answer("Is the earth flat?")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Based on the provided context, the question "Is the earth flat?" is not related to the information provided. 
Therefore, there is no answer to this question.