RAG with Llama 3.1

_{Last Updated:
July 8, 2025}

Simple RAG example on the Oscars using Llama 3.1 open models and the Haystack LLM framework.

Installation

! pip install haystack-ai "transformers>=4.43.1" sentence-transformers accelerate bitsandbytes

Authorization

you need an Hugging Face account
you need to accept Meta conditions here: https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct and wait for the authorization

import getpass, os


os.environ["HF_API_TOKEN"] = getpass.getpass("Your Hugging Face token")

Your Hugging Face token··········

RAG with Llama-3.1-8B-Instruct (about the Oscars) 🏆🎬

! pip install wikipedia

Load data from Wikipedia

from IPython.display import Image
from pprint import pprint
import rich
import random

import wikipedia
from haystack.dataclasses import Document

title = "96th_Academy_Awards"
page = wikipedia.page(title=title, auto_suggest=False)
raw_docs = [Document(content=page.content, meta={"title": page.title, "url":page.url})]

Indexing Pipeline

from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack import Document
from haystack.components.embedders import SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder
from haystack.components.converters import TextFileToDocument
from haystack.components.writers import DocumentWriter
from haystack.components.preprocessors import DocumentSplitter
from haystack.utils import ComponentDevice

document_store = InMemoryDocumentStore()

indexing_pipeline = Pipeline()
indexing_pipeline.add_component("splitter", DocumentSplitter(split_by="word", split_length=200))

indexing_pipeline.add_component(
    "embedder",
    SentenceTransformersDocumentEmbedder(
        model="Snowflake/snowflake-arctic-embed-l",  # good embedding model: https://huggingface.co/Snowflake/snowflake-arctic-embed-l
        device=ComponentDevice.from_str("cuda:0"),    # load the model on GPU
    ))
indexing_pipeline.add_component("writer", DocumentWriter(document_store=document_store))

# connect the components
indexing_pipeline.connect("splitter", "embedder")
indexing_pipeline.connect("embedder", "writer")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7fcc409ea4d0>
🚅 Components
  - splitter: DocumentSplitter
  - embedder: SentenceTransformersDocumentEmbedder
  - writer: DocumentWriter
🛤️ Connections
  - splitter.documents -> embedder.documents (List[Document])
  - embedder.documents -> writer.documents (List[Document])

indexing_pipeline.run({"splitter":{"documents":raw_docs}})

/usr/local/lib/python3.10/dist-packages/sentence_transformers/SentenceTransformer.py:174: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v3 of SentenceTransformers.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:81: UserWarning: 
Access to the secret `HF_TOKEN` has not been granted on this notebook.
You will not be requested again.
Please restart the session if you want to be prompted again.
  warnings.warn(



Batches:   0%|          | 0/1 [00:00<?, ?it/s]





{'writer': {'documents_written': 12}}

RAG Pipeline

from haystack.components.builders import ChatPromptBuilder
from haystack.dataclasses import ChatMessage

template = [ChatMessage.from_user("""
Using the information contained in the context, give a comprehensive answer to the question.
If the answer cannot be deduced from the context, do not give an answer.

Context:
  {% for doc in documents %}
  {{ doc.content }} URL:{{ doc.meta['url'] }}
  {% endfor %};
  Question: {{query}}

""")]
prompt_builder = ChatPromptBuilder(template=template)

Here, we use the HuggingFaceLocalChatGenerator, loading the model in Colab with 4-bit quantization.

import torch
from haystack.components.generators.chat import HuggingFaceLocalChatGenerator

generator = HuggingFaceLocalChatGenerator(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    huggingface_pipeline_kwargs={"device_map":"auto",
                                  "model_kwargs":{"load_in_4bit":True,
                                                  "bnb_4bit_use_double_quant":True,
                                                  "bnb_4bit_quant_type":"nf4",
                                                  "bnb_4bit_compute_dtype":torch.bfloat16}},
    generation_kwargs={"max_new_tokens": 500})

generator.warm_up()

config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]


The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.



model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]



Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]



model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]



model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]



model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]



model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]



Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]



generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/50.9k [00:00<?, ?B/s]



tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]



special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever

query_pipeline = Pipeline()

query_pipeline.add_component(
    "text_embedder",
    SentenceTransformersTextEmbedder(
        model="Snowflake/snowflake-arctic-embed-l",  # good embedding model: https://huggingface.co/Snowflake/snowflake-arctic-embed-l
        device=ComponentDevice.from_str("cuda:0"),  # load the model on GPU
        prefix="Represent this sentence for searching relevant passages: ",  # as explained in the model card (https://huggingface.co/Snowflake/snowflake-arctic-embed-l#using-huggingface-transformers), queries should be prefixed
    ))
query_pipeline.add_component("retriever", InMemoryEmbeddingRetriever(document_store=document_store, top_k=5))
query_pipeline.add_component("prompt_builder", ChatPromptBuilder(template=template))
query_pipeline.add_component("generator", generator)

# connect the components
query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
query_pipeline.connect("retriever.documents", "prompt_builder.documents")
query_pipeline.connect("prompt_builder", "generator")

Let’s ask some questions!

def get_generative_answer(query):

  results = query_pipeline.run({
      "text_embedder": {"text": query},
      "prompt_builder": {"query": query}
    }
  )

  answer = results["generator"]["replies"][0].text
  rich.print(answer)

get_generative_answer("Who won the Best Picture Award in 2024?")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.

**Oppenheimer** won the Best Picture Award in 2024.

get_generative_answer("What was the box office performance of the Best Picture nominees?")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.

The box office performance of the Best Picture nominees was as follows:

* Nine of the ten films nominated for Best Picture had earned a combined gross of $1.09 billion at the American and
Canadian box offices at the time of the nominations.
* The highest-grossing film among the Best Picture nominees was Barbie with $636 million in domestic box office 
receipts.
* The other nominees' box office performances were:
        + Oppenheimer: $326 million
        + Killers of the Flower Moon: $67 million
        + Poor Things: $20.4 million
        + The Holdovers: $18.7 million
        + Past Lives: $10.9 million
        + American Fiction: $7.9 million
        + Anatomy of a Fall: $3.9 million
        + The Zone of Interest: $1.6 million
* The box office performance of Maestro was not available due to its distributor Netflix's policy of not releasing 
such figures.

get_generative_answer("What was the reception of the ceremony")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.

The reception of the ceremony was generally positive. Television critic Robert Lloyd of the Los Angeles Times 
commented that the show was "not the slog it often is" and praised Jimmy Kimmel's performance as host, stating that
he was "a reliable, relatable presence liable to stir no controversy." Alison Herman of Variety noted that despite 
the lack of surprises among the winners, the show delivered "entertainment and emotion in spades, if not surprise."
Daniel Fienberg of The Hollywood Reporter lauded Kimmel's hosting and the ceremony's entertainment, stating that it
was "a maximalist, infectiously goofy singalong" that was the "ideal way to channel the feel-good energy of an 
Oscars where none of the bonhomie felt forced."

get_generative_answer("Can you name some of the films that got multiple nominations?")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.

According to the text, two films received multiple nominations:

1. Oppenheimer - 13 nominations
2. Poor Things - 11 nominations

# unrelated question: let's see how our RAG pipeline performs.

get_generative_answer("Audioslave was formed by members of two iconic bands. Can you name the bands and discuss the sound of Audioslave in comparison?")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.

There is no information in the provided context about the formation of Audioslave or its sound in comparison to 
other bands. The context only discusses the 96th Academy Awards and related events, and does not mention Audioslave
at all.

This is a simple demo. We can improve the RAG Pipeline in several ways, including better preprocessing the input.

To use Llama 3 models in Haystack, you also have other options:

LlamaCppGenerator and OllamaGenerator: using the GGUF quantized format, these solutions are ideal to run LLMs on standard machines (even without GPUs).
HuggingFaceAPIChatGenerator, which allows you to query a the Hugging Face API, a local TGI container or a (paid) HF Inference Endpoint. TGI is a toolkit for efficiently deploying and serving LLMs in production.
vLLM via OpenAIChatGenerator: high-throughput and memory-efficient inference and serving engine for LLMs.

(Notebook by Stefano Fiorucci)