RAG Pipeline Evaluation Using DeepEval


DeepEval is a framework to evaluate Retrieval Augmented Generation (RAG) pipelines. It supports metrics like context relevance, answer correctness, faithfulness, and more.

For more information about evaluators, supported metrics and usage, check out:

This notebook shows how to use DeepEval-Haystack integration to evaluate a RAG pipeline against various metrics.

Prerequisites:

  • OpenAI key
    • DeepEval uses for computing some metrics, so we need an OpenAI key.
import os
from getpass import getpass

os.environ["OPENAI_API_KEY"] = getpass("Enter OpenAI API key:")

Install dependencies

!pip install "pydantic<1.10.10"
!pip install haystack-ai
!pip install "datasets>=2.6.1"
!pip install deepeval-haystack

Create a RAG pipeline

We’ll first need to create a RAG pipeline. Refer to this link for a detailed tutorial on how to create RAG pipelines.

In this notebook, we’re using the SQUAD V2 dataset for getting the context, questions and ground truth answers.

Initialize the document store

from datasets import load_dataset
from haystack import Document
from haystack.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()

dataset = load_dataset("rajpurkar/squad_v2", split="validation")
documents = list(set(dataset["context"]))
docs = [Document(content=doc) for doc in documents]
document_store.write_documents(docs)
import os
from getpass import getpass
from haystack.components.builders import PromptBuilder
from haystack.components.generators import OpenAIGenerator
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever

retriever = InMemoryBM25Retriever(document_store, top_k=3)

template = """
Given the following information, answer the question.

Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: {{question}}
Answer:
"""

prompt_builder = PromptBuilder(template=template)
generator = OpenAIGenerator(model="gpt-3.5-turbo-0125")

Build the RAG pipeline

from haystack import Pipeline
from haystack.components.builders.answer_builder import AnswerBuilder

rag_pipeline = Pipeline()
# Add components to your pipeline
rag_pipeline.add_component("retriever", retriever)
rag_pipeline.add_component("prompt_builder", prompt_builder)
rag_pipeline.add_component("llm", generator)
rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder")

# Now, connect the components to each other
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("llm.meta", "answer_builder.meta")
rag_pipeline.connect("retriever", "answer_builder.documents")

Running the pipeline

question = "In what country is Normandy located?"

response = rag_pipeline.run(
    {"retriever": {"query": question}, "prompt_builder": {"question": question}, "answer_builder": {"query": question}}
)
print(response["answer_builder"]["answers"][0].data)

We’re done building our RAG pipeline. Let’s evaluate it now!

Get questions, contexts, responses and ground truths for evaluation

For computing most metrics, we will need to provide the following to the evaluator:

  1. Questions
  2. Generated responses
  3. Retrieved contexts
  4. Ground truth (Specifically, this is needed for context precision, context recall and answer correctness metrics)

We’ll start with random three questions from the dataset (see below) and now we’ll get the matching contexts and responses for those questions.

Helper function to get context and responses for our questions

def get_contexts_and_responses(questions, pipeline):
    contexts = []
    responses = []
    for question in questions:
        response = pipeline.run(
            {
                "retriever": {"query": question},
                "prompt_builder": {"question": question},
                "answer_builder": {"query": question},
            }
        )

        contexts.append([d.content for d in response["answer_builder"]["answers"][0].documents])
        responses.append(response["answer_builder"]["answers"][0].data)
    return contexts, responses
question_map = {
    "Which mountain range influenced the split of the regions?": 0,
    "What is the prize offered for finding a solution to P=NP?": 1,
    "Which Californio is located in the upper part?": 2
}
questions = list(question_map.keys())
contexts, responses = get_contexts_and_responses(questions, rag_pipeline)

Ground truths, review all fields

Now that we have questions, contexts, and responses we’ll also get the matching ground truth answers.

ground_truths = [""] * len(question_map)

for question, index in question_map.items():
    idx = dataset["question"].index(question)
    ground_truths[index] = dataset["answers"][idx]["text"][0]
print("Questions:\n")
print("\n".join(questions))
print("Contexts:\n")
for c in contexts:
  print(c[0])
print("Responses:\n")
print("\n".join(responses))
print("Ground truths:\n")
print("\n".join(ground_truths))

Evaluate the RAG pipeline

Now that we have the questions, contexts,responses and the ground truths, we can begin our pipeline evaluation and compute all the supported metrics.

Metrics computation

In addition to evaluating the final responses of the LLM, it is important that we also evaluate the individual components of the RAG pipeline as they can significantly impact the overall performance. Therefore, there are different metrics to evaluate the retriever, the generator and the overall pipeline. For a full list of available metrics and their expected inputs, check out the DeepEvalEvaluator Docs

The DeepEval documentation provides explanation of the individual metrics with simple examples for each of them.

Contextul Precision

The contextual precision metric measures our RAG pipeline’s retriever by evaluating whether items in our contexts that are relevant to the given input are ranked higher than irrelevant ones.

from haystack import Pipeline
from haystack_integrations.components.evaluators.deepeval import DeepEvalEvaluator, DeepEvalMetric

context_precision_pipeline = Pipeline()
evaluator = DeepEvalEvaluator(metric=DeepEvalMetric.CONTEXTUAL_PRECISION, metric_params={"model":"gpt-4"})
context_precision_pipeline.add_component("evaluator", evaluator)
evaluation_results = context_precision_pipeline.run(
    {"evaluator": {"questions": questions, "contexts": contexts, "ground_truths": ground_truths, "responses": responses}}
)
print(evaluation_results["evaluator"]["results"])

Contextual Recall

Contextual recall measures the extent to which the contexts aligns with the ground truth.

from haystack import Pipeline
from haystack_integrations.components.evaluators.deepeval import DeepEvalEvaluator, DeepEvalMetric

context_recall_pipeline = Pipeline()
evaluator = DeepEvalEvaluator(metric=DeepEvalMetric.CONTEXTUAL_RECALL, metric_params={"model":"gpt-4"})
context_recall_pipeline.add_component("evaluator", evaluator)
evaluation_results = context_recall_pipeline.run(
    {"evaluator": {"questions": questions, "contexts": contexts, "ground_truths": ground_truths, "responses": responses}}
)
print(evaluation_results["evaluator"]["results"])

Contextual Relevancy

The contextual relevancy metric measures the quality of our RAG pipeline’s retriever by evaluating the overall relevance of the context for a given question.

from haystack import Pipeline
from haystack_integrations.components.evaluators.deepeval import DeepEvalEvaluator, DeepEvalMetric

context_relevancy_pipeline = Pipeline()
evaluator = DeepEvalEvaluator(metric=DeepEvalMetric.CONTEXTUAL_RELEVANCE, metric_params={"model":"gpt-4"})
context_relevancy_pipeline.add_component("evaluator", evaluator)
evaluation_results = context_relevancy_pipeline.run(
    {"evaluator": {"questions": questions, "contexts": contexts, "responses": responses}}
)
print(evaluation_results["evaluator"]["results"])

Answer relevancy

The answer relevancy metric measures the quality of our RAG pipeline’s response by evaluating how relevant the response is compared to the provided question.

from haystack import Pipeline
from haystack_integrations.components.evaluators.deepeval import DeepEvalEvaluator, DeepEvalMetric

answer_relevancy_pipeline = Pipeline()
evaluator = DeepEvalEvaluator(metric=DeepEvalMetric.ANSWER_RELEVANCY, metric_params={"model":"gpt-4"})
answer_relevancy_pipeline.add_component("evaluator", evaluator)
evaluation_results = answer_relevancy_pipeline.run(
    {"evaluator": {"questions": questions, "responses": responses, "contexts": contexts}}
)
print(evaluation_results["evaluator"]["results"])

Note

When this notebook was created, the version 0.20.57 of deepeval required the use of contexts for calculating Answer Relevancy. Please note that future versions will no longer require the context field. Specifically, the upcoming release of deepeval-haystack will eliminate the context field as a mandatory requirement.

Faithfulness

The faithfulness metric measures the quality of our RAG pipeline’s responses by evaluating whether the response factually aligns with the contents of context we provided.

from haystack import Pipeline
from haystack_integrations.components.evaluators.deepeval import DeepEvalEvaluator, DeepEvalMetric

faithfulness_pipeline = Pipeline()
evaluator = DeepEvalEvaluator(metric=DeepEvalMetric.FAITHFULNESS, metric_params={"model":"gpt-4"} )
faithfulness_pipeline.add_component("evaluator", evaluator)
evaluation_results = faithfulness_pipeline.run(
    {"evaluator": {"questions": questions, "contexts": contexts, "responses": responses}}
)
print(evaluation_results["evaluator"]["results"])

Our pipeline evaluation using DeepEval is now complete!

Haystack 2.0 Useful Sources