RAG Pipeline Evaluation Using DeepEval
Last Updated: September 20, 2024
DeepEval is a framework to evaluate Retrieval Augmented Generation (RAG) pipelines. It supports metrics like context relevance, answer correctness, faithfulness, and more.
For more information about evaluators, supported metrics and usage, check out:
This notebook shows how to use DeepEval-Haystack integration to evaluate a RAG pipeline against various metrics.
Prerequisites:
-
OpenAI key
- DeepEval uses for computing some metrics, so we need an OpenAI key.
import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass("Enter OpenAI API key:")
Install dependencies
!pip install "pydantic<1.10.10"
!pip install haystack-ai
!pip install "datasets>=2.6.1"
!pip install deepeval-haystack
Create a RAG pipeline
We’ll first need to create a RAG pipeline. Refer to this link for a detailed tutorial on how to create RAG pipelines.
In this notebook, we’re using the SQUAD V2 dataset for getting the context, questions and ground truth answers.
Initialize the document store
from datasets import load_dataset
from haystack import Document
from haystack.document_stores.in_memory import InMemoryDocumentStore
document_store = InMemoryDocumentStore()
dataset = load_dataset("rajpurkar/squad_v2", split="validation")
documents = list(set(dataset["context"]))
docs = [Document(content=doc) for doc in documents]
document_store.write_documents(docs)
import os
from getpass import getpass
from haystack.components.builders import PromptBuilder
from haystack.components.generators import OpenAIGenerator
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
retriever = InMemoryBM25Retriever(document_store, top_k=3)
template = """
Given the following information, answer the question.
Context:
{% for document in documents %}
{{ document.content }}
{% endfor %}
Question: {{question}}
Answer:
"""
prompt_builder = PromptBuilder(template=template)
generator = OpenAIGenerator(model="gpt-3.5-turbo-0125")
Build the RAG pipeline
from haystack import Pipeline
from haystack.components.builders.answer_builder import AnswerBuilder
rag_pipeline = Pipeline()
# Add components to your pipeline
rag_pipeline.add_component("retriever", retriever)
rag_pipeline.add_component("prompt_builder", prompt_builder)
rag_pipeline.add_component("llm", generator)
rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder")
# Now, connect the components to each other
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("llm.meta", "answer_builder.meta")
rag_pipeline.connect("retriever", "answer_builder.documents")
Running the pipeline
question = "In what country is Normandy located?"
response = rag_pipeline.run(
{"retriever": {"query": question}, "prompt_builder": {"question": question}, "answer_builder": {"query": question}}
)
print(response["answer_builder"]["answers"][0].data)
We’re done building our RAG pipeline. Let’s evaluate it now!
Get questions, contexts, responses and ground truths for evaluation
For computing most metrics, we will need to provide the following to the evaluator:
- Questions
- Generated responses
- Retrieved contexts
- Ground truth (Specifically, this is needed for
context precision
,context recall
andanswer correctness
metrics)
We’ll start with random three questions from the dataset (see below) and now we’ll get the matching contexts
and responses
for those questions.
Helper function to get context and responses for our questions
def get_contexts_and_responses(questions, pipeline):
contexts = []
responses = []
for question in questions:
response = pipeline.run(
{
"retriever": {"query": question},
"prompt_builder": {"question": question},
"answer_builder": {"query": question},
}
)
contexts.append([d.content for d in response["answer_builder"]["answers"][0].documents])
responses.append(response["answer_builder"]["answers"][0].data)
return contexts, responses
question_map = {
"Which mountain range influenced the split of the regions?": 0,
"What is the prize offered for finding a solution to P=NP?": 1,
"Which Californio is located in the upper part?": 2
}
questions = list(question_map.keys())
contexts, responses = get_contexts_and_responses(questions, rag_pipeline)
Ground truths, review all fields
Now that we have questions, contexts, and responses we’ll also get the matching ground truth answers.
ground_truths = [""] * len(question_map)
for question, index in question_map.items():
idx = dataset["question"].index(question)
ground_truths[index] = dataset["answers"][idx]["text"][0]
print("Questions:\n")
print("\n".join(questions))
print("Contexts:\n")
for c in contexts:
print(c[0])
print("Responses:\n")
print("\n".join(responses))
print("Ground truths:\n")
print("\n".join(ground_truths))
Evaluate the RAG pipeline
Now that we have the questions
, contexts
,responses
and the ground truths
, we can begin our pipeline evaluation and compute all the supported metrics.
Metrics computation
In addition to evaluating the final responses of the LLM, it is important that we also evaluate the individual components of the RAG pipeline as they can significantly impact the overall performance. Therefore, there are different metrics to evaluate the retriever, the generator and the overall pipeline. For a full list of available metrics and their expected inputs, check out the DeepEvalEvaluator Docs
The DeepEval documentation provides explanation of the individual metrics with simple examples for each of them.
Contextul Precision
The contextual precision metric measures our RAG pipeline’s retriever by evaluating whether items in our contexts that are relevant to the given input are ranked higher than irrelevant ones.
from haystack import Pipeline
from haystack_integrations.components.evaluators.deepeval import DeepEvalEvaluator, DeepEvalMetric
context_precision_pipeline = Pipeline()
evaluator = DeepEvalEvaluator(metric=DeepEvalMetric.CONTEXTUAL_PRECISION, metric_params={"model":"gpt-4"})
context_precision_pipeline.add_component("evaluator", evaluator)
evaluation_results = context_precision_pipeline.run(
{"evaluator": {"questions": questions, "contexts": contexts, "ground_truths": ground_truths, "responses": responses}}
)
print(evaluation_results["evaluator"]["results"])
Contextual Recall
Contextual recall measures the extent to which the contexts aligns with the ground truth
.
from haystack import Pipeline
from haystack_integrations.components.evaluators.deepeval import DeepEvalEvaluator, DeepEvalMetric
context_recall_pipeline = Pipeline()
evaluator = DeepEvalEvaluator(metric=DeepEvalMetric.CONTEXTUAL_RECALL, metric_params={"model":"gpt-4"})
context_recall_pipeline.add_component("evaluator", evaluator)
evaluation_results = context_recall_pipeline.run(
{"evaluator": {"questions": questions, "contexts": contexts, "ground_truths": ground_truths, "responses": responses}}
)
print(evaluation_results["evaluator"]["results"])
Contextual Relevancy
The contextual relevancy metric measures the quality of our RAG pipeline’s retriever by evaluating the overall relevance of the context for a given question.
from haystack import Pipeline
from haystack_integrations.components.evaluators.deepeval import DeepEvalEvaluator, DeepEvalMetric
context_relevancy_pipeline = Pipeline()
evaluator = DeepEvalEvaluator(metric=DeepEvalMetric.CONTEXTUAL_RELEVANCE, metric_params={"model":"gpt-4"})
context_relevancy_pipeline.add_component("evaluator", evaluator)
evaluation_results = context_relevancy_pipeline.run(
{"evaluator": {"questions": questions, "contexts": contexts, "responses": responses}}
)
print(evaluation_results["evaluator"]["results"])
Answer relevancy
The answer relevancy metric measures the quality of our RAG pipeline’s response by evaluating how relevant the response is compared to the provided question.
from haystack import Pipeline
from haystack_integrations.components.evaluators.deepeval import DeepEvalEvaluator, DeepEvalMetric
answer_relevancy_pipeline = Pipeline()
evaluator = DeepEvalEvaluator(metric=DeepEvalMetric.ANSWER_RELEVANCY, metric_params={"model":"gpt-4"})
answer_relevancy_pipeline.add_component("evaluator", evaluator)
evaluation_results = answer_relevancy_pipeline.run(
{"evaluator": {"questions": questions, "responses": responses, "contexts": contexts}}
)
print(evaluation_results["evaluator"]["results"])
Note
When this notebook was created, the version 0.20.57 of deepeval required the use of contexts for calculating Answer Relevancy. Please note that future versions will no longer require the context field. Specifically, the upcoming release of deepeval-haystack will eliminate the context field as a mandatory requirement.
Faithfulness
The faithfulness metric measures the quality of our RAG pipeline’s responses by evaluating whether the response factually aligns with the contents of context we provided.
from haystack import Pipeline
from haystack_integrations.components.evaluators.deepeval import DeepEvalEvaluator, DeepEvalMetric
faithfulness_pipeline = Pipeline()
evaluator = DeepEvalEvaluator(metric=DeepEvalMetric.FAITHFULNESS, metric_params={"model":"gpt-4"} )
faithfulness_pipeline.add_component("evaluator", evaluator)
evaluation_results = faithfulness_pipeline.run(
{"evaluator": {"questions": questions, "contexts": contexts, "responses": responses}}
)
print(evaluation_results["evaluator"]["results"])
Our pipeline evaluation using DeepEval is now complete!
Haystack 2.0 Useful Sources