RAG Pipeline Evaluation Using DeepEval

_{Last Updated:
July 8, 2025}

DeepEval is a framework to evaluate Retrieval Augmented Generation (RAG) pipelines. It supports metrics like context relevance, answer correctness, faithfulness, and more.

For more information about evaluators, supported metrics and usage, check out:

This notebook shows how to use DeepEval-Haystack integration to evaluate a RAG pipeline against various metrics.

Prerequisites:

OpenAI key
- DeepEval uses for computing some metrics, so we need an OpenAI key.

import os
from getpass import getpass

os.environ["OPENAI_API_KEY"] = getpass("Enter OpenAI API key:")

Enter OpenAI API key: ········

Install dependencies

!pip install haystack-ai
!pip install "datasets>=2.6.1"
!pip install deepeval-haystack

Create a RAG pipeline

We’ll first need to create a RAG pipeline. Refer to this link for a detailed tutorial on how to create RAG pipelines.

In this notebook, we’re using the SQUAD V2 dataset for getting the context, questions and ground truth answers.

Initialize the document store

from datasets import load_dataset
from haystack import Document
from haystack.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()

dataset = load_dataset("rajpurkar/squad_v2", split="validation")
documents = list(set(dataset["context"]))
docs = [Document(content=doc) for doc in documents]
document_store.write_documents(docs)

from haystack.components.builders import ChatPromptBuilder
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.dataclasses import ChatMessage

retriever = InMemoryBM25Retriever(document_store, top_k=3)

chat_message = ChatMessage.from_user(
    text="""Given the following information, answer the question.

Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: {{question}}
Answer:
"""
)
chat_prompt_builder = ChatPromptBuilder(template=[chat_message], required_variables="*")
chat_generator = OpenAIChatGenerator(model="gpt-4o-mini")

Build the RAG pipeline

from haystack import Pipeline
from haystack.components.builders.answer_builder import AnswerBuilder

rag_pipeline = Pipeline()
# Add components to your pipeline
rag_pipeline.add_component("retriever", retriever)
rag_pipeline.add_component("chat_prompt_builder", chat_prompt_builder)
rag_pipeline.add_component("llm", chat_generator)
rag_pipeline.add_component("answer_builder", AnswerBuilder())

# Now, connect the components to each other
rag_pipeline.connect("retriever", "chat_prompt_builder.documents")
rag_pipeline.connect("chat_prompt_builder", "llm")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("retriever", "answer_builder.documents")

<haystack.core.pipeline.pipeline.Pipeline object at 0x16a07b210>
🚅 Components
  - retriever: InMemoryBM25Retriever
  - chat_prompt_builder: ChatPromptBuilder
  - llm: OpenAIChatGenerator
  - answer_builder: AnswerBuilder
🛤️ Connections
  - retriever.documents -> chat_prompt_builder.documents (List[Document])
  - retriever.documents -> answer_builder.documents (List[Document])
  - chat_prompt_builder.prompt -> llm.messages (List[ChatMessage])
  - llm.replies -> answer_builder.replies (List[ChatMessage])

Running the pipeline

question = "In what country is Normandy located?"

response = rag_pipeline.run(
    {"retriever": {"query": question}, "chat_prompt_builder": {"question": question}, "answer_builder": {"query": question}}
)

print(response["answer_builder"]["answers"][0].data)

Normandy is located in France.

We’re done building our RAG pipeline. Let’s evaluate it now!

Get questions, contexts, responses and ground truths for evaluation

For computing most metrics, we will need to provide the following to the evaluator:

Questions
Generated responses
Retrieved contexts
Ground truth (Specifically, this is needed for context precision, context recall and answer correctness metrics)

We’ll start with random three questions from the dataset (see below) and now we’ll get the matching contexts and responses for those questions.

Helper function to get context and responses for our questions

def get_contexts_and_responses(questions, pipeline):
    contexts = []
    responses = []
    for question in questions:
        response = pipeline.run(
            {
                "retriever": {"query": question},
                "chat_prompt_builder": {"question": question},
                "answer_builder": {"query": question},
            }
        )

        contexts.append([d.content for d in response["answer_builder"]["answers"][0].documents])
        responses.append(response["answer_builder"]["answers"][0].data)
    return contexts, responses

question_map = {
    "Which mountain range influenced the split of the regions?": 0,
    "What is the prize offered for finding a solution to P=NP?": 1,
    "Which Californio is located in the upper part?": 2
}
questions = list(question_map.keys())
contexts, responses = get_contexts_and_responses(questions, rag_pipeline)

Ground truths, review all fields

Now that we have questions, contexts, and responses we’ll also get the matching ground truth answers.

ground_truths = [""] * len(question_map)

for question, index in question_map.items():
    idx = dataset["question"].index(question)
    ground_truths[index] = dataset["answers"][idx]["text"][0]

print("Questions:\n")
print("\n".join(questions))

Questions:

Which mountain range influenced the split of the regions?
What is the prize offered for finding a solution to P=NP?
Which Californio is located in the upper part?

print("Contexts:\n")
for c in contexts:
  print(c[0])

Contexts:

The state is most commonly divided and promoted by its regional tourism groups as consisting of northern, central, and southern California regions. The two AAA Auto Clubs of the state, the California State Automobile Association and the Automobile Club of Southern California, choose to simplify matters by dividing the state along the lines where their jurisdictions for membership apply, as either northern or southern California, in contrast to the three-region point of view. Another influence is the geographical phrase South of the Tehachapis, which would split the southern region off at the crest of that transverse range, but in that definition, the desert portions of north Los Angeles County and eastern Kern and San Bernardino Counties would be included in the southern California region due to their remoteness from the central valley and interior desert landscape.
If a problem X is in C and hard for C, then X is said to be complete for C. This means that X is the hardest problem in C. (Since many problems could be equally hard, one might say that X is one of the hardest problems in C.) Thus the class of NP-complete problems contains the most difficult problems in NP, in the sense that they are the ones most likely not to be in P. Because the problem P = NP is not solved, being able to reduce a known NP-complete problem, Π2, to another problem, Π1, would indicate that there is no known polynomial-time solution for Π1. This is because a polynomial-time solution to Π1 would yield a polynomial-time solution to Π2. Similarly, because all NP problems can be reduced to the set, finding an NP-complete problem that can be solved in polynomial time would mean that P = NP.
In the centre of Basel, the first major city in the course of the stream, is located the "Rhine knee"; this is a major bend, where the overall direction of the Rhine changes from West to North. Here the High Rhine ends. Legally, the Central Bridge is the boundary between High and Upper Rhine. The river now flows North as Upper Rhine through the Upper Rhine Plain, which is about 300 km long and up to 40 km wide. The most important tributaries in this area are the Ill below of Strasbourg, the Neckar in Mannheim and the Main across from Mainz. In Mainz, the Rhine leaves the Upper Rhine Valley and flows through the Mainz Basin.

print("Responses:\n")
print("\n".join(responses))

Responses:

The mountain range that influenced the split of the regions is the Tehachapi Mountains.
The prize offered for finding a solution to P=NP is US$1,000,000.
The provided information does not mention any Californios or any details related to California. A Californio refers to a Hispanic person who was born in California during the period when it was part of Mexico, before becoming part of the United States. Without more context or specific information about Californios, it is not possible to answer the question regarding which Californio is located in the upper part.

print("Ground truths:\n")
print("\n".join(ground_truths))

Ground truths:

Tehachapis
$1,000,000
Monterey

Evaluate the RAG pipeline

Now that we have the questions, contexts,responses and the ground truths, we can begin our pipeline evaluation and compute all the supported metrics.

Metrics computation

In addition to evaluating the final responses of the LLM, it is important that we also evaluate the individual components of the RAG pipeline as they can significantly impact the overall performance. Therefore, there are different metrics to evaluate the retriever, the generator and the overall pipeline. For a full list of available metrics and their expected inputs, check out the DeepEvalEvaluator Docs

The DeepEval documentation provides explanation of the individual metrics with simple examples for each of them.

Contextul Precision

The contextual precision metric measures our RAG pipeline’s retriever by evaluating whether items in our contexts that are relevant to the given input are ranked higher than irrelevant ones.

from haystack import Pipeline
from haystack_integrations.components.evaluators.deepeval import DeepEvalEvaluator, DeepEvalMetric

context_precision_pipeline = Pipeline()
evaluator = DeepEvalEvaluator(metric=DeepEvalMetric.CONTEXTUAL_PRECISION, metric_params={"model":"gpt-4o-mini"})
context_precision_pipeline.add_component("evaluator", evaluator)

evaluation_results = context_precision_pipeline.run(
    {"evaluator": {"questions": questions, "contexts": contexts, "ground_truths": ground_truths, "responses": responses}}
)
print(evaluation_results["evaluator"]["results"])

Evaluating 3 test case(s) in parallel ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━  67% 0:00:05
    🎯 Evaluating test case #2        ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0:00:05

======================================================================

Metrics Summary

  - ✅ Contextual Precision (score: 1.0, threshold: 0.0, strict: False, evaluation model: gpt-4o-mini, reason: The score is 1.00 because the relevant node ranks first, providing a direct connection to the influence of the Tehachapi mountain range on regional divisions. The subsequent nodes, which discuss unrelated topics such as structural geology and climate, are ranked lower, ensuring that the relevant information is prioritized., error: None)

For test case:

  - input: Which mountain range influenced the split of the regions?
  - actual output: The mountain range that influenced the split of the regions is the Tehachapi Mountains.
  - expected output: Tehachapis
  - context: None
  - retrieval context: ['The state is most commonly divided and promoted by its regional tourism groups as consisting of northern, central, and southern California regions. The two AAA Auto Clubs of the state, the California State Automobile Association and the Automobile Club of Southern California, choose to simplify matters by dividing the state along the lines where their jurisdictions for membership apply, as either northern or southern California, in contrast to the three-region point of view. Another influence is the geographical phrase South of the Tehachapis, which would split the southern region off at the crest of that transverse range, but in that definition, the desert portions of north Los Angeles County and eastern Kern and San Bernardino Counties would be included in the southern California region due to their remoteness from the central valley and interior desert landscape.', 'Among the most well-known experiments in structural geology are those involving orogenic wedges, which are zones in which mountains are built along convergent tectonic plate boundaries. In the analog versions of these experiments, horizontal layers of sand are pulled along a lower surface into a back stop, which results in realistic-looking patterns of faulting and the growth of a critically tapered (all angles remain the same) orogenic wedge. Numerical models work in the same way as these analog models, though they are often more sophisticated and can include patterns of erosion and uplift in the mountain belt. This helps to show the relationship between erosion and the shape of the mountain range. These studies can also give useful information about pathways for metamorphism through pressure, temperature, space, and time.', "The Mallee and upper Wimmera are Victoria's warmest regions with hot winds blowing from nearby semi-deserts. Average temperatures exceed 32 °C (90 °F) during summer and 15 °C (59 °F) in winter. Except at cool mountain elevations, the inland monthly temperatures are 2–7 °C (4–13 °F) warmer than around Melbourne (see chart). Victoria's highest maximum temperature since World War II, of 48.8 °C (119.8 °F) was recorded in Hopetoun on 7 February 2009, during the 2009 southeastern Australia heat wave."]

======================================================================

Overall Metric Pass Rates

Contextual Precision: 100.00% pass rate

======================================================================


======================================================================

Metrics Summary

  - ✅ Contextual Precision (score: 0.5, threshold: 0.0, strict: False, evaluation model: gpt-4o-mini, reason: The score is 0.50 because while the second node provides a clear and direct answer regarding the prize for solving P=NP, the first and third nodes, ranked higher, do not address the question and instead focus on unrelated topics. This results in a mixed ranking where relevant information is overshadowed by irrelevant nodes., error: None)

For test case:

  - input: What is the prize offered for finding a solution to P=NP?
  - actual output: The prize offered for finding a solution to P=NP is US$1,000,000.
  - expected output: $1,000,000
  - context: None
  - retrieval context: ['If a problem X is in C and hard for C, then X is said to be complete for C. This means that X is the hardest problem in C. (Since many problems could be equally hard, one might say that X is one of the hardest problems in C.) Thus the class of NP-complete problems contains the most difficult problems in NP, in the sense that they are the ones most likely not to be in P. Because the problem P = NP is not solved, being able to reduce a known NP-complete problem, Π2, to another problem, Π1, would indicate that there is no known polynomial-time solution for Π1. This is because a polynomial-time solution to Π1 would yield a polynomial-time solution to Π2. Similarly, because all NP problems can be reduced to the set, finding an NP-complete problem that can be solved in polynomial time would mean that P = NP.', 'The question of whether P equals NP is one of the most important open questions in theoretical computer science because of the wide implications of a solution. If the answer is yes, many important problems can be shown to have more efficient solutions. These include various types of integer programming problems in operations research, many problems in logistics, protein structure prediction in biology, and the ability to find formal proofs of pure mathematics theorems. The P versus NP problem is one of the Millennium Prize Problems proposed by the Clay Mathematics Institute. There is a US$1,000,000 prize for resolving the problem.', 'What intractability means in practice is open to debate. Saying that a problem is not in P does not imply that all large cases of the problem are hard or even that most of them are. For example, the decision problem in Presburger arithmetic has been shown not to be in P, yet algorithms have been written that solve the problem in reasonable times in most cases. Similarly, algorithms can solve the NP-complete knapsack problem over a wide range of sizes in less than quadratic time and SAT solvers routinely handle large instances of the NP-complete Boolean satisfiability problem.']

======================================================================

Overall Metric Pass Rates

Contextual Precision: 100.00% pass rate

======================================================================


======================================================================

Metrics Summary

  - ✅ Contextual Precision (score: 0.0, threshold: 0.0, strict: False, evaluation model: gpt-4o-mini, reason: The score is 0.00 because all nodes in the retrieval contexts are irrelevant to the input question about Californios. The first node ranks highest but states, 'In the centre of Basel...' does not provide any information related to Californios or their locations, indicating a complete lack of relevance. Similarly, the second node discusses Roman frontiers, and the third node is about tectonic plates, both of which are unrelated to the query. As a result, there are no relevant nodes to elevate the score., error: None)

For test case:

  - input: Which Californio is located in the upper part?
  - actual output: The provided information does not mention any Californios or any details related to California. A Californio refers to a Hispanic person who was born in California during the period when it was part of Mexico, before becoming part of the United States. Without more context or specific information about Californios, it is not possible to answer the question regarding which Californio is located in the upper part.
  - expected output: Monterey
  - context: None
  - retrieval context: ['In the centre of Basel, the first major city in the course of the stream, is located the "Rhine knee"; this is a major bend, where the overall direction of the Rhine changes from West to North. Here the High Rhine ends. Legally, the Central Bridge is the boundary between High and Upper Rhine. The river now flows North as Upper Rhine through the Upper Rhine Plain, which is about 300 km long and up to 40 km wide. The most important tributaries in this area are the Ill below of Strasbourg, the Neckar in Mannheim and the Main across from Mainz. In Mainz, the Rhine leaves the Upper Rhine Valley and flows through the Mainz Basin.', 'From the death of Augustus in AD 14 until after AD 70, Rome accepted as her Germanic frontier the water-boundary of the Rhine and upper Danube. Beyond these rivers she held only the fertile plain of Frankfurt, opposite the Roman border fortress of Moguntiacum (Mainz), the southernmost slopes of the Black Forest and a few scattered bridge-heads. The northern section of this frontier, where the Rhine is deep and broad, remained the Roman boundary until the empire fell. The southern part was different. The upper Rhine and upper Danube are easily crossed. The frontier which they form is inconveniently long, enclosing an acute-angled wedge of foreign territory between the modern Baden and Württemberg. The Germanic populations of these lands seem in Roman times to have been scanty, and Roman subjects from the modern Alsace-Lorraine had drifted across the river eastwards.', "In the 1960s, a series of discoveries, the most important of which was seafloor spreading, showed that the Earth's lithosphere, which includes the crust and rigid uppermost portion of the upper mantle, is separated into a number of tectonic plates that move across the plastically deforming, solid, upper mantle, which is called the asthenosphere. There is an intimate coupling between the movement of the plates on the surface and the convection of the mantle: oceanic plate motions and mantle convection currents always move in the same direction, because the oceanic lithosphere is the rigid upper thermal boundary layer of the convecting mantle. This coupling between rigid plates moving on the surface of the Earth and the convecting mantle is called plate tectonics."]

======================================================================

Overall Metric Pass Rates

Contextual Precision: 100.00% pass rate

======================================================================

✓ Tests finished 🎉! Run 'deepeval view' to analyze, debug, and save evaluation results on Confident AI.

[[{'name': 'contextual_precision', 'score': 1.0, 'explanation': 'The score is 1.00 because the relevant node ranks first, providing a direct connection to the influence of the Tehachapi mountain range on regional divisions. The subsequent nodes, which discuss unrelated topics such as structural geology and climate, are ranked lower, ensuring that the relevant information is prioritized.'}], [{'name': 'contextual_precision', 'score': 0.5, 'explanation': 'The score is 0.50 because while the second node provides a clear and direct answer regarding the prize for solving P=NP, the first and third nodes, ranked higher, do not address the question and instead focus on unrelated topics. This results in a mixed ranking where relevant information is overshadowed by irrelevant nodes.'}], [{'name': 'contextual_precision', 'score': 0.0, 'explanation': "The score is 0.00 because all nodes in the retrieval contexts are irrelevant to the input question about Californios. The first node ranks highest but states, 'In the centre of Basel...' does not provide any information related to Californios or their locations, indicating a complete lack of relevance. Similarly, the second node discusses Roman frontiers, and the third node is about tectonic plates, both of which are unrelated to the query. As a result, there are no relevant nodes to elevate the score."}]]

Contextual Recall

Contextual recall measures the extent to which the contexts aligns with the ground truth.

from haystack import Pipeline
from haystack_integrations.components.evaluators.deepeval import DeepEvalEvaluator, DeepEvalMetric

context_recall_pipeline = Pipeline()
evaluator = DeepEvalEvaluator(metric=DeepEvalMetric.CONTEXTUAL_RECALL, metric_params={"model":"gpt-4o-mini"})
context_recall_pipeline.add_component("evaluator", evaluator)

evaluation_results = context_recall_pipeline.run(
    {"evaluator": {"questions": questions, "contexts": contexts, "ground_truths": ground_truths, "responses": responses}}
)
print(evaluation_results["evaluator"]["results"])

Evaluating 3 test case(s) in parallel ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━  67% 0:00:02
    🎯 Evaluating test case #1        ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0:00:02

======================================================================

Metrics Summary

  - ✅ Contextual Recall (score: 0.0, threshold: 0.0, strict: False, evaluation model: gpt-4o-mini, reason: The score is 0.00 because the sentence 'Monterey' does not appear in any part of the node(s) in retrieval context., error: None)

For test case:

  - input: Which Californio is located in the upper part?
  - actual output: The provided information does not mention any Californios or any details related to California. A Californio refers to a Hispanic person who was born in California during the period when it was part of Mexico, before becoming part of the United States. Without more context or specific information about Californios, it is not possible to answer the question regarding which Californio is located in the upper part.
  - expected output: Monterey
  - context: None
  - retrieval context: ['In the centre of Basel, the first major city in the course of the stream, is located the "Rhine knee"; this is a major bend, where the overall direction of the Rhine changes from West to North. Here the High Rhine ends. Legally, the Central Bridge is the boundary between High and Upper Rhine. The river now flows North as Upper Rhine through the Upper Rhine Plain, which is about 300 km long and up to 40 km wide. The most important tributaries in this area are the Ill below of Strasbourg, the Neckar in Mannheim and the Main across from Mainz. In Mainz, the Rhine leaves the Upper Rhine Valley and flows through the Mainz Basin.', 'From the death of Augustus in AD 14 until after AD 70, Rome accepted as her Germanic frontier the water-boundary of the Rhine and upper Danube. Beyond these rivers she held only the fertile plain of Frankfurt, opposite the Roman border fortress of Moguntiacum (Mainz), the southernmost slopes of the Black Forest and a few scattered bridge-heads. The northern section of this frontier, where the Rhine is deep and broad, remained the Roman boundary until the empire fell. The southern part was different. The upper Rhine and upper Danube are easily crossed. The frontier which they form is inconveniently long, enclosing an acute-angled wedge of foreign territory between the modern Baden and Württemberg. The Germanic populations of these lands seem in Roman times to have been scanty, and Roman subjects from the modern Alsace-Lorraine had drifted across the river eastwards.', "In the 1960s, a series of discoveries, the most important of which was seafloor spreading, showed that the Earth's lithosphere, which includes the crust and rigid uppermost portion of the upper mantle, is separated into a number of tectonic plates that move across the plastically deforming, solid, upper mantle, which is called the asthenosphere. There is an intimate coupling between the movement of the plates on the surface and the convection of the mantle: oceanic plate motions and mantle convection currents always move in the same direction, because the oceanic lithosphere is the rigid upper thermal boundary layer of the convecting mantle. This coupling between rigid plates moving on the surface of the Earth and the convecting mantle is called plate tectonics."]

======================================================================

Overall Metric Pass Rates

Contextual Recall: 100.00% pass rate

======================================================================


======================================================================

Metrics Summary

  - ✅ Contextual Recall (score: 1.0, threshold: 0.0, strict: False, evaluation model: gpt-4o-mini, reason: The score is 1.00 because the term 'Tehachapis' is directly referenced in the retrieval context, making it perfectly aligned with the expected output., error: None)

For test case:

  - input: Which mountain range influenced the split of the regions?
  - actual output: The mountain range that influenced the split of the regions is the Tehachapi Mountains.
  - expected output: Tehachapis
  - context: None
  - retrieval context: ['The state is most commonly divided and promoted by its regional tourism groups as consisting of northern, central, and southern California regions. The two AAA Auto Clubs of the state, the California State Automobile Association and the Automobile Club of Southern California, choose to simplify matters by dividing the state along the lines where their jurisdictions for membership apply, as either northern or southern California, in contrast to the three-region point of view. Another influence is the geographical phrase South of the Tehachapis, which would split the southern region off at the crest of that transverse range, but in that definition, the desert portions of north Los Angeles County and eastern Kern and San Bernardino Counties would be included in the southern California region due to their remoteness from the central valley and interior desert landscape.', 'Among the most well-known experiments in structural geology are those involving orogenic wedges, which are zones in which mountains are built along convergent tectonic plate boundaries. In the analog versions of these experiments, horizontal layers of sand are pulled along a lower surface into a back stop, which results in realistic-looking patterns of faulting and the growth of a critically tapered (all angles remain the same) orogenic wedge. Numerical models work in the same way as these analog models, though they are often more sophisticated and can include patterns of erosion and uplift in the mountain belt. This helps to show the relationship between erosion and the shape of the mountain range. These studies can also give useful information about pathways for metamorphism through pressure, temperature, space, and time.', "The Mallee and upper Wimmera are Victoria's warmest regions with hot winds blowing from nearby semi-deserts. Average temperatures exceed 32 °C (90 °F) during summer and 15 °C (59 °F) in winter. Except at cool mountain elevations, the inland monthly temperatures are 2–7 °C (4–13 °F) warmer than around Melbourne (see chart). Victoria's highest maximum temperature since World War II, of 48.8 °C (119.8 °F) was recorded in Hopetoun on 7 February 2009, during the 2009 southeastern Australia heat wave."]

======================================================================

Overall Metric Pass Rates

Contextual Recall: 100.00% pass rate

======================================================================


======================================================================

Metrics Summary

  - ✅ Contextual Recall (score: 1.0, threshold: 0.0, strict: False, evaluation model: gpt-4o-mini, reason: The score is 1.00 because the sentence '$1,000,000' is directly supported by the 2nd node in retrieval context, which clearly states 'There is a US$1,000,000 prize for resolving the problem...'. This strong alignment confirms the accuracy of the expected output., error: None)

For test case:

  - input: What is the prize offered for finding a solution to P=NP?
  - actual output: The prize offered for finding a solution to P=NP is US$1,000,000.
  - expected output: $1,000,000
  - context: None
  - retrieval context: ['If a problem X is in C and hard for C, then X is said to be complete for C. This means that X is the hardest problem in C. (Since many problems could be equally hard, one might say that X is one of the hardest problems in C.) Thus the class of NP-complete problems contains the most difficult problems in NP, in the sense that they are the ones most likely not to be in P. Because the problem P = NP is not solved, being able to reduce a known NP-complete problem, Π2, to another problem, Π1, would indicate that there is no known polynomial-time solution for Π1. This is because a polynomial-time solution to Π1 would yield a polynomial-time solution to Π2. Similarly, because all NP problems can be reduced to the set, finding an NP-complete problem that can be solved in polynomial time would mean that P = NP.', 'The question of whether P equals NP is one of the most important open questions in theoretical computer science because of the wide implications of a solution. If the answer is yes, many important problems can be shown to have more efficient solutions. These include various types of integer programming problems in operations research, many problems in logistics, protein structure prediction in biology, and the ability to find formal proofs of pure mathematics theorems. The P versus NP problem is one of the Millennium Prize Problems proposed by the Clay Mathematics Institute. There is a US$1,000,000 prize for resolving the problem.', 'What intractability means in practice is open to debate. Saying that a problem is not in P does not imply that all large cases of the problem are hard or even that most of them are. For example, the decision problem in Presburger arithmetic has been shown not to be in P, yet algorithms have been written that solve the problem in reasonable times in most cases. Similarly, algorithms can solve the NP-complete knapsack problem over a wide range of sizes in less than quadratic time and SAT solvers routinely handle large instances of the NP-complete Boolean satisfiability problem.']

======================================================================

Overall Metric Pass Rates

Contextual Recall: 100.00% pass rate

======================================================================

✓ Tests finished 🎉! Run 'deepeval view' to analyze, debug, and save evaluation results on Confident AI.

[[{'name': 'contextual_recall', 'score': 0.0, 'explanation': "The score is 0.00 because the sentence 'Monterey' does not appear in any part of the node(s) in retrieval context."}], [{'name': 'contextual_recall', 'score': 1.0, 'explanation': "The score is 1.00 because the term 'Tehachapis' is directly referenced in the retrieval context, making it perfectly aligned with the expected output."}], [{'name': 'contextual_recall', 'score': 1.0, 'explanation': "The score is 1.00 because the sentence '$1,000,000' is directly supported by the 2nd node in retrieval context, which clearly states 'There is a US$1,000,000 prize for resolving the problem...'. This strong alignment confirms the accuracy of the expected output."}]]

Contextual Relevancy

The contextual relevancy metric measures the quality of our RAG pipeline’s retriever by evaluating the overall relevance of the context for a given question.

from haystack import Pipeline
from haystack_integrations.components.evaluators.deepeval import DeepEvalEvaluator, DeepEvalMetric

context_relevancy_pipeline = Pipeline()
evaluator = DeepEvalEvaluator(metric=DeepEvalMetric.CONTEXTUAL_RELEVANCE, metric_params={"model":"gpt-4o-mini"})
context_relevancy_pipeline.add_component("evaluator", evaluator)

evaluation_results = context_relevancy_pipeline.run(
    {"evaluator": {"questions": questions, "contexts": contexts, "responses": responses}}
)
print(evaluation_results["evaluator"]["results"])

Evaluating 3 test case(s) in parallel ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━  67% 0:00:10
    🎯 Evaluating test case #2        ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0:00:10

======================================================================

Metrics Summary

  - ✅ Contextual Relevancy (score: 0.07692307692307693, threshold: 0.0, strict: False, evaluation model: gpt-4o-mini, reason: The score is 0.08 because the retrieval context primarily discusses various unrelated topics, such as jurisdiction and experimental processes, which do not address the influence of a mountain range on regional splits. The only relevant statement mentions 'South of the Tehachapis' but lacks specificity about the mountain range's influence, making it insufficient to connect to the input question., error: None)

For test case:

  - input: Which mountain range influenced the split of the regions?
  - actual output: The mountain range that influenced the split of the regions is the Tehachapi Mountains.
  - expected output: None
  - context: None
  - retrieval context: ['The state is most commonly divided and promoted by its regional tourism groups as consisting of northern, central, and southern California regions. The two AAA Auto Clubs of the state, the California State Automobile Association and the Automobile Club of Southern California, choose to simplify matters by dividing the state along the lines where their jurisdictions for membership apply, as either northern or southern California, in contrast to the three-region point of view. Another influence is the geographical phrase South of the Tehachapis, which would split the southern region off at the crest of that transverse range, but in that definition, the desert portions of north Los Angeles County and eastern Kern and San Bernardino Counties would be included in the southern California region due to their remoteness from the central valley and interior desert landscape.', 'Among the most well-known experiments in structural geology are those involving orogenic wedges, which are zones in which mountains are built along convergent tectonic plate boundaries. In the analog versions of these experiments, horizontal layers of sand are pulled along a lower surface into a back stop, which results in realistic-looking patterns of faulting and the growth of a critically tapered (all angles remain the same) orogenic wedge. Numerical models work in the same way as these analog models, though they are often more sophisticated and can include patterns of erosion and uplift in the mountain belt. This helps to show the relationship between erosion and the shape of the mountain range. These studies can also give useful information about pathways for metamorphism through pressure, temperature, space, and time.', "The Mallee and upper Wimmera are Victoria's warmest regions with hot winds blowing from nearby semi-deserts. Average temperatures exceed 32 °C (90 °F) during summer and 15 °C (59 °F) in winter. Except at cool mountain elevations, the inland monthly temperatures are 2–7 °C (4–13 °F) warmer than around Melbourne (see chart). Victoria's highest maximum temperature since World War II, of 48.8 °C (119.8 °F) was recorded in Hopetoun on 7 February 2009, during the 2009 southeastern Australia heat wave."]

======================================================================

Overall Metric Pass Rates

Contextual Relevancy: 100.00% pass rate

======================================================================


======================================================================

Metrics Summary

  - ✅ Contextual Relevancy (score: 0.14285714285714285, threshold: 0.0, strict: False, evaluation model: gpt-4o-mini, reason: The score is 0.14 because the majority of the retrieval context focuses on theoretical aspects of complexity classes and implications of the P=NP problem, which do not address the prize. However, the relevant statements mention that 'The P versus NP problem is one of the Millennium Prize Problems' and 'There is a US$1,000,000 prize for resolving the problem,' which directly answer the input question., error: None)

For test case:

  - input: What is the prize offered for finding a solution to P=NP?
  - actual output: The prize offered for finding a solution to P=NP is US$1,000,000.
  - expected output: None
  - context: None
  - retrieval context: ['If a problem X is in C and hard for C, then X is said to be complete for C. This means that X is the hardest problem in C. (Since many problems could be equally hard, one might say that X is one of the hardest problems in C.) Thus the class of NP-complete problems contains the most difficult problems in NP, in the sense that they are the ones most likely not to be in P. Because the problem P = NP is not solved, being able to reduce a known NP-complete problem, Π2, to another problem, Π1, would indicate that there is no known polynomial-time solution for Π1. This is because a polynomial-time solution to Π1 would yield a polynomial-time solution to Π2. Similarly, because all NP problems can be reduced to the set, finding an NP-complete problem that can be solved in polynomial time would mean that P = NP.', 'The question of whether P equals NP is one of the most important open questions in theoretical computer science because of the wide implications of a solution. If the answer is yes, many important problems can be shown to have more efficient solutions. These include various types of integer programming problems in operations research, many problems in logistics, protein structure prediction in biology, and the ability to find formal proofs of pure mathematics theorems. The P versus NP problem is one of the Millennium Prize Problems proposed by the Clay Mathematics Institute. There is a US$1,000,000 prize for resolving the problem.', 'What intractability means in practice is open to debate. Saying that a problem is not in P does not imply that all large cases of the problem are hard or even that most of them are. For example, the decision problem in Presburger arithmetic has been shown not to be in P, yet algorithms have been written that solve the problem in reasonable times in most cases. Similarly, algorithms can solve the NP-complete knapsack problem over a wide range of sizes in less than quadratic time and SAT solvers routinely handle large instances of the NP-complete Boolean satisfiability problem.']

======================================================================

Overall Metric Pass Rates

Contextual Relevancy: 100.00% pass rate

======================================================================


======================================================================

Metrics Summary

  - ✅ Contextual Relevancy (score: 0.0, threshold: 0.0, strict: False, evaluation model: gpt-4o-mini, reason: The score is 0.00 because there are no relevant statements in the retrieval context that address the question about 'Californio', and all provided statements discuss unrelated topics such as the Rhine and Roman territories., error: None)

For test case:

  - input: Which Californio is located in the upper part?
  - actual output: The provided information does not mention any Californios or any details related to California. A Californio refers to a Hispanic person who was born in California during the period when it was part of Mexico, before becoming part of the United States. Without more context or specific information about Californios, it is not possible to answer the question regarding which Californio is located in the upper part.
  - expected output: None
  - context: None
  - retrieval context: ['In the centre of Basel, the first major city in the course of the stream, is located the "Rhine knee"; this is a major bend, where the overall direction of the Rhine changes from West to North. Here the High Rhine ends. Legally, the Central Bridge is the boundary between High and Upper Rhine. The river now flows North as Upper Rhine through the Upper Rhine Plain, which is about 300 km long and up to 40 km wide. The most important tributaries in this area are the Ill below of Strasbourg, the Neckar in Mannheim and the Main across from Mainz. In Mainz, the Rhine leaves the Upper Rhine Valley and flows through the Mainz Basin.', 'From the death of Augustus in AD 14 until after AD 70, Rome accepted as her Germanic frontier the water-boundary of the Rhine and upper Danube. Beyond these rivers she held only the fertile plain of Frankfurt, opposite the Roman border fortress of Moguntiacum (Mainz), the southernmost slopes of the Black Forest and a few scattered bridge-heads. The northern section of this frontier, where the Rhine is deep and broad, remained the Roman boundary until the empire fell. The southern part was different. The upper Rhine and upper Danube are easily crossed. The frontier which they form is inconveniently long, enclosing an acute-angled wedge of foreign territory between the modern Baden and Württemberg. The Germanic populations of these lands seem in Roman times to have been scanty, and Roman subjects from the modern Alsace-Lorraine had drifted across the river eastwards.', "In the 1960s, a series of discoveries, the most important of which was seafloor spreading, showed that the Earth's lithosphere, which includes the crust and rigid uppermost portion of the upper mantle, is separated into a number of tectonic plates that move across the plastically deforming, solid, upper mantle, which is called the asthenosphere. There is an intimate coupling between the movement of the plates on the surface and the convection of the mantle: oceanic plate motions and mantle convection currents always move in the same direction, because the oceanic lithosphere is the rigid upper thermal boundary layer of the convecting mantle. This coupling between rigid plates moving on the surface of the Earth and the convecting mantle is called plate tectonics."]

======================================================================

Overall Metric Pass Rates

Contextual Relevancy: 100.00% pass rate

======================================================================

✓ Tests finished 🎉! Run 'deepeval view' to analyze, debug, and save evaluation results on Confident AI.

[[{'name': 'contextual_relevance', 'score': 0.07692307692307693, 'explanation': "The score is 0.08 because the retrieval context primarily discusses various unrelated topics, such as jurisdiction and experimental processes, which do not address the influence of a mountain range on regional splits. The only relevant statement mentions 'South of the Tehachapis' but lacks specificity about the mountain range's influence, making it insufficient to connect to the input question."}], [{'name': 'contextual_relevance', 'score': 0.14285714285714285, 'explanation': "The score is 0.14 because the majority of the retrieval context focuses on theoretical aspects of complexity classes and implications of the P=NP problem, which do not address the prize. However, the relevant statements mention that 'The P versus NP problem is one of the Millennium Prize Problems' and 'There is a US$1,000,000 prize for resolving the problem,' which directly answer the input question."}], [{'name': 'contextual_relevance', 'score': 0.0, 'explanation': "The score is 0.00 because there are no relevant statements in the retrieval context that address the question about 'Californio', and all provided statements discuss unrelated topics such as the Rhine and Roman territories."}]]

Answer relevancy

The answer relevancy metric measures the quality of our RAG pipeline’s response by evaluating how relevant the response is compared to the provided question.

from haystack import Pipeline
from haystack_integrations.components.evaluators.deepeval import DeepEvalEvaluator, DeepEvalMetric

answer_relevancy_pipeline = Pipeline()
evaluator = DeepEvalEvaluator(metric=DeepEvalMetric.ANSWER_RELEVANCY, metric_params={"model":"gpt-4o-mini"})
answer_relevancy_pipeline.add_component("evaluator", evaluator)

evaluation_results = answer_relevancy_pipeline.run(
    {"evaluator": {"questions": questions, "responses": responses, "contexts": contexts}}
)
print(evaluation_results["evaluator"]["results"])

Evaluating 3 test case(s) in parallel ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━  67% 0:00:09
    🎯 Evaluating test case #2        ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0:00:09

======================================================================

Metrics Summary

  - ✅ Answer Relevancy (score: 1.0, threshold: 0.0, strict: False, evaluation model: gpt-4o-mini, reason: The score is 1.00 because the response directly addressed the question about the prize for solving P=NP without including any irrelevant statements., error: None)

For test case:

  - input: What is the prize offered for finding a solution to P=NP?
  - actual output: The prize offered for finding a solution to P=NP is US$1,000,000.
  - expected output: None
  - context: None
  - retrieval context: ['If a problem X is in C and hard for C, then X is said to be complete for C. This means that X is the hardest problem in C. (Since many problems could be equally hard, one might say that X is one of the hardest problems in C.) Thus the class of NP-complete problems contains the most difficult problems in NP, in the sense that they are the ones most likely not to be in P. Because the problem P = NP is not solved, being able to reduce a known NP-complete problem, Π2, to another problem, Π1, would indicate that there is no known polynomial-time solution for Π1. This is because a polynomial-time solution to Π1 would yield a polynomial-time solution to Π2. Similarly, because all NP problems can be reduced to the set, finding an NP-complete problem that can be solved in polynomial time would mean that P = NP.', 'The question of whether P equals NP is one of the most important open questions in theoretical computer science because of the wide implications of a solution. If the answer is yes, many important problems can be shown to have more efficient solutions. These include various types of integer programming problems in operations research, many problems in logistics, protein structure prediction in biology, and the ability to find formal proofs of pure mathematics theorems. The P versus NP problem is one of the Millennium Prize Problems proposed by the Clay Mathematics Institute. There is a US$1,000,000 prize for resolving the problem.', 'What intractability means in practice is open to debate. Saying that a problem is not in P does not imply that all large cases of the problem are hard or even that most of them are. For example, the decision problem in Presburger arithmetic has been shown not to be in P, yet algorithms have been written that solve the problem in reasonable times in most cases. Similarly, algorithms can solve the NP-complete knapsack problem over a wide range of sizes in less than quadratic time and SAT solvers routinely handle large instances of the NP-complete Boolean satisfiability problem.']

======================================================================

Overall Metric Pass Rates

Answer Relevancy: 100.00% pass rate

======================================================================


======================================================================

Metrics Summary

  - ✅ Answer Relevancy (score: 1.0, threshold: 0.0, strict: False, evaluation model: gpt-4o-mini, reason: The score is 1.00 because the response directly addressed the question about the mountain range influencing the split of the regions without any irrelevant statements., error: None)

For test case:

  - input: Which mountain range influenced the split of the regions?
  - actual output: The mountain range that influenced the split of the regions is the Tehachapi Mountains.
  - expected output: None
  - context: None
  - retrieval context: ['The state is most commonly divided and promoted by its regional tourism groups as consisting of northern, central, and southern California regions. The two AAA Auto Clubs of the state, the California State Automobile Association and the Automobile Club of Southern California, choose to simplify matters by dividing the state along the lines where their jurisdictions for membership apply, as either northern or southern California, in contrast to the three-region point of view. Another influence is the geographical phrase South of the Tehachapis, which would split the southern region off at the crest of that transverse range, but in that definition, the desert portions of north Los Angeles County and eastern Kern and San Bernardino Counties would be included in the southern California region due to their remoteness from the central valley and interior desert landscape.', 'Among the most well-known experiments in structural geology are those involving orogenic wedges, which are zones in which mountains are built along convergent tectonic plate boundaries. In the analog versions of these experiments, horizontal layers of sand are pulled along a lower surface into a back stop, which results in realistic-looking patterns of faulting and the growth of a critically tapered (all angles remain the same) orogenic wedge. Numerical models work in the same way as these analog models, though they are often more sophisticated and can include patterns of erosion and uplift in the mountain belt. This helps to show the relationship between erosion and the shape of the mountain range. These studies can also give useful information about pathways for metamorphism through pressure, temperature, space, and time.', "The Mallee and upper Wimmera are Victoria's warmest regions with hot winds blowing from nearby semi-deserts. Average temperatures exceed 32 °C (90 °F) during summer and 15 °C (59 °F) in winter. Except at cool mountain elevations, the inland monthly temperatures are 2–7 °C (4–13 °F) warmer than around Melbourne (see chart). Victoria's highest maximum temperature since World War II, of 48.8 °C (119.8 °F) was recorded in Hopetoun on 7 February 2009, during the 2009 southeastern Australia heat wave."]

======================================================================

Overall Metric Pass Rates

Answer Relevancy: 100.00% pass rate

======================================================================


======================================================================

Metrics Summary

  - ✅ Answer Relevancy (score: 0.6, threshold: 0.0, strict: False, evaluation model: gpt-4o-mini, reason: The score is 0.60 because the output included irrelevant statements that did not address the question about the location of Californios, indicating a lack of specific information. However, some relevant content was present, which is why the score is not lower., error: None)

For test case:

  - input: Which Californio is located in the upper part?
  - actual output: The provided information does not mention any Californios or any details related to California. A Californio refers to a Hispanic person who was born in California during the period when it was part of Mexico, before becoming part of the United States. Without more context or specific information about Californios, it is not possible to answer the question regarding which Californio is located in the upper part.
  - expected output: None
  - context: None
  - retrieval context: ['In the centre of Basel, the first major city in the course of the stream, is located the "Rhine knee"; this is a major bend, where the overall direction of the Rhine changes from West to North. Here the High Rhine ends. Legally, the Central Bridge is the boundary between High and Upper Rhine. The river now flows North as Upper Rhine through the Upper Rhine Plain, which is about 300 km long and up to 40 km wide. The most important tributaries in this area are the Ill below of Strasbourg, the Neckar in Mannheim and the Main across from Mainz. In Mainz, the Rhine leaves the Upper Rhine Valley and flows through the Mainz Basin.', 'From the death of Augustus in AD 14 until after AD 70, Rome accepted as her Germanic frontier the water-boundary of the Rhine and upper Danube. Beyond these rivers she held only the fertile plain of Frankfurt, opposite the Roman border fortress of Moguntiacum (Mainz), the southernmost slopes of the Black Forest and a few scattered bridge-heads. The northern section of this frontier, where the Rhine is deep and broad, remained the Roman boundary until the empire fell. The southern part was different. The upper Rhine and upper Danube are easily crossed. The frontier which they form is inconveniently long, enclosing an acute-angled wedge of foreign territory between the modern Baden and Württemberg. The Germanic populations of these lands seem in Roman times to have been scanty, and Roman subjects from the modern Alsace-Lorraine had drifted across the river eastwards.', "In the 1960s, a series of discoveries, the most important of which was seafloor spreading, showed that the Earth's lithosphere, which includes the crust and rigid uppermost portion of the upper mantle, is separated into a number of tectonic plates that move across the plastically deforming, solid, upper mantle, which is called the asthenosphere. There is an intimate coupling between the movement of the plates on the surface and the convection of the mantle: oceanic plate motions and mantle convection currents always move in the same direction, because the oceanic lithosphere is the rigid upper thermal boundary layer of the convecting mantle. This coupling between rigid plates moving on the surface of the Earth and the convecting mantle is called plate tectonics."]

======================================================================

Overall Metric Pass Rates

Answer Relevancy: 100.00% pass rate

======================================================================

✓ Tests finished 🎉! Run 'deepeval view' to analyze, debug, and save evaluation results on Confident AI.

[[{'name': 'answer_relevancy', 'score': 1.0, 'explanation': 'The score is 1.00 because the response directly addressed the question about the prize for solving P=NP without including any irrelevant statements.'}], [{'name': 'answer_relevancy', 'score': 1.0, 'explanation': 'The score is 1.00 because the response directly addressed the question about the mountain range influencing the split of the regions without any irrelevant statements.'}], [{'name': 'answer_relevancy', 'score': 0.6, 'explanation': 'The score is 0.60 because the output included irrelevant statements that did not address the question about the location of Californios, indicating a lack of specific information. However, some relevant content was present, which is why the score is not lower.'}]]

Faithfulness

The faithfulness metric measures the quality of our RAG pipeline’s responses by evaluating whether the response factually aligns with the contents of context we provided.

from haystack import Pipeline
from haystack_integrations.components.evaluators.deepeval import DeepEvalEvaluator, DeepEvalMetric

faithfulness_pipeline = Pipeline()
evaluator = DeepEvalEvaluator(metric=DeepEvalMetric.FAITHFULNESS, metric_params={"model":"gpt-4o-mini"} )
faithfulness_pipeline.add_component("evaluator", evaluator)

evaluation_results = faithfulness_pipeline.run(
    {"evaluator": {"questions": questions, "contexts": contexts, "responses": responses}}
)
print(evaluation_results["evaluator"]["results"])

Evaluating 3 test case(s) in parallel ━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━  33% 0:01:16
    🎯 Evaluating test case #0        ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0:01:16
    🎯 Evaluating test case #1        ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0:01:16

Our pipeline evaluation using DeepEval is now complete!

Haystack Useful Sources