RAG Pipeline Evaluation Using UpTrain


UpTrain is an evaluation framework that provides a number of LLM-based evaluation metrics and can be used to evaluate Retrieval Augmented Generation (RAG) pipelines. It supports metrics like context relevance, factual accuracy, guideline adherence and more.

For more information about evaluators, supported metrics and usage, check out:

This notebook shows how to use the UpTrain-Haystack integration to evaluate a RAG pipeline against various metrics.

Notebook by Anushree Bannadabhavi

Prerequisites:

  • API key
    • We need an API key to run evaluations with UpTrain. To see a list of the LLMs supported by UpTrain refer to this page.
    • This notebook uses OpenAI for the evaluation pipeline.

Install dependencies

%%bash

pip install "pydantic<1.10.10"
pip install haystack-ai
pip install "datasets>=2.6.1"
pip install uptrain-haystack

Create a RAG pipeline

We’ll first need to create a RAG pipeline. Refer to this link for a detailed tutorial on how to create RAG pipelines.

In this notebook, we’re using the SQUAD V2 dataset for getting the context, questions and ground truth answers.

Initialize the document store

from datasets import load_dataset
from haystack import Document
from haystack.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()

dataset = load_dataset("rajpurkar/squad_v2", split="validation")
documents = list(set(dataset["context"]))
docs = [Document(content=doc) for doc in documents]
document_store.write_documents(docs)
import os
from getpass import getpass
from haystack.components.builders import PromptBuilder
from haystack.components.generators import OpenAIGenerator
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever

retriever = InMemoryBM25Retriever(document_store, top_k=3)

template = """
Given the following information, answer the question.

Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: {{question}}
Answer:
"""

prompt_builder = PromptBuilder(template=template)


os.environ["OPENAI_API_KEY"] = getpass("Enter OpenAI API key:")
generator = OpenAIGenerator(model="gpt-3.5-turbo-0125")

Build the RAG pipeline

from haystack import Pipeline
from haystack.components.builders.answer_builder import AnswerBuilder

rag_pipeline = Pipeline()
# Add components to your pipeline
rag_pipeline.add_component("retriever", retriever)
rag_pipeline.add_component("prompt_builder", prompt_builder)
rag_pipeline.add_component("llm", generator)
rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder")

# Now, connect the components to each other
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("llm.meta", "answer_builder.meta")
rag_pipeline.connect("retriever", "answer_builder.documents")

Running the pipeline

question = "In what country is Normandy located?"

response = rag_pipeline.run(
    {"retriever": {"query": question}, "prompt_builder": {"question": question}, "answer_builder": {"query": question}}
)
print(response["answer_builder"]["answers"][0].data)

We’re done building our RAG pipeline. Let’s evaluate it now!

Get questions, contexts, responses and ground truths for evaluation

For computing most metrics, we will need to provide the following to the evaluator:

  1. Questions
  2. Generated responses
  3. Retrieved contexts
  4. Ground truth

Below is a helper function to get contexts and responses.

Helper function to get context and responses

def get_contexts_and_responses(questions, pipeline):
    contexts = []
    responses = []
    for question in questions:
        response = pipeline.run(
            {
                "retriever": {"query": question},
                "prompt_builder": {"question": question},
                "answer_builder": {"query": question},
            }
        )

        contexts.append([d.content for d in response["answer_builder"]["answers"][0].documents])
        responses.append(response["answer_builder"]["answers"][0].data)
    return contexts, responses
question_map = {
    "Which mountain range influenced the split of the regions?": 0,
    "What is the prize offered for finding a solution to P=NP?": 1,
    "Which Californio is located in the upper part?": 2
}
questions = list(question_map.keys())
contexts, responses = get_contexts_and_responses(questions, rag_pipeline)

Get ground truths

Since we’re using the SQUAD dataset, ground truth answers are also available.

ground_truths = [""] * len(question_map)

for question, index in question_map.items():
    idx = dataset["question"].index(question)
    ground_truths[index] = dataset["answers"][idx]["text"][0]
print("Questions\n")
print("\n".join(questions))
print("Responses\n")
print("\n".join(responses))
print("Ground truths\n")
print("\n".join(ground_truths))

Evaluate the RAG pipeline

Now that we have the questions, generated answers, contexts and the ground truths, we can begin our pipeline evaluation and compute all the supported metrics.

UpTrainEvaluator

We can use the UpTrainEvaluator component to evaluate our Pipeline against one of the metrics provided by UpTrain.

The evaluation pipeline follows a simple structure:

  1. Create the pipeline object.
  2. Instantiate the evaluator object with the following parameters:
    • metric: The upTrain metric you want to compute
    • metric_params: Any additional parameters that that is required for the metric computation
    • api: The API you want to use with your evaluator
    • api_key: By default, this component looks for an environment variable called OPENAI_API_KEY. To change this, pass Secret.from_env_var("YOUR_ENV_VAR") to this parameter.
  3. Add the evaluator component to the pipeline.
    • Note: We can add multiple evaluator components to the pipeline and compute multiple metrics. Example present here.
  4. Run the evaluation pipeline with the necessary inputs.
# Create the pipeline object
eval_pipeline = Pipeline()

# Initialize the evaluator object
evaluator = UpTrainEvaluator(metric=UpTrainMetric.{METRIC_NAME})

# Add the evaluator component to the pipeline
eval_pipeline.add_component("evaluator", evaluator)

# Run the evaluation pipeline with the necessary inputs
evaluation_results = eval_pipeline.run(
    {"evaluator": {"contexts": contexts, "responses": responses}}
)

Metrics computation

For a full list of available metrics and their expected inputs, check out the UpTrainEvaluator Docs. Additionally, UpTrain offers score explanations which can be quite useful.

UpTrain Docs also offers explanations of all metrics along with simple examples.

Helper function to print results

Below is a helper function to print the results of our evaluation pipeline

def print_results(questions: list, results: dict):
  for index, question in enumerate(questions):
    print(question)
    result_dict = results['results'][index][0]
    for key, value in result_dict.items():
      print(f"{key} : {value}")
    print("\n")

Context relevance

Measures how relevant the context was to the question specified.

import os
from haystack import Pipeline
from haystack_integrations.components.evaluators.uptrain import UpTrainEvaluator, UpTrainMetric

evaluator = UpTrainEvaluator(
    metric=UpTrainMetric.CONTEXT_RELEVANCE,
    api="openai",
)

evaluator_pipeline = Pipeline()
evaluator_pipeline.add_component("evaluator", evaluator)
results = evaluator_pipeline.run({"evaluator": {"questions": questions, "contexts": contexts}})
print("\n")
print_results(questions, results["evaluator"])

Factual accurracy

Factual accuracy score measures the degree to which a claim made in the response is true according to the context provided.

import os
from haystack import Pipeline
from haystack_integrations.components.evaluators.uptrain import UpTrainEvaluator, UpTrainMetric

evaluator = UpTrainEvaluator(
    metric=UpTrainMetric.FACTUAL_ACCURACY,
    api="openai",
)

evaluator_pipeline = Pipeline()
evaluator_pipeline.add_component("evaluator", evaluator)
results = evaluator_pipeline.run({"evaluator": {"questions": questions, "contexts": contexts, "responses": responses}})
print("\n")
print_results(questions, results["evaluator"])

Response relevance

Response relevance is the measure of how relevant the generated response is to the question asked.

import os
from haystack import Pipeline
from haystack_integrations.components.evaluators.uptrain import UpTrainEvaluator, UpTrainMetric

evaluator = UpTrainEvaluator(
    metric=UpTrainMetric.RESPONSE_RELEVANCE,
    api="openai",
)

evaluator_pipeline = Pipeline()
evaluator_pipeline.add_component("evaluator", evaluator)
results = evaluator_pipeline.run({"evaluator": {"questions": questions, "responses": responses}})
print("\n")
print_results(questions, results["evaluator"])

Response completeness

Response completeness score measures if the generated response has adequately answered all aspects to the question being asked.

import os
from haystack import Pipeline
from haystack_integrations.components.evaluators.uptrain import UpTrainEvaluator, UpTrainMetric

evaluator = UpTrainEvaluator(
    metric=UpTrainMetric.RESPONSE_COMPLETENESS,
    api="openai",
)

evaluator_pipeline = Pipeline()
evaluator_pipeline.add_component("evaluator", evaluator)
results = evaluator_pipeline.run({"evaluator": {"questions": questions, "responses": responses}})
print("\n")
print_results(questions, results["evaluator"])

Response completeness with respect to context

Measures how complete the generated response was for the question specified given the information provided in the context.

import os
from haystack import Pipeline
from haystack_integrations.components.evaluators.uptrain import UpTrainEvaluator, UpTrainMetric

evaluator = UpTrainEvaluator(
    metric=UpTrainMetric.RESPONSE_COMPLETENESS_WRT_CONTEXT,
    api="openai",
)

evaluator_pipeline = Pipeline()
evaluator_pipeline.add_component("evaluator", evaluator)
results = evaluator_pipeline.run({"evaluator": {"questions": questions, "contexts": contexts, "responses": responses}})
print("\n")
print_results(questions, results["evaluator"])

Response consistency

Response Consistency is the measure of how well the generated response aligns with both the question asked and the context provided.

import os
from haystack import Pipeline
from haystack_integrations.components.evaluators.uptrain import UpTrainEvaluator, UpTrainMetric

evaluator = UpTrainEvaluator(
    metric=UpTrainMetric.RESPONSE_CONSISTENCY,
    api="openai",
)

evaluator_pipeline = Pipeline()
evaluator_pipeline.add_component("evaluator", evaluator)
results = evaluator_pipeline.run({"evaluator": {"questions": questions, "contexts": contexts, "responses": responses}})
print("\n")
print_results(questions, results["evaluator"])

Response conciseness

Response conciseness score measures whether the generated response contains any additional information irrelevant to the question asked.

import os
from haystack import Pipeline
from haystack_integrations.components.evaluators.uptrain import UpTrainEvaluator, UpTrainMetric

evaluator = UpTrainEvaluator(
    metric=UpTrainMetric.RESPONSE_CONCISENESS,
    api="openai",
)

evaluator_pipeline = Pipeline()
evaluator_pipeline.add_component("evaluator", evaluator)
results = evaluator_pipeline.run({"evaluator": {"questions": questions, "responses": responses}})
print("\n")
print_results(questions, results["evaluator"])

Critique language

Evaluates the response on multiple aspects - fluency, politeness, grammar, and coherence. It provides a score for each of the aspects on a scale of 0 to 1, along with an explanation for the score.

import os
from haystack import Pipeline
from haystack_integrations.components.evaluators.uptrain import UpTrainEvaluator, UpTrainMetric

evaluator = UpTrainEvaluator(
    metric=UpTrainMetric.CRITIQUE_LANGUAGE,
    api="openai",
)

evaluator_pipeline = Pipeline()
evaluator_pipeline.add_component("evaluator", evaluator)
results = evaluator_pipeline.run({"evaluator": {"responses": responses}})
print("\n")
print_results(questions, results["evaluator"])

Critique tone

Evaluates the tone of machine generated responses. Provide your required tone as an additional paramter. Example: metric_params={"llm_persona": "informative"}

import os
from haystack import Pipeline
from haystack_integrations.components.evaluators.uptrain import UpTrainEvaluator, UpTrainMetric

evaluator = UpTrainEvaluator(
    metric=UpTrainMetric.CRITIQUE_TONE,
    api="openai",
    metric_params={"llm_persona": "informative"}
)

evaluator_pipeline = Pipeline()
evaluator_pipeline.add_component("evaluator", evaluator)
results = evaluator_pipeline.run({"evaluator": {"responses": responses}})
print("\n")
print_results(questions, results["evaluator"])

Guideline adherence

Evaluates how well the LLM adheres to a provided guideline when giving a response.

import os
from haystack import Pipeline
from haystack_integrations.components.evaluators.uptrain import UpTrainEvaluator, UpTrainMetric

evaluator = UpTrainEvaluator(
    metric=UpTrainMetric.GUIDELINE_ADHERENCE,
    api="openai",
    metric_params={"guideline": "Response shouldn't contain any personal information.",
                   "guideline_name": "personal info check"}
)

evaluator_pipeline = Pipeline()
evaluator_pipeline.add_component("evaluator", evaluator)
results = evaluator_pipeline.run({"evaluator": {"questions": questions, "responses": responses}})
print("\n")
print_results(questions, results["evaluator"])

Add multiple evaluator components in a single pipeline

In the code above, we created separate pipeline objects for every metric. Instead, we can also create a single pipeline object and add all the evaluators as components.

The final result will then include all the metrics.

from haystack import Pipeline
from haystack_integrations.components.evaluators.uptrain import UpTrainEvaluator, UpTrainMetric

pipeline = Pipeline()
evaluator_response_consistency = UpTrainEvaluator(
    metric=UpTrainMetric.RESPONSE_CONSISTENCY,
)
evaluator_response_completeness = UpTrainEvaluator(
    metric=UpTrainMetric.RESPONSE_COMPLETENESS,
)
pipeline.add_component("evaluator_response_consistency", evaluator_response_consistency)
pipeline.add_component("evaluator_response_completeness", evaluator_response_completeness)
results = pipeline.run({
        "evaluator_response_consistency": {"questions": questions, "contexts": contexts, "responses": responses},
        "evaluator_response_completeness": {"questions": questions, "responses": responses},
})
print_results(questions, results["evaluator_response_consistency"])
print_results(questions, results["evaluator_response_completeness"])

Our pipeline evaluation using UpTrain is now complete!

Haystack 2.0 Useful Sources