Trace and Evaluate RAG with Arize Phoenix
Last Updated: June 13, 2025
Phoenix is a tool for tracing and evaluating LLM applications. In this tutorial, we will trace and evaluate a Haystack RAG pipeline. We’ll evaluate using three different types of evaluations:
- Relevance: Whether the retrieved documents are relevant to the question.
- Q&A Correctness: Whether the answer to the question is correct.
- Hallucination: Whether the answer contains hallucinations.
ℹ️ This notebook requires an OpenAI API key.
!pip install -q openinference-instrumentation-haystack haystack-ai arize-phoenix
Set API Keys
from getpass import getpass
import os
if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
os.environ["OPENAI_API_KEY"] = openai_api_key
🔑 Enter your OpenAI API key: ··········
Launch Phoenix and Enable Haystack Tracing
If you don’t have a Phoenix API key, you can get one for free at phoenix.arize.com. Arize Phoenix also provides self-hosting options if you’d prefer to run the application yourself instead.
if os.getenv("PHOENIX_API_KEY") is None:
os.environ["PHOENIX_API_KEY"] = getpass("Enter your Phoenix API Key")
os.environ["OTEL_EXPORTER_OTLP_HEADERS"] = f"api_key={os.environ['PHOENIX_API_KEY']}"
os.environ["PHOENIX_CLIENT_HEADERS"] = f"api_key={os.environ['PHOENIX_API_KEY']}"
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"
Enter your Phoenix API Key··········
The command below connects Phoenix to your Haystack application and instruments the Haystack library. Any calls to Haystack pipelines from this point forward will be traced and logged to the Phoenix UI.
from phoenix.otel import register
project_name = "Haystack RAG"
tracer_provider = register(project_name=project_name, auto_instrument=True)
Set up your Haystack app
For a step-by-step guide to create a RAG pipeline with Haystack, follow the Creating Your First QA Pipeline with Retrieval-Augmentation tutorial.
from haystack.components.builders import ChatPromptBuilder
from haystack.dataclasses import ChatMessage, Document
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack import Pipeline
# Write documents to InMemoryDocumentStore
document_store = InMemoryDocumentStore()
document_store.write_documents(
[
Document(content="My name is Jean and I live in Paris."),
Document(content="My name is Mark and I live in Berlin."),
Document(content="My name is Giorgio and I live in Rome."),
]
)
# Basic RAG Pipeline
template = [
ChatMessage.from_system(
"""
Answer the questions based on the given context.
Context:
{% for document in documents %}
{{ document.content }}
{% endfor %}
Question: {{ question }}
Answer:
"""
)
]
rag_pipe = Pipeline()
rag_pipe.add_component("retriever", InMemoryBM25Retriever(document_store=document_store))
rag_pipe.add_component("prompt_builder", ChatPromptBuilder(template=template, required_variables="*"))
rag_pipe.add_component("llm", OpenAIChatGenerator(model="gpt-4o-mini"))
rag_pipe.connect("retriever", "prompt_builder.documents")
rag_pipe.connect("prompt_builder.prompt", "llm.messages")
<haystack.core.pipeline.pipeline.Pipeline object at 0x7f5e1e4be390>
🚅 Components
- retriever: InMemoryBM25Retriever
- prompt_builder: ChatPromptBuilder
- llm: OpenAIChatGenerator
🛤️ Connections
- retriever.documents -> prompt_builder.documents (List[Document])
- prompt_builder.prompt -> llm.messages (List[ChatMessage])
Run the pipeline with a query. It will automatically create a trace on Phoenix.
# Ask a question
question = "Who lives in Paris?"
results = rag_pipe.run(
{
"retriever": {"query": question},
"prompt_builder": {"question": question},
}
)
print(results["llm"]["replies"][0].text)
Jean lives in Paris.
Evaluating Retrieved Docs
Now that we’ve traced our pipeline, let’s start by evaluating the retrieved documents.
All evaluations in Phoenix use the same general process:
- Query and download trace data from Phoenix
- Add evaluation labels to the trace data. This can be done using the Phoenix library, using Haystack evaluators, or using your own evaluators.
- Log the evaluation labels to Phoenix
- View evaluations
We’ll use the get_retrieved_documents
function to get the trace data for the retrieved documents.
import nest_asyncio
nest_asyncio.apply()
import phoenix as px
client = px.Client()
from phoenix.session.evaluation import get_retrieved_documents
retrieved_documents_df = get_retrieved_documents(px.Client(), project_name=project_name)
retrieved_documents_df.head()
/usr/local/lib/python3.11/dist-packages/phoenix/utilities/client.py:60: UserWarning: The Phoenix server (10.9.1) and client (10.11.0) versions are mismatched and may have compatibility issues.
warnings.warn(
context.trace_id \
context.span_id document_position
40880a3ade3753c3 0 53d4a3ef151e2dc3009fa6aff152dc86
1 53d4a3ef151e2dc3009fa6aff152dc86
2 53d4a3ef151e2dc3009fa6aff152dc86
input \
context.span_id document_position
40880a3ade3753c3 0 {"query": "Who lives in Paris?", "filters": nu...
1 {"query": "Who lives in Paris?", "filters": nu...
2 {"query": "Who lives in Paris?", "filters": nu...
reference \
context.span_id document_position
40880a3ade3753c3 0 My name is Jean and I live in Paris.
1 My name is Mark and I live in Berlin.
2 My name is Giorgio and I live in Rome.
document_score
context.span_id document_position
40880a3ade3753c3 0 1.293454
1 0.768010
2 0.768010
Next we’ll use Phoenix’s RelevanceEvaluator
to evaluate the relevance of the retrieved documents. This evaluator uses a LLM to determine if the retrieved documents contain the answer to the question.
from phoenix.evals import OpenAIModel, RelevanceEvaluator, run_evals
relevance_evaluator = RelevanceEvaluator(OpenAIModel(model="gpt-4o-mini"))
retrieved_documents_relevance_df = run_evals(
evaluators=[relevance_evaluator],
dataframe=retrieved_documents_df,
provide_explanation=True,
concurrency=20,
)[0]
run_evals | | 0/3 (0.0%) | ⏳ 00:00<? | ?it/s
retrieved_documents_relevance_df.head()
label score \
context.span_id document_position
40880a3ade3753c3 0 relevant 1
1 unrelated 0
2 unrelated 0
explanation
context.span_id document_position
40880a3ade3753c3 0 The question asks who lives in Paris. The refe...
1 The question asks about who lives in Paris, wh...
2 The question asks about who lives in Paris, wh...
Finally, we’ll log the evaluation labels to Phoenix.
from phoenix.trace import DocumentEvaluations, SpanEvaluations
px.Client().log_evaluations(
DocumentEvaluations(dataframe=retrieved_documents_relevance_df, eval_name="relevance"),
)
If you now click on your document retrieval span in Phoenix, you should see the evaluation labels.
Evaluate Response
With HallucinationEvaluator
and QAEvaluator
, we can detect the correctness and hallucination score of the generated response.
from phoenix.session.evaluation import get_qa_with_reference
qa_with_reference_df = get_qa_with_reference(px.Client(), project_name=project_name)
qa_with_reference_df
/usr/local/lib/python3.11/dist-packages/phoenix/utilities/client.py:60: UserWarning: The Phoenix server (10.9.1) and client (10.11.0) versions are mismatched and may have compatibility issues.
warnings.warn(
input \
context.span_id
a3e33d1e526e97bd {"data": {"retriever": {"query": "Who lives in...
output \
context.span_id
a3e33d1e526e97bd {"llm": {"replies": ["ChatMessage(_role=<ChatR...
reference
context.span_id
a3e33d1e526e97bd My name is Jean and I live in Paris.\n\nMy nam...
from phoenix.evals import (
HallucinationEvaluator,
OpenAIModel,
QAEvaluator,
run_evals,
)
qa_evaluator = QAEvaluator(OpenAIModel(model="gpt-4-turbo-preview"))
hallucination_evaluator = HallucinationEvaluator(OpenAIModel(model="gpt-4-turbo-preview"))
qa_correctness_eval_df, hallucination_eval_df = run_evals(
evaluators=[qa_evaluator, hallucination_evaluator],
dataframe=qa_with_reference_df,
provide_explanation=True,
concurrency=20,
)
run_evals | | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s
px.Client().log_evaluations(
SpanEvaluations(dataframe=qa_correctness_eval_df, eval_name="Q&A Correctness"),
SpanEvaluations(dataframe=hallucination_eval_df, eval_name="Hallucination"),
)
You should now see the Q&A correctness and hallucination evaluations in Phoenix.