Prompt Optimization with DSPy
Last Updated: September 20, 2024
ย ย ย ย ย ย
When building applications with LLMs, writing effective prompts is a long process of trial and error. Often, if you switch models, you also have to change the prompt. What if you could automate this process?
That’s where DSPy comes in - a framework designed to algorithmically optimize prompts for Language Models. By applying classical machine learning concepts (training and evaluation data, metrics, optimization), DSPy generates better prompts for a given model and task.
In this notebook, we will see how to combine DSPy with the robustness of Haystack Pipelines.
- โถ๏ธ Start from a Haystack RAG pipeline with a basic prompt
- ๐ฏ Define a goal (in this case, get correct and concise answers)
- ๐ Create a DSPy program, define data and metrics
- โจ Optimize and evaluate -> improved prompt
- ๐ Build a refined Haystack RAG pipeline using the optimized prompt
Setup
! pip install haystack-ai datasets dspy-ai sentence-transformers
import os
from getpass import getpass
from rich import print
if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass("Enter OpenAI API key:")
Load data
We will use the first 1000 rows of a labeled PubMed dataset with questions, contexts and answers.
Initially, we will use only the contexts as documents and write them to a Document Store.
(Later, we will also use the questions and answers from a small subset of the dataset to create training and dev sets for optimization.)
from datasets import load_dataset
from haystack import Document
dataset = load_dataset("vblagoje/PubMedQA_instruction", split="train")
dataset = dataset.select(range(1000))
docs = [Document(content=doc["context"]) for doc in dataset]
from haystack.document_stores.in_memory import InMemoryDocumentStore
document_store = InMemoryDocumentStore()
document_store.write_documents(docs)
document_store.filter_documents()[:5]
Initial Haystack pipeline
Let’s create a simple RAG Pipeline in Haystack. For more information, see the documentation.
Next, we will see how to improve the prompt.
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.components.generators import OpenAIGenerator
from haystack.components.builders import PromptBuilder
from haystack import Pipeline
retriever = InMemoryBM25Retriever(document_store, top_k=3)
generator = OpenAIGenerator(model="gpt-3.5-turbo")
template = """
Given the following information, answer the question.
Context:
{% for document in documents %}
{{ document.content }}
{% endfor %}
Question: {{question}}
Answer:
"""
prompt_builder = PromptBuilder(template=template)
rag_pipeline = Pipeline()
rag_pipeline.add_component("retriever", retriever)
rag_pipeline.add_component("prompt_builder", prompt_builder)
rag_pipeline.add_component("llm", generator)
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")
Let’s ask some questions…
question = "What effects does ketamine have on rat neural stem cells?"
response = rag_pipeline.run({"retriever": {"query": question}, "prompt_builder": {"question": question}})
print(response["llm"]["replies"][0])
question = "Is the anterior cingulate cortex linked to pain-induced depression?"
response = rag_pipeline.run({"retriever": {"query": question}, "prompt_builder": {"question": question}})
print(response["llm"]["replies"][0])
The answers seems correct, but suppose that our use case requires shorter answers. How can we adjust the prompt to achieve this effect while maintaining correctness?
DSPy
We will use DSPy to automatically improve the prompt for our goal: getting correct and short answers.
We will perform several steps:
- define a DSPy module for RAG
- create training and dev sets
- define a metric
- evaluate the unoptimized RAG module
- optimize the module
- evaluate the optimized RAG
Broadly speaking, these steps follow those listed in the DSPy guide.
import dspy
from dspy.primitives.prediction import Prediction
lm = dspy.OpenAI(model='gpt-3.5-turbo')
dspy.settings.configure(lm=lm)
DSPy Signature
The RAG module involves two main tasks (smaller modules): retrieval and generation.
For generation, we need to define a signature: a declarative specification of input/output behavior of a DSPy module.
In particular, the generation module receives the context
and a question
as input and returns an answer
.
In DSPy, the docstring and the field description are used to create the prompt.
class GenerateAnswer(dspy.Signature):
"""Answer questions with short factoid answers."""
context = dspy.InputField(desc="may contain relevant facts")
question = dspy.InputField()
answer = dspy.OutputField(desc="short and precise answer")
DSPy RAG module
- the
__init__
method can be used to declare sub-modules. - the logic of the module is contained in the
forward
method.
ChainOfThought
module encourages Language Model reasoning with a specific prompt (“Let’s think step by step”) and examples. Paper- we want to reuse our Haystack retriever and the already indexed data, so we also define a
retrieve
method.
class RAG(dspy.Module):
def __init__(self):
super().__init__()
self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
# this makes it possible to use the Haystack retriever
def retrieve(self, question):
results = retriever.run(query=question)
passages = [res.content for res in results['documents']]
return Prediction(passages=passages)
def forward(self, question):
context = self.retrieve(question).passages
prediction = self.generate_answer(context=context, question=question)
return dspy.Prediction(context=context, answer=prediction.answer)
Create training and dev sets
In general, to use DSPy for prompt optimization, you have to prepare some examples for your task (or use a similar dataset).
The training set is used for optimization, while the dev set is used for evaluation.
We create them using respectively 20 and 50 examples (question and answer) from our original labeled PubMed dataset.
trainset, devset=[],[]
for i,ex in enumerate(dataset):
example = dspy.Example(question = ex["instruction"], answer=ex["response"]).with_inputs('question')
if i<20:
trainset.append(example)
elif i<70:
devset.append(example)
else:
break
Define a metric
Defining a metric is a crucial step for evaluating and optimizing our prompt.
As we show in this example, metrics can be defined in a very customized way.
In our case, we want to focus on two aspects: correctness and brevity of the answers.
- for correctness, we use semantic similarity between the predicted answer and the ground truth answer ( Haystack SASEvaluator). SAS score varies between 0 and 1.
- to encourage short answers, we add a penalty for long answers based on a simple mathematical formulation. The penalty varies between 0 (for answers of 20 words or less) and 0.5 (for answers of 40 words or more).
from haystack.components.evaluators import SASEvaluator
sas_evaluator = SASEvaluator()
sas_evaluator.warm_up()
def mixed_metric(example, pred, trace=None):
semantic_similarity = sas_evaluator.run(ground_truth_answers=[example.answer], predicted_answers=[pred.answer])["score"]
n_words=len(pred.answer.split())
long_answer_penalty=0
if 20<n_words<40:
long_answer_penalty = 0.025 * (n_words - 20)
elif n_words>=40:
long_answer_penalty = 0.5
return semantic_similarity - long_answer_penalty
Evaluate unoptimized RAG module
Let’s first check how the unoptimized RAG module performs on the dev set. Then we will optimize it.
uncompiled_rag = RAG()
from dspy.evaluate.evaluate import Evaluate
evaluate = Evaluate(
metric=mixed_metric, devset=devset, num_threads=1, display_progress=True, display_table=5
)
evaluate(uncompiled_rag)
Optimization
We can now compile/optimized the DSPy program we created.
This can be done using a teleprompter/optimizer, based on our metric and training set.
In particular, BootstrapFewShot
tries to improve the metric in the training set by adding few shot examples to the prompt.
from dspy.teleprompt import BootstrapFewShot
optimizer = BootstrapFewShot(metric=mixed_metric)
compiled_rag = optimizer.compile(RAG(), trainset=trainset)
Evaluate optimized RAG module
Let’s now see if the training has been successful, evaluating the compiled RAG module on the dev set.
evaluate = Evaluate(
metric=mixed_metric, devset=devset, num_threads=1, display_progress=True, display_table=5
)
evaluate(compiled_rag)
Based on our simple metric, we got a significant improvement!
Inspect the optimized prompt
Let’s take a look at the few shot examples that made our results improve…
lm.inspect_history(n=1)
Optimized Haystack Pipeline
We can now use the static part of the optimized prompt (including examples) and create a better Haystack RAG Pipeline.
We include an AnswerBuilder
, to capture only the relevant part of the generation (all text after Answer:
).
%%capture
static_prompt = lm.inspect_history(n=1).rpartition("---\n")[0]
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.components.generators import OpenAIGenerator
from haystack.components.builders import PromptBuilder, AnswerBuilder
from haystack import Pipeline
template = static_prompt+"""
---
Context:
{% for document in documents %}
ยซ{{ document.content }}ยป
{% endfor %}
Question: {{question}}
Reasoning: Let's think step by step in order to
"""
new_prompt_builder = PromptBuilder(template=template)
new_retriever = InMemoryBM25Retriever(document_store, top_k=3)
new_generator = OpenAIGenerator(model="gpt-3.5-turbo")
answer_builder = AnswerBuilder(pattern="Answer: (.*)")
optimized_rag_pipeline = Pipeline()
optimized_rag_pipeline.add_component("retriever", new_retriever)
optimized_rag_pipeline.add_component("prompt_builder", new_prompt_builder)
optimized_rag_pipeline.add_component("llm", new_generator)
optimized_rag_pipeline.add_component("answer_builder", answer_builder)
optimized_rag_pipeline.connect("retriever", "prompt_builder.documents")
optimized_rag_pipeline.connect("prompt_builder", "llm")
optimized_rag_pipeline.connect("llm.replies", "answer_builder.replies")
Let’s ask the same questions as before…
question = "What effects does ketamine have on rat neural stem cells?"
response = optimized_rag_pipeline.run({"retriever": {"query": question}, "prompt_builder": {"question": question}, "answer_builder": {"query": question}})
print(response["answer_builder"]["answers"][0].data)
question = "Is the anterior cingulate cortex linked to pain-induced depression?"
response = optimized_rag_pipeline.run({"retriever": {"query": question}, "prompt_builder": {"question": question}, "answer_builder": {"query": question}})
print(response["answer_builder"]["answers"][0].data)
The answer are correct and shorter than before!
(Notebook by Stefano Fiorucci)