Blog

MCP + Haystack: A Practical Guide for AI Engineers

Mon, 22 Jun 2026 00:00:00 +0000

The Model Context Protocol (MCP) is the connective tissue of the modern AI stack. If you are building AI agents or production RAG systems, understanding how MCP works with Haystack is one of the most direct ways to make your applications composable, reusable, and reachable from the tools your users already live in.

This blog post explains what MCP is, why it matters, and the different ways you can use MCP with Haystack, whether you want to consume external tools inside a Haystack agent or expose your Haystack pipelines and agents as MCP tools for MCP clients like Claude, ChatGPT, and Cursor.

What is MCP (Model Context Protocol)?

The Model Context Protocol (MCP) is an open standard that defines how AI applications connect to external tools, data sources, and services. The common analogy is that MCP is the “USB-C of AI”: instead of writing a bespoke integration for every model and every assistant, you implement the protocol once and any MCP-compatible client can use it.

An MCP setup has two sides:

MCP servers expose capabilities (tools, prompts, resources) over a standardized interface.
MCP clients (an LLM application, an agent, or an AI assistant like Claude Code, ChatGPT, or Cursor) discover those capabilities and call them at runtime.

A “tool” in MCP terms is just a callable with a name, a description, and a JSON Schema describing its inputs. That small, standardized contract is what makes the whole ecosystem interoperable.

⚠️ Security note: MCP servers may introduce security risks. Exercise caution when connecting to MCP servers to ensure they do not expose sensitive data or perform malicious or unsafe actions.

Why MCP and Haystack are a strong combination

Haystack is an open-source AI framework for building production-ready agents, RAG applications, and multimodal search systems out of reusable components and pipelines. MCP and Haystack complement each other in both directions:

Haystack as an MCP client. Your agent gains instant access to a growing ecosystem of MCP servers (filesystem, Git/GitHub, databases, web search, internal services) without you having to hand-roll each integration.
Haystack as an MCP server. The retrieval logic, agentic workflows, and domain knowledge you have already encoded in a pipeline become callable from any MCP client. You build the capability once and it shows up wherever your users work.

Haystack works on both sides of MCP: as a client consuming external tools, and as a server exposing pipelines and agents to any MCP-compatible assistant

The payoff for AI engineers is composability and reach. A single agentic flow can pull data from a Haystack RAG pipeline, hand off to a live web-context MCP tool, and route through another pipeline for structured output, connecting your own logic with external tools instead of re-implementing everything inside one monolithic agent.

The rest of this guide covers three concrete approaches:

Use MCP servers as tools inside a Haystack Agent ( MCPTool, MCPToolset, and SearchableToolset)
Deploy any Haystack pipeline or agent as an MCP server with Hayhooks.
Expose Haystack pipelines/agents as managed MCP tools on the Haystack Enterprise Platform.

These approaches are not mutually exclusive. You can connect a Haystack agent to MCP servers, then expose that same agent as an MCP tool via Hayhooks or the Haystack Enterprise Platform.

Approach 1: Haystack as an MCP client

Using MCP servers as tools is the most common entry point. You have a Haystack Agent and you want it to be able to call tools that live behind an MCP server. In practice, you can combine MCPTool and MCPToolset from the mcp-haystack integration with Haystack’s SearchableToolset for larger catalogs.

pip install mcp-haystack

`MCPTool`: bind a single MCP tool

MCPTool connects to an MCP server and exposes one specific tool to your agent. This gives you precise control over exactly which capability the LLM can reach. It supports Streamable HTTP and stdio transports.

Here is a single MCP tool wired directly into a Haystack Agent (using the official mcp-server-time server):

# pip install ollama-haystack
from haystack.components.agents import Agent
from haystack.components.generators.chat import OllamaChatGenerator
from haystack.dataclasses import ChatMessage
from haystack_integrations.tools.mcp import MCPTool, StdioServerInfo

time_tool = MCPTool(
    name="get_current_time",
    server_info=StdioServerInfo(
        command="uvx",
        args=["mcp-server-time", "--local-timezone=Europe/Berlin"],
    ),
)

agent = Agent(
    chat_generator=OllamaChatGenerator(model="gemma4:e4b"),
    tools=[time_tool],
)

response = agent.run(
    messages=[ChatMessage.from_user("What is the time in New York? Be brief.")],
)
print(response["last_message"].text)

Result:

8:42 AM Thursday (New York)

The Agent component is model-agnostic, so you can swap in any Haystack chat generator your stack supports.

To connect to a remote server instead, swap the transport:

from haystack_integrations.tools.mcp import MCPTool, StreamableHttpServerInfo

server_info = StreamableHttpServerInfo(url="http://localhost:8000/mcp")
tool = MCPTool(name="my_tool", server_info=server_info)

`MCPToolset`: load a whole server’s tools at once

MCPToolset connects to an MCP server and automatically discovers and loads its tools into a single, manageable unit. It is a subclass of Haystack’s Toolset, so it plugs directly into a Chat Generator, a ToolInvoker, or an Agent.

The key feature for controlling agent behavior is the tool_names filter, which lets you decide exactly which tools from the server your agent is allowed to use. Here is a practical example with the official filesystem MCP server:

# pip install mistral-haystack
from haystack.components.agents import Agent
from haystack.components.generators.chat import MistralChatGenerator
from haystack.dataclasses import ChatMessage
from haystack_integrations.tools.mcp import MCPToolset, StdioServerInfo

toolset = MCPToolset(
    server_info=StdioServerInfo(
        command="npx",
        args=["-y", "@modelcontextprotocol/server-filesystem", "/Users/you/projects"],
    ),
    tool_names=["list_directory", "read_file"],  # restrict to safe read-only actions
)

agent = Agent(
    chat_generator=MistralChatGenerator(model="mistral-medium-3-5"),
    tools=toolset,
    exit_conditions=["text"],
)

response = agent.run(
    messages=[ChatMessage.from_user("List markdown files in /Users/you/projects/docs and summarize them.")]
)
print(response["last_message"].text)

Response:

Here are the Markdown files in the repo root and a brief summary of each:...

Tip: If you omit tool_names, the toolset loads every tool the server offers. Be careful here, exposing 20–30+ tools at once can overwhelm the LLM’s tool-selection logic and degrade accuracy. Curating the tool list is one of the simplest reliability wins you can make.

`SearchableToolset`: scale to large tool catalogs

As soon as you connect multiple MCP servers, you hit a hard problem: too many tools. Every tool definition (name, description, input schema) is injected into the LLM’s context on every turn. A catalog of dozens or hundreds of tools bloats the context window, raises cost and latency, and makes the model worse at picking the right tool.

SearchableToolset is Haystack’s answer to this context management challenge. Instead of exposing every tool up front, it exposes a single bootstrap tool, search_tools, that the agent uses to discover relevant tools on demand via BM25 keyword search. Once the agent searches, the matching tools become immediately available for it to call in subsequent iterations.

Crucially, the catalog can contain MCPTool and MCPToolset instances, so you can place many MCP servers behind one searchable, context-efficient interface:

import os

from haystack.components.agents import Agent
from haystack.dataclasses import ChatMessage
from haystack.tools import SearchableToolset
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack_integrations.tools.mcp import MCPToolset, StdioServerInfo, StreamableHttpServerInfo
# Pull tools from several MCP servers into one catalog
fetch_tools = MCPToolset(
    server_info=StdioServerInfo(command="uvx", args=["mcp-server-fetch"]),
)
github_tools = MCPToolset(
    server_info=StreamableHttpServerInfo(
        url="https://api.githubcopilot.com/mcp/",
        headers={"Authorization": f"Bearer {os.environ['GITHUB_PAT']}"},
    ),
)

catalog = [fetch_tools, github_tools] # can be dozens of tools
toolset = SearchableToolset(catalog=catalog, top_k=3, search_threshold=8)

agent = Agent(
    chat_generator=OpenAIChatGenerator(model="gpt-5.4-mini"),
    tools=toolset,
)

result = agent.run(
    messages=[
        ChatMessage.from_user(
            "Check the open PRs on haystack integrations repo "
            "(deepset-ai/haystack-integrations) and create a social media post "
            "about the most interesting ones."
        )
    ]
)

print(result["last_message"].text)

Result:

Here’s a draft social post highlighting the most interesting open PRs in `deepset-ai/haystack-integrations`....

In this setup, the Fetch MCP Server runs over stdio, while the GitHub MCP Server uses the official remote MCP endpoint over Streamable HTTP. If your MCP host supports OAuth, you can authenticate with that remote server without passing a PAT header manually.

⚠️ Fetch MCP caution: The Fetch MCP Server can access local/internal IP addresses and may introduce security risk. Use it carefully to avoid exposing sensitive data.

This pattern keeps the agent’s prompt lean while still giving it access to a large universe of capabilities, exactly the kind of context engineering that separates a demo from a production agent. Learn more about context management in Blog Post: Context Engineering for Agentic Systems

Approach 2: Haystack as an MCP server using Hayhooks

The first approach makes Haystack an MCP client. Now let’s flip it around. Hayhooks is deepset’s tool for serving Haystack pipelines and agents over HTTP, and it can also act as an MCP server. This means any Haystack application, a defined pipeline or a full agent, can be exposed as an MCP tool and connected to MCP clients like Claude, ChatGPT, Cursor, or any other agent.

Getting started

pip install hayhooks[mcp]
hayhooks mcp run

This starts the MCP server on localhost:1417 by default (configurable via HAYHOOKS_MCP_HOST and HAYHOOKS_MCP_PORT). It speaks Streamable HTTP at /mcp.

How a pipeline becomes a tool

When you deploy a pipeline with a PipelineWrapper, Hayhooks automatically turns it into an MCP tool. The magic is in the run_api method and its docstring:

The wrapper name becomes the MCP tool name.
The first line of the run_api docstring becomes the tool description.
The run_api method arguments and their type hints become the tool’s inputSchema.

from pathlib import Path
from haystack import Pipeline
from hayhooks import BasePipelineWrapper


class PipelineWrapper(BasePipelineWrapper):
    def setup(self) -> None:
        pipeline_yaml = (Path(__file__).parent / "my_haystack_pipeline.yml").read_text()
        self.pipeline = Pipeline.loads(pipeline_yaml)

    def run_api(self, urls: list[str], question: str) -> str:
        """
        Ask a question about one or more websites using a Haystack pipeline.

        Args:
            urls: List of website URLs to analyze
            question: Question to ask about the content
        """
        result = self.pipeline.run(
            {"fetcher": {"urls": urls}, "prompt": {"query": question}}
        )
        return result["llm"]["replies"][0]

Because the input schema is derived from your method signature, Hayhooks validates inputs automatically.

Connecting your IDE or assistant

To use your deployed pipeline from MCP clients, add the server in MCP settings like this:

{
  "mcpServers": {
    "hayhooks": {
      "url": "http://localhost:1417/mcp"
    }
  }
}

Once configured, you can deploy, manage, and run your Haystack pipelines directly from chat. Hayhooks also exposes core tools (deploy_pipeline, undeploy_pipeline, get_pipeline_status, get_all_pipeline_statuses) so an assistant can manage your deployments through natural language.

This approach is ideal when you self-host and want full control over the infrastructure, while still getting standardized MCP access for free. Learn about the details of how to use Hayhooks as an MCP Server here.

Approach 3: Haystack as an MCP server using Haystack Enterprise Platform

Self-hosting an MCP server works, but production MCP tools need governance: authentication, access control, scaling, and observability. The Haystack Enterprise Platform provides all of this as a managed service, so you can turn any deployed pipeline into an MCP tool without standing up extra infrastructure.

The workflow is straightforward:

Build and deploy your pipeline in the platform.
Create a single workspace MCP server.
Enable individual pipelines as tools (toggling Use as MCP tool in each pipeline’s settings), optionally with a custom tool name and description.
Copy the generated client configuration and connect your assistant.

MCP client configuration looks like this (the platform generates it for you):

{
  "mcpServers": {
    "haystack-enterprise": {
      "url": "https://api.cloud.deepset.ai/api/v2/workspaces//mcp",
      "headers": {
        "Authorization": "Bearer your-api-key"
      }
    }
  }
}

You can read the full walkthrough in the Haystack Enterprise Platform docs.

Haystack Docs MCP Server

A concrete example of this pattern running in production: we just launched a docs MCP server. Under the hood it is a Haystack pipeline that performs document search over the Haystack documentation, deployed on the Haystack Enterprise Platform and exposed as an MCP server. It is the exact same pipeline that powers the search functionality on the Haystack documentation site.

This is what the whole idea looks like end to end: one retrieval pipeline serves the docs search UI and becomes a callable MCP tool. Connect it to your IDE or assistant and you can ask questions about Haystack, grounded in the official docs, without leaving your editor.

Haystack Docs MCP with an MCP Client

Choosing the right approach

Goal	Use this
Give a Haystack agent one external capability	`MCPTool`
Give a Haystack agent a curated group of tools from a server	`MCPToolset` with `tool_names`
Manage a large catalog of tools without bloating context	`SearchableToolset`
Self-host your pipeline/agent as an MCP server	Hayhooks (`hayhooks mcp run`)
Run managed, governed, observable MCP tools in production	Haystack Enterprise Platform

MCP use cases

Build robust AI agents: Connect Haystack agents to MCP servers with context efficiency in mind. This helps you build more capable agents that can interact with external systems like vector databases, CLI tools, and version control.
Internal knowledge in everyday tools: Expose a Haystack RAG pipeline as an MCP tool so teammates can query your internal knowledge base from inside Claude, ChatGPT, or Cursor, grounded in your own data, with no new interface to adopt.
Developer copilots: Connect the docs MCP server to your IDE so your coding assistant answers framework questions from authoritative documentation while you build.
Productizing pipelines: Expose a Haystack pipeline as a managed, authenticated tool that partners or customers can call, turning internal retrieval logic into an external offering.

Conclusion

MCP gives AI engineers a practical layer: consume external capabilities inside your Haystack agents, and expose your own pipelines and agents as reusable tools for any MCP-compatible client.

Use MCPTool, MCPToolset, and SearchableToolset when you want precise, scalable tool access inside an agent. Use Hayhooks or the Haystack Enterprise Platform when you want to publish Haystack applications for broader teams and production usage.

Ready to try Haystack with MCP? Start with the MCP get started guide, wire one tool into your agent, and iterate from there. If you want more deployment control and governance, explore MCP tools on Haystack Enterprise Platform.

Context Engineering for Agentic Systems: What Goes Into Your Agent's Mind

Mon, 20 Apr 2026 00:00:00 +0000

Every new generation of Large Language Models arrives with a bigger context window - and the temptation to use it fully. If the model can read a million tokens, why not feed it everything? In practice, more context doesn’t reliably mean better answers: it often means higher costs, slower responses, and a model that loses track of what actually matters. Context engineering is the discipline of deciding not just what to put in the context window, but how much, in what form, and when to leave things out - and it’s quickly becoming one of the most important skills in building reliable agentic systems.

Why context is so important for agentic systems

An LLM has exactly two sources of information when generating a response:

Internal state (“knowledge”) - what was baked in during training. It is static, potentially stale, and opaque to the developer.
Context (“prompt”) - everything provided at inference time. That’s the only thing we can actively control.

Training knowledge is fixed. We can’t update it without retraining, and we can’t know exactly what the model does or doesn’t know - though most providers publish a knowledge cutoff date in their model cards or documentation, which tells you the point beyond which the model has no awareness of world events. Context is the lever we actually have. Everything a model knows about the current task, the current user, the tools available to it, and the world right now has to come through the context window.

Today’s leading models offer context windows that would have seemed impossibly large just a few years ago - millions of tokens, enough to fit entire codebases, legal contracts, or a stack of research papers in a single prompt. Yet in practice, agentic systems burn through these limits surprisingly fast. A system prompt, a set of tool definitions, all tool calls and results, a few retrieved documents, and a handful of conversation turns can easily consume tens of thousands of tokens before the agent has done anything meaningful. And even when the hard limit isn’t reached, performance often degrades long before it is - the model starts losing track of earlier instructions, repeating itself, or missing relevant details buried under layers of accumulated context.

At step 1, the context holds little more than the user’s task. By step N, it has grown to include every tool call, every result, every model response, and any retrieved documents - all concatenated and re-sent from scratch on every iteration.

The difference from one-shot prompting is stark. A single prompt is small, hand-crafted, and fully under control. An agentic system operates in a loop - reasoning, calling tools, receiving results, and repeating, potentially dozens of times. Because LLMs are stateless, every iteration re-sends the entire accumulated history from scratch. The context isn’t a fixed input, but more of a growing log, and context engineering is about managing that growth.

When less is more

Transformers architecture behind the LLMs work by letting every token attend to every other token in the context. This is what makes them so powerful at integrating information - but it also means the model’s capacity is spread across all tokens simultaneously. Think of it as an attention budget: every new token you introduce depletes it by some amount, regardless of whether that token is useful or not.

The practical consequence is that irrelevant or redundant content doesn’t just waste space - it actively competes with the information that actually matters. A critical instruction buried under pages of tool outputs may receive less attention than if it had been sent alone. Research from Anthropic confirms this: models remain capable at longer contexts but show reduced precision for information retrieval and long-range reasoning compared to shorter ones. A million-token context window is not a free pass to include everything - it’s a budget, and every token you add is a trade-off.

The cost dimension

Most hosted LLMs charge per input token, which means every byte of context has a price tag. A single call with a 50,000-token context costs roughly 50× more than one with 1,000 tokens - and in an agentic loop that runs dozens of iterations, that multiplier compounds with every step. Context management is therefore not just a quality concern but a cost concern: a bloated context window can turn a cheap pipeline into an expensive one without producing any better answers.

What fills the context window in an agentic system

We’ve already touched on some of the components that fill an agent’s context window - system prompts, tool definitions, retrieved documents. Let’s map out the full picture, because the list is longer than many developers expect.

System prompt - standing instructions, persona, constraints, output format. Usually fixed but can be large.
Conversation history - the full back-and-forth between user and agent across the current session.
Memory - retrieved facts from past sessions or external knowledge stores. See also: Using Mem0 Memory Store with Haystack Agents.
Retrieval output - documents or chunks fetched proactively by a RAG pipeline, before the model acts. This data arrives in context as part of the input to the model, not as a consequence of something the model decided to do.
Tool definitions - every tool the model could call must be described in the context (name, description, parameters schema). With MCP toolsets, this can easily balloon into hundreds of tool descriptions.
Tool call results - the output of tools the model itself chose to invoke. Unlike retrieval output, these arrive mid-session as a consequence of the model’s actions. They can be surprisingly large: a read file operation returning a 500-line source file, a web search returning multiple scraped pages, or a database query returning hundreds of rows - and each result stays in context for the remainder of the session.
Few-shot examples - demonstration input/output pairs used to guide model behaviour.

The iceberg effect. A user sees a single answer. Behind the scenes, the model may have received 50,000 tokens or more on that one turn - a system prompt (perhaps 10k tokens), tool definitions (5k), retrieved documents (20k), and accumulated conversation history (15k). The answer is the tip, while the context is everything below the surface.

What the context actually looks like

The screenshot above shows Claude Code’s /context command, which breaks down exactly where tokens are being spent: system prompt, tool definitions, conversation history, open files, etc. Knowing this makes it possible to identify which component is responsible for a bloated context and whether that cost is justified. With this visibility, optimisation is a bit easier.

Building a Haystack agent

from haystack.components.agents import Agent
from haystack_integrations.components.generators.anthropic import AnthropicChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.tools import tool

@tool
def get_weather(city: str) -> str:
    """Get the current weather for a city."""
    return f"It's sunny and 22°C in {city}."

agent = Agent(
    chat_generator=AnthropicChatGenerator(),
    system_prompt="You are a helpful assistant.",
    tools=[get_weather],
)

result = agent.run(messages=[ChatMessage.from_user("What's the weather in Paris?")])
print(result["last_message"].text)

When you create an agent in Haystack, much of the context is assembled automatically. Tool descriptions are serialised and injected into the prompt under the hood - you define a tool once, and the framework ensures the model receives everything it needs to call it: the name, description, and parameter schema. The same applies to conversation history, which is maintained across turns without any manual concatenation. The context you see in your code is just the surface, but the model receives considerably more on every call.

Strategies for managing context growth

Context explosion is not inevitable. Once you understand what’s filling the window, you can start making choices about what actually needs to be there. There are several proven techniques for keeping context short without sacrificing quality.

Delegation to subagents

Another way to keep context small is to never let it grow large in the first place. Instead of one agent accumulating the full history of a complex task, you can split the work across specialised subagents - each one receiving only the slice of context relevant to its job. The orchestrator maintains a thin, high-level context, while the worker agents get focused, task-specific contexts. The total token count across the system may be similar, but no single model call is burdened with everything at once. For a practical example of this pattern in Haystack, see Building a Swarm of Agents or the Creating a Multi-Agent System with Haystack tutorial.

from haystack.components.agents import Agent
from haystack_integrations.components.generators.anthropic import AnthropicChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.tools import tool

@tool
def search_web(query: str) -> str:
    """Search the web for up-to-date information on a topic."""
    return f"Search results for '{query}': ..."

# Worker agent: only receives context relevant to its task
researcher = Agent(
    chat_generator=AnthropicChatGenerator(),
    system_prompt="You are a research assistant. Answer questions concisely.",
    tools=[search_web],
)

from haystack.tools import ComponentTool

delegate_research = ComponentTool(
    component=researcher,
    name="delegate_research",
    description="Delegate a research question to a specialised agent.",
    outputs_to_string={"source": "last_message"},
)

# Orchestrator: only sees compact summaries from worker agents
orchestrator = Agent(
    chat_generator=AnthropicChatGenerator(),
    system_prompt="Break down tasks and delegate them to specialised agents.",
    tools=[delegate_research],
)

result = orchestrator.run(messages=[ChatMessage.from_user("Compare quantum and classical computing.")])
print(result["last_message"].text)

Improving retrieval quality

In RAG pipelines, retrieval quality directly determines how many tokens land in the context. Poor retrieval returns irrelevant chunks that add noise without adding value - each one consuming part of the attention budget. Better precision means fewer chunks are needed, which means a smaller, cleaner context.

A related problem is redundancy: when retrieved passages are near-duplicates, the model sees the same information repeated multiple times without gaining anything new. This is why diversity matters as much as relevance - a set of chunks that each cover a different facet of the question is far more efficient than a set of very similar top matches. Techniques like hybrid retrieval, HyDE, query decomposition, and auto-merging retrieval all help surface results that are both more relevant and more varied.

from haystack import Pipeline
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.rankers import TransformersSimilarityRanker
from haystack.components.retrievers import InMemoryEmbeddingRetriever
from haystack.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()

rag = Pipeline()
rag.add_component("embedder", SentenceTransformersTextEmbedder())
# Retrieve 10 candidates, then rerank to the 3 most relevant
rag.add_component("retriever", InMemoryEmbeddingRetriever(document_store, top_k=10))
rag.add_component("ranker", TransformersSimilarityRanker(top_k=3))

rag.connect("embedder.embedding", "retriever.query_embedding")
rag.connect("retriever.documents", "ranker.documents")

result = rag.run({
    "embedder": {"text": "climate change"},
    "ranker": {"query": "climate change"},
})
# result["ranker"]["documents"] now contains at most 3 highly relevant chunks

Coming up in the series: Retrieval quality deserves a post of its own. The next article will go deep on techniques for surfacing more relevant, more diverse results - so your RAG pipelines put more important tokens in front of the model.

Summarisation and compaction

As a conversation grows, the raw message history becomes the biggest consumer of context. Compaction addresses this by periodically replacing the accumulated history with a condensed summary - retaining the essential facts and decisions while discarding the verbatim back-and-forth. The agent continues with a much shorter context, and the summary is updated with each new turn.

This pattern is well-established in practice. Popular coding agents’ context compaction feature works exactly this way: when the context approaches its limit, it summarises the conversation so far and continues from the summary rather than truncating or failing.

from haystack import Pipeline
from haystack.core.component import component
from haystack.components.agents import Agent
from haystack.components.builders import ChatPromptBuilder
from haystack_integrations.components.generators.anthropic import AnthropicChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.tools import tool
from haystack_experimental.chat_message_stores.in_memory import InMemoryChatMessageStore
from haystack_experimental.components.retrievers import ChatMessageRetriever
from haystack_experimental.components.writers import ChatMessageWriter

@tool
def get_current_date() -> str:
    """Return today's date."""
    from datetime import date
    return date.today().isoformat()

@component
class HistoryCompactor:
    def __init__(self, threshold: int = 3):
        self.threshold = threshold
        self.compactor = ChatPromptBuilder(
            template=[
                ChatMessage.from_user(
                    "Summarise the key facts from the conversation below in "
                    "3-5 bullet points.\n\n"
                    "{{ history }}"
                )
            ],
            required_variables=["history"],
        )
        self.summariser = AnthropicChatGenerator(model="claude-haiku-4-5-20251001")

    @component.output_types(messages=list[ChatMessage])
    def run(self, messages: list[ChatMessage]) -> dict:
        if len(messages) <= self.threshold:
            return {"messages": messages}
        history_text = "\n".join(f"{m.role}: {m.text}" for m in messages if m.text)
        prompt = self.compactor.run(template_variables={"history": history_text})["prompt"]
        summary = self.summariser.run(messages=prompt)["replies"][0].text
        # The output message has to be a user message, as our chat 
        # generator cannot work with just system/assistant messages
        return {
            "messages": [
                ChatMessage.from_user(f"Conversation so far (summary):\n{summary}")
            ]
        }

# skip_system_messages=False so the compacted summary (a system message) is persisted
message_store = InMemoryChatMessageStore(skip_system_messages=False)

pipeline = Pipeline()
pipeline.add_component("message_retriever", ChatMessageRetriever(message_store))
pipeline.add_component("compactor", HistoryCompactor(threshold=3))
pipeline.add_component(
    "agent", 
    Agent(
        chat_generator=AnthropicChatGenerator(model="claude-haiku-4-5-20251001"),
        system_prompt="You are a helpful assistant.",
        tools=[get_current_date],
    )
)
pipeline.add_component("message_writer", ChatMessageWriter(message_store))

pipeline.connect("message_retriever.messages", "compactor.messages")
pipeline.connect("compactor.messages", "agent.messages")
pipeline.connect("agent.messages", "message_writer.messages")

chat_history_id = "session_1"

# First turn
pipeline.run({
    "message_retriever": {
        "current_messages": [ChatMessage.from_user("What day is it today?")],
        "chat_history_id": chat_history_id,
    },
    "message_writer": {"chat_history_id": chat_history_id},
})

# Second turn - history is retrieved, compacted if needed, and stored back automatically
result = pipeline.run({
    "message_retriever": {
        "current_messages": [ChatMessage.from_user("What month are we in?")],
        "chat_history_id": chat_history_id,
    },
    "message_writer": {"chat_history_id": chat_history_id},
})
print(result["agent"]["last_message"].text)

Adding only relevant tools to the context

Tool definitions can be a surprisingly large slice of the context window, especially when connecting to MCP servers that expose dozens or hundreds of tools. Listing every tool upfront means the model receives all those descriptions on every single call, regardless of which tool is actually needed.

SearchableToolset, introduced in Haystack 2.25, inverts this approach. Instead of exposing the full catalog, the agent starts with a single search_tools function and uses it to dynamically discover relevant tools via BM25 keyword search. Only the tools it actually needs are loaded into the context for that turn.

from haystack.components.agents import Agent
from haystack_integrations.components.generators.anthropic import AnthropicChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.tools import Tool, SearchableToolset

# Create a catalog of tools
catalog = [
    Tool(name="get_weather", description="Get weather for a city", ...),
    Tool(name="search_web", description="Search the web", ...),
    # ... 100s more tools
]
toolset = SearchableToolset(catalog=catalog)

agent = Agent(chat_generator=AnthropicChatGenerator(), tools=toolset)

# The agent is initially provided only with the search_tools tool
# and will use it to find relevant tools on demand.
result = agent.run(messages=[ChatMessage.from_user("What's the weather in Milan?")])

Offloading notes (scratchpad / working memory)

An agent’s intermediate reasoning - the chain of thoughts it builds up while working through a multi-step task - does not have to live inside the context window. A simple alternative is to give the agent two dedicated tools: one to write a note to an external store, and one to read notes back. Instead of accumulating its internal monologue in the prompt, the agent can offload conclusions, partial results, and reminders to storage and retrieve only what it needs at each step.

This keeps the context lean: rather than carrying the full trace of every intermediate thought, the agent holds a minimal working state and queries its own notes on demand. The pattern is especially useful for long-horizon tasks where the reasoning chain would otherwise grow without bound, and it has the side effect of making the agent’s thinking inspectable and debuggable from outside the model.

What’s coming next in this series

This article is the foundation of a series on context engineering. Future posts will go deeper on specific topics - measuring whether your context actually helps the model, keeping context manageable in long-running agent loops, diversifying retrieval results, tracking token usage across pipelines, and more. If there is a particular area you would like us to cover first, let us know.

To stay up to date with the series and everything else happening in Haystack, star the Haystack GitHub repository and join the conversation on Discord.

Multimodal Search with Gemini Embedding 2 in Haystack

Tue, 10 Mar 2026 00:00:00 +0000

Embeddings are the backbone of modern AI applications, from semantic search and recommendation systems to Retrieval-Augmented Generation (RAG). However, most embedding models operate in a single modality, typically focusing only on textual data.

Google has introduced Gemini Embedding 2, a fully multimodal embedding model that maps text, images, video, audio, and PDFs into a shared vector space. This means you can search across different types of data using a single embedding model: gemini-embedding-2.

Even better, Haystack supports Gemini Embedding 2 from Day 0. Through the Google GenAI x Haystack integration, you can immediately start using the model in your Haystack applications for both text and multimodal embeddings.

Let’s take a closer look.

Meet Gemini Embedding 2

Gemini Embedding 2 is Google’s first fully multimodal embedding model, built on the Gemini architecture. It can map text, images, video, audio, and PDFs into a single unified vector space, enabling cross-modal comparison and retrieval using a shared semantic representation.

For example, a text query can retrieve relevant images, an audio clip can match a document, or a video segment can be retrieved using text search. This unified representation makes it easier to build multimodal applications like image search, recommendation systems, and RAG.

The model supports 100+ languages and allows developers to choose flexible embedding sizes using Matryoshka Representation Learning (MRL). Depending on the trade-off between storage and accuracy, you can select embedding dimensions up to 3072, with commonly recommended sizes being 768, 1536, or 3072 (default).

Gemini Embedding 2 also supports large inputs up to 8192 tokens, making it suitable for embedding longer documents and complex multimodal inputs.

Early benchmarks indicate strong performance across modalities, including a top-5 ranking on the MTEB Multilingual leaderboard for text and state-of-the-art results among proprietary models, with document retrieval performance comparable to Voyage.

Check out the official Google documentation for more details.

Generated by gemini-3.1-flash-image-preview aka. Nano Banana 2

Using Gemini Embeddings in Haystack

Haystack provides built-in components for generating Gemini embeddings through the Gemini API and Vertex AI.

For text data, you can use:

The GoogleGenAIDocumentEmbedder is typically used during the indexing to embed documents before storing them in a vector database.

# pip install haystack-ai google-genai-haystack
from datasets import load_dataset
from haystack import Document
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.embedders.google_genai import (
    GoogleGenAIDocumentEmbedder, GoogleGenAITextEmbedder
)

document_store = InMemoryDocumentStore(embedding_similarity_function="cosine")

docs = [
    Document(content="The capybara is the largest rodent in the world and is native to South America, where it lives near rivers, lakes, and wetlands. It is highly social and often seen relaxing in groups, spending much of its time swimming or soaking in water. Capybaras communicate through whistles, barks, and purr-like sounds."),
    Document(content="Dogs are domesticated mammals known for their loyalty, intelligence, and strong bond with humans. They have been bred for thousands of years for roles such as companionship, hunting, guarding, and assisting people with various tasks. Different breeds vary widely in size, temperament, and abilities."),
    Document(content="The tiger is the largest species of big cat and is recognized by its distinctive orange coat with black stripes. It is a powerful solitary predator that inhabits forests, grasslands, and wetlands across parts of Asia. Tigers are excellent swimmers and rely on stealth and strength to hunt prey."),
    Document(content="The giraffe is the tallest land animal on Earth, easily identified by its long neck and distinctive spotted coat. It uses its height to reach leaves high in acacia trees and roams the savannas and open woodlands of Africa. Despite its long neck, a giraffe has the same number of neck vertebrae as most mammals."),
    Document(content="Elephants are the largest land animals and are known for their intelligence, strong family bonds, and remarkable memory. They use their trunks for breathing, grasping objects, and communication. Elephants live in complex social groups led by a matriarch."),
    Document(content="Penguins are flightless birds that live primarily in the Southern Hemisphere, especially in Antarctica. They are excellent swimmers, using their flipper-like wings to move through the water while hunting fish, squid, and krill."),
    Document(content="Dolphins are highly intelligent marine mammals known for their playful behavior and complex communication. They live in social groups called pods and use echolocation to navigate and locate prey in the ocean."),
    Document(content="Owls are nocturnal birds of prey with excellent night vision and silent flight. They hunt small mammals, insects, and other birds, relying on their sharp talons and keen hearing to detect prey in darkness."),
    Document(content="Red pandas are small mammals native to the eastern Himalayas and southwestern China. They have reddish-brown fur, bushy tails, and spend most of their time in trees. Their diet mainly consists of bamboo, though they may also eat fruits and insects."),
    Document(content="Kangaroos are large marsupials native to Australia and are famous for their powerful hind legs, large feet, and strong tails that help them balance while hopping. Female kangaroos carry and nurture their young, called joeys, in a pouch. They typically live in open grasslands and forests and often move in groups called mobs."),
]

doc_embedder = GoogleGenAIDocumentEmbedder(
    model="gemini-embedding-2", 
    batch_size=5, 
    prefix="title: none | text: " # https://ai.google.dev/gemini-api/docs/embeddings#task-types
    config={
		    "output_dimensionality": 768 # flexible embedding sizes using MRL
		}
)
docs_with_embeddings = doc_embedder.run(docs)
document_store.write_documents(docs_with_embeddings["documents"])

Once documents are embedded and stored, you can embed queries using GoogleGenAITextEmbedder and retrieve relevant documents.

from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack_integrations.components.embedders.google_genai import GoogleGenAITextEmbedder

embedding_retriever = InMemoryEmbeddingRetriever(document_store=document_store)

query = "animal that communicates with whistles and barks"
text_embedder = GoogleGenAITextEmbedder(
    model="gemini-embedding-2", 
    prefix="task: search result | query: " # https://ai.google.dev/gemini-api/docs/embeddings#task-types
    config={
		    "output_dimensionality": 768 # flexible embedding sizes using MRL
		}
)
query_embedding = text_embedder.run(query)["embedding"]

result = embedding_retriever.run(query_embedding=query_embedding, top_k=2)

for doc in result["documents"]:
    print(doc.meta)
    print(doc.content)
    print(doc.score)
    print("-" * 10)

In production applications, embeddings can be stored in search engines such as Qdrant, Elasticsearch or OpenSearch.

Multimodal Embeddings

Haystack also provides Day-0 support for multimodal embeddings with the new GoogleGenAIMultimodalDocumentEmbedder component.

This component enables embedding images, audio, video, and PDFs directly inside Haystack pipelines, making it straightforward to build applications that search across multiple media types.

Example:

from haystack_integrations.components.embedders.google_genai import GoogleGenAIMultimodalDocumentEmbedder
from haystack import Document

docs = [
    Document(meta={"file_path": "kangaroo.mp4"}),
    Document(meta={"file_path": "tiger.jpg"}),
    Document(meta={"file_path": "sample.pdf"}),
    Document(meta={"file_path": "kangaroo.jpg"}),
    Document(meta={"file_path": "cat.jpg"})
]

doc_multimodal_embedder = GoogleGenAIMultimodalDocumentEmbedder(
    model="gemini-embedding-2", 
    config={
        "output_dimensionality": 768 # flexible embedding sizes using MRL
        }
)
docs_with_embeddings = doc_multimodal_embedder.run(docs)
document_store.write_documents(docs_with_embeddings["documents"])

One powerful capability enabled by multimodal embeddings is cross-modal retrieval, retrieving one type of content using another modality as the query.

For example, you can perform image-to-text search, where an image is used as a query to retrieve relevant text documents. If a user uploads an image of a dog, the system could retrieve documents describing dog breeds, training tips, or animal behavior.

To perform image-to-text search, you first embed your text documents during indexing (as shown earlier using GoogleGenAIDocumentEmbedder). Then you embed the image as the query using the multimodal embedder.

Because all modalities share the same vector space, you can use this approach to support cross-modal retrieval in any direction, for example text-to-image, image-to-text, audio-to-video, or video-to-document search.

Example:

image_doc = Document(meta={"file_path": "another_kangaroo.jpg"})
image_embedder = GoogleGenAIMultimodalDocumentEmbedder(
    model="gemini-embedding-2", 
    config={
        "output_dimensionality": 768 # flexible embedding sizes using MRL
        }
)

# Create the embedding for the image
image_embedding = image_embedder.run([image_doc])["documents"][0].embedding

# Find the most semantically similar texts, images, audio, video and PDFs in the vector database
results = embedding_retriever.run(query_embedding=image_embedding, top_k=3)["documents"]

for doc in results:
    print(doc.content)
    print(doc.score)
    print("-" * 100)

With multimodal embeddings, you can build applications such as:

Multimodal search

Example: search a product catalog with a text query like “red running shoes with white soles” and retrieve both product images and descriptions.
Cross-modal retrieval

Search one type of media using another modality. For example, upload a photo of a monument to retrieve relevant articles describing it.
Multimodal RAG systems

Combine documents, diagrams, screenshots, and videos as sources of knowledge.
Media recommendation systems

Recommend videos or images based on textual descriptions.

To learn more about working with multimodal data in Haystack, check out the tutorial on Creating Vision + Text RAG Pipelines.

What’s Next

Gemini Embedding 2 makes it easier to build multimodal-aware retrieval systems without stitching together multiple embedding models.

With Haystack, you can already use both text and multimodal embeddings with Gemini Embedding 2. We’re excited to see what multimodal applications you build next.

Haystack Ecosystem: One Name, One Product Family, One Look

Fri, 19 Dec 2025 00:00:00 +0000

We’re making some naming and visual updates at deepset to better reflect the role Haystack already plays as a framework, a community, and the foundation of our enterprise platform.

If you’re already building with Haystack, nothing is changing in how you build or run applications. This update is about clarity, making the Haystack ecosystem easier to understand, easier to navigate, and centered around a single open foundation.

The Open Source to Enterprise Story of Haystack

Haystack began as an open-source framework for building NLP pipelines, created to give developers precise control over how AI systems are composed, debugged, and run in production. From the start, it was designed for real-world use, not just experimentation.

Over time, the framework expanded and so did the community. Teams began using Haystack to power increasingly sophisticated systems in production: advanced RAG pipelines, agentic workflows, and multi-step, multimodal applications operating over complex enterprise data. As adoption and contribution grew, Haystack evolved from a framework into a broader ecosystem, shaped by its community and proven in demanding production environments.

As enterprises scaled their use of Haystack, the need for additional operational capabilities emerged. The deepset AI Platform evolved alongside the framework, built on the same architecture, abstractions, and mental model to support governance, deployment, and collaboration at scale. Renaming it to Haystack Enterprise Platform reflects how the open-source framework and enterprise operational tooling are already used together in practice.

If you know Haystack, you already understand the foundation of the enterprise platform. The new name reflects how teams use Haystack across development, production, and enterprise-scale environments without changing tools, rewriting systems, or navigating brand confusion when choosing the right Haystack offering.

The Haystack Logo: What Changed and Why

Alongside the naming change, we’re updating the Haystack logo and overall look.

This isn’t just a change in appearance. It reflects how Haystack is used today and how the ecosystem has evolved.

Haystack is no longer only a framework, it’s the foundation across everything we ship. Haystack Open Source, Haystack Enterprise Starter, and Haystack Enterprise Platform now share a consistent look, reflecting their place within a single Haystack family.

The updated logo reflects Haystack’s role in building complete AI systems. It aligns with how Haystack is used today, bringing agents, retrieval, pipelines, and operational concerns together while staying true to its core values of openness, clarity and structure.

The Haystack Product Family

The Haystack product family supports teams across the full lifecycle of enterprise AI adoption, from development to production and organization-wide scale. It includes the Haystack Framework, Haystack Enterprise Starter, and Haystack Enterprise Platform. These are layers teams can grow into as their use of Haystack matures.

What Is Haystack?

If you’re new to Haystack, it’s an open-source AI orchestration framework developed by deepset that enables Python developers to build sophisticated, agentic AI applications.

It provides a modular pipeline architecture with a rich ecosystem of integrations, while preserving full flexibility to add custom components and external tools, including those from MCP servers. With robust documentation and production-ready capabilities such as serialization, logging, and well-defined APIs, Haystack supports reliable deployment across both cloud and on-premise environments.

Haystack is designed to help teams move from experimentation to production without sacrificing control or architectural clarity and is already used in production by many teams today.

What Is Haystack Enterprise Starter?

Haystack Enterprise Starter combines the Haystack open-source framework with enterprise-grade support to help teams move more quickly and confidently toward production.

It’s designed for teams already building with the Haystack framework who want additional guidance and reduced implementation risk, including:

Private and dedicated consulting and support
Secure pipeline and deployment design guidance
Proven deployment blueprints and production pipeline templates
A clear path to expand as requirements evolve

Enterprise Starter serves as a practical stepping stone for teams preparing for broader adoption and more complex production use cases.

Haystack Enterprise Starter was previously known as Haystack Enterprise.

👉 Learn more about Haystack Enterprise Starter

What Is Haystack Enterprise Platform?

Haystack Enterprise Platform extends Haystack with the tooling needed to run AI systems reliably in production at enterprise scale.

It builds on the same architecture, abstractions, and mental model the developer community already uses in Haystack, adding operational capabilities for deploying, testing, and governing AI applications across complex data, multiple use cases, and different environments.

Key capabilities include:

Composable, repeatable pipeline orchestration with visual views aligned with code
Production-ready indexing and retrieval for large, complex enterprise data
Integrated testing and evaluation to validate changes and improve system behavior over time
Access control and auditability for shared, multi-team environments
Scalable deployment across cloud and on-premise setups

Built on Haystack’s open-source foundation and shaped by real-world production use, Haystack Enterprise Platform helps teams elevate Haystack pipelines into enterprise systems they can operate, share, and scale across the organization.

👉 Try Haystack Enterprise Platform for free or learn more.

Looking Ahead

With these changes, Haystack is easier to understand, easier to navigate, and easier to adopt across the product family.

This evolution reflects how Haystack is used in practice today, as an open, modular foundation for development and the core of a platform supporting production AI systems ranging from retrieval-based pipelines to agent-based workflows.

We’ll keep building Haystack together with the community, customers, and partners, focused on transparent, debuggable, production-grade AI systems teams can trust and evolve over time.

Lufthansa Industry Solutions Uses Haystack to Power Enterprise RAG

Fri, 24 Oct 2025 00:00:00 +0000

When you think of Lufthansa, you might picture planes, airports, or global travel, but Lufthansa Industry Solutions (LHIND) is making an impact in a different way: as a full-service IT company delivering digital solutions for clients both inside and outside the Lufthansa Group.

At LHIND, a subsidiary of the Lufthansa Group, teams work on a wide range of projects that span cloud infrastructure, AI, and enterprise data systems to custom software development, process automation, and digital transformation initiatives. Among them is SmartAssistantAI, an enterprise AI chatbot implementation to make company knowledge accessible to everyone, instantly and securely.

Behind the product is Nils Hilgers, Lead AI Engineer at LHIND and his team of engineers and product builders. Together, they’re rethinking enterprise search through the lens of retrieval-augmented generation (RAG) and enterprise-grade security standards.

To bring that vision to life, the team selected Haystack as one of their key solutions for powering their AI Assistant.

The Challenge: Connecting Scattered Knowledge

LHIND’s engineering group was tasked with building a secure, centralized assistant capable of answering employees’ questions using the company’s internal documentation. The challenge wasn’t just accuracy — it was compliance and control.

The system needed to:

Handle multiple data sources (SharePoint, internal wikis etc.)
Operate under GDPR, ISO 27001, and Lufthansa Group’s own IT governance standards
Deliver explainable, source-cited results

With a small team of developer and engineers working in an agile setup, supported by a dedicated platform team, they set out to design a solution that could unify data retrieval and LLM-based reasoning without sacrificing traceability or maintainability.

Choosing Haystack: Flexibility Meets Stability

When the project began, the team evaluated several orchestration frameworks to structure their RAG pipelines. They needed something reliable enough for production but flexible enough to adapt as requirements evolved.

“We needed a graph orchestration framework with well-thought fundamentals that is stable for production” says Nils.

After testing a few alternatives, Haystack stood out for:

Orchestration layer built on directed graphs with easy serialization and visualization
Unified filtering across different vector database providers
Jinja-based prompt templating, which made their prompts more maintainable

Having used the older 1.x REST API for some demos, the team already knew Haystack’s foundations and migrating to 2.x resulted in cleaner, more maintainable code.

The Technical Architecture: How It All Comes Together

The assistant is a cloud-native, microservice-based system built around modularity and open-source principles. It combines Haystack pipelines with custom middleware and observability tooling. It’s not a public-facing product, rather, a secure solution deployed in enterprise environments where control over data and access is critical.

Core Components

1. Frontend and Authentication

A modular frontend built with microfrontends allows different configurations per customer like custom stylesheet, logo, and login interfaces through an admin interface. A Golang-based authentication middleware enforces role-based access control (RBAC) and ensures user permissions are respected end-to-end.

2. Ingestion Pipelines

Data ingestion is built on Haystack pipelines, with a custom data integration protocol. For preprocessing, domain-specific tools with Haystack OCR and custom converters. The custom data integration protocol includes the periodic LEFT-JOIN-style synchronization that removes outdated or deleted entries using hashing logic, allowing customers to synchronize their niche data source into the vector database.

3. Query Services

At the heart of the system, FastAPI-based query services integrate directly with Haystack pipelines. A ConditionalRouter directs requests to either a general LLM (GPT models) or a RAG pipeline, depending on the nature of the question.

Before the retrieval step, the assistant performs query rewriting based on the ongoing chat history, rephrasing the user’s question to optimize for hybrid search and ensure more accurate context retrieval.

Using server-sent events (SSE), the system streams results in real time and provides full transparency into the generation process: showing the routed and rewritten queries, the matched documents, and the final LLM-generated answer.

4. Observability and Monitoring

To monitor performance and reliability, the team integrated Langfuse for observability and Grafana + Kubernetes for operational monitoring. These tools save every incoming query and feedback in the data warehouse for future analysis and evaluation.

Measuring Success: From Latency to Feedback Loops

Rather than focusing on flashy metrics, the team prioritized efficiency and feedback quality. Their success criterion was simple: reduce the time it takes for people to find what they need.

Since that’s hard to measure directly, they rely on a mix of proxy metrics:

Usage metrics: daily and unique request counts across customers
Latency: time-to-first-token and time-to-last-token, especially 90th percentile
User feedback: thumbs-up/down and open-text comments

💡 Negative feedback with detailed comments has proven especially valuable, enabling developers to use observability tools to trace whether an issue stems from prompting, retrieval, or data quality.

Lessons Learned: Build Quality Early, Iterate Fast

Nils emphasizes a pragmatic approach: software quality and observability are not afterthoughts; they are foundations.

“Focus on engineering quality early, even if it slows you down,” he advises.

The team learned that robust infrastructure and documentation pay off in the long run, even if setup takes longer initially.

They emphasize:

Lean container builds and fast CI/CD pipelines
Clear internal documentation of database schemas, roles, and dependencies
Human-annotated evaluation datasets to track the performance of different iterations over model and prompt adjustments

Their development process combines disciplined engineering with agile experimentation, a balance that keeps the project moving fast without sacrificing reliability.

What’s Next

The roadmap for the coming quarters focuses on enhancing data integration and observability:

Migrating to a ClickHouse warehouse to handle growing data volumes efficiently
Adding server-side prompt management for version control

Further plans include:

Time-aware ranking to eliminate outdated knowledge
Deeper Integrations with Atlassian tools, relational databases, and CRM systems
A cleaner architecture using Haystack’s newer abstractions, like SuperComponents

The next step is scaling both quality and speed while still keeping the system easy to maintain.

LHIND’s journey shows how Haystack empowers developers to build reliable, customizable, and compliant AI assistants at enterprise scale, like SmartAssistantAI. Get started with Haystack and build your own intelligent, trustworthy assistants for enterprise knowledge access.

We know there are many more stories like this out there. If your team has built something exciting with Haystack, whether it’s a chatbot, a retrieval system, an agent, or an AI-powered internal tool, get in touch with us. We’d love to feature your work and share how you’re shaping the future with Haystack.

How TAC Built an Agentic Chatbot with Haystack to Transform Trade Promotions Workflows

Mon, 06 Oct 2025 00:00:00 +0000

When a leading company like TELUS Agriculture & Consumer Goods (TAC), with a strong presence in agriculture and consumer goods, turns to AI to streamline complex processes, it’s worth taking a closer look.

TELUS Agriculture & Consumer Goods helps businesses optimize everything from supply chains to retail operations. One of their latest innovations: an agentic chatbot powered by Haystack that simplifies how users interact with their trade promotions platform.

We sat down with the team behind this project to learn how they built it, why they chose Haystack, and what advice they have for other teams looking to implement Retrieval-Augmented Generation (RAG) and agent-based AI solutions in production.

The Challenge: Simplifying Complex Workflows

The team’s mission was clear: give users instant, intelligent access to their data without requiring them to dig through documentation or rely solely on the existing UI.

TAC’s trade promotions platform is where large consumer goods companies manage the sales incentives and agreements they set up with retailers and distributors. Think of discounts, special in-store displays, or seasonal promotions. The platform helps teams plan, track, and measure the impact of these promotions across products, customers, and time periods.

Previously, users had to navigate dense manuals to find the information they needed about promotions. This slowed down decision-making and made even simple questions difficult to answer quickly.

The chatbot changes all of this. Instead of manually searching documentation or waiting for new reporting features, users can now ask questions conversationally and get real-time answers. For example, someone managing promotions for a product can quickly check which campaigns are running, how much has been spent, or what results have been achieved — and share that information across their organization. Whether it’s a one-off question or a request for deeper insights, the system delivers immediate results.

As the team put it, this was about “giving users unprecedented access to their data” while removing the bottlenecks of traditional reporting systems.

Choosing Haystack: From Exploration to Production

When the team began exploring options, they came across Haystack while researching RAG workflows. What they needed was a framework that could support retrieval-augmented generation, handle tool calling for tasks like SQL query generation, and provide streaming capabilities for real-time interactions.

Haystack quickly stood out.

“Haystack checked all the boxes and was easy to get up and running. Switching to Haystack gave us exactly what we needed.” says Kelsey.

The team had some experience with other frameworks, but Haystack’s flexibility, modular design, and ease of use allowed them to build a proof of concept almost immediately, earning early buy-in and helping them move fast from experimentation to production.

The Technical Architecture: From Pipelines to Agents

The first version of the system used a pipeline-based architecture with two separate workflows:

One workflow handled knowledge base queries via RAG after ingesting user documentation and converting it from raw HTML into a searchable format
Another generated SQL queries from user inputs using metadata and schema descriptions, then executed them on the SQL database

A topic router decided which workflow to trigger. However, this setup quickly proved too rigid for real-world use.

The solution was to move to an agent-based architecture. Instead of isolated pipelines, the team wrapped each capability—the RAG workflow and the SQL pipeline—into separate tools that an agent could call dynamically.

Switching to an agent-based model was a game-changer. With Haystack Agents, the system gained the ability to:

Retrieve documentation and query the SQL database within a single conversational flow
Retry and rewrite queries based on error messages (self-debugging)
Deliver emergent behaviors like combining insights from multiple sources

Key Components of the Solution

Knowledge Base Tool: Ingests user documentation (HTML → searchable documents)
SQL Tool: Generates queries with metadata awareness, using MS SQL + SQL Alchemy. This tool enables end users to both get data faster and access datasets that may not have been easily reachable before.
Observability & Monitoring:
- OTEL (OpenTelemetry) for observability
- Langfuse + Sentry for monitoring and debugging
ETL & Development: Kedro for ETL pipelines, Gradio for UI testing
Guardrails & Safety:
- Restricting the incoming SQL queries to SELECT statements with LIMIT clauses
- Sanitizing SQL table/column names before output to prevent leakage
- Enforcing user identity constraints on generated queries

The team’s focus on security and reliability ensures sensitive data stays protected while users enjoy a smooth experience.

Evaluating the Performance

To track performance, the team focuses on three main signals: latency, accuracy, and user engagement.

Tool and LLM response times are monitored with Langfuse, while accuracy initially relied on human-in-the-loop evaluation before moving toward automated benchmarking. Feedback from early users guides iterative improvements and feature priorities.

The knowledge base bot was deployed to production this month, while the SQL-generation bot is scheduled for later this year. As adoption grows, the team expects to share more impact stories around productivity gains and user adoption.

Lessons Learned & Advice for Other Teams

The top advice from the team is to start small with a minimal vertical slice of functionality.

“Starting small gave us confidence” says Kelsey. “Our first proof of concept was simple: two pipelines with topic routing, shown through a Gradio demo. But it proved the value and helped us scale up with certainty.”

More advice for anyone considering a similar project:

Invest in observability early for debugging and insights
Use agents for flexibility rather than hardcoded pipelines
Automate evaluation to speed up iteration cycles

What’s Next

The journey doesn’t stop here. TAC plans to automate data ingestion processes, build an AI-based evaluation framework to score chatbot responses, and even white-label the application for other use cases across the organization.

“We see this as a template for the future,” says Kelsey. “Our goal is to bring this capability to more workflows across TELUS Agriculture & Consumer Goods, including internal documentation and SQL tooling.”

The TAC team’s journey shows what’s possible when innovative teams combine Haystack with real-world challenges. From streamlining SQL workflows to instant access to documentation, they turned a complex problem into a powerful, production-ready solution. Get started with Haystack and build reliable, production-ready AI applications for your own team.

Announcing Haystack Enterprise Starter: Best Practices and Support

Fri, 01 Aug 2025 00:00:00 +0000

💙 Thanks to you and all of our amazing community members, the Haystack open source framework has grown into a thriving developer ecosystem, now used by thousands of organizations to power everything from simple Q&A bots to advanced enterprise agents. As more teams run Haystack in production, one thing has become increasingly clear: building reliable AI systems is hard and scaling them securely is even harder.

We’ve had a front-row seat to these challenges. Across GitHub threads, meetups, community calls, and production deployments, developers have consistently asked for engineering support and hands-on guidance to build for their use case, accelerate deployment, improve observability, and scale infrastructure with confidence. These aren’t just feature requests; they reflect the real-world friction points of teams building AI products that actually ship.

We’ve tackled many of these problems ourselves while building for customers on the Haystack Enterprise Platform. Now, we want to make that experience more accessible to the broader Haystack community.

That’s why we’re introducing Haystack Enterprise Starter today. It adds secure support, best practices, and professional services availability for organizations scaling up their Haystack pipelines. ⬇️

What’s Inside Haystack Enterprise Starter

For developers building with Haystack, speed, reliability, and support are essential as projects move from prototype to production. Haystack Enterprise adds a new layer of readiness for exactly that:

Direct access to the Haystack team
Get expert help with private email support and dedicated consultation hours.
Curated pipeline templates for faster developement
Deploy out-of-the-box templates for advanced RAG pipelines, agentic workflows, and multimodal applications with built-in Hayhooks and Open WebUI support, ready to go.
Helm chart and guides for secure deployments
Run Haystack securely across Kubernetes clusters on AWS, Azure, GCP, or on-prem, backed by best setup, scaling, and upgrade practices.
Early access to enterprise-grade features
Take advantage of capabilities like prompt injection countermeasures and other security-oriented features ahead of broader release.

Think of it as Haystack+, an offering designed to accelerate delivery and give teams the necessary production muscle.

What About Open Source?

While Haystack Enterprise Starter might be the right choice for some organizations, it may not be for others. Today’s news does not introduce any changes to the OSS community. Haystack started and will always remain as an open source framework. That isn’t changing.

We’re committed to keeping Haystack fully featured, well-documented, and free to use. You’ll continue to get:

Open source codebase
Community-driven feature development and new integrations
Tutorials, Cookbook, and other learning materials
A welcoming, collaborative Discord community with support

Haystack Enterprise Starter is simply an opt-in layer for teams needing more support and guidance. It’s designed for engineering teams running Haystack at scale who want faster deployment, higher reliability, and private channels to get expert help.

No license changes. No gated community features. Just more choices for teams building AI products.

How Haystack Enterprise Starter Fits into the Bigger Picture

We see Haystack Enterprise Starter as part of a continuum that supports developers from the first prototype to full-scale production systems:

Haystack: Self-hosted and fully open source.
Haystack Enterprise Starter: Production-grade support and best practices for teams who are scaling up. Includes secure deployment tooling, private engineering guidance, and early access to enterprise features.
Haystack Enterprise Platform: A fully managed, end-to-end development and orchestration platform. Built for teams that want visual tooling, iterative testing, autoscaling, and centralized governance over resource and access management.

Now, you can start with Haystack, opt-in to Enterprise Starter as you grow, and consider the Platform if you want to accelerate your development.

Want to Learn More?

If you’re a developer running Haystack in production or thinking about scaling your next use case up, let’s talk!

👉 Get in touch with our team to explore Haystack Enterprise Starter

Summarize Hacker News Posts with Haystack & OPEA

Tue, 10 Jun 2025 00:00:00 +0000

Welcome to this step-by-step tutorial where we’ll build a simple Retrieval-Augmented Generation (RAG) pipeline using Haystack and OPEA. We’ll fetch the newest Hacker News posts, feed them to a lightweight LLM endpoint (OPEAGenerator), and generate concise one-sentence summaries (based on this notebook). Let’s dive in! 🎉

1. Introduction & Motivation

In modern GenAI applications, having a flexible, performant, and scalable platform is essential. OPEA (Open Platform for Enterprise AI) is an open, model-agnostic framework for building and operating composable GenAI solutions. It provides:

A library of microservices (LLMs, data stores, prompt engines) and higher-order megaservices for end-to-end workflows
HTTP-based inference with multi-model support (open- and closed-source)
Advanced features like batching, streaming, auto-scaling, routing via gateways, and unified observability

In this demo, we’ll use an OPEA LLM endpoint in a Haystack pipeline, giving you:

Instant HTTP access to any hosted model.
Seamless switching from small prototypes to production-grade RAG solutions.

In this tutorial, we’ll build a simple RAG pipeline that fetches the newest Hacker News posts, sends them to a local OPEA endpoint running a Qwen/Qwen2.5-7B-Instruct demo model, and produces concise one-sentence summaries. Of course, you can replace our example model with any other OPEA-served model, making this pattern both lightweight for prototyping and powerful for real-world deployments. Let’s get started! 🚀

2. Prerequisites

Make sure you have:

Python 3.9+
Install dependencies: pip install haystack-ai haystack-opea newspaper3k lxml[html_clean]
A running OPEA endpoint at http://localhost:9000/v1 (or your own)

NOTE: As a reference, here is a Docker Compose recipe to get you started. OPEA LLM service can be configured to use a variety of model serving backends like TGI, vLLM, ollama, OVMS… and offers validated runtime settings for good performance on various hardware’s including Intel Gaudi. In this example, it creates an OPEA LLM service with a TGI backend. See the documentation for LLM Generation. The code is based on OPEA LLM example and OPEA TGI example.

To run, call LLM_MODEL_ID=Qwen/Qwen2.5-7B-Instruct docker compose up.

3. Building the News Fetcher

We’ll create a custom Haystack component, HackernewsNewestFetcher, that:

Calls the Hacker News API to get the latest story IDs
Filters for posts containing URLs
Downloads & parses each article with newspaper3k.
Wraps results in Haystack Document objects

from typing import List
from haystack import component, Document
from newspaper import Article
import requests

@component
class HackernewsNewestFetcher():

  @component.output_types(documents=List[Document])
  def run(self, last_k: int):
    # Fetch the IDs of the newest stories
    newest_list = requests.get(
        url='https://hacker-news.firebaseio.com/v0/newstories.json?print=pretty'
    )
    url_list = []
    # Keep only the first `last_k` IDs with URLs
    for id in newest_list.json()[0:last_k]:
      article = requests.get(
          url=f"https://hacker-news.firebaseio.com/v0/item/{id}.json?print=pretty"
      )
      if 'url' in article.json():
        url_list.append(article.json()['url'])

    docs = []
    # Download and parse each article
    for url in url_list:
      try:
        article = Article(url)
        article.download()
        article.parse()
        docs.append(
          Document(
            content=article.text,
            meta={'title': article.title, 'url': url}
          )
        )
      except Exception:
        print(f"Couldn't download {url}, skipped")
    return {"documents": docs}

4. Integrating the LLM (OPEAGenerator)

We use the OPEAGenerator to call our LLM over HTTP. Here, we point to a local endpoint serving the Qwen/Qwen2.5-7B-Instruct model:

from haystack_opea import OPEAGenerator

llm = OPEAGenerator(
    "http://localhost:9000/v1",             # Your OPEA endpoint
    "Qwen/Qwen2.5-7B-Instruct",             # Model name
    model_arguments={"max_tokens": 2000}    # Generation settings
)

5. Crafting the Prompt

Using PromptBuilder, we define a Jinja-style template that:

Lists each article’s title, content, and URL.
Asks the model for a one-sentence summary plus the URL.

from haystack.components.builders import PromptBuilder

prompt_template = """
You will be provided a few of the latest posts in HackerNews, followed by their URL.
For each post, provide a one sentence summary, followed by the original post URL.

Posts:
{% for doc in documents %}
  {{doc.meta['title']}}:
  {{doc.content}}
  URL: {{doc.meta['url']}}
{% endfor %}
"""

prompt_builder = PromptBuilder(template=prompt_template)

6. Building the Pipeline

We wire up the components in a Pipeline:

from haystack import Pipeline

# Instantiate components
fetcher = HackernewsNewestFetcher()

# Build and connect
pipe = Pipeline()
pipe.add_component("hackernews_fetcher", fetcher)
pipe.add_component("prompt_builder", prompt_builder)
pipe.add_component("llm", llm)

# Define data flow
pipe.connect("hackernews_fetcher.documents", "prompt_builder.documents")
pipe.connect("prompt_builder.prompt", "llm.prompt")

# Visualize the pipeline
pipe.show()                    # Interactive usage
pipe.draw(pipeline.png)        # Plotting to file

7. Running the Pipeline

Fetch and summarize the top 2 newest Hacker News posts:

result = pipe.run(data={"hackernews_fetcher": {"last_k": 2}})
print(result['llm']['replies'][0])

8. Results

A course on using Large Language Models (LLMs) to understand and structure search queries without relying on external services is being offered, demonstrating how LLMs can significantly improve and automate search capabilities.

[URL: https://softwaredoug.com/blog/2025/04/08/llm-query-understand]

The European Commission's new ProtectEU security strategy proposes enhanced tools for law enforcement, including methods to access encrypted data, raising significant concerns about potential violations of privacy and civil liberties.

[URL: https://www.cloudwards.net/news/protecteu-security-strategy-raises-encryption-concerns/]

Beautiful, concise summaries in seconds! ✨

9. Conclusion

In this tutorial, we built a full RAG pipeline:

Custom news fetcher for Hacker News.
Lightweight LLM integration via OPEAGenerator.
Jinja-templated prompt for structured input.
Haystack pipeline orchestration.

Feel free to extend this setup with more advanced retrieval, caching, or different LLM backends. Happy coding! 🛠️🔥

Deploy AI Pipelines Faster with Hayhooks

Mon, 12 May 2025 00:00:00 +0000

Haystack is an AI orchestration framework that enables developers to effortlessly build custom AI pipelines using a modular, building-block approach. However, when it’s time to take those pipelines from your development environment to production, you’re often left with a tough decision: write custom server code, or rely on proprietary tools that may not offer the flexibility you need.

We’re excited to announce Hayhooks, an open source package designed to simplify deployment. It lets you focus on developing meaningful AI systems rather than worrying about the underlying infrastructure.

With Hayhooks, you can deploy Haystack pipelines with custom logic, expose OpenAI-compatible chat endpoints, stream responses in real time, and customize your server—all with minimal code and full flexibility. Read on to find out why Hayhooks is a game-changer for every Haystack developer.

What are Haystack pipelines?

Haystack is an AI orchestration framework designed to create powerful LLM-based systems. It offers a vast library of pre-built components, which you can easily extend with custom logic if your project demands it. Haystack pipelines are designed as modular graphs, allowing you to define how components connect and add branches or loops to your logic. This flexibility enables the development of complex agentic systems, custom retrieval-augmented generation (RAG) applications, semantic search, and more

Introducing Hayhooks

While building AI pipelines has become increasingly accessible thanks to frameworks like Haystack, the journey from a working pipeline in a notebook to a production-ready system remains challenging. For basic tasks like processing requests and streaming responses, developers often find themselves writing boilerplate code and dealing with complicated deployment settings.

By simplifying deployment, Hayhooks provides the missing piece to Haystack’s ease of building custom AI pipelines. With Hayhooks, you can quickly turn any Haystack pipeline into a production-ready REST API or expose it as an MCP server. This allows, for example, a large audience to immediately interact with a pipeline through a UI or use it as an MCP tool for MCP clients.

📺 Explore Hayhooks in details and see a demo in Open NLP Meetup #15.

Hayhooks’ key features

Hayhooks bridges the gap between pipeline development and production deployment, focusing on what developers need most. Here are the key features that make this framework a must for every Haystack developer:

1. Simplified deployment

With Hayhooks, the days of writing boilerplate server code just to expose your pipeline are over. Now you can turn your Haystack pipeline into a REST API with a single command. The Hayhooks CLI handles the API creation, auto-generates Swagger and ReDocly documentation, and formats requests/responses for you, so you can focus on what your pipeline does, not how to serve it.

Hayhooks Swagger documentation with endpoints

2. Complete customization for your API

When it comes to real-world AI applications, customization is key. Hayhooks introduces the PipelineWrapper paradigm, allowing you to define exactly how your pipeline behaves as an API. You can insert custom logic before and after pipeline execution for preprocessing, response formatting, and dynamic component configuration, giving you total control over your deployment.

from pathlib import Path
from typing import List
from haystack import Pipeline
from hayhooks import BasePipelineWrapper

class PipelineWrapper(BasePipelineWrapper):
    def setup(self) -> None:
        ## Create the pipeline
        my_pipeline = Pipeline()
        my_pipeline.add_component(....
        ...
        self.pipeline = my_pipeline
        ## or load from yaml
        pipeline_yaml = (Path(__file__).parent / "chat_with_website.yml").read_text()
        self.pipeline = Pipeline.loads(pipeline_yaml)

    def run_api(self, urls: List[str], question: str) -> str:
        """
        Ask a question about one or more websites using a Haystack pipeline.
        """
        result = self.pipeline.run({"fetcher": {"urls": urls}, "prompt": {"query": question}})
        return result["llm"]["replies"][0]

3. Model Context Protocol (MCP) support

Model Context Protocol (MCP) provides a standardized way of interacting with large language models (LLMs) and other AI models, making it much easier to build customized AI applications. With Hayhooks, you can wrap a Haystack Pipeline into an MCP server with a single command and expose it as an MCP tool that MCP clients like Cursor, Windsurf, Claude Desktop, and Haystack Agents can interact with.

4. Production-ready streaming and chat endpoints

It’s great if your AI pipeline works well and you can talk to it through your IDE. But that’s not the way to bring it to a broader audience! Hayhooks generates OpenAI-compatible endpoints for chat interfaces like Open WebUI and supports streaming responses out of the box, so you can easily provide the engaging chat experience users are used to from ChatGPT, Perplexity, and the like.

5. Developer-friendly workflow

Hayhooks includes tools to speed up iteration during development. The --overwrite flag allows you to update an existing pipeline without restarting the server. If you’re debugging and making frequent changes, you can speed things up even more by skipping file persistence. And for better error visibility during development, enable tracebacks with HAYHOOKS_SHOW_TRACEBACKS=true. This provides detailed information when things go wrong, helping you diagnose and fix issues faster.

6. Infrastructure that scales with you

Since Hayhooks is built on FastAPI, you can leverage all FastAPI features for more advanced needs. For example, you can add authentication mechanisms for secure deployments, implement custom logging for monitoring, create additional endpoints for administration or diagnostics, and integrate with existing FastAPI applications in your infrastructure. Additionally, you can easily containerize your Hayhooks web server with Docker and deploy it at scale with Kubernetes. This flexibility means Hayhooks can adapt to your production requirements rather than forcing you to adapt to its limitations.

See Hayhooks in action

Want to see how Hayhooks can transform your pipeline into a production-ready solution? Check out our step-by-step tutorial, showing you how to build a chatbot that interacts with website content using Haystack, Hayhooks, and Open WebUI. You’ll have a fully functional chatbot that streams responses and responds to real-time user queries in just a few minutes.

Explore more examples on our GitHub.

Chat with website example

Outlook and getting started

Since the launch of Hayhooks, the feedback from developers has been great. Many have already successfully deployed their pipelines using Hayhooks, validating the demand for customizable deployment solutions in the AI space. We’re continuously improving the framework and prioritizing feature requests from the community, including:

Better support for pipeline redeployment.
Dependency management using requirements.txt.
Asynchronous pipeline support.

If you’re a Haystack developer looking for a fast, easy, and powerful deployment method, check out Hayhooks and let us know what you think either on GitHub or on our Discord. We hope you’ll sleep better knowing there’s now a safe and easy way to get your AI pipelines out to the masses faster ;)

Optimize RAG Applications with Document Reranking Using Haystack With NVIDIA NeMo Retriever

Thu, 20 Mar 2025 00:00:00 +0000

In retrieval-augmented generation (RAG) applications, the quality of the retrieved documents plays a critical role in delivering accurate and meaningful responses. But what happens when embedding similarity is not enough to get an accurate ordering of the reference documents? This is where reranking comes into play.

What’s Reranking?

Reranking refers to assigning a relevance score to each document based on how well it matches the query. Reranking reorders the retrieved documents to ensure the most contextually relevant results are at the top. This is important because while the retrieval stage focuses on recall, considering relevance broadly, reranking “fine-tunes” the results for increased precision.

Examples of Reranking

Consider a query like, “What are the best practices for securing a REST API?” The retrieval model might return a ranked list with these documents:

REST API: a practical guide
Best REST API frameworks
Detailed steps on how to secure REST APIs
Public vs. private APIs: challenges and limitations
REST API architecture principles

While all of these seem relevant to the topic of REST APIs, the document with specific security steps (document 3) should ideally be ranked first. Using purely embedding similarity, the document score may rely too much on common words - for instance, document 1 includes “REST API” and a similar word to “practice”, while document 2 also includes the word “best” from the query. The use of a reranker should lead to a better document scoring that overcomes these faults, leading to a better retrieval pipeline.

Why Reranking is Crucial in RAG Systems

Adding a reranking component to a RAG pipeline enhances both recall (retrieving relevant documents) and precision (selecting the most relevant ones). The reranker, typically using a fine-tuned LLM, reorders retrieved document chunks to ensure the most relevant ones appear at the top, making the retrieval process more accurate.

By prioritizing the right documents, reranking increases the likelihood of providing the LLM with the best context, which improves the quality of generated responses. For example, in an application where the user seeks specific technical information, the reranking model ensures that highly relevant content appears first, preventing less helpful results from diluting the response quality. This is particularly important when the LLM providing the response has a limited context window or when we aim to optimize its inference process for speed and cost-efficiency.

Reranking is especially valuable in hybrid retrieval setups, where chunks come from different datastores or from various retrieval methods (e.g., sparse, dense, or keyword-based). Each method may rank relevance differently, but reranking brings consistency regardless of the retrieval method. In hybrid setups, it ensures that the final set of documents provided to the LLM reflects the true semantic relevance to the query, rather than being dominated by a single retrieval method’s biases.

Evaluation Metrics for Retrieval and Reranking

Depending on the purpose, many metrics, such as semantic answer similarity or faithfulness, can be used to evaluate a RAG pipeline. When using a reranker in the pipeline, it’s essential to assess not only retrieval performance but also how well the reranker optimizes the order of returned results.

Retrieval performance:

Recall reflects how successful the retrieval operation was, checking how many ground truth documents were retrieved. Single-hit recall indicates how often at least one relevant document is retrieved within the results, and multi-hit recall measures how often all relevant documents appear in the top results.

Reranking performance:

To measure how well the reranking model is ordering document chunks, we can use Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG). MRR reflects how well the relevant documents are positioned, while NDCG evaluates the quality of the entire ranking. Combined, they can give insights in both cases where a little or a lot of context is needed.

	Recall@5 (Single Hit)	Recall@5 (Multi Hit)	Precision@5	MRR@5	NDCG@5
Retriever (top_k=100)	0.818	0.650	0.635	0.652	0.584
Reranker (top_k=5)	0.884	0.718	0.692	0.708	0.643
Reranker Improvement	6.60%	6.80%	5.69%	5.59%	5.90%

Table 1 - Evaluation scores of retrieval and reranking over the small subset of the HotpotQA dataset. For evaluation, the following NVIDIA NeMo Retriever microservices were used: nvidia/llama-3.2-nv-embedqa-1b-v2, for retrieval and nvidia/llama-3.2-nv-rerankqa-1b-v2, for reranking.

The table reveals the impact of adding a reranker to enhance retrieval output:

Recall Improvements: The reranker improves Recall@5 for both single-hit and multi-hit, with multi-hit recall seeing the highest boost (+6.80%). This improvement is crucial when multiple relevant documents are needed for comprehensive context, as the reranker successfully surfaces more relevant documents within the top results.
Reranking Quality: Metrics like MRR@5 and NDCG@5 indicate a significant improvement in reranking performance. The rise in MRR (+5.59%) suggests that relevant documents appear earlier, while the NDCG increase (+5.90%) indicates better overall ranking quality, making it easier to retrieve relevant information from the top of the results.

In summary, this analysis shows that the reranking model significantly enhances both retrieval and reranking metrics, underscoring its value in surfacing relevant content effectively within the RAG pipeline.

For detailed code used in this evaluation, check out Cookbook: Evaluate Reranking-Enhanced Retrieval Pipelines

Introducing NVIDIA NIM

NeMo Retriever microservices are built with NVIDIA NIM, part of the NVIDIA AI Enterprise software platform. NVIDIA NIM is a collection of containerized microservices designed for optimized inference of state-of-the-art AI models. The container uses a variety of components to serve AI models and exposes them via standard APIs. Models are optimized using NVIDIA TensorRT or using NVIDIA TensorRT-LLM (depending on the type of the model), automatically applying procedures such as quantization, model distribution, optimized kernel/runtimes, and inflight- or continuous batching, among others, allowing even further optimization if needed. Learn more about NIM.

Haystack provides 4 components that connect with NVIDIA NIM:

NvidiaGenerator: Text generation with LLM NIM.
NvidiaDocumentEmbedder: Document embedding with NVIDIA NeMo Retriever embedding NIM microservice.
NvidiaTextEmbedder: Query embedding with NVIDIA NeMo Retriever embedding NIM microservice.
(NEW) NvidiaRanker: Document chunk reranker using NVIDIA NeMo Retriever reranking NIM microservice.

Fig 1 - Haystack Indexing and RAG pipeline upgraded with reranking, with NVIDIA NeMo Retriever

Getting Started with NVIDIA NIM

To integrate NVIDIA NIM, you can either access pre-trained models using the NVIDIA API catalog or deploy NVIDIA NIM directly on your own infrastructure for enhanced control and scalability. In this example, we’ll focus on using NVIDIA-hosted models.

Get API Keys: Sign up on the NVIDIA API catalog to obtain API keys with free credits for access to pre-trained models.
Use NVIDIA-hosted Models: Integrate NVIDIA NIM models hosted on the NVIDIA API catalog for seamless access.

Learn how to deploy NIM on your own infrastructure in Building RAG Applications with NVIDIA NIM and Haystack on K8s.

Build a Haystack RAG Pipeline with Reranking using NVIDIA Nemo Retriever

See the complete code in Cookbook: Reranking for Optimized RAG With NVIDIA NeMo Retriever

Enhanced Retrieval

For retrieval, initialize the NeMo Retriever microservices, NvidiaRanker with nvidia/llama-3.2-nv-rerankqa-1b-v2 model and the NvidiaTextEmbedder with nvidia/llama-3.2-nv-embedqa-1b-v2. We’ll set the top_k value for retriever to 30 and for reranker to 5. Thus, we’ll retrieve 30 docs but only pass the 5 most relevant documents as context to the LLM.

from haystack_integrations.components.embedders.nvidia import NvidiaTextEmbedder
from haystack_integrations.components.generators.nvidia import NvidiaGenerator
from haystack_integrations.components.rankers.nvidia import NvidiaRanker
from haystack.components.retrievers import InMemoryEmbeddingRetriever

embedder = NvidiaTextEmbedder(model="nvidia/llama-3.2-nv-embedqa-1b-v2",
                              api_url="https://integrate.api.nvidia.com/v1")

retriever = InMemoryEmbeddingRetriever(document_store=document_store, top_k=30)
reranker = NvidiaRanker(
    model="nvidia/llama-3.2-nv-rerankqa-1b-v2",
    top_k=5
)

Generation

Define a basic prompt and initialize NvidiaGenerator with meta/llama3-70b-instruct model as the generator:

from haystack.components.builders import PromptBuilder
from haystack_integrations.components.generators.nvidia import NvidiaGenerator

prompt = """Answer the question given the context.
Question: {{ query }}
Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}
Answer:
"""
prompt_builder = PromptBuilder(template=prompt)

generator = NvidiaGenerator(
    model="meta/llama3-70b-instruct",
    model_arguments={
        "max_tokens": 1024
    }
)

Build the Enhanced RAG Pipeline

Bring all components together and make the relevant connections.

from haystack import Pipeline

enhanced_rag = Pipeline()
enhanced_rag.add_component("embedder", embedder)
enhanced_rag.add_component("retriever", retriever)
enhanced_rag.add_component("reranker", reranker)
enhanced_rag.add_component("prompt_builder", prompt_builder)
enhanced_rag.add_component("generator", generator)

enhanced_rag.connect("embedder.embedding", "retriever.query_embedding")
enhanced_rag.connect("retriever", "reranker")
enhanced_rag.connect("reranker.documents", "prompt_builder.documents")
enhanced_rag.connect("prompt_builder", "generator")

Run the pipeline

To test our pipeline, we’ll use questions from the HotpotQA dataset:

question = "A medieval fortress in Dirleton, East Lothian, Scotland borders on the south side of what coastal area?" # correct answer is "Yellowcraig"

enhanced_rag.run({
    "embedder": {"text": question},
    "reranker": {"query": question}, 
    "prompt_builder": {"query": question}
})

{'embedder': {'meta': {'usage': {'prompt_tokens': 26, 'total_tokens': 26}}},
 'generator': {'replies': ['The answer is Yellowcraig. According to the context, Dirleton Castle borders on the south side of the Yellowcraig coastal area.'],
  'meta': [{'role': 'assistant',
    'usage': {'prompt_tokens': 503,
     'total_tokens': 532,
     'completion_tokens': 29},
    'finish_reason': 'stop'}]}}

For the question, the enhanced pipeline’s response is “The answer is Yellowcraig. According to the context, Dirleton Castle borders on the south side of the Yellowcraig coastal area.”. Let’s now create a basic RAG pipeline without reranking and compare the results.

Basic RAG Pipeline without Reranking

prompt = """Answer the question given the context.
Question: {{ query }}
Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}
Answer:
"""

rag = Pipeline()
rag.add_component("embedder", NvidiaTextEmbedder(model="nvidia/llama-3.2-nv-embedqa-1b-v2",
                              api_url="https://integrate.api.nvidia.com/v1"))
rag.add_component("retriever", InMemoryEmbeddingRetriever(document_store=document_store, top_k=5))
rag.add_component("prompt_builder", PromptBuilder(template=prompt))
rag.add_component("generator", NvidiaGenerator(
    model="meta/llama3-70b-instruct",
    model_arguments={
        "max_tokens": 1024
    }
))

rag.connect("embedder.embedding", "retriever.query_embedding")
rag.connect("retriever", "prompt_builder.documents")
rag.connect("prompt_builder", "generator")

question = "A medieval fortress in Dirleton, East Lothian, Scotland borders on the south side of what coastal area?" # correct answer is "Yellowcraig"

rag.run({
    "embedder": {"text": question},
    "prompt_builder": {"query": question}
})

{'embedder': {'meta': {'usage': {'prompt_tokens': 26, 'total_tokens': 26}}},
 'generator': {'replies': ['The Firth of Forth.'],
  'meta': [{'role': 'assistant',
    'usage': {'prompt_tokens': 488,
     'total_tokens': 496,
     'completion_tokens': 8},
    'finish_reason': 'stop'}]}}

The basic pipeline’s response is “The Firth of Forth.” which is mentioned in the context but isn’t the correct answer. This shows that the retriever isn’t enough to retrieve the most relevant documents, supporting the improvements in the recall with reranker.

Conclusion

In this blog post, we explored the significant impact of adding a reranking model to a RAG pipeline. When using only retrieval, the documents returned may broadly match the query based on embedding similarity, but without guaranteed ordering for contextual relevance. This can lead to responses that lack specificity, as shown in the basic RAG pipeline example where the answer “The Firth of Forth.” was returned instead of the correct “Yellowcraig” due to suboptimal document ranking.

By integrating the NeMo Retriever nvidia/llama-3.2-nv-rerankqa-1b-v2 model using the NvidiaRanker, the enhanced RAG pipeline prioritized the most contextually appropriate documents, improving the overall precision of the response. With the reranking, metrics like Recall@5 (Multi Hit) and NDCG@5 showed marked improvements, indicating that relevant documents not only appeared within the top results more frequently but were also positioned earlier in the list, enhancing the LLM’s access to high-quality context for accurate generation.

In summary, by adding NeMo Retriever reranking capabilities built with NVIDIA NIM, RAG pipelines achieve better document ordering, more relevant context, and increased response accuracy—demonstrating the essential role of rerankers in building robust, real-world RAG applications.

Use DeepSeek-R1 with Haystack: Demo and Tutorial

Wed, 29 Jan 2025 00:00:00 +0000

The latest release from DeepSeek confirms an essential truth about AI: there won’t be one model or provider to rule them all. As the field evolves, it’s evident that diverse models bring unique strengths, making a model-agnostic approach vital for developers and organizations alike. Whether you’re building agentic systems, Retrieval-Augmented Generation (RAG) architectures, search or other architectures, a model-agnostic design unlocks flexibility, scalability, and long-term success 🔐

Stay Flexible with a Model-Agnostic Approach

Decoupling your application from specific models or APIs gives you the freedom to adapt as AI evolves. A model-agnostic approach lets you choose the best tool for the job—whether it’s generating human-like text, answering complex questions, or handling domain-specific analysis. Through Haystack’s modular architecture, you can easily test, swap, or integrate new models as they emerge, all without rearchitecting your entire AI system. This flexibility ensures you stay ahead of advancements, fine-tune for industry needs, and maintain optimal performance without being locked into a single provider ecosystem.

DeepSeek-R1: Open-Source AI with Cutting-Edge Reasoning

DeepSeek-R1 is the latest large language model by DeepSeek, designed for high-level reasoning tasks. Achieving performance comparable to OpenAI-o1-1217 on reasoning tasks, it stands out as a reliable choice for advanced AI applications. This release marks a major step forward in open-source AI, offering researchers and developers the flexibility to distill, deploy, and commercialize models under the permissive MIT license.

Alongside DeepSeek-R1, the release includes six distilled models, ranging from 1.5B to 70B parameters, built from DeepSeek-R1 based on Qwen and Llama. Impressively, the distilled 32B and 70B models rival the performance of OpenAI-o1-mini. By open-sourcing DeepSeek-R1-Zero, DeepSeek-R1, and the complete suite of distilled models, the creators have made a powerful contribution to the research community, fostering innovation and accessibility in AI development 💙

Benchmark Performance of DeepSeek-R1

Learn more about DeepSeek-R1 in this paper.

Test DeepSeek-R1 yourself

DeepSeek-R1 can be seamlessly integrated with Haystack. To illustrate the potential of DeepSeek-R1, try out this demo using DeepSeek-R1 within the Haystack and deepset ecosystem. In the demo, you can compare DeepSeek-R1 with Open AI’s new reasoning model, o1, to see which of these two models sets the bar for accuracy, performance, and efficiency.

Compare DeepSeek-R1 and OpenAI’s o1 in the deepset demo

If you’d like to use DeepSeek-R1 in your Haystack applications, you can run inference using OpenAIChatGenerator with Together AI.

import os
from getpass import getpass
from haystack.utils import Secret
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.components.generators.utils import print_streaming_chunk

if "TOGETHER_AI_API_KEY" not in os.environ:
    os.environ["TOGETHER_AI_API_KEY"] = getpass("Enter TogetherAI API key:")

chat_generator = OpenAIChatGenerator(model="deepseek-ai/DeepSeek-R1",
    api_key=Secret.from_env_var("TOGETHER_AI_API_KEY"),
    api_base_url="https://api.together.xyz/v1",
    streaming_callback=print_streaming_chunk)

Find a full example in this tutorial.

Conclusion

AI is evolving fast, and DeepSeek-R1 proves that no single model fits all use cases. A model-agnostic approach lets you plug in the best tool for the job, and with modular systems, you can experiment, optimize, and stay ahead as new models emerge. Whether you’re building agents, intelligent search systems, or RAG architectures, the ability to seamlessly switch between models ensures long-term scalability and success.

We’d love to see what you build! Try out the notebook, adapt it to your needs, and share your results with us on LinkedIn or Discord. You can also submit your Haystack story through this form.

Build an Agentic RAG Pipeline in deepset Studio

Tue, 14 Jan 2025 00:00:00 +0000

In this article, we’ll explore how to build an agentic Retrieval Augmented Generation (RAG) pipeline on deepset Studio, the AI application prototyping tool for developers. We’ll first build a basic RAG pipeline and then extend the pipeline with a fallback mechanism that can perform a web search if the answer to the user query cannot be found in the database.

This article also serves as a solution to Day 5 challenge of Advent of Haystack 2024: Elves’ Secret for Faster Development 💨

Let’s dive in and create a pipeline that doesn’t just search but actively decides.

Creating a deepset Studio Account

deepset Studio is a development environment for Haystack. It allows you to visually build and test Haystack pipelines. It’s free and open to everyone. Learn more about Studio and its features in the announcement blog post.

To start building a pipeline on deepset Studio, you need to have an account. If you don’t have one yet, sign up here to get access.

Once you’re in, you can complete the onboarding tutorial or return to it later in the left sidebar. The left sidebar also has some relevant tabs, such as “Files,” “Pipeline Templates,” and “Pipelines.”

Adding Files

For the sake of simplicity, we’ll use the “AI” sample files that come with Studio. Of course, you can use your own files in Studio. If you want to upload your own files, you have two options: you can either use the built-in database that Studio uses, which is an OpenSearch instance, or you can connect to your Weaviate, Pinecone, Qdrant, ElasticSearch instance remotely.

There are example files on various topics

Building the Initial RAG Pipeline

Check out Create a Pipeline in Pipeline Builder for more detailed explanation.

To start building the pipeline, we need to switch to the “Pipeline Templates” tab. Here, you can find several pre-built pipelines (with components, models, and optimal parameters) for different use cases (RAG, chat, summarization, document similarity, etc.) to speed up the building process and avoid starting from scratch.

Locate “RAG Question Answering GPT-4o” and click “Use Template.” On the modal screen, you can change the default name or leave it as is. Clicking “Create Pipeline” takes you to the Pipeline Builder, the drag-and-drop interface for creating and editing the pipeline.

This pipeline template comes with a comprehensive indexing pipeline that processes all file types, splits them into chunks and creates embeddings using the intfloat/e5-base-v2 model. The query pipeline has hybrid retrieval with a Ranker and uses gpt-4o from OpenAI for generation.

⚠️ If you’re using your own database instance, you’ll need to update your indexing pipeline accordingly. For more information, see Connect to an External Document Store.

Default query pipeline coming with the “RAG Question Answering GPT-4o” template

We’ll leave the indexing pipeline as it is, but update the query pipeline to incorporate the web search fallback mechanism.

Incorporating a Fallback Mechanism into a RAG pipeline

We want our pipeline to exhibit agentic behavior by dynamically deciding its course of action. Specifically, it will first perform RAG on our database, and if the query cannot be resolved, it will intelligently shift to a web search fallback. This decision-making capability mirrors an agentic design, enhancing the pipeline’s robustness and flexibility.

We already have a working RAG pipeline that we’ll extend with additional components to include the web fallback mechanism.

Read the Tutorial: Building Fallbacks to Websearch with Conditional Routing to understand how to design a Haystack pipeline with a fallback mechanism.

Update the default prompt

In the extended RAG pipeline, the LLM used in the Generator component will have a dual function. In addition to generating the answer based on the documents in the database, it should also indicate when an answer cannot be generated based on the given documents. The template pipeline already comes with an extensive prompt, so all you need to do is tweak the prompt slightly with instructions to return NO_ANSWER if the documents cannot answer the question. Here’s the new prompt we’ll use:

You are a technical expert.
You answer questions truthfully based on provided documents.
Ignore typing errors in the question.
For each document check whether it is related to the question.
Only use documents that are related to the question to answer it.
Ignore documents that are not related to the question.
If the answer exists in several documents, summarize them.
Only answer based on the documents provided. Don't make things up.
Just output the structured, informative and precise answer and nothing else.
If the documents can't answer the question, say "NO_ANSWER".
Always use references in the form [NUMBER OF DOCUMENT] when using information from a document, e.g. [3] for Document[3].
Never name the documents, only enter a number in square brackets as a reference.
The reference must only refer to the number that comes in square brackets after the document.
Otherwise, do not use brackets in your answer and reference ONLY the number of the document without mentioning the word document.
These are the documents:
{% for document in documents %}
Document[{{ loop.index }}]:
Name of Source File: {{ document.meta.file_name }}
{{ document.content }}
{% endfor %}

Question: {{ question }}
Answer:

Add ConditionalRouter

In Haystack, routing is the most convenient way to build a fallback mechanism into a pipeline and enable agentic behavior. Router components can help direct input to different branches based on some condition or specification of the input, such as metadata or file type.

For this example, we will add the ConditionalRouter component to help the pipeline decide whether to proceed with the database query results or invoke the web search branch based on the LLM’s response. This component will be connected to the OpenAIGenerator and check the LLM response. If the response has the keyword NO_ANSWER, it will direct the query to the web search branch. If not, the search will be terminated. Here’s how you need to define the conditions in ConditionalRouter to achieve this:

- condition: '{{''NO_ANSWER'' in replies[0]}}'
  output: '{{query}}'
  output_name: go_to_web
  output_type: str
- condition: '{{''NO_ANSWER'' not in replies[0]}}'
  output: '{{replies}}'
  output_name: replies
  output_type: typing.List[str]

These conditions will create two outputs/edges for the ConditionalRouter : “replies” and “go_to_web”. The “replies” edge will be connected to the AnswerBuilder as it completes the search, and the “go_to_web” edge will be connected to the web search branch.

Create the Web Search Branch

The fallback branch will be a simple RAG pipeline but this time, we’ll use SerperDevWebSearch instead of the Retriever, which is one of the WebSearch components. This component will get the query from the ConditionalRouter and retrieve relevant information from the web. The web search pipeline will continue with a new PromptBuilder, Generator and AnswerBuilder. As a prompt, we can use a shorter one this time:

Answer the following query given the documents retrieved from the web.
Your answer should indicate that your answer was generated from websearch.

Documents:
{% for document in documents %}
  {{document.content}}
{% endfor %}

Query: {{query}}

For the generator, we can again use the gpt-4o model through OpenAIGenerator

Tie Up the Loose Ends

In deepset Studio, a query pipeline must end with an “Output” node that can return a list of Answers, a list of Documents, or both. AnswerBuilder is a handy component that builds the Answer object using the query, LLM answers, and other optional information such as documents or meta. Since we are getting Answer objects from two different branches here, we also need an AnswerJoiner to concatenate the list of Answers and pass it to the “Output” node.

The pipeline should look like this when it is complete.

⚠️ Don’t forget to press “Save” to keep all these changes before moving forward 🙂

Adding API Keys

Before we deploy and begin testing our pipeline, we need to add OPENAI_API_KEY and SERPERDEV_API_KEY to deepset Studio. You need to use “ Connections” for OpenAI and “ Secrets” for Serper under the menu in the top right corner.

Deploying and Testing the Pipeline

Once you’re done with all the steps above, deploy the pipeline. This process might take some time as after deployment, deepset Studio sets up a document store instance, processes all files, creates embeddings, and indexes them. When everything is complete, you’ll see Indexed tag on the pipeline.

When the deployment is complete, you can test your pipeline in the Playground. You can find the “Playground” tab on the sidebar. This UI gives you the ability to run some queries on your pipeline and investigate the response. You can give it a go with the query, “What’s Compound AI?”. Then, ask Santa’s birthday to confirm that the pipeline uses the fallback branch when required.

For this query, pipeline uses the fallback branch

💡Tips

Try some queries and rate each answer with the buttons. This will help you collect feedback systematically and understand how your pipeline performs.
If you’re working with other people, you can share your pipeline with them.
When you’re done with the development and testing, you can easily export your pipeline in Python or in YAML format to deploy in your own infrastructure or upgrade to deepset Cloud, the enterprise version of deepset Studio.

Conclusion

In this article, we demonstrated how to build, deploy, and test an agentic Haystack pipeline in deepset Studio that intelligently switches from database search to web search, all without worrying about the underlying infrastructure.

Haystack’s modular design, combined with deepset Studio’s user-friendly environment, makes developing sophisticated AI applications accessible for everyone, from beginners to seasoned developers. Get started with Haystack and get your free deepset Studio account now!

Announcing Advent of Haystack 2024 🎄

Mon, 02 Dec 2024 00:00:00 +0000

December is here, and so is the 2nd edition of the Advent of Haystack! This holiday season, we’re inviting the Haystack community to take part in our series of challenges around Generative AI use cases. From mastering the basics of Haystack to building advanced pipelines and intelligent agents with LLMs, this year’s Advent of Haystack is packed with interesting tasks. Whether you’re new to Haystack or a Gen AI pro, there’s something for everyone!

What is Advent of Haystack?

The Advent of Haystack is a festive series of 10 challenges, inspired by Advent of Code. Every couple of days, a new challenge will be released, designed to help you enhance your AI development skills while showcasing the power of Haystack.

Here’s what you can expect:

✨ Fun Coding Challenges – Get familiar with the most current AI use cases.
📖 Festive Stories – Enjoy your learning journey with holiday-themed tasks.
🏆 Surprise Prizes – Win gift cards, Haystack swag, and more!
💡 Gen AI Ecosystem Insights – Explore essential tools from the AI engineer’s toolbox, like Weaviate Vector Database, AssemblyAI models, NVIDIA’s NGC, Arize Phoenix, and MongoDB Atlas to build real-world applications.

Meet Our Amazing Partners

This year, we’re thrilled to team up with Weaviate, AssemblyAI, NVIDIA, Arize, and MongoDB!

These wonderful partners have helped us put together the challenges – so that this Advent of Haystack you can get even more out of the coding experience and explore Haystack in a realistic environment combined with other frameworks, vector databases, and additional tools.

How to Participate

1️⃣ Access the Challenges: Visit the Advent of Haystack page to get started.
2️⃣ Solve the Challenge: Check out the hints and requirements on the challenge page to get started. We’ve included starter notebooks to make things easier. Got questions? Jump into the conversation on GitHub Discussions or Discord
3️⃣ Submit Your Solution: Submit your results using the challenge form provided for each challenge
4️⃣ Win Prizes: Complete and submit all challenges by December 31 for a chance to win amazing prizes

We love learning about what our community is building with Haystack. It helps us understand the landscape, informs our roadmap - and frankly, it is just rewarding to see what you all are up to! So this holiday season, we invite you to share your Haystack journey with us! Whether you’re just starting out or have built an advanced tool, your story can motivate others to explore the possibilities of building with AI and inspire the whole team!

Publish your story on your blog, website, Medium, LinkedIn, or another platform. Share the cheer and earn perks! Learn more here.

Get Started Today

The fun starts TODAY, so don’t miss out!

💙 Join the first challenge: Fetching Holiday Wisdom for Santa 📖

Share your progress on social media using the hashtag #AdventOfHaystack and tag @Haystack_AI.

Let’s code, create, and celebrate this holiday season! Happy Hacking and Happy Holidays! 🎄✨

Create a Swarm of Agents

Tue, 26 Nov 2024 00:00:00 +0000

When building applications with Language Models, new patterns are emerging to bridge the gap between the statistical nature of these models and the deterministic logic of traditional programming. Haystack, as an AI framework, supports developers by providing abstractions that simplify this integration.

One of the most promising advances in this space is Tool/function calling, allowing a model to prepare calls for functions. We are working to standardize this capability across different model providers.

OpenAI Swarm is an educational framework that proposes lightweight techniques for creating and orchestrating multi-agent systems. In Swarm, Tool calling is used not only for interacting with external functions but also for enabling agents to exchange control dynamically.

In this article, we’ll introduce the core concepts of Swarm (Routines and Handoffs) and implement them step by step using Haystack and its tool support. Along the way, we’ll expand Swarm’s functionality to support models from various providers (OpenAI, Anthropic and open models running locally via Ollama).

You can find the full working example and code in the linked recipe from our cookbook.

Starting simple: building an Assistant

In this article we’ll be using Haystack support for Tools. Check out the documentation for more details.

The first step toward building an Agent is creating an Assistant: think of it of Chat Language Model + a system prompt.

We can implement this as a lightweight dataclass with three parameters:

name
LLM (Haystack Chat Generator)
instructions (these will constitute the system message)

from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.dataclasses import ChatMessage

@dataclass
class Assistant:
    name: str = "Assistant"
    llm: object = OpenAIChatGenerator()
    instructions: str = "You are a helpful Agent"

    def __post_init__(self):
        self._system_message = ChatMessage.from_system(self.instructions)

    def run(self, messages: list[ChatMessage]) -> list[ChatMessage]:
        new_message = self.llm.run(messages=[self._system_message] + messages)["replies"][0]

        if new_message.text:
            print(f"\\n{self.name}: {new_message.text}")

        return [new_message]

Let’s create a Joker assistant, tasked with telling jokes.

joker = Assistant(name="Joker", instructions="you are a funny assistant making jokes")

messages = []
print("Type 'quit' to exit")

while True:
    if not messages or messages[-1].role == ChatRole.ASSISTANT:
        user_input = input("User: ")
        if user_input.lower() == "quit":
            break
        messages.append(ChatMessage.from_user(user_input))

    new_messages = joker.run(messages)
    messages.extend(new_messages)

Let’s see it in action 😀

Type 'quit' to exit
User: hey!

Joker: Hey there! How's it going? Are you ready for some laughs, or are we saving the jokes for dessert? 🍰
User: where is Rome?

Joker: Rome is in Italy, but if you're asking me for directions, I might just say, “Take a left at the Colosseum and keep going until you smell pizza!” 🍕
User: quit

Tools and Routines

In the context of applications based on Language Models, the term Agent is broadly defined.

However, to qualify as an Agent, a software application built on a Language Model should go beyond generating text; it should also be capable of performing actions, such as executing functions or calling APIs.

A popular way to achieve this is Tool calling:

We provide a set of tools (functions, APIs with a given spec) to the model.
The model prepares function calls based on user request and available tools.
The actual invocation is executed outside the model (at the Agent level).
The model can further elaborate on the result of the invocation.

Swarm introduces the concept of routines: natural-language instructions paired with the tools needed to execute them.

Here’s an example of a routine:

# instructions
You are a customer support agent for ACME Inc.
Always answer in a sentence or less.
Follow the following routine with the user:
1. First, ask probing questions and understand the user's problem deeper.
 - unless the user has already provided a reason.
2. Propose a fix (make one up).
3. ONLY if not satesfied, offer a refund.
4. If accepted, search for the ID and then execute refund.

# tools needed
...

The authors emphasize that routines are simple yet robust. If they are small, LLMs manage them effectively, offering the soft adherence: the model can guide conversations naturally without being constrained by rigid patterns or dead-ends.

Implementation

Let’s see how we can transform our Assistant into a Tool Calling Agent that can handle routines.

instructions could already be passed to the Assistant, to guide its behavior.
The Agent introduces a new init parameter called functions. These functions are automatically converted into Tools. The key difference is that, to be passed to a Language Model, a Tool must have a name, description, and a JSON schema specifying its parameters.
During initialization, we also create a ToolInvoker. This Haystack component takes in Chat Messages containing prepared tool_calls, performs the tool invocation and wraps the results in Chat Message with tool role.
What happens during run? The Agent first generates a response. If the response includes tool calls, these are executed, and the results are integrated into the conversation.
The while loop manages user interactions:
- If the last message role is assistant, it waits for user input.
- If the last message role is tool, it continues running to handle tool execution and its responses.

from haystack.tools import create_tool_from_function

@dataclass
class ToolCallingAgent:
    name: str = "ToolCallingAgent"
    llm: object = OpenAIChatGenerator()
    instructions: str = "You are a helpful Agent"
    functions: list[Callable] = field(default_factory=list)

    def __post_init__(self):
        self._system_message = ChatMessage.from_system(self.instructions)
        self.tools = [create_tool_from_function(fun) for fun in self.functions] if self.functions else None
        self._tool_invoker = ToolInvoker(tools=self.tools, raise_on_failure=False) if self.tools else None

    def run(self, messages: list[ChatMessage]) -> Tuple[str, list[ChatMessage]]:

        # generate response
        agent_message = self.llm.run(messages=[self._system_message] + messages, tools=self.tools)["replies"][0]
        new_messages = [agent_message]

        if agent_message.text:
            print(f"\n{self.name}: {agent_message.text}")

        if not agent_message.tool_calls:
            return new_messages

        # handle tool calls
        tool_results = self._tool_invoker.run(messages=[agent_message])["tool_messages"]
        new_messages.extend(tool_results)

        return new_messages

Here’s an example of a Refund Agent using this setup.

# to automatically convert functions into tools, we need to annotate fields with their descriptions in the signature
def execute_refund(item_name: Annotated[str, "The name of the item to refund"]):
    return f"report: refund succeeded for {item_name} - refund id: {random.randint(0,10000)}"

refund_agent = ToolCallingAgent(
    name="Refund Agent",
    instructions=(
        "You are a refund agent. "
        "Help the user with refunds. "
        "1. Before executing a refund, collect all specific information needed about the item and the reason for the refund. "
        "2. Then collect personal information of the user and bank account details. "
        "3. After executing it, provide a report to the user. "
    ),
    functions=[execute_refund],
)

Let’s see it in action

Type 'quit' to exit
User: hey

Refund Agent: Hello! How can I assist you today? If you need help with a refund, please let me know the details.
User: my phone does not work

Refund Agent: I'm sorry to hear that your phone is not working. To assist you with the refund, could you please provide the following information:

1. The name of the phone (brand and model).
2. The reason for the refund (e.g., defective, not as described, etc.).

Once I have that information, I'll guide you through the next steps.
User: Nokia 3310; it does not work

Refund Agent: Thank you for the information. To proceed with the refund for the Nokia 3310, I'll need a few more details:

1. Can you please provide your full name?
2. Your email address and phone number (for communication purposes).
3. Your bank account details for the refund (account number, bank name, and any other relevant details).

Once I have this information, I can execute the refund for you.
User: John Doe; johndoe@mymail.com; bank account number: 0123456

Refund Agent: Thank you, John Doe. I still need the following information to complete the refund process:

1. The name of your bank.
2. Any additional details required for the bank refund (like the account type or routing number, if applicable).

Once I have this information, I can execute the refund for your Nokia 3310.
User: Bank of Mouseton

Refund Agent: The refund process has been successfully completed! Here are the details:

- **Item:** Nokia 3310
- **Refund ID:** 3753
- **Bank:** Bank of Mouseton
- **Refund ID:** 1220

If you have any more questions or need further assistance, feel free to ask!
User: quit

Handoffs: switching control between Agents

We have introduced Swarm as a way to create a multi-agent system, but so far, we’ve only explored single Agents. This raises two important questions:

Why would we need more than one Agent?
How can we enable Agents to switch control effectively?

Why use multiple Agents?

While routines are simple and effective, an Agent might struggle to handle a large set of instructions and tools.

By dividing responsibilities among multiple Agents, each with a distinct persona and specialized tools, we can create a system that mimics real-world collaboration or even competition.

For example, a Sales Agent can handle product inquiries and purchases, while an Issues and Repairs Agent manages complaints and repair requests.

Handoffs

The most interesting technical idea of Swarm is probably handoffs: enabling one Agent to transfer control to another with Tool calling.

Here’s how it works:

Add specific handoff functions to the Agent’s available tools, allowing it to transfer control when needed.
Modify the Agent to return the name of the next agent, along with its messages.
Handle the switch in while loop, directing the flow of the conversation to the next agent.

Implementation

The implementation is similar to the previous one, but, compared to ToolCallingAgent, a SwarmAgent also returns the name of the next agent to be called, enabling handoffs.

HANDOFF_TEMPLATE = "Transferred to: {agent_name}. Adopt persona immediately."
HANDOFF_PATTERN = r"Transferred to: (.*?)(?:\.|$)"

@dataclass
class SwarmAgent:
    name: str = "SwarmAgent"
    llm: object = OpenAIChatGenerator()
    instructions: str = "You are a helpful Agent"
    functions: list[Callable] = field(default_factory=list)

    def __post_init__(self):
        self._system_message = ChatMessage.from_system(self.instructions)
        self.tools = [create_tool_from_function(fun) for fun in self.functions] if self.functions else None
        self._tool_invoker = ToolInvoker(tools=self.tools, raise_on_failure=False) if self.tools else None

    def run(self, messages: list[ChatMessage]) -> Tuple[str, list[ChatMessage]]:
        # generate response
        agent_message = self.llm.run(messages=[self._system_message] + messages, tools=self.tools)["replies"][0]
        new_messages = [agent_message]

        if agent_message.text:
            print(f"\n{self.name}: {agent_message.text}")

        if not agent_message.tool_calls:
            return self.name, new_messages

        # handle tool calls
        for tc in agent_message.tool_calls:
            # trick: Ollama do not produce IDs, but OpenAI and Anthropic require them.
            if tc.id is None:
                tc.id = str(random.randint(0, 1000000))
        tool_results = self._tool_invoker.run(messages=[agent_message])["tool_messages"]
        new_messages.extend(tool_results)

        # handoff
        last_result = tool_results[-1].tool_call_result.result
        match = re.search(HANDOFF_PATTERN, last_result)
        new_agent_name = match.group(1) if match else self.name

        return new_agent_name, new_messages

Let’s see this in action with a Joker Agent and a Refund Agent!

def transfer_to_refund():
    """Pass to this Agent for anything related to refunds"""
    return HANDOFF_TEMPLATE.format(agent_name="Refund Agent")

def transfer_to_joker():
    """Pass to this Agent for anything NOT related to refunds."""
    return HANDOFF_TEMPLATE.format(agent_name="Joker Agent")

refund_agent = SwarmAgent(
    name="Refund Agent",
    instructions=(
        "You are a refund agent. "
        "Help the user with refunds. "
        "Ask for basic information but be brief. "
        "For anything unrelated to refunds, transfer to other agent."
    ),
    functions=[execute_refund, transfer_to_joker],
)

joker_agent = SwarmAgent(
    name="Joker Agent",
    instructions=(
        "you are a funny assistant making jokes. "
        "If the user asks questions related to refunds, send him to other agent."
    ),
    functions=[transfer_to_refund],
)

agents = {agent.name: agent for agent in [joker_agent, refund_agent]}

print("Type 'quit' to exit")

messages = []
current_agent_name = "Joker Agent"

while True:
    agent = agents[current_agent_name]

    if not messages or messages[-1].role == ChatRole.ASSISTANT:
        user_input = input("User: ")
        if user_input.lower() == "quit":
            break
        messages.append(ChatMessage.from_user(user_input))

    current_agent_name, new_messages = agent.run(messages)
    messages.extend(new_messages)

Type 'quit' to exit
User: i need a refund for my Iphone

Refund Agent: I can help you with that! Please provide the name of the item you'd like to refund.
User: Iphone 15

Refund Agent: Your refund for the iPhone 15 has been successfully processed. The refund ID is 9090. If you need any further assistance, feel free to ask!
User: great. can you give some info about escargots?

Joker Agent: Absolutely! Did you know that escargots are just snails trying to get a head start on their travels? They may be slow, but they sure do pack a punch when it comes to flavor! 

Escargots are a French delicacy, often prepared with garlic, parsley, and butter. Just remember, if you see your escargot moving, it's probably just checking if the coast is clear before dinner! 🐌🥖 If you have any other questions about escargots or need a good recipe, feel free to ask!
User: quit

A more complex multi-agent system

In the accompanying notebook, we propose a more intricate multi-agent system simulating a customer service setup for ACME Corporation, a fictional entity from the Road Runner/Wile E. Coyote cartoons, which sells quirky products meant to catch roadrunners. (We are reimplementing the example from the original article by OpenAI.)

This system involves several different agents, each with specific tools:

Triage Agent: handles general questions and directs to other agents. Tools: transfer_to_sales_agent, transfer_to_issues_and_repairs and escalate_to_human.
Sales Agent: proposes and sells products to the user, it can execute the order or redirect the user back to the Triage Agent. Tools: execute_order and transfer_back_to_triage.
Issues and Repairs Agent: supports customers with their problems, it can look up item IDs, execute refund or redirect the user back to triage. Tools: look_up_item, execute_refund, and transfer_back_to_triage.

Here we report only the definition of our Agents. Refer to the notebook for the complete code.

triage_agent = SwarmAgent(
    name="Triage Agent",
    instructions=(
        "You are a customer service bot for ACME Inc. "
        "Introduce yourself. Always be very brief. "
        "If the user asks general questions, try to answer them yourself without transferring to another agent. "
        "Only if the user has problems with already bought products, transfer to Issues and Repairs Agent."
        "If the user looks for new products, transfer to Sales Agent."
        "Make tool calls only if necessary and make sure to provide the right arguments."
    ),
    functions=[transfer_to_sales_agent, transfer_to_issues_and_repairs, escalate_to_human],
)

sales_agent = SwarmAgent(
    name="Sales Agent",
    instructions=(
        "You are a sales agent for ACME Inc."
        "Always answer in a sentence or less."
        "Follow the following routine with the user:"
        "1. Ask them about any problems in their life related to catching roadrunners.\n"
        "2. Casually mention one of ACME's crazy made-up products can help.\n"
        " - Don't mention price.\n"
        "3. Once the user is bought in, drop a ridiculous price.\n"
        "4. Only after everything, and if the user says yes, "
        "tell them a crazy caveat and execute their order.\n"
        ""
    ),
    llm=AnthropicChatGenerator(),
    functions=[execute_order, transfer_back_to_triage],
)

issues_and_repairs_agent = SwarmAgent(
    name="Issues and Repairs Agent",
    instructions=(
        "You are a customer support agent for ACME Inc."
        "Always answer in a sentence or less."
        "Follow the following routine with the user:"
        "1. If the user is intered in buying or general questions, transfer back to Triage Agent.\n"
        "2. First, ask probing questions and understand the user's problem deeper.\n"
        " - unless the user has already provided a reason.\n"
        "3. Propose a fix (make one up).\n"
        "4. ONLY if not satesfied, offer a refund.\n"
        "5. If accepted, search for the ID and then execute refund."
        ""
    ),
    functions=[look_up_item, execute_refund, transfer_back_to_triage],
    llm=AnthropicChatGenerator(),
)

Let’s see it in action.

Type 'quit' to exit
User: hey!

Triage Agent: Hello! I'm the customer service bot for ACME Inc. How can I assist you today?
User: i need a product to catch roadrunners

Triage Agent: I can transfer you to a sales agent who can help you find suitable products for catching roadrunners. One moment please!

Sales Agent: Hello there! I hear you're having some roadrunner troubles. Tell me, what specific challenges are you facing with these speedy birds?
User: they are damn fast!

Sales Agent: Ah, those pesky roadrunners and their lightning speed! Have you ever considered our ACME Rocket-Powered Roller Skates? They'll have you zipping after those birds in no time!
User: tell me more

Sales Agent: Well, our Rocket-Powered Roller Skates come with turbo boosters and autopilot. They're guaranteed to match any roadrunner's speed. Interested in giving them a spin?
User: yes

Sales Agent: Fantastic! You're in luck because we have a special offer right now. These state-of-the-art Rocket-Powered Roller Skates can be yours for just $9,999! Shall we proceed with your order?
User: yes

Sales Agent: Excellent! I'll process that order right away. Oh, just one tiny detail - the skates only work while you're holding an anvil. Ready to complete the purchase?

=== Order Summary ===
Product: Rocket-Powered Roller Skates
Price: $9999
=================

Confirm order? y/n: y
Order execution successful!

Sales Agent: Great! Your order is confirmed. Happy roadrunner chasing!
User: now I also need to solve a problem with my old ACME Superphone 7500

Sales Agent: I understand you're having issues with our ACME products. Let me transfer you to our customer support team for assistance with your Superphone 7500.

Triage Agent: Hello! I see you're experiencing an issue with your ACME Superphone 7500. Could you please describe the problem you're facing?
User: it does not call my mum

Triage Agent: Let's get that sorted out! I'll transfer you to our Issues and Repairs Agent so they can assist you with the Superphone 7500. One moment please!

Issues and Repairs Agent: Hello! I understand your ACME Superphone 7500 isn't calling your mum. Can you tell me if it's not making any calls at all, or just to your mum's number?
User: just my mum

Issues and Repairs Agent: I see. Is your mum's number correctly saved in your contacts?
User: ofc

Issues and Repairs Agent: Understood. Have you tried turning the phone off and on again?
User: ofc

Issues and Repairs Agent: I apologize for the inconvenience. Let's try a quick fix. Can you clear your mum's contact and re-add it?
User: done but does not work. I'm getting impatient

Issues and Repairs Agent: I apologize for the frustration. Let's try one last solution. Can you update your phone's software to the latest version?
User: hey gimme a refund

Issues and Repairs Agent: I understand your frustration. Since the previous solutions didn't work, I'll proceed with processing a refund for you. First, I need to look up the item ID for your ACME Superphone 7500.

Issues and Repairs Agent: Thank you for your patience. I've found the item ID. Now, I'll execute the refund for you.

=== Refund Summary ===
Item ID: item_132612938
Reason: Product not functioning as expected
=================

Refund execution successful!

Issues and Repairs Agent: Your refund has been successfully processed.
User: quit

It’s model-agnostic!

A nice bonus feature of our implementation is that we can use different model providers supported by Haystack: for example, OpenAI, Anthropic and open models running locally via Ollama.

In practice, this means we can have agents handling complex tasks using powerful proprietary models, and other agents performing simpler tasks using smaller open models.

The accompanying notebook contains usage examples where we mix models offered by different model providers, including Ollama. You can also find an example illustrated in the image below.

Conclusion

In this article, we have built a multi-agent system using Swarm concepts and Haystack tools, demonstrating how to integrate models from different providers.

Swarm concepts are simple yet powerful for several use cases, and the abstractions provided by Haystack make them easy to implement.

However, this architecture may not be the best fit for all use cases:

Memory is handled as a list of messages and is not persistent.
The systems you can build with Swarm run only one agent at a time.

Looking ahead, we plan to develop and showcase more advanced Agents with Haystack. Stay tuned! 📻

Announcing Studio: Your Development Environment for Haystack

Wed, 20 Nov 2024 00:00:00 +0000

We’re thrilled to announce an exciting new addition to the Haystack ecosystem: deepset Studio! After countless requests from our community for a visual editor to create AI workflows and invaluable feedback during the beta phase, we’re officially launching deepset Studio, a powerful tool for visually building, deploying, and managing Haystack pipelines.

What is deepset Studio?

deepset Studio is the community version of deepset Cloud, the enterprise offering from the creators of Haystack. It allows users to visually construct and deploy Haystack pipelines for free. With an intuitive drag-and-drop interface, Studio simplifies the process of designing AI applications by combining Haystack’s core and core-integration components.

If you’re a Haystack user, you’ll feel right at home. You can visually connect the same components you use in code - streamlining your workflow and reducing development time.

💡 Learn the basics of Haystack in our tutorial: Creating Your First QA Pipeline with Retrieval-Augmentation.

Why “Composable AI” Matters

AI workflows are rarely monolithic. Instead, they’re built as modular systems where smaller tasks work together to achieve a larger goal, an approach often referred to as Composable AI or Compound AI.

Take Retrieval-Augmented Generation (RAG) as an example. A RAG application consists of:

Retrieval: Fetching the most relevant context for a query.
Augmentation: Embedding the retrieved context into a prompt for the LLM.
Generation: Using the prompt to generate a response with an LLM.

As workflows grow in complexity, the number of tasks and components increases, making it harder to manage the entire system. With Studio, you can architect these use cases visually, test, and deploy workflows, simplifying the process and reducing overhead.

Key Features

With deepset Studio, you can:

Build visually: Drag, drop, and connect components to create custom pipelines.
Use a built-in vector database: Leverage OpenSearch or connect to your preferred instance (Weaviate, Qdrant, ElasticSearch, or Pinecone).
Start quickly: Use your own files or explore the sample files available in deepset Cloud.
Leverage pipeline templates: Take advantage of pre-built pipeline templates for common use cases such as Retrieval Augmented Generation (RAG).
Experiment with custom components: Initial support for integrating your own components into pipelines.
Deploy with ease: Host Haystack pipelines on deepset Cloud infrastructure.
Test and share: Test pipelines in the Playground and share prototypes with stakeholders.
Export pipelines: Export your pipeline as a YAML definition or Python code to run locally or customize further.

💡 Want to dive deeper? Check out our blog post: Design Haystack AI Applications Visually with deepset Studio & NVIDIA NIMs to learn how to build your first indexing and RAG pipeline with Studio.

Coming soon

We’re actively working to make Studio even more versatile. Here’s a glimpse of what’s coming soon:

Enhanced custom component support: Seamlessly bring your own integrations alongside Haystack’s core suite.
Extended database options: Support for more storage solutions.

Open Access: Start Using Studio!

deepset Studio is free and open to everyone! Sign up today to start exploring and creating with Studio, and you’ll receive an email in the next 24 hours with account setup instructions. Whether you’re building simple RAG workflows or architecting complex AI systems, Studio offers the tools you need to get started. Learn more about Studio on deepset blog.

Need help or have feedback? Join the dedicated #deepset-studio channel on our Haystack Discord server. Our team and community are happy to hear your thoughts. 🧡

Building a Multimodal Nutrition Agent

Thu, 07 Nov 2024 00:00:00 +0000

In artificial intelligence, multimodal agents are becoming increasingly popular for their ability to understand and integrate multiple types of input, such as text and images. In this article, we’ll show you how to build a multimodal agent that can interpret both text and image data, like nutrition fact labels on food items, to answer practical questions such as “How much protein is in yogurt?”

We’ll focus on building an agent using Haystack and fastRAG, which can perform multi-step reasoning to extract and provide accurate answers about the nutritional content of different foods.

fastRAG is a research framework developed by Intel Labs for efficient and optimized RAG pipelines. It is fully compatible with Haystack and includes novel and efficient RAG modules designed for efficient deployment on Intel hardware, including client and server CPUs (Xeon) and the Intel Gaudi AI accelerator.

Understanding Multimodal Agents: Multi-Hop and ReAct Architecture

A multimodal agent processes different input types, such as text and images, making it versatile for tasks like image question answering. The agent we implement in this article allows users to ask questions like “Which one has more protein, yogurt or a protein bar?” and gives the correct answer by retrieving nutrition fact labels of different food items. By using multi-hop reasoning, the agent processes an image, extracts nutritional data, tries to answer the user query, and, if necessary, performs these actions again without human intervention. Its ReAct architecture lets it dynamically choose which tool to use, whether to retrieve a new image or respond based on already retrieved information, ensuring flexibility and efficiency in handling diverse queries.

This combination of multimodality, multi-hop reasoning, and reactive decision-making makes this agent ideal for quick, accurate responses to user questions.

Now that we know the basics, let’s implement our agent! 🤖

Indexing Data

Getting Nutrition Fact Labels

Let’s start by getting images of nutrition facts and indexing them in our database. You can find the data here.

import json

entries = json.load(open("../assets/multi_modal_files/nutrition_data.json", "r"))

Each entry in this data contains a brief textual description with a title and an image url. Here’s an example:

{
    "image_url": "https://m.media-amazon.com/images/I/71nh-zRJCSL.jpg",
    "title": "Protein bar nutrition facts",
    "content": "Protein bar with chocolate peanut butter nutrition facts per bar (50g)"
}

Index Documents to InMemoryDocumentStore

We will use sentence-transformers/all-MiniLM-L6-v2 model to create embeddings for each label description and create a pipeline to index our data to InMemoryDocumentStore.

from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.writers import DocumentWriter
from haystack.components.embedders import SentenceTransformersDocumentEmbedder

document_store = InMemoryDocumentStore()

index_pipeline = Pipeline()
index_pipeline.add_component(
    instance=SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2"), name="doc_embedder"
)
index_pipeline.add_component(
    instance=DocumentWriter(document_store=document_store), name="doc_writer"
)

index_pipeline.connect("doc_embedder.documents", "doc_writer.documents")

Next, we create Document objects with the nutrition label content as the content and store the title and image_url as metadata before passing them to the indexing pipeline for processing.

index_pipeline.run({
    "documents": [
        Document(
            content=entry["content"],
            meta={
                "title": entry["title"],
                "image_url": entry["image_url"]
            }
        ) for entry in entries
    ]
})

Building a Retrieval Pipeline

Next, we create a document retrieval pipeline for the documents above. We will later use this pipeline in our tool.

This pipeline consists of:

A SentenceTransformersTextEmbedder, to embed our questions.
An InMemoryEmbeddingRetriever to fetch the top-1 document.
A MultiModalPromptBuilder, to construct the prompt that our Agent will eventually use.

from haystack import Pipeline
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from fastrag.prompt_builders.multi_modal_prompt_builder import MultiModalPromptBuilder

template = """{% for document in documents %}
Image: <|image_
This image shows: {{ document.content }}
{% endfor %}
"""

retrieval_pipeline = Pipeline()
retrieval_pipeline.add_component("embedder", SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2"))
retrieval_pipeline.add_component("retriever", InMemoryEmbeddingRetriever(document_store=document_store, top_k=1))
retrieval_pipeline.add_component("prompt_builder", MultiModalPromptBuilder(template=template))

retrieval_pipeline.connect("embedder.embedding", "retriever.query_embedding")
retrieval_pipeline.connect("retriever", "prompt_builder.documents")

In this pipeline, the MultiModalPromptBuilder component receives one Document object from the retriever and renders the prompt. Notice that we have the “<|image_” placeholder in the prompt template for our model, in order to inject the images into later. Additionally, MultiModalPromptBuilder converts the given image into a base64 string for the image to be processed by the multimodal agent. Let’s run the pipeline to see its output.

retrieval_pipeline.run({"embedder":{"text": "Protein bar"}})

"""
{'prompt_builder': {'prompt': '\nImage: <|image_\nThis image shows: Protein bar with chocolate peanut butter nutrition facts per bar (50g)\n',
 'images': ['/9j/4AAQSkZJRgABAQAAAQABAAD/4....']} 
"""

Creating the Multimodal ReAct Agent

Defining a Tool

With our retrieval pipeline ready, we can create our Tool using the DocWithImageHaystackQueryTool component from fastRAG. DocWithImageHaystackQueryTool can use Haystack v2 pipelines as a tool with fastRAG Agents.

This tool, like other agent tools, requires a name and a description of its functionality for our agent to decide when to use it. We provide it with our retrieval_pipeline as follows:

from fastrag.agents.tools.tools import DocWithImageHaystackQueryTool

nutrition_tool = DocWithImageHaystackQueryTool(
    name="nutrition_tool",
    description="useful for when you need to retrieve nutrition fact image of packaged food. It can give information about one food type per query. Pass the food name as input",
    pipeline_or_yaml_file=retrieval_pipeline
)

Let’s test out our tool!

tool_result = nutrition_tool.run("protein bar")
print(tool_result[0])

# Image: <|image_
# This image shows: Protein bar with chocolate peanut butter nutrition facts per bar (50g)

With our tool ready, we can create our agent.

Initialize the Generator

For our multimodal agent, we initialize a Phi35VisionHFGenerator, which processes both text prompts and base64-encoded images. This makes it well-suited for image-to-text tasks like visual question answering.

The Phi35VisionHFGenerator generator uses a Hugging Face image-to-text model, which will function as the LLM for our agent. For this example, we’ll use a 4B Phi3.5 Vision model to perform multi-step reasoning with tools and answer questions about the nutrition facts of various foods.

Note that we define the words “Observation:” and “<|end|>” as stop words. These stop words are specific to the model and the ReAct prompting.

from fastrag.generators.stopping_criteria.stop_words import StopWordsByTextCriteria
from transformers import AutoTokenizer, StoppingCriteriaList
from fastrag.generators.llava import Phi35VisionHFGenerator
import torch

model_name_or_path = "microsoft/Phi-3.5-vision-instruct"
sw = StopWordsByTextCriteria(
    tokenizer=AutoTokenizer.from_pretrained(model_name_or_path),
    stop_words=["Observation:", "<|end|>"],
    device="cpu"
)

generator = Phi35VisionHFGenerator(
    model = model_name_or_path,
    task = "image-to-text",
    generation_kwargs = {
        "max_new_tokens": 100,
        "stopping_criteria": StoppingCriteriaList([sw])
    },
    huggingface_pipeline_kwargs={
        "torch_dtype": torch.bfloat16,
        "trust_remote_code": True,
        "_attn_implementation": "eager",
        "device_map": "auto"
    },
)

generator.warm_up()

ReAct Prompting

To allow our agent to deduce which tools it needs to use logically, we will use ReAct, which prompts the agent iteratively and requires it to generate 3 main steps:

Let’s say we want a description of how a bird chirps.

Thought: A logical explanation of what the model should perform (For example, I will use the docRetriever tool to find a description of how a bird chirps).
Action: The precise operation that must be executed (For example, Tool: docRetriever, Tool Input: {”input”: “Description of how a bird chirps”}).
Observation: The output produced by the action (i.e. tool call) after it has been performed (For example, Observation: A bird’s chirp is a light, melodic sound that often feels crisp and rhythmic, with a sequence of short, high-pitched notes… ).

Let’s define a prompt instructing the LLM to follow the ReAct behavior. Note that we provide tool information as {tool_names_with_descriptions} in the prompt.

agent_prompt="""
You are designed to help with a variety of multimodal tasks and can perform multiple hops to answer questions.

## Tools

You have access to a wide variety of tools. You are responsible for using the tools in any sequence you deem appropriate to complete the task at hand.
Break the task into subtasks and iterate to complete each subtask.

You have access to the following tools:
{tool_names_with_descriptions}

## Output Format

If you need to make a tool call, your responses should follow this structure:

Thought: [your reasoning process, decide whether you need a tool or not]
Tool: [tool name]
Tool Input: [the input to the tool, in a JSON format representing the kwargs (e.g. {{"input": "hello world"}})]
Observation: [tool response]

Based on the tool response, you need decide whether you need another more information. If so, make another tool call with the same structure.

If you have enough information to answer the question without using any more tools, you MUST give your answer to the user question with "Final Answer:" and respond in the following format:

Thought: [your reasoning process, decide whether you need a tool or not]
Final Answer: [final answer to the human user's question after observation]

"""
prompt_template = {"system":[{"role": "system", "content": agent_prompt}], "chat":[{'role': 'user', 'content': 'Question: {query}\nThought: '}]}

Bring it All Together

With our tools and generator ready, we create our multimodal agent using Agent. We incorporate ConversationMemory to save the conversation history between the user and the agent and provide tools with the ToolsManager.

from fastrag.agents.base import Agent, ToolsManager
from fastrag.agents.create_agent import ConversationMemory

multimodal_agent = Agent(
    generator,
    prompt_template=prompt_template,
    tools_manager=ToolsManager(tools=[nutrition_tool]),
    memory=ConversationMemory(generator=generator),
)

Testing Out the Agent

Our agent is now ready! Let’s start interacting with it. We can begin by asking a question about a food’s nutrition information:

agent_response = multimodal_agent.run("What is the fat content of the protein bar?")
print(agent_response["transcript"])

Thought: I need to find out the fat content of a protein bar.
Tool: nutrition_tool
Tool Input: {{"input": "protein bar"}}
Observation:
Observation: 
Image: <|image_
This image shows: Protein bar with chocolate peanut butter nutrition facts per bar (50g)

Thought:Thought: I have found the fat content of the protein bar.
Final Answer: The fat content of the protein bar is 8 grams.

The answer is “The fat content of the protein bar is 8 grams.” and it’s correct!

Now, let’s try a more complex query that requires multi-hop reasoning.

agent_response = multimodal_agent.run("Which one has more protein, protein bar or yogurt?")
print(agent_response["transcript"])

Thought: I need to compare the protein content of a protein bar and yogurt.
Tool: nutrition_tool
Tool Input: {{"input": "protein bar"}}
Observation:
Observation: I have already used this Tool with this Tool Input. I will use the information I already have to respond.
Thought:Thought: I need to find out the protein content of yogurt.
Tool: nutrition_tool
Tool Input: {{"input": "yogurt"}}
Observation:
Observation: 
Image: <|image_
This image shows: Greek style yoghurt nutrition facts per serving

Thought:Thought: I have found the protein content of yogurt.
Final Answer: The protein content of yogurt is 18 grams per cup.

Comparing the two:
- Protein bar: 14 grams
- Yogurt: 18 grams

Thought: The yogurt has more protein than the protein bar.
Final Answer: Yogurt has more protein than the protein bar.

Since the information about the protein bar is already stored in memory, the agent doesn’t need to make an additional tool call for it. Instead, it processes the previously retrieved image to find the protein amount.

Conclusion

In this article, we built a powerful multimodal agent capable of retrieving and answering questions about nutrition facts using fastRAG, Haystack, and the Phi3.5 Vision model. By combining multi-hop reasoning and ReAct prompting, this agent effectively handles complex queries, making it an ideal solution for real-time nutrition information retrieval.

Hopefully, this article gives you an idea of what these types of systems can accomplish by combining both image and textual data to answer multi-faceted questions.

Check out IntelLabs frameworks for more information and AI solutions:

Are you interested in exchanging tips and opinions on agents, LLMs, or other topics in AI with like-minded people? Come join the Haystack Discord community.

Happy coding! :)

Design Haystack AI Applications Visually in deepset Studio with NVIDIA NIM

Fri, 01 Nov 2024 00:00:00 +0000

In our previous article, we explored how to build and deploy two key AI pipelines in a retrieval-augmented generation (RAG) application using Haystack with NVIDIA NIM:

Indexing pipeline: Prepares data by preprocessing, chunking, and embedding PDF files, finally storing them in a vector database.
RAG pipeline: Designed to answer questions based on the contents of the uploaded PDF files.

In this article, we’ll take a step further by showing you how to visually design the architecture of these AI pipelines using deepset Studio, a newly released tool to visually create, deploy and test pipelines. With this tool, you can map out the entire structure of your AI workflows, from data ingestion to retrieval, while seamlessly integrating generative AI models accelerated by NVIDIA NIM microservices available in the NVIDIA API catalog.

By visually building these pipelines, you’ll not only see the logical flow of your AI applications but also iterate quickly on the design before committing to code. Once finalized, deepset Studio allows you to export your pipeline as Python code or YAML definitions, ready to be deployed.

By the end of this article, you’ll know how to build AI pipelines with deepset Studio while leveraging retrieval embedding and LLM NIM microservices from the NVIDIA API catalog. We’ll also guide you through the process of working with NIM APIs and demonstrate how to configure them for use in deepset Studio, should you opt for an on-prem deployment.

deepset Studio: Your Development Environment for Haystack

Most AI applications are composed of many moving parts that work together to serve a final use case. Whether that be RAG, preparing and embedding documents, chatting with a database or something else, we are involving different models, prompts, decision steps, preprocessors and more. Furthermore, committing to the final design can also often involve multiple stakeholders.

A visual layer representing the logical flow of the application helps us reason about the application within a simple interface that’s easy to decipher. It also helps with quick iterations with a diverse set of stakeholders.

Now we will see how we can visually create these pipelines using deepset Studio (for which you can sign up here), the new development environment for Haystack that allows you to design, build, deploy, and run these pipelines within a drag-and-drop UI. For each step that requires a generative AI model, such as an embedding model or LLM generator, we will use NVIDIA-hosted models on the NVIDIA API catalog. Finally, we will show how to self-host generative AI models with NVIDIA NIM.

NVIDIA NIM Microservices

NVIDIA NIM, part of the NVIDIA AI Enterprise software platform, is a suite of containerized microservices built to optimize the inference of AI models. These containers incorporate various components to efficiently serve AI models and expose them through standard APIs. The models are optimized using either NVIDIA TensorRT or NVIDIA TensorRT-LLM (depending on the model type), utilizing techniques such as quantization, model partitioning, optimized kernels/runtimes, and dynamic or continuous batching. This allows for further performance tuning to maximize efficiency.

NIM microservices offer simplified integration and production-ready, optimized generative AI deployment, allowing developers to focus on building their applications.

NIM microservices are available for popular AI foundation models, including LLM community models and retrieval-specific text embedding and reranking AI models. Developers can easily get started with NVIDIA NIM through the NVIDIA API catalog by generating a free API key, or by downloading and deploying NIM microservice containers in their own environment.

Building a RAG Application with deepset Studio

An often overlooked but essential step in building an effective RAG application is data preparation and context embedding - a process that can be quite time-consuming. Before diving into decisions about which LLM to use or how we augment a prompt, it’s critical to first focus on embedding and storing your documents (especially if you’re not relying on external APIs for data retrieval).

Most RAG applications revolve around two core pipelines: Indexing and RAG. Fig. 1 shows a high-level overview of the two pipelines using Haystack and NVIDIA NIM. We will be using Qdrant as the vector database in this article but it can be replaced with any other.

Fig. 1 - Haystack Indexing and RAG pipelines with NVIDIA NIM microservices.

In the following sections, we’ll break down each pipeline in detail and guide you through building them using deepset Studio.

Indexing Pipeline

The indexing pipeline is responsible for preparing the data that creates the context for an LLM’s response. This pipeline may include components for data cleaning, chunking, embedding, and ultimately storing the processed data i.e. embeddings in a vector database to create a searchable context.

NVIDIA NeMo Retriever provides fundamental building blocks for semantic search applications (like RAG), delivering accurate and optimized document indexing and search at scale. You can use the NeMo Retriever text embedding NIM microservice for vectorizing documents and further refine the search with the NeMo Retriever text reranking NIM microservice.

This pipeline involves indexing a PDF file into a vector database (here Qdrant database), which includes preprocessing, cleaning, splitting, and embedding the data. Haystack offers a variety of components to build this type of pipeline. In this example, we’ll be using the following components:

PyPDFtoDocument:A PDF converter that transforms a PDF file into a document dataclass, which Haystack can use within the pipeline.
DocumentCleaner: A preprocessing document cleaner component that removes unnecessary elements such as empty lines, headers, and footers, enabling cleaner data.
DocumentSplitter: This component breaks down large documents into smaller, manageable chunks for embedding and storage. You can customize the split by word, sentence, or passage, and define the length of each chunk and the overlap between them. This helps navigate challenges like LLM context limits and ensures that during the retrieval phase, only the most relevant segments are fetched for processing.
NvidiaDocumentEmbedder: This component enables document embedding using a NeMo Retriever text embedding NIM microservice, which can either be hosted via the NVIDIA API catalog or self-hosted within your own infrastructure. In this article, we’ll be using the nvidia/nv-embedqa-e5-v5 NIM microservice for document embedding.

We can configure each of these components individually and connect them together in the deepset Studio to build the indexing pipeline. Fig. 2 shows the final visual design of the built indexing pipeline in the deepset Studio.

Fig. 2 - End-to-end visual design of the indexing pipeline in the deepset Studio.

RAG Pipeline

After you’ve completed the data preparation steps, you can move on to implementing the retrieval augmentation step. It’s crucial to identify which embedding NIM microservice was used to generate the document embeddings, as we typically need the same model for retrieval.

In most use cases, RAG pipelines consist of three to four components. In this article, we’ll be using the following components for building the RAG pipeline:

Text Embedder: This component embeds incoming user queries using the same embedding model employed during the data indexing pipeline. For this, we’ll use the NvidiaTextEmbedder component, which we configured to leverage an NVIDIA-hosted NVIDIA NeMo Retriever text embedding NIM microservice from the NVIDIA API Catalog ( nvidia/nv-embedqa-e5-v5). To set this up, you will need to provide the model’s api_url and an NVIDIA_API_KEY as shown in Fig. 3.

Fig. 3 - NVIDIATextEmbedder configuration in the deepset Studio.

Retriever: In this case, we will use the QdrantEmbeddingRetriever, which takes the query embedding from the previous component and retrieves the most relevant documents from the Qdrant database. Fig. 4 shows the configuration of this component in the deepset Studio.

Fig. 4 - QdrantEmbeddingRetriever configuration in the deepset Studio.

Prompt Construction Component: This component is responsible for creating the instruction (prompt) that will be sent to a large language model (LLM), representing the ‘augmentation’ step. In Haystack, this is handled by the PromptBuilder. It allows you to create a prompt template using Jinja and dynamically detects inputs based on the template’s contents. For our use case, we have developed a prompt template shown in Fig. 5 using PromptBuilder in the deepset Studio, which expects query and documents as input.

Fig. 5 - PromtBuilder Configuration in the deepset Studio.

LLM Component: Finally, we will add an LLM component that receives our final augmented prompt and generates a response. In this example, we will use the llama-3_1-70b-instruct NIM microservice from the NVIDIA API catalog. In this case, we use the NvidiaGenerator component from Haystack and configure it to use the meta/llama-3.1-70b-instruct model. Fig. 6 shows the configuration of this component in the deepset Studio.

Fig. 6 - NvidiaGenerator configuration in the deepset Studio.

As you can see, each of these individual components expects certain inputs, and produces various outputs. You can learn more about the pipeline architecture and how components are connected here. We can now connect them in deepset Studio to create the final RAG pipeline, as shown in Fig. 7.

Fig. 7 - Full RAG pipeline built visually in the deepset Studio.

Use Self-hosted NIM Microservices for the Pipelines

The indexing and RAG Haystack pipelines described above use NIM endpoints from the NVIDIA API catalog. However, you can also self-host and deploy the NIM microservices in your own environment. In this case, you can simply substitute NIM endpoints from the API catalog with self-hosted NIM endpoints. Next, we will see how to do that.

To begin, you’ll need to deploy the NIM microservices in your environment. This deployment can be done using either Docker or Kubernetes. In our previous article, we explained in detail the deployment of NIM microservice containers on Kubernetes clusters. However, for quicker prototyping, we’ll outline here the steps to deploy the NIM containers using Docker.

The following command will deploy the meta/llama-3.1-70b-instruct LLM NIM microservice on a server equipped with a supported GPU. You can refer to the support matrix for compatibility with different GPU models.

export NGC_API_KEY=
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
docker run -it --rm \
--gpus all \
--shm-size=16GB \
-e NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
nvcr.io/nim/meta/llama-3.1-70b-instruct:1.1.2

This command will pull the NIM container, which first checks the underlying GPU. Based on this detection, it will download the optimized model from the NVIDIA NGC catalog that is suitable for the identified GPU infrastructure.

You can follow the same deployment procedure for the embedding models with the NVIDIA NeMo Retriever text embedding NIM microservice, with instructions available here.

Once the NIMs are deployed, whether through Docker or Kubernetes, simply update the api_url in each of the embedding and LLM components in the pipelines to point to your self-hosted NIM API endpoint URL (example: http://your_server_ip_address:8000/v1/), as illustrated in Fig. 8 for the NvidiaGenerator i.e. LLM component.

Fig. 8 - NvidiaGenerator configuration for self-hosted NIM microservices in the deepset Studio.

Export Pipelines and Deploy RAG Application

After building the final layout of our application in deepset Studio, we can export the pipelines in one of the two formats for deployment by clicking the export button:

YAML: In our previous article, we created similar indexing and RAG pipelines, demonstrating how to serialize them to YAML and use Hayhooks for deployment. You may choose this option if you want to follow the same approach.
Python: Alternatively, you can export the pipeline as a standard Python script for local execution.

Fig. 9 shows an example of exporting the RAG pipeline in the deepset Studio in Python format.

Fig. 9 - Example showing export of RAG pipeline in the deepset Studio in Python format.

Following Python code snippet shows the exported code of the RAG pipeline from the deepset Studio.

from  haystack  import  Pipeline
from  haystack_integrations.components.embedders.nvidia.text_embedder  import  NvidiaTextEmbedder
from  haystack_integrations.document_stores.qdrant.document_store  import  QdrantDocumentStore
from  haystack_integrations.components.retrievers.qdrant.retriever  import  QdrantEmbeddingRetriever
from  haystack.components.builders.prompt_builder  import  PromptBuilder
from  haystack_integrations.components.generators.nvidia.generator  import  NvidiaGenerator

  

nvidiatextembedder  =  NvidiaTextEmbedder(model="snowflake/arctic-embed-l",  api_url="https://ai.api.nvidia.com/v1/retrieval/snowflake/arctic-embed-l")
document_store  =  QdrantDocumentStore(embedding_dim=1024,  host="qdrant")
qdrantembeddingretriever  =  QdrantEmbeddingRetriever(top_k=10,  return_embedding=True,  document_store=document_store)
promptbuilder  =  PromptBuilder(template="Answer the question given the context.\nQuestion: {{ query }}\nContext:\n{% for document in documents %}\n {{ document.content }}\n{% endfor %}\nAnswer:")
nvidiagenerator  =  NvidiaGenerator(model="meta/llama-3.1-70b-instruct",  api_url="https://integrate.api.nvidia.com/v1",  model_arguments={"max_tokens":  1024})

pipeline  =  Pipeline()
pipeline.add_component("nvidiatextembedder",  nvidiatextembedder)
pipeline.add_component("qdrantembeddingretriever",  qdrantembeddingretriever)
pipeline.add_component("promptbuilder",  promptbuilder)
pipeline.add_component("nvidiagenerator",  nvidiagenerator)

pipeline.connect("nvidiatextembedder.embedding",  "qdrantembeddingretriever.query_embedding")
pipeline.connect("qdrantembeddingretriever.documents",  "promptbuilder.documents")
pipeline.connect("promptbuilder.prompt",  "nvidiagenerator.prompt")

Now that we have our runnable pipeline, we can choose one of the following options to deploy the RAG application:

Use Hayhooks to self-host the pipeline in a containerized environment. Deployment instructions via docker can be found in our previous article. Or if you want to deploy it via Kubernetes, instructions can be found here.
Deploy from deepset Studio (limitations apply). Sign up today.
If you’re a deepset Cloud customer: Deploy directly from deepset Studio to deepset Cloud, a fully managed AI workbench. This option enables you to collaborate with your team on the application design, evaluate your application, and manage its deployment effectively.

Summary

Designing the final layout of an AI application requires careful integration of multiple components and, importantly, several iterations. In this article, we saw how to visually design the architecture of AI pipelines using deepset Studio. This tool allows you to map out the entire structure of your AI workflows - from data ingestion to retrieval - while seamlessly integrating NVIDIA-hosted model API endpoints from the NVIDIA API catalog or from self-hosted NIM microservices. With deepset Studio, you can visualize the architecture of your application without making compromises about critical technical decisions, including your hosting options for language models. This tool allows you to reason about complex pipeline architectures within a visual layer, before committing to code. Once your design is complete, deepset Studio enables you to deploy and test your pipeline or export it as Python code or YAML definitions, making it ready for deployment.

Advanced RAG: Query Decomposition & Reasoning

Mon, 30 Sep 2024 00:00:00 +0000

This is part one of the Advanced Use Cases series:

1️⃣ Extract Metadata from Queries to Improve Retrieval

2️⃣ Query Expansion

3️⃣ Query Decomposition

4️⃣ Automated Metadata Enrichment

Sometimes a single question is multiple questions in disguise. For example: “Did Microsoft or Google make more money last year?”. To get to the correct answer for this seemingly simple question, we actually have to break it down: “How much money did Google make last year?” and “How much money did Microsoft make last year?”. Only if we know the answer to these 2 questions can we reason about the final answer.

This is where query decomposition comes in. This is a technique for retrieval augmented generation (RAG) based AI applications that follows a simple approach:

Decompose the original question into smaller questions that can be answered independently to each other. Let’s call these ‘sub questions’ here on out.
Reason about the final answer to the original question, based on each sub-answer.

While for many query/dataset combinations, this may not be required, for some, it very well may be. At the end of the day, often one query results in one retrieval step. If within that one single retrieval step we are unable to have the retriever return both the money Microsoft made last year and Google, then the system will struggle to produce an accurate final response.

This method ensures that we are:

retrieving the relevant context for each sub question.
reasoning about the final answer given each answer based on the contexts retrieved for each sub question.

In this article, I’ll be going through some key steps that allow you to achieve this. You can find the full working example and code in the linked recipe from our cookbook. Here, I’ll only show the most relevant parts of the code.

🚀 I’m sneaking something extra into this article. I saw the opportunity to try out the structured output functionality (currently in beta) by OpenAI to create this example. For this step, I extended the OpenAIGenerator in Haystack to be able to work with Pydantic schemas. More on this in the next step.

Let’s try build a full pipeline that makes use of query decomposition and reasoning. We’ll use a dataset about Game of Thrones (a classic for Haystack) which you can find preprocessed and chunked on Tuana/game-of-thrones on Hugging Face Datasets.

Defining our Questions Structure

Our first step is to create a structure within which we can contain the subquestions, and each of their answers. This will be used by our OpenAIGenerator to produce a structured output.

from pydantic import BaseModel

class Question(BaseModel):
    question: str
    answer: Optional[str] = None

class Questions(BaseModel):
    questions: list[Question]

The structure is simple, we have Questions made up of a list of Question. Each Question has the question string as well as an optional answer to that question.

Defining the Prompt for Query Decomposition

Next up, we need to get an LLM to decompose a question and produce multiple questions. Here, we will start making use of our Questions schema.

splitter_prompt = """
You are a helpful assistant that prepares queries that will be sent to a search component.
Sometimes, these queries are very complex.
Your job is to simplify complex queries into multiple queries that can be answered
in isolation to eachother.

If the query is simple, then keep it as it is.
Examples
1. Query: Did Microsoft or Google make more money last year?
   Decomposed Questions: [Question(question='How much profit did Microsoft make last year?', answer=None), Question(question='How much profit did Google make last year?', answer=None)]
2. Query: What is the capital of France?
   Decomposed Questions: [Question(question='What is the capital of France?', answer=None)]
3. Query: {{question}}
   Decomposed Questions:
"""

builder = PromptBuilder(splitter_prompt)
llm = OpenAIGenerator(model="gpt-4o-mini", generation_kwargs={"response_format": Questions})

Answering Each Sub Question

First, let’s build a pipeline that uses the splitter_prompt to decompose our question:

query_decomposition_pipeline = Pipeline()

query_decomposition_pipeline.add_component("prompt", PromptBuilder(splitter_prompt))
query_decomposition_pipeline.add_component("llm", OpenAIGenerator(model="gpt-4o-mini", generation_kwargs={"response_format": Questions}))

query_decomposition_pipeline.connect("prompt", "llm")

question = "Who has more siblings, Jamie or Sansa?"
result = query_decomposition_pipeline.run({"prompt":{"question": question}})
print(result["llm"]["structured_reply"])

This produces the following Questions (List[Question])

questions=[Question(question='How many siblings does Jamie have?', answer=None), 
           Question(question='How many siblings does Sansa have?', answer=None)]

Now, we have to fill in the answer fields. For this step, we need to have a separate prompt and two custom components:

The CohereMultiTextEmbedder which can take multiple questions rather than a single one like the CohereTextEmbedder.
The MultiQueryInMemoryEmbeddingRetriever which can again, take multiple questions and their embeddings, returning question_context_pairs. Each pair contains the question and documents that are relevant to that question.

Next, we need to construct a prompt that can instruct a model to answer each subquestion:

multi_query_template = """
You are a helpful assistant that can answer complex queries.
Here is the original question you were asked: {{question}}

And you have split the task into the following questions:
{% for pair in question_context_pairs %}
  {{pair.question}}
{% endfor %}

Here are the question and context pairs for each question.
For each question, generate the question answer pair as a structured output
{% for pair in question_context_pairs %}
  Question: {{pair.question}}
  Context: {{pair.documents}}
{% endfor %}
Answers:
"""

multi_query_prompt = PromptBuilder(multi_query_template)

Let’s build a pipeline that can answer each individual sub question. We will call this the query_decomposition_pipeline :

query_decomposition_pipeline = Pipeline()

query_decomposition_pipeline.add_component("prompt", PromptBuilder(splitter_prompt))
query_decomposition_pipeline.add_component("llm", OpenAIGenerator(model="gpt-4o-mini", generation_kwargs={"response_format": Questions}))
query_decomposition_pipeline.add_component("embedder", CohereMultiTextEmbedder(model="embed-multilingual-v3.0"))
query_decomposition_pipeline.add_component("multi_query_retriever", MultiQueryInMemoryEmbeddingRetriever(InMemoryEmbeddingRetriever(document_store=document_store)))
query_decomposition_pipeline.add_component("multi_query_prompt", PromptBuilder(multi_query_template))
query_decomposition_pipeline.add_component("query_resolver_llm", OpenAIGenerator(model="gpt-4o-mini", generation_kwargs={"response_format": Questions}))

query_decomposition_pipeline.connect("prompt", "llm")
query_decomposition_pipeline.connect("llm.structured_reply", "embedder.questions")
query_decomposition_pipeline.connect("embedder.embeddings", "multi_query_retriever.query_embeddings")
query_decomposition_pipeline.connect("llm.structured_reply", "multi_query_retriever.queries")
query_decomposition_pipeline.connect("llm.structured_reply", "embedder.questions")
query_decomposition_pipeline.connect("multi_query_retriever.question_context_pairs", "multi_query_prompt.question_context_pairs")
query_decomposition_pipeline.connect("multi_query_prompt", "query_resolver_llm")

Running this pipeline with the original question “Who has more siblings, Jamie or Sansa?”, results in the following structured output:

question = "Who has more siblings, Jamie or Sansa?"
result = query_decomposition_pipeline.run({"prompt":{"question": question},
                                           "multi_query_prompt": {"question": question}})

print(result["query_resolver_llm"]["structured_reply"])

questions=[Question(question='How many siblings does Jamie have?', answer='2 (Cersei Lannister, Tyrion Lannister)'),
           Question(question='How many siblings does Sansa have?', answer='5 (Robb Stark, Arya Stark, Bran Stark, Rickon Stark, Jon Snow)')]

Reasoning About the Final Answer

The final step we have to take is to reason about the ultimate answer to the original question. Again, we create a prompt that will instruct an LLM to do this. Given we have the questions output that contains each sub question and answer, we will make these inputs to this final prompt.

reasoning_template = """
You are a helpful assistant that can answer complex queries.
Here is the original question you were asked: {{question}}

You have split this question up into simpler questions that can be answered in
isolation.
Here are the questions and answers that you've generated
{% for pair in question_answer_pair %}
  {{pair}}
{% endfor %}

Reason about the final answer to the original query based on these questions and
aswers
Final Answer:
"""

resoning_prompt = PromptBuilder(reasoning_template)

To be able to augment this prompt with the question answer pairs, we will have to extend our previous pipeline and connect the structured_reply from the previous LLM, to the question_answer_pair input of this prompt.

query_decomposition_pipeline.add_component("reasoning_prompt", PromptBuilder(reasoning_template))
query_decomposition_pipeline.add_component("reasoning_llm", OpenAIGenerator(model="gpt-4o-mini"))

query_decomposition_pipeline.connect("query_resolver_llm.structured_reply", "reasoning_prompt.question_answer_pair")
query_decomposition_pipeline.connect("reasoning_prompt", "reasoning_llm")

Now, let’s run this final pipeline and see what results we get:

question = "Who has more siblings, Jamie or Sansa?"
result = query_decomposition_pipeline.run({"prompt":{"question": question},
                                           "multi_query_prompt": {"question": question},
                                           "reasoning_prompt": {"question": question}},
                                           include_outputs_from=["query_resolver_llm"])

print("The original query was split and resolved:\n")

for pair in result["query_resolver_llm"]["structured_reply"].questions:
  print(pair)
print("\nSo the original query is answered as follows:\n")
print(result["reasoning_llm"]["replies"][0])

🥁 Drum roll please:

The original query was split and resolved:

question='How many siblings does Jaime have?' answer='Jaime has one sister (Cersei) and one younger brother (Tyrion), making a total of 2 siblings.'
question='How many siblings does Sansa have?' answer='Sansa has five siblings: one older brother (Robb), one younger sister (Arya), and two younger brothers (Bran and Rickon), as well as one older illegitimate half-brother (Jon Snow).'

So the original query is answered as follows:

To determine who has more siblings between Jaime and Sansa, we need to compare the number of siblings each has based on the provided answers.

From the answers:
- Jaime has 2 siblings (Cersei and Tyrion).
- Sansa has 5 siblings (Robb, Arya, Bran, Rickon, and Jon Snow).

Since Sansa has 5 siblings and Jaime has 2 siblings, we can conclude that Sansa has more siblings than Jaime.

Final Answer: Sansa has more siblings than Jaime.

Wrapping up

Given the right instructions, LLMs are good at breaking down tasks. Query decomposition is a great way we can make sure we do that for questions that are multiple questions in disguise.

In this article, you learned how to implement this technique with a twist 🙂 Let us know what you think about using structured outputs for these sorts of use cases. And check out the Haystack experimental repo to see what new features we’re working on.

Improving Retrieval with Auto-Merging

Thu, 12 Sep 2024 00:00:00 +0000

For most RAG applications, where we first have to retrieve the most relevant context, we end up having to split up documents first, and index those smaller splits of documents. Reasons for this range from needing to retrieve only relevant sections of larger bits of documents to the simple fact that (although they’re improving massively) LLMs simply don’t have infinite context lengths.

Auto-Merging is a retrieval technique that leverages a hierarchical document structure. When a document is too long, it is split into smaller documents or chunks, where we can think of the smaller documents as the children of the original document and the original document as the parent. This results in a hierarchical tree structure where each smaller document is a child of a previous larger document. The leaves of the tree are the documents which don’t have any children, and the root is the original document.

Auto-merging retrieval is a technique we can use if the parent document is likely to contain more of the relevant context about the information the user is after, in comparison to a subset of it’s child documents. When a query is made, the the retriever will normally return the top_k number of document chunks that are relevant to the query. However, if the number of retrieved document chunks that belong to the same parent document is above a certain threshold, the retriever would return the parent document instead of the individual chunks.

Haystack Components

Haystack implements the Auto-Merging Retrieval with two components:

HierarchicalDocumentSplitter: splits a Document into multiple Document objects of different block sizes, building a hierarchical tree structure where each smaller block is a child of a previous larger block. The init method expects three parameters:
- block_sizes: Set of block sizes to split the document into. The blocks are split in descending order. So, block_sizes of {20, 5} would mean that each ‘parent’ split would be of length max 20, and and each of its children would be of length max 5.
- split_overlap: The number of overlapping units for each split.
- split_by: The unit for splitting your documents.
AutoMergingRetriever: a retriever that leverages the hierarchical tree structure of documents, where the leaf nodes are indexed in a document store. During retrieval, if the number of matched leaf documents below the same parent is higher than a defined threshold, the retriever will return the parent document instead of the individual leaf documents. The init method expects three parameters:
- document_store: DocumentStore from which to retrieve the parent documents
- threshold: Threshold to decide whether the parent instead of the individual documents is returned

Introductory Example

Let’s see a simple example of how the AutoMergingRetriever works. In this example we will use a single document. We use the HierarchicalDocumentSplitter to split the document into chunks, represented by smaller documents, and capturing the hierarchical structure of the document.

    from haystack import Document
    from haystack.components.preprocessors import HierarchicalDocumentSplitter

    docs = [Document(content="The monarch of the wild blue yonder rises from the eastern side of the horizon.")]
    splitter = HierarchicalDocumentSplitter(block_sizes={10, 3}, split_overlap=0, split_by="word")
    docs = splitter.run(docs)

We start by creating a document, and then we split it into smaller documents using the HierarchicalDocumentSplitter. We need to specify the block sizes that we want to split the document into. In this case, we are splitting the document into 10 and 3-word blocks - this means that the splitter will only have 2 levels, the first with a maximum of 10 words and the second a maximum of 3 words. There are no overlaps among the documents, and we also specify that we want to split the document by words. This results in 9 documents being created from the original document. The documents are split as follows:

`The monarch of the wild blue yonder rises from the eastern side of the horizon.` -- (root)
|
|
|
|--- `The monarch of the wild blue yonder rises from the`
|               |
|               |
|               |--- `The monarch of` -- (leaf)
|               |
|               |--- `the wild blue` -- (leaf)
|               |
|               |--- `yonder rises from` -- (leaf)
|               |
|               |--- `the` -- (leaf)
|
|
|--- `eastern side of the horizon.` -- (leaf)
|               |
|               |
|               |--- `eastern side of` -- (leaf)
|               |
|               |--- `the horizon.` -- (leaf)

Note that the original document is always the root of the tree. We then have two levels of children, the first with a maximum block size of 10 words, and the second with a maximum block size of 3 words.

We now need to split this documents into two distinct document stores. During initialization the AutoMergingRetriever requires the document store where the parent documents are indexed. At run time it receives leaf documents that matched a user query, it returns the parent document if the number of matched leaf documents below the same parent is higher than a defined threshold, otherwise it returns the original retrieved leaf documents.

Let’s see it in practice. We index the parent documents, by selecting the ones with a __level of 1.

    from haystack.document_stores.in_memory import InMemoryDocumentStore

    parent_docs_store = InMemoryDocumentStore()
    parent_docs = [doc for doc in docs["documents"] if doc.meta["__level"]==1]
    parent_docs_store.write_documents(parent_docs)

Let’s now initialize the AutoMergingRetriever with parent document store and a parent threshold of 0.5, meaning that if at least 50% of the leaf documents below the same parent match the query, the retriever will return the parent instead of the leaf documents which matched the user query. If we query the document store with a single leaf document, the retriever will return the same leaf document.

    from haystack.components.retrievers import AutoMergingRetriever

    retriever = AutoMergingRetriever(document_store=parent_docs_store, threshold=0.5)
    retriever.run(matched_leaf_documents=[docs['documents'][4]])

If we now we query the document store with two leaf documents, the retriever will return the parent document instead of the individual leaf documents, as the threshold of 0.5 is met.

    matched_leaf_documents = [docs['documents'][4], docs['documents'][5]]
    retriever.run(matched_leaf_documents=matched_leaf_documents)

This was a simple introductory example to show how AutoMergingRetriever works and retrieves parent documents instead of individual leaf documents. Next we will see a full example over news articles dataset.

Advanced Example

We will use the BBC news dataset to show how the AutoMergingRetriever works with a dataset containing multiple news articles. This dataset consists of 2.225 documents from the BBC corresponding to stories in five topical areas collected between 2004-2005, and was part of work by D. Greene and P. Cunningham. “Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering”, Proc. ICML 2006.

Reading the dataset

The original dataset is available at http://mlg.ucd.ie/datasets/bbc.html, but we are going to use a version that was already preprocessed and stored in a single CSV file available at the following URL:

https://raw.githubusercontent.com/amankharwal/Website-data/master/bbc-news-data.csv

from typing import List
import csv
from haystack import Document

def read_documents(file: str) -> List[Document]:
    with open(file, "r") as file:
        reader = csv.reader(file, delimiter="\\t")
        next(reader, None)  # skip the headers
        documents = []
        for row in reader:
            category = row[0].strip()
            title = row[2].strip()
            text = row[3].strip()
            documents.append(Document(content=text, meta={"category": category, "title": title}))

    return documents

docs = read_documents("bbc-news-data.csv")
len(docs)
>> 2225

Indexing the documents

After reading the converting the news articles into Haystack Document objects, let’s now let’s index them. We will use as document store theInMemoryDocumentStore for the sake of simplicity. We first apply the HierarchicalDocumentSplitter to the list of Documents, creating a hierarchical structure

We will create two document stores, one for the parent documents, and one for the leaf documents. We will later say that there will be an intermediate retriever to match user query with the indexed leaf documents, this intermediate retriever will then be connected to an AutoMergingRetriever which decides for when to return the parent instead of the matched leaf documents.

The function below receives the news articles as Documents and filters them by the meta field __level to differentiate between children and parent Documents, indexing them in their respective document stores, which are then both returned by the function.

from typing import Tuple

from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.document_stores.types import DuplicatePolicy

from haystack.components.preprocessors import HierarchicalDocumentSplitter

def indexing(documents: List[Document]) -> Tuple[InMemoryDocumentStore, InMemoryDocumentStore]:
    splitter = HierarchicalDocumentSplitter(block_sizes={10, 5}, split_overlap=0, split_by="sentence")
    docs = splitter.run(documents)

    # store the leaf documents in one document store
    leaf_documents = [doc for doc in docs["documents"] if doc.meta["__level"] == 1]
    leaf_doc_store = InMemoryDocumentStore()
    leaf_doc_store.write_documents(leaf_documents, policy=DuplicatePolicy.SKIP)

    # store the parent documents in another document store
    parent_documents = [doc for doc in docs["documents"] if doc.meta["__level"] == 0]
    parent_doc_store = InMemoryDocumentStore()
    parent_doc_store.write_documents(parent_documents, policy=DuplicatePolicy.SKIP)

    return leaf_doc_store, parent_doc_store

Querying the documents

Now that we have our document stores let’s construct a querying pipeline, consisting of a BM25Retriever associated with the document store containing the leaf documents, and an AutoMergingRetriever associated with the parent documents and with a threshold of 0.6, meaning that if at least 60% of the matched leaf documents belong to the same parent, their parent is returned instead of each individual Document.

from haystack import Pipeline
from haystack.components.retrievers import InMemoryBM25Retriever
from haystack.components.retrievers import AutoMergingRetriever

def querying_pipeline(leaf_doc_store: InMemoryDocumentStore, parent_doc_store: InMemoryDocumentStore, threshold: float = 0.6):
    pipeline = Pipeline()
    bm25_retriever = InMemoryBM25Retriever(document_store=leaf_doc_store)
    auto_merge_retriever = AutoMergingRetriever(parent_doc_store, threshold=threshold)
    pipeline.add_component(instance=bm25_retriever, name="BM25Retriever")
    pipeline.add_component(instance=auto_merge_retriever, name="AutoMergingRetriever")
    pipeline.connect("BM25Retriever.documents", "AutoMergingRetriever.matched_leaf_documents")
    return pipeline

Putting it all together

docs = read_documents("bbc-news-data.csv")
leaf_doc_store, parent_doc_store = indexing(docs)
pipeline = querying_pipeline(leaf_doc_store, parent_doc_store, threshold=0.6)

So, now can run each function individually and have a querying pipeline that uses the AutoMergingRetriever. We can then use the pipeline to query the document store for articles related to cybersecurity, and let’s also make use of the pipeline parameter include_outputs_from to also get the outputs from the BM25Retriever component.

result = pipeline.run(data={'query': 'phishing attacks spoof websites spam e-mails spyware'},  include_outputs_from={'BM25Retriever'})

The result will have two keys, one for each retriever component: AutoMergingRetriever, BM25Retriever.

Let’s see how many documents were retrieved by each component.


In [17]: len(result['AutoMergingRetriever']['documents'])
Out[17]: 7

In [18]: len(result['BM25Retriever']['documents'])
Out[18]: 10

As we can see, the AutoMergingRetriever retrieved 7 documents, while the BM25Retriever retrieved 10 documents. This is because the AutoMergingRetriever returned parent documents instead of individual leaf documents. Let’s compare the titles of the documents retrieved by the BM25Retriever and the AutoMergingRetriever.

doc_titles = sorted([d.meta['title'] for d in result['BM25Retriever']['documents']])
In [14]: doc_titles
Out[14]:
['Bad e-mail habits sustains spam',
 'Bad e-mail habits sustains spam',
 'Cyber crime booms in 2004',
 'Cyber criminals step up the pace',
 'Cyber criminals step up the pace',
 'Junk e-mails on relentless rise',
 'More women turn to net security',
 'Security scares spark browser fix',
 'Spam e-mails tempt net shoppers',
 'Spam e-mails tempt net shoppers']

In [15]: doc_titles = sorted([d.meta['title'] for d in result['AutoMergingRetriever']['documents']])
In [16]: doc_titles
Out[16]:
['Bad e-mail habits sustains spam',
 'Cyber crime booms in 2004',
 'Cyber criminals step up the pace',
 'Junk e-mails on relentless rise',
 'More women turn to net security',
 'Security scares spark browser fix',
 'Spam e-mails tempt net shoppers']

Instead of returning individual leaf documents, the AutoMergingRetriever returned parent document for the articles:

“Bad e-mail habits sustains spam”,
“Cyber criminals step up the pace”,
“Spam e-mails tempt net shoppers”;

since at least 60% of the leaf documents of each of those documents matched the query.

Conclusion

In this tutorial we saw how the AutoMergingRetriever works. One important aspect of the AutoMergingRetriever implementation in Haystack is that it requires the documents to be split using the HierarchicalDocumentSplitter. Another aspect to notice as we saw, is that the AutoMergingRetriever should be used in conjunction with other base Retrievers allowing for a more flexible retrieval system.

Announcing Our New Short Course with DeepLearning.AI

Wed, 21 Aug 2024 00:00:00 +0000

DeepLearning.AI and Andrew Ng hold a special place in many AI/ML engineers’ careers and development. For many engineers, they’ve played a pivotal role into breaking into AI, or extending their knowledge and capabilities by teaching about core concepts and technologies at the right time. So, we’re incredibly happy to announce that we’re launching our first short course with them - “Building AI Applications with Haystack”

🚀 Enroll now

What you’ll learn

This course is designed with a bottom up approach, starting by teaching you about the building blocks of generative AI applications. Throughout the course, you’ll start to design and customize your own retrieval-augmented generative (RAG) pipelines, you’ll learn how to extend Haystack with custom capabilities for your own generative AI use case, you’ll even implement a simple self-reflecting agent and end the course by creating a full chat agent with function calling.

Who is the course for?

If you’re new to AI, this course is a great way to start learning about the basics and how they come together to build some of the most useful GenAI use cases. It will also teach you the fundamentals of Haystack and how you can make use of our framework to build out your own applications. You certainly do not need to have a deep understanding or much prior knowledge, however, it will be useful for you to be familiar with Python and have some basic understanding of different model types like embedding models and generative models.

If you’re not new to AI, this course is a great way to get started with Haystack as a tool for you to make your GenAI use cases come to life.

It’s been an absolute pleasure for me to create this with the Haystack team and the DeepLearning.AI team. I’d love to hear what you think, and if you take the course and end up with questions, please join me for an AMA session on September 10th.

Advanced RAG: Query Expansion

Wed, 14 Aug 2024 00:00:00 +0000

This is part one of the Advanced Use Cases series:

1️⃣ Extract Metadata from Queries to Improve Retrieval

2️⃣ Query Expansion

3️⃣ Query Decomposition

4️⃣ Automated Metadata Enrichment

The quality of RAG (retrieval augmented generation) highly depends on the quality of the first step in the process: retrieval. The generation step can only be as good as the context its working on, which it will receive as a result of a retrieval step.

However, retrieval is also in turn dependent on the query that it receives. There are multiple types of retrieval: keyword based, semantic search (embedding) based, hybrid, or even in some cases simply based on the results of a query to an API (for example, the results of websearch and so on). But at the end of the day, in the majority of cases, there’s a human behind a keyboard typing a query, and humans are not guaranteed to produce good quality queries for the results they intend to get.

In this article, we’ll walk you through a very simple yet effective technique that allows us to make sure we are retrieving more of, and more relevant bits of context to a given query: query expansion.

TL;DR: Query expansion increases the number of results, so it increases recall (vs precision). In general, BM25 favors precision while embedding retrieval favors recall (See this explanation by Nils Reimers). So, it makes sense to use BM25+query expansion to increase recall in cases where you want to rely on keyword search.

Query Expansion

Query expansion is a technique where we take the user query, and generate a certain number of similar queries. For example:

User Query: “open source NLP frameworks”

After Query Expansion: [”natural language processing tools”, “free nlp libraries”, “open-source language processing platforms”, “NLP software with open-source code”, “open source NLP frameworks”]

This helps improve retrieval results, and in turn the quality of RAG results in cases where:

The user query is vague or poorly formed.
In cases of keyword-based retrieval, it also allows you to cover your bases with queries of similar meaning or synonyms.

Take ‘global warming’ as an example, query expansion would allow us to make sure we’re also doing keyword search for ‘climate change’ or similar queries.

Let’s start by importing the experimental QueryExpander component. This component is using an OpenAI model (gpt-4o-mini in this case) to generate a certain number of additional queries that are similar to the original user query. It returns queries, which include the original query plus the generated similar ones:

expander = QueryExpander()
expander.run(query="open source nlp frameworks", number=4)

This would result in the component returning queries that include the original query + 4 expanded queries:

{'queries': ['natural language processing tools',
  'free nlp libraries',
  'open-source language processing platforms',
  'NLP software with open-source code',
  'open source nlp frameworks']}

Retrieval With Query Expansion

Let’s look at what happens if we use query expansion as a step in our retrieval pipeline. Let’s look at this through a very simple and small demo. To this end, I used some dummy data. Here’s the list of documents I used:

documents = [
    Document(content="The effects of climate are many including loss of biodiversity"),
    Document(content="The impact of climate change is evident in the melting of the polar ice caps."),
    Document(content="Consequences of global warming include the rise in sea levels."),
    Document(content="One of the effects of environmental changes is the change in weather patterns."),
    Document(content="There is a global call to reduce the amount of air travel people take."),
    Document(content="Air travel is one of the core contributors to climate change."),
    Document(content="Expect warm climates in Turkey during the summer period."),
]

When asking to retrieve the top 3 documents to the query “climate change” using the InMemoryBM25Retriever (so, we’re doing keyword search) here’s what we get as our top 3 candidates:

'Air travel is one of the core contributors to climate change.'
'The impact of climate change is evident in the melting of the polar ice caps.'
'The effects of climate are many including loss of biodiversity'

There are 2 things to notice here:

We’re only asking for 3 documents, and we’re getting 3 relevant documents to the query “climate change”. In this sense, this retrieval is completely valid and has done a good job.
But, because we’re using the query “climate change” in combination with a keyword retriever, we are actually missing out on some documents that may be even more relevant to the query. For example, the document with “global warming” is completely left out.

You can start to see how this could impact the results you get in cases where users are typing vague queries or keywords into the search box.

Now, let’s add query expansion to the mix. We will be using a custom retriever this time called the MultiQueryInMemoryBM25Retriever which can accept a list of queries instead of a single query (see the cookbook for the full code). Here’s the retrieval pipeline that we create:

query_expander = QueryExpander()
retriever = MultiQueryInMemoryBM25Retriever(InMemoryBM25Retriever(document_store=doc_store))

expanded_retrieval_pipeline = Pipeline()
expanded_retrieval_pipeline.add_component("expander", query_expander)
expanded_retrieval_pipeline.add_component("keyword_retriever", retriever)

expanded_retrieval_pipeline.connect("expander.queries", "keyword_retriever.queries")

Now, we can run this pipeline, again with the same query “climate change”

expanded_retrieval_pipeline.run({"expander": {"query": "climate change"}}, 
																include_outputs_from=["expander"])

And we get the following results. The query expander has created the following queries:

'expander': {'queries': ['global warming consequences',
   'environmental impact of climate change',
   'effects of climate variability',
   'implications of climate crisis',
   'consequences of greenhouse gas emissions',
   'climate change']}}

Note that you may get different results because your QueryExpander may generate different queries

And we’ve received the following documents from the retrieval pipeline:

'Consequences of global warming include the rise in sea levels.'
'The impact of climate change is evident in the melting of the polar ice caps.',
'There is a global call to reduce the amount of air travel people take.'
'The effects of climate are many including loss of biodiversity'
'One of the effects of environmental changes is the change in weather patterns.'
'Air travel is one of the core contributors to climate change.'

Notice how we’re able to add context about ‘global warming’ and ‘effects of environmental change’.

Using Query Expansion for RAG

In the example cookbook, we’ve also added a section on using query expansion for RAG on Wikipedia pages. We index the following wikipedia pages into an InMemoryDocumentStore :

"Electric_vehicle", "Dam", "Electric_battery", "Tree", "Solar_panel", "Nuclear_power",
"Wind_power", "Hydroelectricity", "Coal", "Natural_gas", 
"Greenhouse_gas", "Renewable_energy", "Fossil_fuel"

And then, we construct a RAG pipeline. For our resulting prompt to the LLM, we also indicate what the original query from the user was.

template = """
You are part of an information system that summarises related documents.
You answer a query using the textual content from the documents retrieved for the
following query.
You build the summary answer based only on quoting information from the documents.
You should reference the documents you used to support your answer.
###
Original Query: "{{query}}"
Retrieved Documents: {{documents}}
Summary Answer:
"""
query_expander = QueryExpander()
retriever = MultiQueryInMemoryBM25Retriever(InMemoryBM25Retriever(document_store=doc_store))
prompt_builder = PromptBuilder(template = template)
llm = OpenAIGenerator()

query_expanded_rag_pipeline = Pipeline()
query_expanded_rag_pipeline.add_component("expander", query_expander)
query_expanded_rag_pipeline.add_component("keyword_retriever", retriever)
query_expanded_rag_pipeline.add_component("prompt", prompt_builder)
query_expanded_rag_pipeline.add_component("llm", llm)

query_expanded_rag_pipeline.connect("expander.queries", "keyword_retriever.queries")
query_expanded_rag_pipeline.connect("keyword_retriever.documents", "prompt.documents")
query_expanded_rag_pipeline.connect("prompt", "llm")

Running this pipeline with the simple query “green energy sources” with the query expander, we’re able to get a response constructed from Wikipedia pages including “Electric Vehicle”, “Wind Power”, “Renewable Energy”, “Fossil Fuel” and “Nuclear Power”. Without the MultiQueryInMemoryBM25Retriever, we rely on the top k results from a single pass of BM25 retrieval on the query “green energy sources” resulting in a response constructed from the pages “Renewable energy”, “Wind Power” and “Fossil Fuel”

Wrapping Up

Query Expansion is a great technique that will allow you to get a wider range of relevant resources while still using keyword search. While semantic search is a great option, it does require the use of an embedding model, and the existence of embeddings for the data source we will perform search on. This makes keyword based search quite an attractive option for faster, cheaper retrieval.

This does however mean that we heavily rely on the quality of the provided query. Query expansion allows you to navigate this issue by generating similar queries to the user query.

In my opinion, one of the main advantages of this technique is that it allows you to avoid embedding documentation at each update, while still managing to increase the relevance of retrieved documents at query time. Keyword retrieval doesn’t require any extra embedding step, so the only inferencing happening at retrieval time in this scenario is when we ask an LLM to generate a certain number of similar queries.

CPU-Optimized Embedding Models with fastRAG and Haystack

Thu, 01 Aug 2024 00:00:00 +0000

One of the main and critical components of a retrieval augmented generation (RAG) pipeline is the embedding process, which forms the foundation for efficient information retrieval by transforming raw text into machine-readable vector representations. Embedding models encode textual data into dense vectors, capturing semantic and contextual meaning. These models are used to create embeddings for both queries (for retrieval) and documents (for indexing and reranking). Therefore, optimizing these models through quantization could improve our RAG application by providing:

Higher throughput: useful for reducing the time needed for creating or updating your vectors store.
Lower latency: improves real-time experience as creating query embeddings and re-ranking of documents are done online per user input.
Reduced memory and cost requirements: Optimizing by quantization to int8 reduces the memory footprint and the cost when running such models.

This is where specialized frameworks, such as fastRAG by Intel Labs, come into play, offering enhancements tailored to specific hardware and use cases, and with the extensive feature-set offered by Haystack.

fastRAG: Intel Labs’ Framework for Efficient RAG

fastRAG is a research framework developed by Intel Labs for efficient and optimized RAG pipelines. It incorporates state-of-the-art large language models (LLMs) and information retrieval capabilities. fastRAG is fully compatible with Haystack and includes novel and efficient RAG modules designed for efficient deployment on Intel hardware, including client and server CPUs (Xeon) and the Intel Gaudi AI accelerator.

The fastRAG GitHub repository provides extensive documentation on each component available in the framework, comprehensive examples, and easy installation instructions for optimized backends. The framework utilizes optimized extensions to popular deep learning frameworks such as PyTorch.

One such extension is Optimum Intel, an open-source library that extends the Hugging Face Transformers library and takes advantage of Intel® Advanced Vector Extensions 512 (Intel® AVX-512), Vector Neural Network Instructions (VNNI), and Intel® Advanced Matrix Extensions (Intel® AMX) on Intel CPUs to accelerate models. AMX accelerated inference is introduced in PyTorch 2.0 and the Intel Extension for PyTorch (IPEX).

Intel and deepset are key members in Open Platform for Enterprise AI (OPEA), a project recently announced by LF AI & Data Foundation. OPEA aims to accelerate secure, cost-effective generative AI (GenAI) deployments for businesses by driving interoperability across a diverse and heterogeneous ecosystem, starting with RAG.

Optimization Process: Quantization

The optimization process involves quantizing the model using a calibration dataset, and leveraging an optimized backend like IPEX for Intel Xeon CPUs. Quantization reduces the model size by converting weights and activations from floating-point (e.g., 32-bit) to lower-bit representations (e.g., 8-bit integers). This makes models smaller, faster, and more cost-efficient, with negligible loss in accuracy. Benchmark results of BGE-large demonstrate a potential 10x speed-up in the indexing process when using the int8 variant of the model.

Throughput comparison of int8 and FP32 variants of the BGE-large model with varying batch sizes (sequence length=256) on a Gen 4 Xeon CPU (8480+) with 56 Cores (1 Socket)

Note that the benchmark results focus solely on the time spent in the encoding process of the BGE-large model variants. Time taken for tokenization is excluded from these measurements to provide a clearer comparison of the models’ encoding efficiency.

A comprehensive guide to quantizing a model from scratch is available. Additionally, three quantized BGE embedding models are available on Intel’s Hugging Face Model Hub.

Components

fastRAG is supported as an integration to Haystack, extending Haystack’s document and text embedders with IPEX support. In addition, fastRAG includes two Bi-encoder similarity rankers:

IPEXSentenceTransformersDocumentEmbedder and IPEXSentenceTransformersTextEmbedder - Embedder components that use an int8 quantized embedding models via IPEX, and can embed Document and text inputs.
BiEncoderSimilarityRanker - A bi-encoder similarity ranker that re-orders a list of documents given a query and an embedder. Bi-encoder models are used to encode documents and queries independently and are more efficient than cross-encoders.
IPEXBiEncoderSimilarityRanker - An IPEX-based BiEncoderSimilarityRanker to be used with an int8 quantized embedding model.

See the full list of fastRAG components here.

Same Accuracy, 9x Faster

Maintaining competitive retrieval accuracy is important when optimizing the models. We evaluated the impact of optimization (quantization and calibration) on performance using the Rerank and Retrieval sub-tasks of MTEB with three BGE bi-encoder embedding models. For the BGE-large model, the optimization process marginally altered performance compared to the original model, as shown in the table.

	int8	FP32	%diff
Reranking	0.5997	0.6003	-0.108%
Retrieval	0.5346	0.5429	-1.53%

Results for other BGE models can be found here.

Let’s compare encoding random texts as passages using two different models:

BAAI/bge-large-en-v1.5 (fp32) with Haystack’s SentenceTransformersDocumentEmbedder
Intel/bge-large-en-v1.5-rag-int8-static (int8) with fastRAG’s IPEXSentenceTransformersDocumentEmbedder

The script below creates random passages, each passage translates into 256 tokens once encoded with the tokenizer, and encodes 16384 passages using the same model in both fp32 and int8 variants.

import time
from datasets import load_dataset
from haystack import Document
from fastrag.embedders import IPEXSentenceTransformersDocumentEmbedder
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Intel/bge-large-en-v1.5-rag-int8-static")

def generate_text_for_fixed_length(seq_length):
    text = ""
    while True:
        # Tokenize the current text
        token_ids = tokenizer(text)["input_ids"]
        
        # Check if the tokenized sequence has reached the desired length
        if len(token_ids) == seq_length:
            break
        elif len(token_ids) > seq_length:
            text = text[:-1]
        
        # Add a random character to the text
        text += random.choice(string.ascii_letters + string.digits + string.punctuation + " ")
    
    return text

seq_length = 256
generated_texts = []
for _ in tqdm(range(1000), desc="Generating texts"):
    generated_texts.append(generate_text_for_fixed_length(seq_length))
generated_texts = generated_texts * 20

docs = [Document(content=doc) for doc in generated_texts]

BATCH_SIZE_LIST = [1, 4, 8, 16, 32, 64, 128, 256]

for BATCH_SIZE in BATCH_SIZE_LIST:
    print("Running with BATCH_SIZE:", BATCH_SIZE)
    ipex_doc_embedder = IPEXSentenceTransformersDocumentEmbedder(
            model="Intel/bge-large-en-v1.5-rag-int8-static", 
            batch_size=BATCH_SIZE
        )
    haystack_doc_embedder = SentenceTransformersDocumentEmbedder(
            model="BAAI/bge-large-en-v1.5", 
            batch_size=BATCH_SIZE
        )

    ipex_doc_embedder.warm_up()
    haystack_doc_embedder.warm_up()

    # Measure runtime for SentenceTransformersDocumentEmbedder
    start_time = time.time()
    documents_with_embeddings = haystack_doc_embedder.run(docs[:16384])
    end_time = time.time()
    haystack_doc_embedder_runtime = end_time - start_time

    # Measure runtime for IPEXSentenceTransformersDocumentEmbedder
    start_time = time.time()
    documents_with_embeddings = ipex_doc_embedder.run(docs[:16384])
    end_time = time.time()
    ipex_doc_embedder_runtime = end_time - start_time

    print("Runtime for SentenceTransformersDocumentEmbedder:", haystack_doc_embedder_runtime)
    print("Runtime for IPEXSentenceTransformersDocumentEmbedder:", ipex_doc_embedder_runtime)

The runtime results indicate that using the setup of fastRAG’s components, as demonstrated in the script above, leads to 5.25x to 9.3x speed-ups in the embedding process when running on a single socket of a Gen 4 Xeon CPU (8480+) and using 56 cores. We can also translate that to throughput (higher is better) and see the differences in speed-ups as well.

💡 The difference in speedups compared to the benchmark presented previously is due to the extra processing done in Haystack components, mainly, the tokenization process which was excluded in the previous benchmark.

Throughput comparison of int8 and FP32 variants of the BGE-large model using IPEXSentenceTransformersDocumentEmbedder and SentenceTransformersDocumentEmbedder

💡 The performance of the quantized int8 model highly depends on the structure of the data. For best performance, it is advised to use static shapes, meaning, and tokenized sequences of the same length. In addition, batching is highly effective when using CPU backends and it could also be combined with dynamic shapes. It’s a matter of tuning the setup according to the data and hardware.

💡 We followed the instructions available here when running the experiments, which includes using numactl to limit running the processes on a single socket, and TCMalloc. We recommend reading the performance tuning guide and launch script usage available on the IPEX documentation website.

Read the Intel fastRAG team’s blog with additional evaluations and performance benchmarking for more information.

RAG with Optimized Embedding Models

In this section, we will explore how to use optimized models within a RAG pipeline. We will use embedder models to create the initial index more quickly than the standard fp32 Hugging Face models. Additionally, we will demonstrate a simple Q&A pipeline that employs an optimized bi-encoder ranker. This ranker re-orders the retrieved documents to enhance the list of documents used in the LLM prompt, thereby improving the overall performance of the retrieval process.

Installation

First, install fastRAG, Optimum Intel and Haystack via fastRAG:

pip install fastrag[intel]

Indexing Data

We will start with initializing an in-memory data store and loading the document embedder component from fastRAG. The IPEXSentenceTransformersDocumentEmbedder can be seamlessly integrated into a Haystack pipeline, just like any other component.

from haystack import Document
from haystack.document_stores.in_memory import InMemoryDocumentStore
from fastrag.embedders import IPEXSentenceTransformersDocumentEmbedder, IPEXSentenceTransformersTextEmbedder

document_store = InMemoryDocumentStore()
doc_embedder = IPEXSentenceTransformersDocumentEmbedder(model="Intel/bge-small-en-v1.5-rag-int8-static")
doc_embedder.warm_up()

Now, let’s load a dataset. We’ll use bilgeyucel/seven-wonders dataset that doesn’t need any further processing:

from datasets import load_dataset
from haystack import Document

dataset = load_dataset("bilgeyucel/seven-wonders", split="train")
docs = [Document(content=doc["content"], meta=doc["meta"]) for doc in dataset]

Next, we embed the documents and write them to the index:

documents_with_embeddings = doc_embedder.run(docs)
document_store.write_documents(documents_with_embeddings["documents"])

RAG Pipeline

We continue to initialize components required to build a pipeline that represents a simple Q&A RAG example using an embedder, a retriever, a reranker, a prompt template, and a generator. Notably, the IPEXSentenceTransformersTextEmbedder and IPEXBiEncoderSimilarityRanker can be seamlessly integrated into a Haystack pipeline alongside other components.

Learn how to create a RAG pipeline with Haystack in 📚 Tutorial: Creating Your First QA Pipeline with Retrieval-Augmentation.

from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from fastrag.rankers import IPEXBiEncoderSimilarityRanker

query_embedder = IPEXSentenceTransformersTextEmbedder(model="Intel/bge-small-en-v1.5-rag-int8-static")
retriever = InMemoryEmbeddingRetriever(document_store, top_k=100)
reranker = IPEXBiEncoderSimilarityRanker("Intel/bge-large-en-v1.5-rag-int8-static", top_k=5)

We create a simple RAG prompt template:

from haystack.components.builders import PromptBuilder

template = """
You are a helpful AI assistant. You are given contexts and a question. 
You must answer the question using the information given in the context.

Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: {{question}}
Answer:

"""

prompt_builder = PromptBuilder(template=template)

We initialize a HuggingFaceLocalGenerator with HuggingFaceTB/SmolLM-1.7B-Instruct model (SmolLM-1.7B-Instruct using a local Hugging Face model) to generate answers:

from haystack.components.generators import HuggingFaceLocalGenerator

generator = HuggingFaceLocalGenerator(model="HuggingFaceTB/SmolLM-1.7B-Instruct",
                                      task="text-generation",
                                      generation_kwargs={
                                        "max_new_tokens": 100,
                                        "do_sample": False,
                                        })

Finally, we create the pipeline:

from haystack import Pipeline

pipe = Pipeline()
pipe.add_component("retriever", retriever)
pipe.add_component("embedder", query_embedder)
pipe.add_component("reranker", reranker)
pipe.add_component("prompt_builder", prompt_builder)
pipe.add_component("llm", generator)

pipe.connect("embedder", "retriever")
pipe.connect("retriever", "reranker.documents")
pipe.connect("reranker", "prompt_builder.documents")
pipe.connect("prompt_builder", "llm")

Try the Pipeline

Let’s try the pipeline with a real question:

question = 'What does Rhodes Statue look like?'

response = pipe.run({'embedder': {'text': question},
                    'reranker': {'query': question},
                    'prompt_builder': {'question': question}})

print(response['llm']['replies'][0])

>>> The statue was a Colossus of Rhodes, a statue of the Greek sun god Helios that stood in the city of Rhodes and was one of the Seven Wonders of the Ancient World. It is said to have stood about 100 feet (30 meters) tall, making it the tallest statue of its time. The statue was built by Chares of Lindos between 280 and 240 BC. It was destroyed by an earthquake in 226

Summary

In this short blog, we’ve highlighted the significant advantages of CPU-optimized embedding models in terms of accuracy and performance, demonstrating how seamlessly these components can be incorporated into your Haystack pipeline. At the forefront of these advancements is fastRAG, a research library dedicated to integrating Intel-based optimizations into Haystack.

The fastRAG team provides in-depth information on the quantization process and extensive benchmarking conducted on 4th Gen Xeon processors. To dive deep into the optimizations, read this detailed blog post and join our Discord community to explore Haystack.

Mixedbread 🤝 deepset: Announcing our New German/English Embedding Model

Thu, 18 Jul 2024 00:00:00 +0000

It’s 2024 and yet, most models today are still primarily geared towards English speaking markets. Today, deepset and Mixedbread are jointly announcing our latest contribution towards changing that landscape: A new open-source German/English embedding model - deepset-mxbai-embed-de-large-v1.

Our model is based on intfloat/multilingual-e5-large and was fine-tuned on 30+ million pairs of German data for retrieval tasks. On the NDCG@10 metric, which compares the list of retrieval results against an ideally ordered list of expected results, our model not only sets a new standard for open-source German embedding models but is also competitive with commercial alternatives.

Model	Avg. Performance (NDCG@10)	Binary Support	MRL Support
deepset-mxbai-embed-de-large-v1	51.7	✅	✅
multilingual-e5-large	50.5	❌	❌
jina-embeddings-v2-base-de	50.0	✅	❌
Commercial Models
Cohere Multilingual v3	52.4	✅	-

Nvidia enabled this work by providing cutting-edge computational resources. All training and evaluation was done on a Nvidia DGX with 8xA100, sponsored by Nvidia. We are extremely grateful for their contribution to this project.

To learn more and get a deeper dive into benchmarks on real-world data, read our full announcement article with Mixedbread. You can find an overview of the benchmarks in this spreadsheet.

Storage and Inference Efficiency

Beyond support for the German language, we also focused on improving the storage and inference efficiency of this new embedding model using the following methods:

Matryoshka Representation Learning (MRL): Matryoshka representation learning reduces the number of output dimensions in an embedding model without significant accuracy loss. This is done by modifying the loss function to prioritise the representation of important information in the initial dimensions of the embedding vector, enabling the truncation of later dimensions.

Binary Quantization: Binary quantization reduces the size of each dimension by converting float32 values to binary values, significantly enhancing memory and disk space efficiency while retaining high performance during inference.

Start Using it With Haystack

You can start using deepset-mxbai-embed-de-large-v1 today with the SentenceTransformersDocumentEmbedder and SentenceTransformersTextEmbedder components in Haystack, as well as the Mixedbread integrations of MixedbreadDocumentEmbedder and MixedbreadTextEmbedder:

Use it with the Sentence Transformers Embedders

from haystack.components.embedders import SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder

text_embedder = SentenceTransformersTextEmbedder(model="mixedbread-ai/deepset-mxbai-embed-de-large-v1")

document_embedder = SentenceTransformersDocumentEmbedder(model="mixedbread-ai/deepset-mxbai-embed-de-large-v1")

Use it with the Mixedbread Embedders

To start using this model with the Mixedbread integration for Haystack, install mixedbread-ai-haystack and export your Mixedbread API key to MXBAI_API_KEY.

from mixedbread_ai_haystack import  MixedbreadAITextEmbedder, MixedbreadAIDocumentEmbedder
from mixedbread_ai import EncodingFormat

text_embedder = MixedbreadAITextEmbedder( model="mixedbread-ai/deepset-mxbai-embed-de-large-v1",
                                          encoding_format=EncodingFormat.BINARY)

document_embedder = MixedbreadAIDocumentEmbedder(model="mixedbread-ai/deepset-mxbai-embed-de-large-v1",
                                                 encoding_format=EncodingFormat.BINARY)

Like our influential German BERT model, we hope that this state-of-the-art model will enable the German-speaking AI community to build innovative products in the field of retrieval-augmented generation (RAG) and beyond!

Join our Discord community to explore Haystack.

Benchmarking Haystack Pipelines for Optimal Performance

Mon, 24 Jun 2024 00:00:00 +0000

In this article, we will show you how to use Haystack to evaluate the performance of a RAG pipeline. Note that the code in this article is meant to be illustrative and may not run as is; if you want to run the code, please refer to the python script.

Introduction

This article will guide you through building a Retrieval-Augmented Generation (RAG) pipeline using Haystack, adjusting various parameters, and evaluating it with the ARAGOG dataset. The dataset consists of pairs of questions and answers, and our objective is to assess the RAG pipeline’s efficiency in retrieving the correct context and generating accurate answers. To do this, we will use the following evaluation metrics:

We did this experiment by relying on three different Haystack pipelines with different purposes: one pipeline for indexing, another for RAG, and one for evaluation. We describe each of these pipelines in detail and show how to combine them together to evaluate the RAG pipeline.

The article is organized as follows: we first describe the origin and authorship of the ARAGOG dataset, then we build the pipelines. We then demonstrate how to integrate everything, performing multiple runs over the dataset and adjusting parameters. These parameters were chosen based on feedback from our community, reflecting how users optimize their pipelines:

top_k: the maximum number of documents returned by the retriever. For this experiment, we tested our pipeline with top_k value of [1, 2, 3].
embedding_model: the model used to encode the documents and the question. For this example, we used these sentence-transformers models:
- all-MiniLM-L6-v2
- msmarco-distilroberta-base-v2
- all-mpnet-base-v2
chunk_size: the number of tokens in the input text that makes up segments of text to be embedded and indexed. For this experiment, we tested our pipeline with chunk_size of [64, 128, 256].

We end by discussing the results of the evaluation and sharing some lessons learned.

The “ARAGOG: Advanced RAG Output Grading” Dataset

The knowledge data, as well as the questions and answers, all stem from the ARAGOG: Advanced RAG Output Grading paper. The data is a subset of the AI ArXiv Dataset and consists of 423 selected research papers centered around the themes of Transformers and Large Language Models (LLMs).

The evaluation dataset comprises 107 question-answer pairs (QA) generated with the assistance of GPT-4. Each QA pair is validated and corrected by humans, ensuring that the evaluation is correct and accurately measures the RAG techniques’ performance in real-world applications.

Within the scope of this article, we only considered 16 papers, the ones from which the questions were drawn, instead of the 423 papers in the original dataset, to reduce the computational cost.

The Indexing Pipeline

The indexing pipeline is responsible for preprocessing and storing the documents in a DocumentStore. We will define a function that wraps a pipeline, takes the embedding model and the chunk size as parameters, and returns a DocumentStore for later use. The pipeline in the function first converts the PDF files into Documents, cleans them, splits them into chunks, and then embeds them using a SentenceTransformers model. The embeddings are then stored in an InMemoryDocumentStore. Learn more about creating an indexing pipeline in 📚 Tutorial: Preprocessing Different File Types.

For this example, we store the documents using the InMemoryDocumentStore, but you can use any other document store supported by Haystack. We split the documents by word, but you can split them by sentence or paragraph by changing the value of split_by parameter in the DocumentSplitter component.

We need to pass the parameters embedding_model and chunk_size to this indexing pipeline function since we want to experiment with different indexing approaches.

The indexing pipeline function is defined as follows:

import os

from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.converters import PyPDFToDocument
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack.document_stores.types import DuplicatePolicy

def indexing(embedding_model: str, chunk_size: int):
    files_path = "datasets/ARAGOG/papers_for_questions"
    document_store = InMemoryDocumentStore()
    pipeline = Pipeline()
    pipeline.add_component("converter", PyPDFToDocument())
    pipeline.add_component("cleaner", DocumentCleaner())
    pipeline.add_component("splitter", DocumentSplitter(split_length=chunk_size))  # default splitting by word
    pipeline.add_component("writer", DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP))
    pipeline.add_component("embedder", SentenceTransformersDocumentEmbedder(embedding_model))
    pipeline.connect("converter", "cleaner")
    pipeline.connect("cleaner", "splitter")
    pipeline.connect("splitter", "embedder")
    pipeline.connect("embedder", "writer")
    pdf_files = [files_path+"/"+f_name for f_name in os.listdir(files_path)]
    pipeline.run({"converter": {"sources": pdf_files}})

    return document_store

The RAG Pipeline

We use a simple RAG pipeline composed of a retriever, a prompt builder, a language model, and an answer builder. First, we use the SentenceTransformersTextEmbedder to embed the query and an InMemoryEmbeddingRetriever to retrieve the top-k documents relevant to the query. We then rely on an LLM to generate an answer based on the context retrieved from the documents and the query question.

We used the OpenAI API through the OpenAIGenerator with the gpt-3.5-turbo model in our implementation. The PromptBuilder is responsible for building the prompt to be fed to the LLM, using a template that includes the context and the question. Finally, the AnswerBuilder is responsible for extracting the answer from the LLM output and returning it. Learn more about creating a RAG pipeline in 📚 Tutorial: Creating Your First QA Pipeline with Retrieval-Augmentation.

Note that we instruct the LLM to explicitly answer "None" when the context is empty. We do this to avoid the LLM answering the prompt with its own internal knowledge.

After creating the pipeline, we wrap it with a function to easily initialize it with different parameters. We expect a document_store, an embedding_model, and the top_k for this function.

The RAG pipeline is defined as follows:

from haystack import Pipeline
from haystack.components.builders import PromptBuilder, AnswerBuilder
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.generators import OpenAIGenerator
from haystack.components.retrievers import InMemoryEmbeddingRetriever

def rag_pipeline(document_store, embedding_model, top_k=2):
    template = """
        You have to answer the following question based on the given context information only.
        If the context is empty or just a '\\n' answer with None, example: "None".

        Context:
        {% for document in documents %}
            {{ document.content }}
        {% endfor %}

        Question: {{question}}
        Answer:
        """

    basic_rag = Pipeline()
    basic_rag.add_component("query_embedder", SentenceTransformersTextEmbedder(
        model=embedding_model, progress_bar=False
    ))
    basic_rag.add_component("retriever", InMemoryEmbeddingRetriever(document_store, top_k=top_k))
    basic_rag.add_component("prompt_builder", PromptBuilder(template=template))
    basic_rag.add_component("llm", OpenAIGenerator(model="gpt-3.5-turbo"))
    basic_rag.add_component("answer_builder", AnswerBuilder())

    basic_rag.connect("query_embedder", "retriever.query_embedding")
    basic_rag.connect("retriever", "prompt_builder.documents")
    basic_rag.connect("prompt_builder", "llm")
    basic_rag.connect("llm.replies", "answer_builder.replies")
    basic_rag.connect("llm.meta", "answer_builder.meta")
    basic_rag.connect("retriever", "answer_builder.documents")

    return basic_rag

The Evaluation Pipeline

We will also need an evaluation pipeline, which will be responsible for computing the scoring metrics to measure the performance of the RAG pipeline. You can learn how to build an evaluation pipeline in 📚 Tutorial: Evaluating RAG Pipelines. The evaluation pipeline will include three evaluators:

ContextRelevanceEvaluator will assess the relevancy of the retrieved context to answer the query question
FaithfulnessEvaluator evaluates whether the generated answer can be derived from the context
SASEvaluator compares the embedding of a generated answer against a ground-truth answer based on a common embedding model.

This new function returns the evaluation results and the inputs used to run the evaluation. This data is useful for later analysis and understanding the pipeline’s performance in more detail and granularity. We need to pass the questions and answers from the dataset to the function, plus the data generated by the RAG pipeline, i.e., retrieved_contexts, predicted_answers, and the embedding_model used for these results.

from haystack import Pipeline
from haystack.components.evaluators import ContextRelevanceEvaluator, FaithfulnessEvaluator, SASEvaluator

def evaluation(questions, answers, retrieved_contexts, predicted_answers, embedding_model):
    eval_pipeline = Pipeline()
    eval_pipeline.add_component("context_relevance", ContextRelevanceEvaluator(raise_on_failure=False))
    eval_pipeline.add_component("faithfulness", FaithfulnessEvaluator(raise_on_failure=False))
    eval_pipeline.add_component("sas", SASEvaluator(model=embedding_model))

    eval_pipeline_results = eval_pipeline.run(
        {
            "context_relevance": {"questions": questions, "contexts": retrieved_contexts},
            "faithfulness": {"questions": questions, "contexts": retrieved_contexts, "predicted_answers": predicted_answers},
            "sas": {"predicted_answers": predicted_answers, "ground_truth_answers": answers},
        }
    )

    results = {
        "context_relevance": eval_pipeline_results['context_relevance'],
        "faithfulness": eval_pipeline_results['faithfulness'],
        "sas": eval_pipeline_results['sas']
    }

		inputs = {
				'questions': sample_questions,
        'contexts': retrieved_contexts,
        'true_answers': sample_answers,
        'predicted_answers': predicted_answers
     }

    return results, inputs

Putting it all together

We now have the building blocks to evaluate the RAG pipeline: indexing the knowledge data, generating answers using a RAG architecture, and evaluating the results. However, we still need a method to run the questions over our RAG pipeline and collect all the needed results to perform an evaluation. We will use a function that wraps up all the interactions with the RAG pipeline. It takes as parameters a document_store, the questions, an embedding_model and the top_k and returns the retrieved contexts and the predicted answers.

def run_rag(document_store, sample_questions, embedding_model, top_k):
    """
    A function to run the basic rag model on a set of sample questions and answers
    """

    rag = rag_pipeline(document_store=document_store, embedding_model=embedding_model, top_k=top_k)

    predicted_answers = []
    retrieved_contexts = []
    for q in tqdm(sample_questions):
        try:
            response = rag.run(
                data={"query_embedder": {"text": q}, "prompt_builder": {"question": q}, "answer_builder": {"query": q}})
            predicted_answers.append(response["answer_builder"]["answers"][0].data)
            retrieved_contexts.append([d.content for d in response['answer_builder']['answers'][0].documents])
        except BadRequestError as e:
            print(f"Error with question: {q}")
            print(e)
            predicted_answers.append("error")
            retrieved_contexts.append(retrieved_contexts)

    return retrieved_contexts, predicted_answers

Notice that we wrap the call to the RAG pipeline in a try-except block to handle any errors that may occur during the pipeline’s execution. This might happen, for instance, if the prompt is too big—due to large contexts—for the model to generate an answer, if there are network errors, or simply if the model cannot generate an answer for any other reason.

You can decide if the LLM-based evaluators stop immediately if an error is found or if they ignore the evaluation for a particular sample and continue see, for instance in the ContextRelevanceEvaluator, the raise_on_failure parameter.

Finally, we need to run whole query questions through the pipeline over the dataset for each possible combination of the parameters top_k, embedding_model, and chunk_size. That’s handled by the next function.

Note that for indexing, we only vary the embedding_model and chunk_size, as the top_k parameter does not affect the indexing.

def parameter_tuning(out_path: str):

    base_path = "../datasets/ARAGOG/"

    with open(base_path + "eval_questions.json", "r") as f:
        data = json.load(f)
        questions = data["questions"]
        answers = data["ground_truths"]

    embedding_models = {
        "sentence-transformers/all-MiniLM-L6-v2",
        "sentence-transformers/msmarco-distilroberta-base-v2",
        "sentence-transformers/all-mpnet-base-v2"
    }
    top_k_values = [1, 2, 3]
    chunk_sizes = [64, 128, 256]

    # create results directory
    out_path = Path(out_path)
    out_path.mkdir(exist_ok=True)

    for embedding_model in embedding_models:
        for chunk_size in chunk_sizes:
            print(f"Indexing documents with {embedding_model} model with a chunk_size={chunk_size}")
            doc_store = indexing(embedding_model, chunk_size)
            for top_k in top_k_values:
                name_params = f"{embedding_model.split('/')[-1]}__top_k:{top_k}__chunk_size:{chunk_size}"
                print(name_params)
                print("Running RAG pipeline")
                retrieved_contexts, predicted_answers = run_rag(doc_store, questions, embedding_model, top_k)
                print(f"Running evaluation")
                results, inputs = evaluation(questions, answers, retrieved_contexts, predicted_answers, embedding_model)
                eval_results = EvaluationRunResult(run_name=name_params, inputs=inputs, results=results)
                eval_results.score_report().to_csv(f"{out_path}/score_report_{name_params}.csv", index=False)
                eval_results.to_pandas().to_csv(f"{out_path}/detailed_{name_params}.csv", index=False)

This function will store the results in a directory specified by the out_path parameter. The results will be stored in .csv files. For each parameter combination, there will be two files generated, one with the aggregated score report overall questions (e.g.: score_report_all-MiniLM-L6-v2__top_k:3__chunk_size:128.csv) and another with the detailed results for each question (e.g.: detailed_all-MiniLM-L6-v2__top_k:3__chunk_size:128.csv).

Note that we make use of the EvaluationRunResult to store the results and generate the score report and the detailed results in the .csv files.

In the next section, we will show the evaluation results and discuss the insights gained from the experiment.

Results Analysis

You can run this notebook to visualize and analyze the results. All relevant .csv files can be found in the aragog_parameter_search_2024_06_12 folder.

To make the analysis of the results easier, we will load all the aggregated score reports from the different parameter combinations from multiple .csv files into a single DataFrame. For that, we use the following code to parse the file content:

import os
import re
import pandas as pd

def parse_results(f_name: str):
    pattern = r"score_report_(.*?)__top_k:(\\d+)__chunk_size:(\\d+)\\.csv"
    match = re.search(pattern, f_name)
    if match:
        embeddings_model = match.group(1)
        top_k = int(match.group(2))
        chunk_size = int(match.group(3))
        return embeddings_model, top_k, chunk_size
    else:
        print("No match found")

def read_scores(path: str):
    all_scores = []
    for root, dirs, files in os.walk(path):
        for f_name in files:
            if not f_name.startswith("score_report"):
                continue

            embeddings_model, top_k, chunk_size = parse_results(f_name)

            df = pd.read_csv(path+"/"+f_name)

            df.rename(columns={'Unnamed: 0': 'metric'}, inplace=True)
            df_transposed = df.T
            df_transposed.columns = df_transposed.iloc[0]
            df_transposed = df_transposed[1:]

            # Add new columns
            df_transposed['embeddings'] = embeddings_model
            df_transposed['top_k'] = top_k
            df_transposed['chunk_size'] = chunk_size

            all_scores.append(df_transposed)

    df = pd.concat(all_scores)
    df.reset_index(drop=True, inplace=True)
    df.rename_axis(None, axis=1, inplace=True)

    return df

We can then read the scores from the CSV files and analyze the results.

df = read_scores('aragog_results/')

We can now analyze the results in a single table:

context_relevance	faithfulness	sas	embeddings	top_k	chunk_size
0.834891	0.738318	0.524882	all-MiniLM-L6-v2	1	64
0.869485	0.895639	0.633806	all-MiniLM-L6-v2	2	64
0.933489	0.948598	0.65133	all-MiniLM-L6-v2	3	64
0.843447	0.831776	0.555873	all-MiniLM-L6-v2	1	128
0.912355	NaN	0.661135	all-MiniLM-L6-v2	2	128
0.94463	0.928349	0.659311	all-MiniLM-L6-v2	3	128
0.912991	0.827103	0.574832	all-MiniLM-L6-v2	1	256
0.951702	0.925456	0.642837	all-MiniLM-L6-v2	2	256
0.909638	0.932243	0.676347	all-MiniLM-L6-v2	3	256
0.791589	0.67757	0.480863	all-mpnet-base-v2	1	64
0.82648	0.866044	0.584507	all-mpnet-base-v2	2	64
0.901218	0.890654	0.611468	all-mpnet-base-v2	3	64
0.897715	0.845794	0.538579	all-mpnet-base-v2	1	128
0.916422	0.892523	0.609728	all-mpnet-base-v2	2	128
0.948038	NaN	0.643175	all-mpnet-base-v2	3	128
0.867887	0.834112	0.560079	all-mpnet-base-v2	1	256
0.946651	0.88785	0.639072	all-mpnet-base-v2	2	256
0.941952	0.91472	0.645992	all-mpnet-base-v2	3	256
0.909813	0.738318	0.530884	msmarco-distilroberta-base-v2	1	64
0.88004	0.929907	0.600428	msmarco-distilroberta-base-v2	2	64
0.918135	0.934579	0.67328	msmarco-distilroberta-base-v2	3	64
0.885314	0.869159	0.587424	msmarco-distilroberta-base-v2	1	128
0.953649	0.919003	0.664224	msmarco-distilroberta-base-v2	2	128
0.945016	0.936916	0.68591	msmarco-distilroberta-base-v2	3	128
0.949844	0.866822	0.613355	msmarco-distilroberta-base-v2	1	256
0.952544	0.893769	0.662694	msmarco-distilroberta-base-v2	2	256
0.964182	0.943925	0.62854	msmarco-distilroberta-base-v2	3	256

We can see some NaN values for the faithfullness scores which is based on an LLM-based evaluator. This was due to network errors when calling the OpenAI API.

Let’s now see which parameter configuration yielded the best Semantic Similarity Answer score

df.sort_values(by=['sas'], ascending=[False])

context_relevance	faithfulness	sas	embeddings	top_k	chunk_size
0.945016	0.936916	0.68591	msmarco-distilroberta-base-v2	3	128
0.909638	0.932243	0.676347	all-MiniLM-L6-v2	3	256
0.918135	0.934579	0.67328	msmarco-distilroberta-base-v2	3	64
0.953649	0.919003	0.664224	msmarco-distilroberta-base-v2	2	128
0.952544	0.893769	0.662694	msmarco-distilroberta-base-v2	2	256
0.912355	NaN	0.661135	all-MiniLM-L6-v2	2	128
0.94463	0.928349	0.659311	all-MiniLM-L6-v2	3	128
0.933489	0.948598	0.65133	all-MiniLM-L6-v2	3	64
0.941952	0.91472	0.645992	all-mpnet-base-v2	3	256
0.948038	NaN	0.643175	all-mpnet-base-v2	3	128
0.951702	0.925456	0.642837	all-MiniLM-L6-v2	2	256
0.946651	0.88785	0.639072	all-mpnet-base-v2	2	256
0.869485	0.895639	0.633806	all-MiniLM-L6-v2	2	64
0.964182	0.943925	0.62854	msmarco-distilroberta-base-v2	3	256
0.949844	0.866822	0.613355	msmarco-distilroberta-base-v2	1	256
0.901218	0.890654	0.611468	all-mpnet-base-v2	3	64
0.916422	0.892523	0.609728	all-mpnet-base-v2	2	128
0.88004	0.929907	0.600428	msmarco-distilroberta-base-v2	2	64
0.885314	0.869159	0.587424	msmarco-distilroberta-base-v2	1	128
0.82648	0.866044	0.584507	all-mpnet-base-v2	2	64
0.912991	0.827103	0.574832	all-MiniLM-L6-v2	1	256
0.867887	0.834112	0.560079	all-mpnet-base-v2	1	256
0.843447	0.831776	0.555873	all-MiniLM-L6-v2	1	128
0.897715	0.845794	0.538579	all-mpnet-base-v2	1	128
0.909813	0.738318	0.530884	msmarco-distilroberta-base-v2	1	64
0.834891	0.738318	0.524882	all-MiniLM-L6-v2	1	64
0.791589	0.67757	0.480863	all-mpnet-base-v2	1	64

Focusing on the Semantic Answer Similarity:

The msmarco-distilroberta-base-v2 embeddings model with a top_k=3 and a chunk_size=128 yields the best results.
In this evaluation, retrieving documents with top_k=3 will most usually yield a higher semantic similarity score than with top_k=1 or top_k=2
It also seems that regardless of the top_k and chunk_size the best semantic similarity scores come from using the embedding model all-MiniLM-L6-v2 and the msmarco-distilroberta-base-v2

Let’s inspect how the scores of each embedding model compare with each other in terms of Semantic Answer Similarity. For that, we will group the results by the embeddings column and plot the scores using box plots

from matplotlib import pyplot as plt

fig, ax = plt.subplots(figsize=(10, 6))
df.boxplot(column='sas', by='embeddings', ax=ax)

plt.xlabel("Embeddings Model")
plt.ylabel("Semantic Answer Similarity Values")
plt.title("Boxplots of Semantic Answer Similarity Values Aggregated by Embeddings")

plt.show()

The box-plots above show that:

The all-MiniLM-L6-v2 and the msmarco-distilroberta-base-v2 embedding models outperform the all-mpnet-base-v2
The msmarco-distilroberta-base-v2 scores have less variance, indicating that this model is more stable to top_k and chunk_size parameter variations than the other models
All three embedding models have an outlier corresponding to the highest-scoring and lowest-scoring parameter combination
Not surprisingly, all the lowest scores outliers correspond to top_k=1 and chunk_size=64
The highest scores outliers correspond to top_k=3 and a chunk_size of 128 or 256

Since we have the ground truth answers, we focuses on the Semantic Similarity Answer, but let’s also look at the Faithfulness and Context Relevance scores for a few examples. For that, we will need to load the detailed scores:

detailed_best_sas_df = pd.read_csv("results/aragog_results/detailed_all-MiniLM-L6-v2__top_k:3__chunk_size:128.csv")

def inspect(idx):
    print("Question: ")
    print(detailed_best_sas_df.loc[idx]['questions'])
    print("\nTrue Answer:")
    print(detailed_best_sas_df.loc[idx]['true_answers'])
    print()
    print("Generated Answer:")
    print(detailed_best_sas_df.loc[idx]['predicted_answers'])
    print()
    print(f"Context Relevance  : {detailed_best_sas_df.loc[idx]['context_relevance']}")
    print(f"Faithfulness       : {detailed_best_sas_df.loc[idx]['faithfulness']}")
    print(f"Semantic Similarity: {detailed_best_sas_df.loc[idx]['sas']}")

Let’s look at the query question 6:

inspect(6)

Question: 
How does BERT's performance on the GLUE benchmark compare to previous state-of-the-art models?

True Answer:
BERT achieved new state-of-the-art on the GLUE benchmark (80.5%), surpassing the previous best models.

Generated Answer:
BERT's performance on the GLUE benchmark significantly outperforms previous state-of-the-art models, achieving 4.5% and 7.0% respective average accuracy improvement over the prior state of the art.

Context Relevance  : 1.0
Faithfulness       : 1.0
Semantic Similarity: 0.9051246047019958

Contexts:
recent work in this area.
Since its release, GLUE has been used as a testbed and showcase by the developers of several
inﬂuential models, including GPT (Radford et al., 2018) and BERT (Devlin et al., 2019). As shown
in Figure 1, progress on GLUE since its release has been striking. On GLUE, GPT and BERT
achieved scores of 72.8 and 80.2 respectively, relative to 66.5 for an ELMo-based model (Peters
et al., 2018) and 63.7 for the strongest baseline with no multitask learning or pretraining above the
word level. Recent models (Liu et al., 2019d; Yang et al., 2019) have clearly surpassed estimates of
non-expert human performance on GLUE (Nangia and Bowman, 2019). The success of these models
on GLUE has been driven by ever-increasing model capacity, compute power, and data quantity, as
well as innovations in 
---------
56.0 75.1
BERT BASE 84.6/83.4 71.2 90.5 93.5 52.1 85.8 88.9 66.4 79.6
BERT LARGE 86.7/85.9 72.1 92.7 94.9 60.5 86.5 89.3 70.1 82.1
Table 1: GLUE Test results, scored by the evaluation server ( https://gluebenchmark.com/leaderboard ).
The number below each task denotes the number of training examples. The “Average” column is slightly different
than the ofﬁcial GLUE score, since we exclude the problematic WNLI set.8BERT and OpenAI GPT are single-
model, single task. F1 scores are reported for QQP and MRPC, Spearman correlations are reported for STS-B, and
accuracy scores are reported for the other tasks. We exclude entries that use BERT as one of their components.
We use a batch size of 32 and ﬁne-tune for 3
epochs over the data for all GLUE tasks. For each
task, we selected the best ﬁne-tuning learning rate
(among 5e-5, 
---------
4e-5, 3e-5, and 2e-5) on the Dev set.
Additionally, for BERT LARGE we found that ﬁne-
tuning was sometimes unstable on small datasets,
so we ran several random restarts and selected the
best model on the Dev set. With random restarts,
we use the same pre-trained checkpoint but per-
form different ﬁne-tuning data shufﬂing and clas-
siﬁer layer initialization.9
Results are presented in Table 1. Both
BERT BASE and BERT LARGE outperform all sys-
tems on all tasks by a substantial margin, obtaining
4.5% and 7.0% respective average accuracy im-
provement over the prior state of the art. Note that
BERT BASE and OpenAI GPT are nearly identical
in terms of model architecture apart from the at-
tention masking. For the largest and most widely
reported GLUE task, MNLI, BERT obtains a 4.6%
absolute accuracy improvement. On the ofﬁcial
GLUE leaderboard10, BERT LARGE obtains a score
of 
---------

In this example, the context relevancy and faithfulness scores are both 1.0. This indicates that the context is relevant to the question and our RAG LLM used this context to generate a semantically similar answer to the correct (ground-truth) answer.

Let’s take a look at another example:

inspect(44)

Question: 
How should future language model benchmarks be structured to ensure a holistic assessment of models' capabilities and knowledge breadth?

True Answer:
Future benchmarks should integrate a broader spectrum of subjects and cognitive skills, emphasizing the inclusion of tasks that test models' ethical reasoning, understanding of human values, and ability to perform complex problem-solving, beyond the mere scale of data and parameters.

Generated Answer:
Future language model benchmarks should be structured to measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings across a diverse set of subjects that humans learn. The benchmark should cover a wide range of subjects across STEM, humanities, social sciences, and more, ranging in difficulty from elementary to advanced professional levels. It should test both world knowledge and problem-solving ability, ensuring a holistic assessment of models' capabilities and knowledge breadth.

Context Relevance  : 0.6
Faithfulness       : 1.0
Semantic Similarity: 0.6483339071273804

Contexts:
learning model
usage should be developed for guiding users to learn ‘Dos’
and Dont’ in AI. Detailed policies could also be proposed
to list all user’s responsibilities before the model access.
C. Language Models Beyond ChatGPT
The examination of ethical implications associated with
language models necessitates a comprehensive examina-
tion of the broader challenges that arise within the domainof language models, in light of recent advancements in
the field of artificial intelligence. The last decade has seen
a rapid evolution of AI techniques, characterized by an
exponential increase in the size and complexity of AI
models, and a concomitant scale-up of model parameters.
The scaling laws that govern the development of language
models,asdocumentedinrecentliterature[84,85],suggest
thatwecanexpecttoencounterevenmoreexpansivemod-
els that incorporate multiple modalities in the near future.
Efforts to integrate multiple modalities into a single model
are driven by the ultimate goal of realizing the concept of
foundation models [86]. 
---------
language models are
at learning and applying knowledge from many domains.
To bridge the gap between the wide-ranging knowledge that models see during pretraining and the
existing measures of success, we introduce a new benchmark for assessing models across a diverse
set of subjects that humans learn. We design the benchmark to measure knowledge acquired during
pretraining by evaluating models exclusively in zero-shot and few-shot settings. This makes the
benchmark more challenging and more similar to how we evaluate humans. The benchmark covers
57subjects across STEM, the humanities, the social sciences, and more. It ranges in difﬁculty from
an elementary level to an advanced professional level, and it tests both world knowledge and problem
solving ability. Subjects range from traditional areas, such as mathematics and history, to more
1arXiv:2009.03300v3 [cs.CY] 12 Jan 2021Published as a conference paper at 
---------
a
lack of access to the benefits of these models for people
who speak different languages and can lead to biased or
unfairpredictionsaboutthosegroups[14,15].Toovercome
this, it is crucial to ensure that the training data contains
a substantial proportion of diverse, high-quality corpora
from various languages and cultures.
b) Robustness: Another major ethical consideration
in the design and implementation of language models is
their robustness. Robustness refers to a model’s ability
to maintain its performance when given input that is
semantically or syntactically different from the input it
was trained on.
Semantic Perturbation: Semantic perturbation is a type
of input that can cause a language model to fail [40, 41].
This input has different syntax but is semantically similar
to the input used for training the model. To address this,
it is crucial to ensure that the training data is diverse and
representative of the population it will 
---------

It seems that for this question, the content is not completely relevant (Context Relevance = 0.6) and only the second context was used to generate the answer.

Running your own experiments

If you want to run this experiment yourself, follow the Python code evaluation_aragog.py in the haystack-evaluation repository.

Start by cloning the repository

git clone https://github.com/deepset-ai/haystack-evaluation
cd haystack-evaluation
cd evaluations

Next, run the Python script:

usage: evaluation_aragog.py [-h] --output_dir OUTPUT_DIR [--sample SAMPLE]

You can specify the output directory to hold the results and the sample size, i.e.: how many questions to use for the evaluation.

Don’t forget to define your Open AI API key in the environmental variable OPENAI_API_KEY

 OPENAI_API_KEY= python evaluation_aragog.py --output-dir experiment_a --sample 10

Execution Time and Costs

NOTE: all the numbers reported were run on an Mac Book Pro Apple M3 Pro with 36GB of RAM with Haystack 2.2.1 and Python 3.9

Indexing

The Indexing pipeline needs to consider the parameter combinations defined below:

3 different values for embedding_model
3 different chunk_size values

Therefore, the index runs 9 times in total.

RAG Pipeline

The RAG pipeline needs to run 27 times, since the following parameters affect the retrieval process:

3 different values for embedding_model
3 different top_k values
3 different chunk_size values

This needs to run for each of the 107 questions, so in total, the RAG pipeline will run 2.889 times (3 x 3 x 3 x 107) and produce 2889 calls to OpenAI API.

Evaluation Pipeline

The Evaluation pipeline also runs 27 times since all parameter combinations need to be evaluated for each of the 107 questions. Note, however, that the Evaluation pipeline contains two Evaluators that rely on an LLM through OpenAI API, so this pipeline runs 2.889 times. However, due to the Faithfulness and ContextRelevance evaluators, it will produce 5.778 (2 x 2.889) calls to OpenAI API.

You can see the detailed running times for each parameter combination in the Benchmark Times Spreadsheet.

Pricing

For detailed pricing information, visit OpenAI Pricing 💸

Lessons Learned

In this article, we have shown how to use the Haystack Evaluators to find the best combination of parameters that yield the best performance of our RAG pipeline, as opposed to using only the default parameters.

For this ARAGOG dataset, in particular, the best performance is achieved using the msmarco-distilroberta-base-v2 embeddings model instead of the default model (sentence-transformers/all-mpnet-base-v2), together with a top_k=3 and a chunk_size=128.

A few learnings are important to take:

When using an LLM through an external API, it is important to account for potential network errors or other issues. Ensure that during your experiments, running the questions through the RAG pipeline or evaluating the results doesn’t crash due to an error, for instance, by wrapping the call within a try/except code block.
Before starting your experiments, estimate the costs and time involved. If you plan to use an external LLM through an API, calculate approximately how many API calls you will need to run queries through your RAG pipeline and evaluate the results if you use LLM-based evaluators. This will help you understand the total costs and time required for your experiments.
Depending on your dataset size and running time, Python notebooks might not be the best approach to run your experiments; a Python script is probably a more reliable solution.
Beware of which parameters affect which components. For instance, for indexing, only the embedding_model and the chunk_size are important - this can reduce the number of experiments you need to carry out.

Explore a variety of evaluation examples tailored to different use cases and datasets by visiting the haystack-evaluation repository on GitHub.

RAG Evaluation with Prometheus 2

Mon, 17 Jun 2024 00:00:00 +0000

When building real-world applications based on Language Models (such as RAG), evaluation plays an important role. Recently, evaluating generated answers using powerful proprietary Language Models (such as GPT-4) has become popular and correlates well with human judgment, but it comes with its own limitations and challenges.

Prometheus 2 is a newly released family of open-source models specifically trained to evaluate the output of other Language Models. In this article (and in the related notebook), we will see how to use Prometheus and we will experiment with it to evaluate the generated responses of a RAG Pipeline using Haystack.

Language Models as Evaluators

With the rise of Language Models (LMs) demonstrating strong general capabilities across diverse tasks, evaluating answers generated by these models using other generative LMs has become a common and effective approach. Compared to statistical-based evaluation, this technique is convenient as it usually does not require ground truth labels.

Proprietary models such as GPT-4 or Claude 3 Opus are frequently chosen for evaluation and have shown a good correlation with human judgment. However, relying on closed models has several limitations:

data privacy: your data exits your machine and is transmitted to the model provider
transparency: the training data of these models is unknown
controllability: as these models are accessed via APIs, their behavior can change over time
price: despite a constant drop in prices, these large models remain expensive. Additionally, the evaluation process typically involves several cycles of testing and refinement, which can significantly increase the overall expense.

On the other hand, using open models for evaluation is an active research area, but their practical use is often limited. They typically do not correlate well with human judgments and lack flexibility (for more details, see the Prometheus 2 paper).

🔥 Prometheus 2: a strong open-source model for evaluation

Prometheus 2 is a new family of open-source models designed to bridge the gap between proprietary models and open LMs for evaluation.

The authors unified two different evaluation paradigms: direct assessment (evaluating the quality of an answer generated by a single model based on a specific criterion) and pairwise ranking (choosing the best answer between two, usually produced by different models).

In particular, for each variant, they started from a MistralAI base model, they trained 2 different models (one for each of the mentioned tasks) on open-source datasets and then merged their weights to create a robust evaluator Language Model.

The results are impressive:

two variants: 7B and 8x7B, fine-tuned from Mistral-7B and Mixtral8x7B, respectively
high correlation with human evaluations and proprietary models
the models are highly flexible: capable of performing direct assessments and pairwise rankings, and allowing custom evaluation criteria definition

Prompting Prometheus 2

The prompt template

Prometheus 2 models are generative language models trained to perform evaluation. To achieve the best results with these models, we need to follow a precise yet customizable prompt structure. You can find the prompt templates in the paper and on GitHub.

Since we want to experiment with Prometheus 2 to evaluate a single RAG system, we are primarily interested in the Direct Assessment prompt template, which allows evaluating the quality of an answer based on specific criteria. The following template includes a reference answer; the provided link also contains a version without it.

Let’s take a look at it.

You are a fair judge assistant tasked with providing clear, objective feedback 
based on specific criteria, ensuring each assessment reflects the absolute 
standards set for performance.

###Task Description:
An instruction (might include an Input inside it), a response to evaluate, a 
reference answer that gets a score of 5, and a score rubric representing a 
evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly 
based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 5. 
You should refer to the score rubric.
3. The output format should look as follows: \\"Feedback: (write a feedback for 
criteria) [RESULT] (an integer number between 1 and 5)\\"
4. Please do not generate any other opening, closing, and explanations.

###The instruction to evaluate:
{instruction}

###Response to evaluate:
{response}

###Reference Answer (Score 5):
{reference_answer}

###Score Rubrics:
{score_rubric}

###Feedback:

In this prompt template, the only parts to be customized are those enclosed in curly brackets.

We should provide:

the instruction to use for evaluation, which may include an input (e.g., a user question, if evaluating a RAG pipeline)
the LLM response to evaluate
the reference answer: a perfect answer, scoring 5 according to the score rubric
a score rubric with scores from 1 to 5, accurately describing when the response qualifies for each score.

When provided with such a prompt, the model will generate two outputs: detailed feedback and a score from 1 to 5.

An example

Let’s say we want to evaluate the correctness of a generated answer. In this case we will have a ground truth answer, but it is not mandatory.

Question: “Who won the 2022 World Cup?” Generated Answer: “Argentina won the 2022 FIFA World Cup. France won the 2018 FIFA World Cup.” Reference Answer: “Argentina won the 2022 FIFA World Cup.”

In this case, our prompt might look like the following:

[...]

###The instruction to evaluate:
Your task is to evaluate the generated answer against the reference answer for 
the question: Who won the 2022 World Cup?

###Response to evaluate:
Argentina won the 2022 FIFA World Cup. France won the 2018 FIFA World Cup.

###Reference Answer (Score 5):
Argentina won the 2022 FIFA World Cup.

###Score Rubrics:
Score 1: The answer is not relevant to the question and does not align with the reference answer.
Score 2: The answer is relevant to the question but deviates significantly from the reference answer.
Score 3: The answer is relevant to the question and generally aligns with the reference answer but has errors or omissions.
Score 4: The answer is relevant to the question and closely matches the reference answer but is less concise or clear.
Score 5: The answer is highly relevant, fully accurate, and matches the reference answer in both content and clarity.

###Feedback:

Calling the model with this prompt should return a result similar to this:

Both the generated response and the reference answer identify Argentina as the winner of the 2022 FIFA World Cup. This shows that the generated answer is accurate and relevant to the user’s query about who won the 2022 World Cup. However, the inclusion of an unrelated fact about France winning the 2018 World Cup does not enhance the relevance or accuracy of the response, making it somewhat less concise than the reference answer. While the information about France is not incorrect, it does not pertain to the original query about the 2022 World Cup. Thus, despite being generally correct, the additional detail in the generated response deviates from the concise nature of the reference answer, and it could be perceived as slightly less effective. Therefore, according to the score rubric, which places emphasis on relevance, conciseness, and accuracy, the generated response is not as well aligned with the reference answer as it could be. So the score is 4. [RESULT] 4

Experimenting with Prometheus 2 in Haystack

Haystack is an orchestration framework for building and evaluating applications based on LLMs. It comes with its own set of evaluators and integrates with different evaluation libraries. It is easy to extend Haystack functionalities, by creating custom components, so we can try to incorporate Prometheus 2.

This is the plan:

build and run an Indexing pipeline
build and run a RAG pipeline to evaluate
implement a PrometheusLLMEvaluator component
create different PrometheusLLMEvaluators
build and run an evaluation pipeline with different PrometheusLLMEvaluators

In this blog post, we summarize these steps, but you can find the full implementation in the accompanying notebook.

Indexing pipeline

Before running a RAG pipeline, we need to index some data.

We will be using a labeled PubMed dataset with questions, contexts and answers. This allows us to use the contexts as Documents and provides the necessary labeled data for some of the evaluation metrics we will define.

For simplicity, we will use the InMemoryDocumentStore. Our indexing pipeline will include a DocumentEmbedder (embedding model: sentence-transformers/all-MiniLM-L6-v2) and a DocumentWriter.

See the accompanying notebook for the full code on building a running an indexing pipeline.

RAG pipeline

Now that we have our data ready, we can create a simple RAG pipeline.

Our RAG pipeline will include:

InMemoryEmbeddingRetriever to retrieve the relevant documents for the query (based on the same embedding model as before)
PromptBuilder to dynamically create prompts
HuggingFaceLocalGenerator with google/gemma-1.1-2b-it to generate answers to queries. It is a small model, and later we will evaluate the quality of the generated responses based on custom criteria.
AnswerBuilder

Let’s run our RAG pipeline with a set of questions and save the data we need for evaluation: questions, ground truth answers, and generated answers.

Implement a PrometheusLLMEvaluator component

To perform evaluation, we create a custom Haystack Evaluator component, based on Prometheus 2.

This component allows you to develop a diverse range of evaluators.

You can find the implementation in the accompanying notebook. Let’s take a high-level overview of this component:

init parameters
- template: A Jinja2 prompt template that adheres to the Prometheus 2 prompt structure, with placeholders for input data that we want to pass at runtime (e.g., question, generated_answer, ground_truth_answer)
- inputs: A list of tuples in the format (input_name, input_type). These are the inputs that the evaluator expects and uses for evaluation. They should match those defined in the template.
- generator: (hacky) Allows passing different types of Haystack generators to use the Prometheus 2 model. Examples: HuggingFaceLocalGenerator, LlamaCPPGenerator, etc.
run method: for each example to evaluate, the inputs are validated, integrated into the prompt and passed to the model. The model output is parsed to extract score and feedback. This method returns a dictionary containing an aggregate score, individual_scores and feedbacks.

Create different Evaluators

Let’s see how we can use the PrometheusLLMEvaluator.

We start by creating a Correctness Evaluator, similar to the above example.

First, we initialize a generator to load the Prometheus 2 model; in particular, we are using the small variant (7B).

from haystack.components.generators import HuggingFaceLocalGenerator

generator = HuggingFaceLocalGenerator(
    model="prometheus-eval/prometheus-7b-v2.0",
    task="text2text-generation",
		...
)
generator.warm_up()

In this example, we are using the HuggingFaceLocalGenerator, which can run on the free GPU provided by Colab, but there are several other options, depending on your environment: LlamaCPPGenerator for resource-constrained environments (even without a GPU); TGI (via HuggingFaceAPIGenerator) and vLLM for production environments with available GPU resources.

Next, let’s prepare the prompt template for the Correctness evaluator. Note that we are inserting placeholders for query, generated_answer and reference_answer. These fields will be populated dynamically based on RAG results and ground truth answers.

correctness_prompt_template = """
...
###The instruction to evaluate:
Your task is to evaluate the generated answer against the reference answer for the question: {{query}}

###Response to evaluate:
generated answer: {{generated_answer}}

###Reference Answer (Score 5): {{reference_answer}}

###Score Rubrics:
Score 1: The answer is not relevant to the question and does not align with the reference answer.
Score 2: The answer is relevant to the question but deviates significantly from the reference answer.
Score 3: The answer is relevant to the question and generally aligns with the reference answer but has errors or omissions.
Score 4: The answer is relevant to the question and closely matches the reference answer but is less concise or clear.
Score 5: The answer is highly relevant, fully accurate, and matches the reference answer in both content and clarity.

###Feedback:"""

Finally, let’s initialize our evaluator, specifying which inputs it should expect at runtime (they should match the placeholders of the above prompt template).

correctness_evaluator = PrometheusLLMEvaluator(
    template=correctness_prompt_template,
    generator=generator,
    inputs=[
        ("query", List[str]),
        ("generated_answer", List[str]),
        ("reference_answer", List[str]),
    ],
)

Similarly, we can create other evaluators:

Response Relevance: Evaluates the generated answer in terms of its relevance to the user’s question.
Logical Robustness: Evaluates the logical organization and progression of the response.

These evaluators do not require ground truth labels. For details about prompt templates and required inputs, refer to the accompanying notebook.

Evaluation pipeline

We can now put our evaluators in a pipeline, and run it to see how our small model performs.

from haystack import Pipeline

eval_pipeline = Pipeline()
eval_pipeline.add_component("correctness_evaluator", correctness_evaluator)
eval_pipeline.add_component("response_relevance_evaluator", response_relevance_evaluator)
eval_pipeline.add_component("logical_robustness_evaluator", logical_robustness_evaluator)

eval_results = eval_pipeline.run(
    {
        "correctness_evaluator": {
            "query": questions,
            "generated_answer": rag_answers,
            "reference_answer": ground_truth_answers,
        },
        "response_relevance_evaluator": {
            "query": questions,
            "generated_answer": rag_answers,
        },
        "logical_robustness_evaluator": {
            "query": questions,
            "generated_answer": rag_answers,
        },
    }
)

Once we’ve run our evaluation pipeline, we can also create a full evaluation report. Haystack provides an EvaluationRunResult which we can use to display a score_report.

from haystack.evaluation.eval_run_result import EvaluationRunResult

inputs = {
    "question": questions,
    "answer": ground_truth_answers,
    "predicted_answer": rag_answers,
}

evaluation_result = EvaluationRunResult(run_name="pubmed_rag_pipeline", inputs=inputs, results=eval_results)
evaluation_result.score_report()

In our experiment (involving a small sample of 10 examples), we get the following results:

Evaluation	Score
correctness_evaluator	3.9
response_relevance_evaluator	4.3
logical_robustness_evaluator	3.5

Gemma-1.1-2b-it seems to generate relevant answers, but the responses differ from ground truth answers and the logical organization is not optimal.

To inspect these results in more detail, we can convert the evaluation_result to a Pandas dataframe and also look at the individual feedbacks of each evaluator for each example.

Wrapping it up

In this post, you have learned about Prometheus 2: a new family of SOTA open-source models for evaluation.

After introducing the models and their specific usage, we have put them in action in Haystack and created different Evaluators to assess the quality of the responses produced by a RAG pipeline, along several axes.

The results of our experiments are interesting and promising. However, before using these models for real-world applications, you should assess them for your specific use case. Moreover, in this rapidly changing world, perhaps the day is not far off when general-purpose open models can be used effectively for evaluation.

Building RAG Applications with NVIDIA NIM and Haystack on K8s

Sun, 02 Jun 2024 00:00:00 +0000

Retrieval-augmented generation (RAG) systems combine generative AI with information retrieval for contextualized answer generation. Building reliable and performant RAG applications at scale is challenging. In this blog, we show how to use Haystack and NVIDIA NIM to create a RAG solution which is easy to deploy/maintain, standardized and enterprise-ready, that can run on-prem as well as on cloud native environments. This recipe is applicable in the cloud, on-premise or even in air-gapped environments.

About Haystack

Haystack, by deepset, is an open source framework for building production-ready LLM applications, RAG pipelines and state-of-the-art search systems that work intelligently over large document collections.

Haystack’s growing ecosystem of community integrations provide tooling for evaluation, monitoring, transcription, data ingestion and more. The NVIDIA Haystack integration allows using NVIDIA models and NIMs in Haystack pipelines, giving the flexibility to pivot from prototyping in the cloud to deploying on-prem.

About NVIDIA NIM

NVIDIA NIM is a collection of containerized microservices designed for optimized inference of state-of-the-art AI models. The container uses a variety of components to serve AI models and exposes them via standard API. Models are optimized using TensorRT or TensorRT-LLM (depending on the type of the model), applying procedures such as quantization, model distribution, optimized kernel/runtimes and inflight- or continuous batching among others allowing even further optimization if needed. Learn more about NIM here.

This tutorial shows how to build a Haystack RAG pipeline leveraging NIMs hosted on the NVIDIA API catalog. Then, we provide instructions on deploying NIMs on your infrastructure in a Kubernetes environment for self-hosting AI foundation models. Note that hosting NIMs requires NVIDIA AI Enterprise license.

Build a Haystack RAG Pipeline with NVIDIA NIMs hosted on the NVIDIA API Catalog

For RAG pipelines, Haystack provides 3 components that can be connected with NVIDIA NIMs:

NvidiaGenerator: Text generation with LLM NIM.
NvidiaDocumentEmbedder: Document embedding with NVIDIA NeMo Retriever Text Embedding NIM.
NvidiaTextEmbedder: Query embedding with NVIDIA NeMo Retriever Text Embedding NIM.

Fig. 1 - Haystack Indexing and RAG pipelines with NVIDIA NIMs

For this section, we have provided scripts and instructions for building a RAG pipeline leveraging NIMs hosted on the NVIDIA API catalog as part of the GitHub repository. We also provide a Jupyter Notebook for building the same RAG pipeline using NIMs deployed on your infrastructure in a Kubernetes environment.

Vectorize Documents with Haystack Indexing Pipelines

Our indexing pipeline implementation is available in the indexing tutorial. Haystack provides several preprocessing components for document cleaning, splitting, embedders, as well as converters extracting data from files in different formats. In this tutorial, we will store PDF files in a QdrantDocumentStore. NvidiaDocumentEmbedder is used to connect with NIMs hosted on the NVIDIA API catalog. Below is an example of how to initialize the embedder component with the snowflake/arctic-embed-l NIM hosted on the NVIDIA API catalog.

from haystack.utils.auth import Secret
from haystack_integrations.components.embedders.nvidia import NvidiaDocumentEmbedder


embedder = NvidiaDocumentEmbedder(model="snowflake/arctic-embed-l",
                                  api_url="https://ai.api.nvidia.com/v1/retrieval/snowflake/arctic-embed-l",
                                  batch_size=1)

Creating the Haystack RAG Pipeline

In our example, we will create a simple question/answering RAG pipeline using both NVIDIA NeMo Retriever Text Embedding NIM and LLM NIM. For this pipeline, we use the NvidiaTextEmbedder to embed the query for retrieval, and the NvidiaGenerator to generate a response. Example below shows how to instantiate the generator using meta/llama3-70b-instruct LLM NIM hosted on the NVIDIA API catalog.

generator = NvidiaGenerator(
    model="meta/llama3-70b-instruct",
    api_url="https://integrate.api.nvidia.com/v1",
    model_arguments={
        "max_tokens": 1024
    }
)

We use Haystack pipelines to connect various components of this RAG pipeline including query embedders and LLM generators. Below is an example of a RAG pipeline:

from haystack import Pipeline
from haystack.utils.auth import Secret
from haystack.components.builders import PromptBuilder
from haystack_integrations.components.embedders.nvidia import NvidiaTextEmbedder
from haystack_integrations.components.generators.nvidia import NvidiaGenerator
from haystack_integrations.components.retrievers.qdrant import QdrantEmbeddingRetriever
from haystack_integrations.document_stores.qdrant import QdrantDocumentStore

document_store = QdrantDocumentStore(embedding_dim=1024, host="qdrant")

embedder = NvidiaTextEmbedder(model="snowflake/arctic-embed-l", 
                                  api_key=Secret.from_env_var("NVIDIA_EMBEDDINGS_KEY"), 
                                  api_url="https://ai.api.nvidia.com/v1/retrieval/snowflake/arctic-embed-l")

retriever = QdrantEmbeddingRetriever(document_store=document_store)

prompt = """Answer the question given the context.
Question: {{ query }}
Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}
Answer:"""
prompt_builder = PromptBuilder(template=prompt)

generator = NvidiaGenerator(
    model="meta/llama3-70b-instruct",
    api_url="https://integrate.api.nvidia.com/v1",
    model_arguments={
        "max_tokens": 1024
    }
)

rag = Pipeline()
rag.add_component("embedder", embedder)
rag.add_component("retriever", retriever)
rag.add_component("prompt", prompt_builder)
rag.add_component("generator", generator)

rag.connect("embedder.embedding", "retriever.query_embedding")
rag.connect("retriever.documents", "prompt.documents")
rag.connect("prompt", "generator")

Indexing Files and Deploying the Haystack RAG Pipeline

Hayhooks allows the deployment of RAG pipelines in a containerized environment. In our example, we have provided a docker-compose file to setup both the Qdrant database, and the RAG pipeline. As we are leveraging NIMs hosted on the NVIDIA API catalog, we need to set the API keys for the NIMs in the .env file. The instructions below expect NVIDIA_API_KEY (for NvidiaGenerator) and NVIDIA_EMBEDDINGS_KEY (for NvidiaDocumentEmbedder and NvidiaTextEmbedder).

Executing docker-compose up will launch 3 containers: qdrant, hayhooks and qdrant-setup (which will run our indexing pipeline and stop). The Qdrant database will be deployed on the localhost and exposed at port 6333. The Qdrant dashboard allows users to inspect the vectorized documents at localhost:6333/dashboard.

Serializing Pipelines

Haystack pipelines defined in Python can be serialized to YAML by calling dump() on the pipeline object, as shown in our RAG pipeline tutorial. The YAML definition is as follows:

components:
  embedder:
    ...
    type: haystack_integrations.components.embedders.nvidia.text_embedder.NvidiaTextEmbedder
  generator:
    init_parameters:
      api_key:
        ...
    type: haystack_integrations.components.generators.nvidia.generator.NvidiaGenerator
  prompt:
    init_parameters:
      template: "Answer the question given the context.\nQuestion: {{ query }}\nContext:\n\
        {% for document in documents %}\n    {{ document.content }}\n{% endfor %}\n\
        Answer:"
    type: haystack.components.builders.prompt_builder.PromptBuilder
  retriever:
    init_parameters:
      document_store:
        init_parameters:
          ...
        type: haystack_integrations.document_stores.qdrant.document_store.QdrantDocumentStore
      ...
    type: haystack_integrations.components.retrievers.qdrant.retriever.QdrantEmbeddingRetriever

connections:
- receiver: retriever.query_embedding
  sender: embedder.embedding
- receiver: prompt.documents
  sender: retriever.documents
- receiver: generator.prompt
  sender: prompt.prompt
max_loops_allowed: 100
metadata: {}

Deploy the RAG Pipeline

To deploy the RAG pipeline, execute hayhooks deploy rag.yaml which will expose the pipeline on http://localhost:1416/rag by default. You can then visit http://localhost:1416/docs for the API docs and try out the pipeline.

For production, Haystack provides Helm charts and instructions to create services running Hayhooks with a container orchestrator like Kubernetes.

In the next sections, we will show how to deploy, monitor and autoscale NIMs on your infrastructure in a Kubernetes environment for self-hosting AI foundation models. Finally, we will provide instructions on how to use them in the Haystack RAG pipeline.

Self-hosting NVIDIA NIMs on a Kubernetes cluster

Kubernetes Cluster Environment

In this tutorial, the setup environment consists of a DGX H100 with 8 H100 GPUs each having 80GB of memory as host and with Ubuntu as the operating system. Docker is used as the container runtime. Kubernetes is deployed on it using Minikube. To enable GPU utilization in Kubernetes, we install essential NVIDIA software components using the GPU Operator.

NVIDIA NIMs Deployment

As part of this setup, we deploy following NVIDIA NIMs into the Kubernetes cluster using Helm charts:

The LLM NIM, which uses the model llama3-8b-instruct
The NeMo Retriever Text Embedding NIM, which uses the model nvidia/nv-embedqa-e5-v5

The Helm chart for the LLM NIM is located in GitHub whereas the helm chart for NVIDIA NeMo Retriever Embedding NIM can be fetched from the the NGC registry. Figure 3 illustrates the deployment of NIMs on a Kubernetes cluster running on a DGX H100. The GPU Operator components are deployed via its Helm chart and are part of the GPU Operator stack. Prometheus and Grafana are deployed via Helm charts for monitoring the Kubernetes cluster and the NIM.

Figure 3 - NVIDIA NIMs and other components deployment on a Kubernetes cluster

The LLM NIM Helm chart contains the LLM NIM container, which runs within a pod and references the model via Persistent Volume (PV) and Persistent Volume Claim (PVC). The LLM NIM pods are autoscaled using the Horizontal Pod Autoscaler (HPA) based on custom metrics and are exposed via Kubernetes ClusterIP service. To access the LLM NIM, we deploy an ingress and expose it at the /llm endpoint.

Similarly, the NeMo Retriever Text Embedding NIM Helm chart includes the NeMo Retriever Text Embedding NIM container, which runs within a pod and references the model on the host via PV and PVC. The NeMo Retriever Text Embedding NIM pods are also autoscaled via HPA and are exposed via Kubernetes ClusterIP service. To access the NeMo Retriever Text Embedding NIM, we deploy an ingress and expose it at the /embedding endpoint.

Users and other applications can access the exposed NVIDIA NIMs via the ingress. The vector database Qdrant is deployed using this helm chart.

Now, let’s take a closer look at the deployment process for each NIM:

LLM NIM deployment

Create the namespace, if it is not already created yet:

kubectl create namespace nim-llm

Add a Docker registry secret that will be used for pulling NIM containers from NGC and replace with the API key from NGC. Follow this link for generating an API key in NGC.

kubectl create secret -n nim-llm docker-registry nvcrimagepullsecret \
    --docker-server=nvcr.io \
    --docker-username='$oauthtoken' --docker-password=

Create a generic secret ngc-api, which is used to pull the model within the NIM container.

kubectl create secret -n nim-llm generic ngc-api \
    --from-literal=NGC_CLI_API_KEY=

Create nim-llm-values.yaml file with the below content. Adjust repository and tag values depending on your environment.

image:
  # Adjust to the actual location of the image and version you want
  repository: nvcr.io/nim/meta/llama3-8b-instruct
  tag: 1.0.0
imagePullSecrets:
  - name: nvcrimagepullsecret
model:
  name: meta/llama3-8b-instruct
  ngcAPISecret: ngc-api
persistence:
  enabled: true
  size: 30Gi
  annotations:
    helm.sh/resource-policy: keep
statefulSet:
    enabled: false
resources:
  limits:
    nvidia.com/gpu: 1

We assume that the helm chart for the LLM NIM is located here: ./nims/helm/nim-llm/. You can change the command accordingly depending on where the helm chart is located. Deploy the LLM NIM by running the following command:

helm -n nim-llm install nim-llm -f ./nims/helm/nim-llm/ nim-llm-values.yaml

The deployment takes a few minutes to start the containers, download models, and become ready. You can monitor the pods with the below command:

kubectl get pods -n nim-llm

Example Output

NAME        READY   STATUS    RESTARTS   AGE
nim-llm-0   1/1     Running   0          8m21s

Install an ingress controller, if it has not been installed already. Then, create a file ingress-nim-llm.yaml with the below content to create the ingress for the LLM NIM. Make sure to change the host (here nims.example.com) with your fully qualified domain name.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: nim-llm-ingress
  namespace: nim-llm
  annotations:
    nginx.ingress.kubernetes.io/use-regex: "true"
    nginx.ingress.kubernetes.io/rewrite-target: /$2
spec:
  rules:
    - host: nims.example.com
      http:
        paths:
          - path: /llm(/|$)(.*)
            pathType: ImplementationSpecific
            backend:
              service:
                name: nim-llm
                port:
                  number: 8000

Deploy the ingress with the below command:

kubectl apply -f ingress-nim-llm.yaml

Access the exposed service by making a curl request for testing (replace nims.example.com with you own fully qualified domain name)

curl -X 'POST' 'http://nims.example.com/llm/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "messages": [
    {
      "content": "You are a polite and respectful chatbot helping people plan a vacation.",
      "role": "system"
    },
    {
      "content": "What shall i do in France in one line?",
      "role": "user"
    }
  ],
  "model": "meta/llama3-8b-instruct",
  "temperature": 0.5,
  "max_tokens": 1024,
  "top_p": 1,
  "stream": false
}'

Example output:

{
  "id": "cmpl-44c301e7f12942fb830fc53a58e98e2a",
  "object": "chat.completion",
  "created": 1724399020,
  "model": "meta/llama3-8b-instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Indulge in the rich culture and history of France by visiting iconic landmarks like the Eiffel Tower, Notre-Dame Cathedral, and the Louvre Museum, and savor the country's renowned cuisine and wine in charming cities like Paris, Lyon, and Bordeaux."
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": 128009
    }
  ],
  "usage": {
    "prompt_tokens": 39,
    "total_tokens": 94,
    "completion_tokens": 55
  }
}

Now, we have the LLM NIM up and running.

NeMo Retriever Text Embedding NIM deployment

The deployment of the NeMo Retriever Text Embedding NIM is similar to the LLM NIM.

Follow steps 1 - 3 as LLM NIM deployment but replace namespace with nim-embedding in the commands.
Create nim-embedding-values.yaml file with the below content. Adjust repository and tag values depending on your environment.

image:
  repository: nvcr.io/nim/nvidia/nv-embedqa-e5-v5
  tag: 1.0.0
  pullPolicy: IfNotPresent
resources:
  limits:
    ephemeral-storage: 30Gi
    nvidia.com/gpu: 1
    memory: 12Gi
    cpu: "16000m"
  requests:
    ephemeral-storage: 3Gi
    nvidia.com/gpu: 1
    memory: 8Gi
    cpu: "4000m"
metrics:
  enabled: true

Fetch the NeMo Retriever Text Embedding NIM Helm chart, by running following commands (you would need to create NGC CLI API key, check here on how to create it):

helm fetch https://helm.ngc.nvidia.com/nim/nvidia/charts/text-embedding-nim-1.0.0.tgz --username='$oauthtoken' --password=

We assume that the helm chart for the NeMo Retriever Text Embedding NIM is located here: ./nims/helm/nim-embedding/. You can change the command accordingly depending on where the helm chart is located. Deploy the NeMo Retriever Text Embedding NIM by running the following command

helm -n nim-embedding install nim-embedding -f embedding-nim-values.yaml /nims/helm/nim-embedding/text-embedding-nim-1.0.0.tgz

The deployment takes a few minutes to start the container, download model, and become ready. You can monitor the pod with the below command:

kubectl get pods -n nim-embedding

Example Output

NAME                                 READY   STATUS     RESTARTS   AGE
nemo-embedding-ms-58dd974469-tpdvn   1/1     Running   0          5m

Create a file ingress-nim-embedding.yaml similar to the LLM NIM ingress with service name nemo-embedding-ms, port 8080, and path /embedding(/|$)(.*) as below:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: nim-embedding-ingress
  namespace: nim-embedding
  annotations:
    nginx.ingress.kubernetes.io/use-regex: "true"
    nginx.ingress.kubernetes.io/rewrite-target: /$2
spec:
  rules:
    - host: nims.example.com
      http:
        paths:
          - path: /embedding(/|$)(.*)
            pathType: ImplementationSpecific
            backend:
              service:
                name: nemo-embedding-ms
                port:
                  number: 8080

Access the exposed service by making a curl request for testing (replace in below the nims.example.com with your fully qualified domain name).

curl 'GET' \
  'http://nims.example.com/embedding/v1/models' \
  -H 'accept: application/json'

Example output:

{
  "object": "list",
  "data": [
    {
      "id": "nvidia/nv-embedqa-e5-v5",
      "created": 0,
      "object": "model",
      "owned_by": "organization-owner"
    }
  ]
}

7 You can test the NeMo Retriever Text Embedding NIM by creating embeddings for sample text as below:

curl -X "POST" \
  "http://nims.example.com/embedding/v1/embeddings" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
"input": ["Hello world"],
"model": "nvidia/nv-embedqa-e5-v5",
"input_type": "query"
}'

Example output:

{
  "object": "list",
  "data": [
    {
      "index": 0,
      "embedding": [
        -0.0003485679626464844,
        -0.017822265625,
        0.0262298583984375,
        0.0222015380859375,
        ...
        -0.00823974609375
      ],
      "object": "embedding"
    }
  ],
  "model": "nvidia/nv-embedqa-e5-v5",
  "usage": {
    "prompt_tokens": 6,
    "total_tokens": 6
  }
}

Now, we have the NeMo Retriever Text Embedding NIM up and running.

Once the above procedure is completed, you will have API endpoints of LLM NIM and NeMo Retriever Text Embedding NIM.

Operational Considerations

Monitoring and autoscaling are essential for deployed NVIDIA NIMs to ensure efficient, effective, and reliable operation. Monitoring tracks performance metrics, detects errors, and optimizes resource utilization, while autoscaling dynamically adjusts resources to match changing workloads, ensuring the NVIDIA NIMs can handle sudden spikes or dips in demand. This enable NVIDIA NIMs to provide accurate and timely responses, even under heavy loads, while optimizing costs and maintaining high availability. In this section, we will delve into details of deploying monitoring and enabling autoscaling for NVIDIA NIMs.

Monitoring

NVIDIA NIM metrics are collected with the open-source tool Prometheus and visualized with the Grafana dashboards. NVIDIA dcgm-exporter is the preferred tool to collect GPU telemetry. We follow the instructions from here for the deployment of Prometheus and Grafana.

Visualizing NVIDIA NIM Metrics

LLM NIM

By default, NVIDIA LLM NIM metrics are exposed at http://localhost:8000/metrics by the LLM NIM container. All the exposed metrics are listed here. Using a Prometheus ServiceMonitor they can be published to Prometheus and be viewed in the Grafana. The Prometheus ServiceMonitor is used to define application to scrape metrics from within Kubernetes cluster.

Create a file service-monitor-nim-llm.yaml with the below content.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: nim-llm-sm
  namespace: nim-llm
spec:
  endpoints:
  - interval: 30s
    targetPort: 8000
    path: /metrics
  namespaceSelector:
    matchNames:
    - nim-llm
  selector:
    matchLabels:
      app.kubernetes.io/name: nim-llm

Create a Prometheus ServiceMonitor using the below command:

kubectl apply -f service-monitor-nim-llm.yaml

In the prometheus UI under Status -> Targets, you will see the below ServiceMonitor once it’s deployed.

Let’s check some inference metrics on the Prometheus UI. Figure 4 shows the stacked graph for request_success_total NIM metric.

Figure 4 - Prometheus UI showing the plot of request_success_total metric indicating number of finished requests.

NeMo Retriever Text Embedding NIM

NeMo Retriever Text Embedding NIM exposes underlying Triton metrics on port 8002. All the exposed metrics are listed here. We will create a service monitor to expose these metrics so as to publish them in Prometheus and visualize in Grafana.

Create a file service-monitor-nim-embedding.yaml with the below content.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: nim-embedding-sm
  namespace: nim-embedding
spec:
  endpoints:
  - interval: 30s
    targetPort: 8002
    path: /metrics
  namespaceSelector:
    matchNames:
    - nim-embedding
  selector:
    matchLabels:
      app.kubernetes.io/name: text-embedding-nim

Create a Prometheus ServiceMonitor using the below command:

kubectl apply -f service-monitor-nim-embedding.yaml

In the prometheus UI under Status -> Targets, you will see the below ServiceMonitor once it’s deployed.

We can check some triton metrics on the Prometheus UI. Figure 5 shows the stacked graph for nv_inference_count metric.

Figure 5 - Prometheus UI showing the stacked plot of nv_inference_count metric indicating number of inferences performed.

Autoscaling NVIDIA NIM

In this tutorial, we use the Kubernetes Horizontal Pod Autoscaler - HPA to adjust the scaling of the NIM pods. We have defined custom metrics to monitor the average GPU usage of each NVIDIA NIM and used by the Horizontal Pod Autoscaler (HPA) to dynamically adjust the number of NIM pods. See the metric definition below:

Metrics	Expressions
nim_llm_gpu_avg	`avg by (kubernetes_node, pod, namespace, gpu) (DCGM_FI_DEV_GPU_UTIL{pod=~"nim-llm-.*"})`
nim_embedding_gpu_avg	`avg by (kubernetes_node, pod, namespace, gpu) (DCGM_FI_DEV_GPU_UTIL{pod=~"nemo-embedding-ms-.*"})`

These metrics are example metrics and one should adjust them based on their environment.

Let’s deploy the HPA.

Create a file with the name prometheus_rule_nims.yaml with the below content to create the prometheus rules for the above custom metric. Adjust the labels (app, other prometheus labels) according to the current deployed prometheus instance.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    app: kube-prometheus-stack
    app.kubernetes.io/instance: kube-prometheus-stack-1710254997
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/part-of: kube-prometheus-stack
    app.kubernetes.io/version: 56.8.2
    chart: kube-prometheus-stack-56.8.2
    heritage: Helm
    release: kube-prometheus-stack-1710254997
  name: kube-prometheus-stack-1709-gpu.rules
  namespace: prometheus
spec:
  groups:
  - name: gpu.rules
    rules:
    - expr: avg by (kubernetes_node, pod, namespace, gpu) (DCGM_FI_DEV_GPU_UTIL{pod=~"nim-llm-.*"})
      record: nim_llm_gpu_avg
    - expr: avg by (kubernetes_node, pod, namespace, gpu) (DCGM_FI_DEV_GPU_UTIL{pod=~"nemo-embedding-ms-.*"})
      record: nim_embedding_gpu_avg

Create custom prometheus recording rules by running the below command:

kubectl apply -f prometheus_rule_nims.yaml

In prometheus UI, under Status -> Rules, you can see the above two created rules as shown in Figure 6.

Figure 6 - Prometheus rules tab showing the created custom rules to record GPU usage by NVIDIA NIM.

Install prometheus-adapter to query the custom metrics based on the custom recording rules created above and register them to the custom metrics API for HPA to fetch. Replace in below command with the name of the prometheus service in Kubernetes.

helm upgrade --install prometheus-adapter prometheus-community/prometheus-adapter --set prometheus.url="http://.prometheus.svc.cluster.local"

Query the custom metrics API to see if the metrics have been registered using the below command:

kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq -r . | grep llms

Example Output:

"name": "pods/nim_embedding_gpu_avg",
"name": "namespaces/nim_embedding_gpu_avg",
"name": "pods/nim_llm_gpu_avg",
"name": "namespaces/nim_llm_gpu_avg",

A separate HPA definition is created for the two NVIDIA NIMs. Within this definition, we specify the minimum and maximum number of replicas, the metric to monitor, and the target value for that metric. Below is the definition for the LLM NIM HPA and you can create the similar for the NeMo Retriever Text Embedding NIM using nim_embedding_gpu_avg metric.

LLM NIM HPA file:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: nim-llm-hpa
  namespace: nim-llm
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: nim-llm
  minReplicas: 1
  maxReplicas: 2
  metrics:
    - type: Pods
      pods:
        metric:
          name: nim_llm_gpu_avg
        target:
          type: AverageValue
          averageValue: 30

Create the two HPAs using the below commands:

kubectl apply -f hpa_nim_llm.yaml
kubectl apply -f hpa_nim_embedding.yaml

Check the status of HPAs:

kubectl get hpa -A

Example Output:

NAMESPACE       NAME                REFERENCE                      TARGETS        MINPODS   MAXPODS   REPLICAS   AGE
nim-embedding   nim-embedding-hpa   Deployment/nemo-embedding-ms   0/30           1         2         1          80s
nim-llm         nim-llm-hpa         Deployment/nim-llm             0/30           1         2         1          2m11s

Send some requests to LLM NIM and see the LLM NIM pod getting scaled as shown below:

NAME        READY   STATUS    RESTARTS   AGE
nim-llm-0   1/1     Running   0          3h47m
nim-llm-1   1/1     Running   0          3m30s

Also, Figure 6 shows the prometheus graph showing the scaling of LLM NIM.

Figure 7 - Prometheus graph showing the scaling of LLM NIM.

We have now deployed NVIDIA NIMs on your infrastructure in a scalable fashion. We can now use them in the RAG pipeline. The next section provides the details for the same.

Use Self-hosted NVIDIA NIMs in the RAG Pipeline

This section provides instructions to use previously deployed NVIDIA NIMs on your infrastructure in a Kubernetes cluster for NvidiaTextEmbedder, NvidiaDocumentEmbedder and NvidiaGenerator in the Haystack RAG pipeline, replacing with the endpoint of the NeMo Retriever Text Embedding NIM and with the LLM NIM. The provided notebook in the repository has examples of how to use the self-hosted NVIDIA NIMs.

NvidiaDocumentEmbedder:

embedder = NvidiaDocumentEmbedder(
    model=embedding_nim_model,
    api_url="http:///v1"
)

NvidiaTextEmbedder:

# initialize NvidiaTextEmbedder with the self-hosted NeMo Retriever Text Embedding NIM URL
embedder = NvidiaTextEmbedder(
    model=embedding_nim_model,
    api_url="http:///v1"
)

NvidiaGenerator:

# initialize NvidiaGenerator with the self-hosted LLM NIM URL
generator = NvidiaGenerator(
    model=llm_nim_model_name,
    api_url="http:///v1",
    model_arguments={
        "temperature": 0.5,
        "top_p": 0.7,
        "max_tokens": 2048,
    },
)

Summary

In this blog, we provide a comprehensive walkthrough for building robust and scalable RAG applications using Haystack and NVIDIA NIMs. We cover building the RAG pipeline by leveraging NVIDIA NIMs hosted on the NVIDIA API catalog and also using self-hosted NVIDIA NIMs deployed on your infrastructure in a Kubernetes environment. Our step-by-step instructions detail how to deploy NVIDIA NIMs in a Kubernetes cluster, monitor their performance, and scale them as needed.

By leveraging proven deployment patterns, our architecture ensures a responsive user experience and predictable query times, even in the face of high or bursty user queries and document indexing workloads. Moreover, our deployment recipe is flexible, allowing for easy implementation in cloud, on-premise, or air-gapped environments. With this guide, we aim to provide a resource for anyone looking to build reliable and performant RAG applications at scale.

Monitor and trace your Haystack pipelines with Langfuse

Fri, 17 May 2024 00:00:00 +0000

Getting your LLM application into production is a huge milestone, but that’s only the beginning. It’s critical to monitor how your pipeline is performing in the real world so you can keep improving performance and cost, and proactively address any issues that might arise.

With the new Haystack Langfuse integration, it’s now easier than ever to have visibility into your pipelines. In this post, we’ll explain more about Langfuse, and demonstrate how to trace an end to end request to a Haystack pipeline.

What is Langfuse?

Langfuse is an open source LLM engineering platform. It offers a ton of features to help you understand how your LLM application is performing under the hood.

Langfuse features and benefits

Track model usage and cost
Collect user feedback
Identify low-quality outputs
Build fine-tuning and testing datasets
Open source 💙
Self-hosted version available
Frequent releases with new features and improvements
as of the time of this writing, free to try out 🤑

Getting started

In order to use this integration, you’ll need to sign up for a Langfuse account. See the Langfuse docs for the most up-to-date information about features and pricing.

Prerequisites

First, sign up for an account on the Langfuse website.

On the Langfuse dashboard, make a note of your LANGFUSE_SECRET_KEY and LANGFUSE_PUBLIC_KEY. Set them as environment variables. While you’re at it, set the HAYSTACK_CONTENT_TRACING_ENABLED environment variable to true in order to enable Haystack tracing in your pipeline.

The following code examples also require an OPENAI_API_KEY environment variable to be set. Haystack is model-agnostic and you can use any model provider we support, by changing the generator in the code samples below.

Installation

To install the integration run the following command in your terminal:

pip install langfuse-haystack

To use Langfuse in a pipeline you’ll need a few additional dependencies:

pip install sentence-transformers datasets

Use Langfuse in a RAG pipeline

First, import all the modules you’ll need.

from datasets import load_dataset
from haystack import Document, Pipeline
from haystack.components.builders import PromptBuilder
from haystack.components.embedders import SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder
from haystack.components.generators import OpenAIGenerator
from haystack.components.retrievers import InMemoryEmbeddingRetriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.connectors.langfuse import LangfuseConnector

Next, write a function that takes a DocumentStore and returns a Haystack RAG pipeline. Add the LangfuseConnector to your pipeline, but don’t connect it to any other component in the pipeline.

def get_pipeline(document_store: InMemoryDocumentStore):
    retriever = InMemoryEmbeddingRetriever(document_store=document_store, top_k=2)

    template = """
    Given the following information, answer the question.
    Context:
    {% for document in documents %}
        {{ document.content }}
    {% endfor %}
    Question: {{question}}
    Answer:
    """

    prompt_builder = PromptBuilder(template=template)

    basic_rag_pipeline = Pipeline()
    # Add components to your pipeline
    basic_rag_pipeline.add_component("tracer", LangfuseConnector("Basic RAG Pipeline"))
    basic_rag_pipeline.add_component(
        "text_embedder", SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
    )
    basic_rag_pipeline.add_component("retriever", retriever)
    basic_rag_pipeline.add_component("prompt_builder", prompt_builder)
    basic_rag_pipeline.add_component("llm", OpenAIGenerator(model="gpt-3.5-turbo", generation_kwargs={"n": 2}))

    # Now, connect the components to each other
    # NOTE: the tracer component doesn't need to be connected to anything in order to work
    basic_rag_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
    basic_rag_pipeline.connect("retriever", "prompt_builder.documents")
    basic_rag_pipeline.connect("prompt_builder", "llm")

    return basic_rag_pipeline

Now, instantiate the pipeline using an InMemoryDocumentStore to keep things simple. Generate some embeddings based on the 7 wonders of the world dataset, and populate them into our document store. If you were running this code in production, you’d probably want to use an indexing pipeline to load the data into the store, but for demo purposes this approach reduces complexity.

document_store = InMemoryDocumentStore()
dataset = load_dataset("bilgeyucel/seven-wonders", split="train")
embedder = SentenceTransformersDocumentEmbedder("sentence-transformers/all-MiniLM-L6-v2")
embedder.warm_up()
docs_with_embeddings = embedder.run([Document(**ds) for ds in dataset]).get("documents") or []  # type: ignore
document_store.write_documents(docs_with_embeddings)

Run the pipeline and ask it a question.

pipeline = get_pipeline(document_store)
question = "What does Rhodes Statue look like?"
response = pipeline.run({"text_embedder": {"text": question}, "prompt_builder": {"question": question}})

Setting the HAYSTACK_CONTENT_TRACING_ENABLED environment variable automatically traces every request that the pipeline runs. If all goes well you should receive something like the following output:

# {'tracer': {'name': 'Basic RAG Pipeline', 'trace_url': 'https://cloud.langfuse.com/trace/3d52b8cc-87b6-4977-8927-5e9f3ff5b1cb'}, 'llm': {'replies': ['The Rhodes Statue was described as being about 105 feet tall, with iron tie bars and brass plates forming the skin. It was built on a white marble pedestal near the Rhodes harbour entrance. The statue was filled with stone blocks as construction progressed.', 'The Rhodes Statue was described as being about 32 meters (105 feet) tall, built with iron tie bars, brass plates for skin, and filled with stone blocks. It stood on a 15-meter-high white marble pedestal near the Rhodes harbor entrance.'], 'meta': [{'model': 'gpt-3.5-turbo-0125', 'index': 0, 'finish_reason': 'stop', 'usage': {'completion_tokens': 100, 'prompt_tokens': 453, 'total_tokens': 553}}, {'model': 'gpt-3.5-turbo-0125', 'index': 1, 'finish_reason': 'stop', 'usage': {'completion_tokens': 100, 'prompt_tokens': 453, 'total_tokens': 553}}]}}

Dumping tracing output in the terminal, is pretty cool, but the integration also sends the info to Langfuse. The Langfuse dashboard has a much more comprehensive and beautiful UI so you can make sense of your pipeline. Let’s hop over there and take a look.

Use Langfuse in a RAG pipeline with chat

Agent and chat use cases are rising in popularity. If you wanted to use the integration to trace a pipeline that includes a chat generator component, here’s an example of how to do so.

from haystack import Pipeline
from haystack.components.builders import DynamicChatPromptBuilder
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.dataclasses import ChatMessage
from haystack_integrations.components.connectors.langfuse import LangfuseConnector

pipe = Pipeline()
pipe.add_component("tracer", LangfuseConnector("Chat example"))
pipe.add_component("prompt_builder", DynamicChatPromptBuilder())
pipe.add_component("llm", OpenAIChatGenerator(model="gpt-3.5-turbo"))

pipe.connect("prompt_builder.prompt", "llm.messages")
messages = [
    ChatMessage.from_system("Always respond in German even if some input data is in other languages."),
    ChatMessage.from_user("Tell me about {{location}}"),
]

response = pipe.run(
    data={"prompt_builder": {"template_variables": {"location": "Berlin"}, "prompt_source": messages}}
)
print(response["llm"]["replies"][0])
print(response["tracer"]["trace_url"])
# ChatMessage(content='Berlin ist die Hauptstadt von Deutschland und zugleich eines der bekanntesten kulturellen Zentren Europas. Die Stadt hat eine faszinierende Geschichte, die bis in die Zeiten des Zweiten Weltkriegs und des Kalten Krieges zurückreicht. Heute ist Berlin für seine vielfältige Kunst- und Musikszene, seine historischen Stätten wie das Brandenburger Tor und die Berliner Mauer sowie seine lebendige Street-Food-Kultur bekannt. Berlin ist auch für seine grünen Parks und Seen beliebt, die den Bewohnern und Besuchern Raum für Erholung bieten.', role=, name=None, meta={'model': 'gpt-3.5-turbo-0125', 'index': 0, 'finish_reason': 'stop', 'usage': {'completion_tokens': 137, 'prompt_tokens': 29, 'total_tokens': 166}})
# https://cloud.langfuse.com/trace/YOUR_UNIQUE_IDENTIFYING_STRING

Explore the Langfuse dashboard

Once you’ve run these code samples, head over to the Langfuse dashboard to see and interact with traces. As of the time of this writing, the demo is free to try.

Trace Detail

Trace details show cost and latency for a specific end-to-end request. This data is helpful for estinating usage and cost of a RAG application in production. For example, here is the trace detail for the text embedder step of the pipeline we just ran. For a comprehensive explanation of LLM tracing, see the Langfuse docs.

The right sidebar shows latency for every step of the pipeline, which helps to pinpoint performance bottlenecks.

Trace details that are tagged “generation” also show the monetary cost of the request.

Traces can also be segmented by user or by session, so you can more granularly understand the user’s journey.

Evaluation

Evaluation helps us understand the quality of the results the LLM application is returning to the end user. There are currently 4 ways to add scores into Langfuse:

Manual evaluation
User feedback
Model-based evaluation
Custom via SDKs/API

For the sake of time, this post will only cover manual evaluation, but see the Langfuse docs for comprehensive info on all the evaluation methods.

Clicking on a trace, you can manually add a score to note the quality of that specific request.

For this trace, the input shows us our prompt, interpolated with the actual context that was passed to the LLM. Cool!

Input:

    Given the following information, answer the question.
    Context:
    
        Within it, too, are to be seen large masses of rock, by the weight of which the artist steadied it while erecting it.[22][23]
Destruction of the remains[edit]
The ultimate fate of the remains of the statue is uncertain. Rhodes has two serious earthquakes per century, owing to its location on the seismically unstable Hellenic Arc. Pausanias tells us, writing ca. 174, how the city was so devastated by an earthquake that the Sibyl oracle foretelling its destruction was considered fulfilled.[24] This means the statue could not have survived for long if it was ever repaired. By the 4th century Rhodes was Christianized, meaning any further maintenance or rebuilding, if there ever was any before, on an ancient pagan statue is unlikely. The metal would have likely been used for coins and maybe also tools by the time of the Arab wars, especially during earlier conflicts such as the Sassanian wars.[9]
The onset of Islamic naval incursions against the Byzantine empire gave rise to a dramatic account of what became of the Colossus. 
    
        Construction[edit]
Timeline and map of the Seven Wonders of the Ancient World, including the Colossus of Rhodes
Construction began in 292 BC. Ancient accounts, which differ to some degree, describe the structure as being built with iron tie bars to which brass plates were fixed to form the skin. The interior of the structure, which stood on a 15-metre-high (49-foot) white marble pedestal near the Rhodes harbour entrance, was then filled with stone blocks as construction progressed.[14] Other sources place the Colossus on a breakwater in the harbour. According to most contemporary descriptions, the statue itself was about 70 cubits, or 32 metres (105 feet) tall.[15] Much of the iron and bronze was reforged from the various weapons Demetrius's army left behind, and the abandoned second siege tower may have been used for scaffolding around the lower levels during construction.

    
    Question: What does Rhodes Statue look like?
    Answer:

Output:
The Rhodes Statue was described as being about 105 feet tall, with iron tie bars and brass plates forming the skin. It was built on a white marble pedestal near the Rhodes harbour entrance. The statue was filled with stone blocks as construction progressed."
1: "The Rhodes Statue was described as being about 32 meters (105 feet) tall, built with iron tie bars, brass plates for skin, and filled with stone blocks. It stood on a 15-meter-high white marble pedestal near the Rhodes harbor entrance."
]

This seems like a decent quality response, based on the inputs and outputs. Click on the “Add score” button and give it a score of 1. The score is even editable, in case you make a mistake.

Now clicking on the “Scores” section, the score we added is visible. Over time, this data helps build a comprehensive picture of the quality of our LLM application.

Wrapping it up

If you’ve been following along, today you’ve learned:

How Langfuse can help give you better visibility into your Haystack pipelines, giving you confidence to ship into production
How to integrate Langfuse into Haystack RAG and chat pipelines
The basics of LLM tracing and evaluation with the Langfuse dashboard

For a small team, Langfuse ships new features with incredible velocity. We can’t wait to see what they build next. To stay in the loop for future updates, be sure to follow Langfuse and Haystack on Twitter. Thanks for reading!

Advanced Retrieval: Extract Metadata from Queries to Improve Retrieval

Mon, 13 May 2024 00:00:00 +0000

This is part one of the Advanced Use Cases series:

1️⃣ Extract Metadata from Queries to Improve Retrieval

2️⃣ Query Expansion

3️⃣ Query Decomposition

4️⃣ Automated Metadata Enrichment

In Retrieval-Augmented Generation (RAG) applications, the retrieval step, which provides relevant context to your large language model (LLM), is vital for generating high-quality responses. There are possible ways of improving retrieval and metadata filtering is one of the easiest ways. Metadata filtering, the approach of limiting the search space based on some concrete metadata, can really enhance the quality of the retrieved documents. Here are some advantages of using metadata filtering:

Relevance: Metadata filtering narrows down the information being retrieved. This ensures that the generated responses align with the specific query or topic.
Accuracy: Filtering based on metadata such as domain, source, date, or topic guarantees that the information used for generation is accurate and trustworthy. This is particularly important for applications where accuracy is paramount. For instance, if you need information about a specific year, using the year as a metadata filter will retrieve only pertinent data.
Efficiency: Eliminating irrelevant or low-quality information boosts the efficiency of your RAG application, reduces the amount of processing needed, and speeds up retrieval response times.

You have two options for applying the metadata filter: you can either specify it directly when running the pipeline or, you can extract it from the query itself. In this article, we’ll focus on extracting filters from a query to improve the quality of generated responses in RAG applications. Let’s get started.

Introduction to Metadata Filters

First things first, what is metadata? Metadata (or meta tag) is actually data about your data, used to categorize, sort, and filter information based on various attributes such as date, topic, source, or any other information that you find relevant. After incorporating meta information into your data, you can apply filters to queries used with Retrievers to limit the scope of your search based on this metadata and ensure that your answers come from a specific slice of your data.

Imagine that you have following Documents in your document store:

documents = [
    Document(
        content="Some text about revenue increase",
        meta={"year": 2022, "company": "Nvidia", "name":"A"}),
    Document(
        content="Some text about revenue increase",
        meta={"year": 2023, "company": "Nvidia", "name":"B"}),
    Document(
        content="Some text about revenue increase",
        meta={"year": 2022, "company": "BMW", "name":"C"}),
    Document(
        content="Some text about revenue increase",
        meta={"year": 2023, "company": "BMW", "name":"D"}),
    Document(
        content="Some text about revenue increase",
        meta={"year": 2022, "company": "Mercedes", "name":"E"}),
    Document(
        content="Some text about revenue increase",
        meta={"year": 2023, "company": "Mercedes", "name":"F"}),
]

When the query is “Causes of the revenue increase”, the retriever returns all documents as they all contain some information about revenue. However, the metadata filter below ensures that any returned document by the retriever has a value of 2022 in the year metadata field and either BMW or Mercedes in the company metadata field. So, only documents with name “C” and “E” are retrieved.

pipeline.run(
    data={
        "retriever":{
            "query": "Causes of the revenue increase",
            "filters": {
                "operators": "AND",
                "conditions": [
                    {"field": "meta.year", "operator": "==", "value": "2022"},
                    {"field": "meta.company", "operator": "in", "value": ["BMW", "Mercedes"]}
                ]
            }
        }
    }
)

In this example, we pass the filter explicitly, but sometimes, the query itself might contain information that can be used as a metadata filter during the querying process. In this case, we need to preprocess the query to extract filters before we use it with a retriever.

Extracting Metadata Filters from a Query

In LLM-based applications, queries are written in natural language. From time to time, they include valuable hints that can be used as metadata filters to improve the retrieval. We can extract these hints, formulate them as metadata filters and use them with the retriever alongside the query. For instance, when the query is “What was the revenue of Nvidia in 2022?”, we can extract 2022 as years and Nvidia as companies. Based on this information, formulated metadata filter to use with a retriever should look like:

"filters": {
    "operators": "AND",
    "conditions": [
        {"field": "meta.years", "operator": "==", "value": "2022"},
        {"field": "meta.companies", "operator": "==", "value": "Nvidia"}
    ]
}

Thankfully, LLMs are highly capable of extracting structured information from unstructured text. Let’s see step-by-step how we can implement a custom component that uses an LLM to extract keywords, phrases, or entities from the query and formulate the metadata filter.

Implementing `QueryMetadataExtractor`

🧑‍🍳 You can find and run all the code in our cookbook Extrating Metadata Filter from a Query

We start by creating a custom component, QueryMetadataExtractor, which takes query and metadata_fields as inputs and outputs filters. This component encapsulates a generative pipeline, made up of PromptBuilder and OpenAIGenerator. The pipeline instructs the LLM to extract keywords, phrases, or entities from a given query which can then be used as metadata filters. In the prompt, we include instructions to ensure the output format is in JSON and provide metadata_fields along with the query to ensure the correct entities are extracted from the query.

Once the pipeline is initialized in the init method of the component, we post-process the LLM output in the run method. This step ensures the extracted metadata is correctly formatted to be used as a metadata filter.

import json
from typing import Dict, List

from haystack import Pipeline, component
from haystack.components.builders import PromptBuilder
from haystack.components.generators import OpenAIGenerator

@component()
class QueryMetadataExtractor:

    def __init__(self):
        prompt = """
        You are part of an information system that processes users queries.
        Given a user query you extract information from it that matches a given list of metadata fields.
        The information to be extracted from the query must match the semantics associated with the given metadata fields.
        The information that you extracted from the query will then be used as filters to narrow down the search space
        when querying an index.
        Just include the value of the extracted metadata without including the name of the metadata field.
        The extracted information in 'Extracted metadata' must be returned as a valid JSON structure.
        ###
        Example 1:
        Query: "What was the revenue of Nvidia in 2022?"
        Metadata fields: {"company", "year"}
        Extracted metadata fields: {"company": "nvidia", "year": 2022}
        ###
        Example 2:
        Query: "What were the most influential publications in 2023 regarding Alzheimer's disease?"
        Metadata fields: {"disease", "year"}
        Extracted metadata fields: {"disease": "Alzheimer", "year": 2023}
        ###
        Example 3:
        Query: "{{query}}"
        Metadata fields: "{{metadata_fields}}"
        Extracted metadata fields:
        """
        self.pipeline = Pipeline()
        self.pipeline.add_component(name="builder", instance=PromptBuilder(prompt))
        self.pipeline.add_component(name="llm", instance=OpenAIGenerator(model="gpt-3.5-turbo"))
        self.pipeline.connect("builder", "llm")

    @component.output_types(filters=Dict[str, str])
    def run(self, query: str, metadata_fields: List[str]):
        result = self.pipeline.run({'builder': {'query': query, 'metadata_fields': metadata_fields}})
        metadata = json.loads(result['llm']['replies'][0])

        # this can be done with specific data structures and in a more sophisticated way
        filters = []
        for key, value in metadata.items():
            field = f"meta.{key}"
            filters.append({f"field": field, "operator": "==", "value": value})

        return {"filters": {"operator": "AND", "conditions": filters}}

First, let’s test the QueryMetadataExtractor in isolation, passing a query and a list of metadata fields.

extractor = QueryMetadataExtractor()

query = "What were the most influential publications in 2022 regarding Parkinson's disease?"
metadata_fields = {"disease", "year"}

result = extractor.run(query, metadata_fields)
print(result)

The result should look like this:

{'filters': {'operator': 'AND',
  'conditions': [
    {'field': 'meta.disease', 'operator': '==', 'value': 'Alzheimers'},
    {'field': 'meta.year', 'operator': '==', 'value': 2023}
  ]}
}

Notice that the QueryMetadataExtractor has extracted the metadata fields from the query and returned them in a format that can be used as filters passed directly to a Retriever. By default, the QueryMetadataExtractor will use all metadata fields as conditions together with an AND operator.

Using `QueryMetadataExtractor` in a Pipeline

Now, let’s plug the QueryMetadataExtractor into a Pipeline with a Retriever connected to a DocumentStore to see how it works in practice.

We start by creating a InMemoryDocumentStore and adding some documents to it. We include info about “year” and “disease” in the “meta” field of each document.

from haystack import Document
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.document_stores.types import DuplicatePolicy

documents = [
    Document(
        content="some publication about Alzheimer prevention research done over 2023 patients study",
        meta={"year": 2022, "disease": "Alzheimer", "author": "Michael Butter"}),
    Document(
        content="some text about investigation and treatment of Alzheimer disease",
        meta={"year": 2023, "disease": "Alzheimer", "author": "John Bread"}),
    Document(
        content="A study on the effectiveness of new therapies for Parkinson's disease",
        meta={"year": 2022, "disease": "Parkinson", "author": "Alice Smith"}
    ),
    Document(
        content="An overview of the latest research on the genetics of Parkinson's disease and its implications for treatment",
        meta={"year": 2023, "disease": "Parkinson", "author": "David Jones"}
    )
]

document_store = InMemoryDocumentStore(bm25_algorithm="BM25Plus")
document_store.write_documents(documents=documents, policy=DuplicatePolicy.OVERWRITE)

We then create a pipeline consisting of the QueryMetadataExtractor and a InMemoryBM25Retriever connected to the InMemoryDocumentStore created above.

Learn about connecting components and creating pipelines in Docs: Creating Pipelines.

from haystack import Pipeline, Document
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever

retrieval_pipeline = Pipeline()
metadata_extractor = QueryMetadataExtractor()
retriever = InMemoryBM25Retriever(document_store=document_store)

retrieval_pipeline.add_component(instance=metadata_extractor, name="metadata_extractor")
retrieval_pipeline.add_component(instance=retriever, name="retriever")
retrieval_pipeline.connect("metadata_extractor.filters", "retriever.filters")

Now define a query and metadata fields and pass them to the pipeline:

query = "publications 2023 Alzheimer's disease"
metadata_fields = {"year", "author", "disease"}

retrieval_pipeline.run(data={"metadata_extractor": {"query": query, "metadata_fields": metadata_fields}, "retriever":{"query": query}})

This returns only documents whose metadata field year = 2023 and disease = Alzheimer

{'documents': 
 [Document(
     id=e3b0bfd497a9f83397945583e77b293429eb5bdead5680cc8f58dd4337372aa3, 
     content: 'some text about investigation and treatment of Alzheimer disease', 
     meta: {'year': 2023, 'disease': 'Alzheimer', 'author': 'John Bread'}, 
     score: 2.772588722239781)]
     }

Conclusion

Metadata filtering stands out as a powerful technique for improving the relevance and accuracy of retrieved documents, thus enabling the generation of high-quality responses in RAG applications. Using the custom component QueryMetadataExtractor we implemented, we can extract filters from user queries and directly use them with Retrievers.

This article was part one of the Advanced Use Cases series. If you want to stay on top of the latest Haystack developments, you can subscribe to our newsletter or join our Discord community 💙

Chatting with SQL Databases 3 Ways

Mon, 22 Apr 2024 00:00:00 +0000

Let’s talk about how we build AI applications that can interact with, even chat to SQL databases. Heads-up that this is a mini project that I tried out, and it’s objectively a simple approach to interacting with SQL with natural language. I’m certain that this is not a golden bullet that works with all types of SQL tables etc. But, nonetheless, it’s cool, it works, and you can try it along with me.

My aim setting off with this project was to be able to ask questions in natural language, and get back an answer based on the contents in a SQL database. For example, assuming we have a database that stores information on employees and their time spent at work; “How many people spend more than 8 hours at work a day?”, “A total of 20 employees spend more than 8 hours at work a day.”.

A few struggles I knew I might get with a system based on a generative LLM:

What if the question is unrelated to the contents in the SQL database?
I can get the LLM to maybe produce a SQL query, but how do I actually query the database?
What if the SQL query is just wrong? 🙂

So, here are 4 things I did and that I’m showing you here. Not all solve all of the problems above, but they’re a good starting point. And as a heads up, I used Haystack for this project, so the following walk-through has some Haystack terminology.

I created a component that can query a SQL database given a query. I used SQLite for this.
I created a pipeline using an LLM that can ingest a question, produce a SQL query based on the structure of my database, and query it
Next, to resolve the problem of ‘what if the query is unrelated’, I “upgraded” my pipeline to use conditional routing.
Finally, so as to have a proper chat experience, I changed tactics and defined everything as a function and used the function-calling capabilities of the latest LLMs. I then wrapped that into a mini Gradio app that you can run from within the Cookbook I’ve provided.

Before we dive in, a quick note on the data at hand. I won’t show all the code to get that into the database here because, frankly, it’s the most boring step… A few bullet points:

I used a CSV file called ‘Absenteeism_at_work’ that the JupySQL team uses to show their examples. Very handy! I took inspiration from the first demo we created with them and wrote this table to my SQLite database.
The resulting table has the following columns:

ID;Reason_for_absence;Month_of_absence;Day_of_the_week;Seasons;Transportation_expense;Distance_from_Residence_to_Work;Service_time;Age;Work_load_Average_day_;Hit_target;Disciplinary_failure;Education;Son;Social_drinker;Social_smoker;Pet;Weight;Height;Body_mass_index;Absenteeism_time_in_hours

I’ve populated a database called absenteeism.db that represents this table, which we will be using here on out.

🧑‍🍳 You can see and run all the code required for the SQL database setup in the provided Cookbook

Part 1: Query SQL with Natural Language

Part 2: Skip Unrelated Questions with Conditional Routing

Part 3: Build a Chat App with Function Calling (& Gradio)

Part 1: Query SQL with Natural Language

First, let’s tackle the easiest of the lot. Let’s create a pipeline that:

Accepts a question
Creates a SQL query
Actually queries our DB.

To this end, here’s something we’re going to need. A SQLQuery component which Haystack does not have. But Haystack does provide a uniform interface for component creation. And here’s what mine looks like:

import sqlite3
from typing import List
from haystack import component

@component
class SQLQuery:

    def __init__(self, sql_database: str):
      self.connection = sqlite3.connect(sql_database, check_same_thread=False)

    @component.output_types(results=List[str], queries=List[str])
    def run(self, queries: List[str]):
        results = []
        for query in queries:
          result = pd.read_sql(query, self.connection)
          results.append(f"{result}")
        return {"results": results, "queries": queries}

Now I have an SQLQuery component, I can initialize it with sql_query = SQLQuery('absenteeism.db).

To test it, I can run the following SQL query with this component:

result = sql_query.run(queries=['SELECT Age, SUM(Absenteeism_time_in_hours) as Total_Absenteeism_Hours FROM absenteeism WHERE Disciplinary_failure = 0 GROUP BY Age ORDER BY Total_Absenteeism_Hours DESC LIMIT 3;'])
print(result["results"][0])

Resulting in:

   Age  Total_Absenteeism_Hours
0   28                      651
1   33                      538
2   38                      482

Build the SQL Query Pipeline

Now that we have a component that can ingest SQL queries and query our desired database, let’s add it as a step in a full AI application. We will build a Haystack pipeline that:

Uses a prompt that instructs an LLM (as an example here, we’re using GPT-4, if you want to stick to this one you’ll have to provide your API key as the OPENAI_API_KEY environment variable) to generate a SQL query based on the columns present in our database and the natural language question that we provide
Forwards the generated SQL query to our newly creates SQLQuery component

from haystack import Pipeline
from haystack.components.builders import PromptBuilder
from haystack.components.generators.openai import OpenAIGenerator

prompt = PromptBuilder(template="""Please generate an SQL query. The query should answer the following Question: {{question}};
            The query is to be answered for the table is called 'absenteeism' with the following
            Columns: {{columns}};
            Answer:""")
            
sql_query = SQLQuery('absenteeism.db')
llm = OpenAIGenerator(model="gpt-4")

sql_pipeline = Pipeline()
sql_pipeline.add_component("prompt", prompt)
sql_pipeline.add_component("llm", llm)
sql_pipeline.add_component("sql_querier", sql_query)

sql_pipeline.connect("prompt", "llm")
sql_pipeline.connect("llm.replies", "sql_querier.queries")

The resulting sql_pipeline looks like this (produces with sql_pipeline.show() in the cookbook) 👇

Let’s try running it, and here’s what we get 👇

result = sql_pipeline.run({"prompt": {"question": "On which days of the week does the average absenteeism time exceed 4 hours?",
                            "columns": columns}})

print(result["sql_querier"]["results"][0])
#   Day_of_the_week
#0                2
#1                3
#2                4
#3                5
#4                6

Part 2: Skip Unrelated Questions with Conditional Routing

Next, let’s add a simple improvement. What if the provided question is unrelated? It’s better if we skip it, rather than querying the database with a SQL query that’s going to result in an error.

For this step, I used the [ConditionalRouter](https://docs.haystack.deepset.ai/docs/conditionalrouter).

To start with, let’s have an LLM reply with no_answer if the provided question cannot be answered with the SQL table we have:

from haystack.components.builders import PromptBuilder
from haystack.components.generators.openai import OpenAIGenerator

prompt = PromptBuilder(template="""Please generate an SQL query. The query should answer the following Question: {{question}};
            If the question cannot be answered given the provided table and columns, return 'no_answer'
            The query is to be answered for the table is called 'absenteeism' with the following
            Columns: {{columns}};
            Answer:""")

llm = OpenAIGenerator(model="gpt-4")

Next, let’s define some routes, and our ConditionalRouter. What we want to accomplish here is:

If the reply from our LLM is no_answer we want to go down the go_to_fallback branch of our pipeline
If there’s no no_answer, we want to go down the sql route.

from haystack.components.routers import ConditionalRouter

routes = [
     {
        "condition": "{{'no_answer' not in replies[0]}}",
        "output": "{{replies}}",
        "output_name": "sql",
        "output_type": List[str],
    },
    {
        "condition": "{{'no_answer' in replies[0]}}",
        "output": "{{question}}",
        "output_name": "go_to_fallback",
        "output_type": str,
    },
]

router = ConditionalRouter(routes)

Finally, let’s decide on what to do if we want to go down the fallback route. For this example, I’ve simply created a fallback_llm that is going to tell the user why the question cannot be answered:

fallback_prompt = PromptBuilder(template="""User entered a query that cannot be answerwed with the given table.
                                            The query was: {{question}} and the table had columns: {{columns}}.
                                            Let the user know why the question cannot be answered""")
fallback_llm = OpenAIGenerator(model="gpt-4")

When we add all of these, and our sql_query component into a pipeline, it looks like this 👇

from haystack import Pipeline

conditional_sql_pipeline = Pipeline()
conditional_sql_pipeline.add_component("prompt", prompt)
conditional_sql_pipeline.add_component("llm", llm)
conditional_sql_pipeline.add_component("router", router)
conditional_sql_pipeline.add_component("fallback_prompt", fallback_prompt)
conditional_sql_pipeline.add_component("fallback_llm", fallback_llm)
conditional_sql_pipeline.add_component("sql_querier", sql_query)

conditional_sql_pipeline.connect("prompt", "llm")
conditional_sql_pipeline.connect("llm.replies", "router.replies")
conditional_sql_pipeline.connect("router.sql", "sql_querier.queries")
conditional_sql_pipeline.connect("router.go_to_fallback", "fallback_prompt.question")
conditional_sql_pipeline.connect("fallback_prompt", "fallback_llm")

If I ask the question “On which days of the week does the average absenteeism time exceed 4 hours?” to this pipeline I get:

question = "On which days of the week does the average absenteeism time exceed 4 hours?"
result = conditional_sql_pipeline.run({"prompt": {"question": question,
                                                  "columns": columns},
                                       "router": {"question": question},
                                       "fallback_prompt": {"columns": columns}})
if 'sql_querier' in result:
  print(result['sql_querier']['results'][0])
elif 'fallback_llm' in result:
  print(result['fallback_llm']['replies'][0])
   
#   Day_of_the_week
#0                2
#1                3
#2                4
#3                5
#4                6

But if I ask “When is my birthday?” I get the following reply:

The query cannot be answered as the provided table does not contain information regarding the user’s personal data such as birthdays. The table primarily focuses on absence-related data for presumably work or similar situations. Please provide the relevant data to get the accurate answer.

Part 3: Build a Chat App with Function Calling (& Gradio)

The last thing I tried was to use function calling instead of conditional routing. The aim here is to have a human-like discussion with our SQL database. By doing this, we allow the LLM to decide whether the tool (our function) should be selected to resolve a question. The resulting system has the following flow:

User asks question
LLM decides whether any of the tools (functions) provided to it can resolve the question
The LLM generates the inputs for the tool it has selected to use. These inputs should be what the tool needs to actually run
We run the tool with the generated inputs.
The LLM then generates a human-like answer based on the returned result from the tool.

I already have a SQLQuery component that is pretty straightforward, so here’s how I wrapped that into a function:

sql_query = SQLQuery('absenteeism.db')

def sql_query_func(queries: List[str]):
    try:
      result = sql_query.run(queries)
      return {"reply": result["results"][0]}

    except Exception as e:
      reply = f"""There was an error running the SQL Query = {queries}
              The error is {e},
              You should probably try again.
              """
      return {"reply": reply}

Since I’m using GPT-4 as my demo LLM here, I’ll be providing this function as a tool:

tools = [
    {
        "type": "function",
        "function": {
            "name": "sql_query_func",
            "description": f"This a tool useful to query a SQL table called 'absenteeism' with the following Columns: {columns}",
            "parameters": {
                "type": "object",
                "properties": {
                    "queries": {
                        "type": "array",
                        "description": "The query to use in the search. Infer this from the user's message. It should be a question or a statement",
                        "items": {
                            "type": "string",
                        }
                    }
                },
                "required": ["question"],
            },
        },
    }
]

Once I had this in place, I simply added this as a tool to an OpenAIChatGenerator and followed it up by creating a mini app that keeps track of conversation history. I used Gradio (thank you for the example in the Building a Chat Application Tutorial Bilge!!) to wrap this into a demo app.

See the cookbook for the full code example

demo = gr.ChatInterface(
    fn=chatbot_with_fc,
    examples=[
        "Find the top 3 ages with the highest total absenteeism hours, excluding disciplinary failures",
        "On which days of the week does the average absenteeism time exceed 4 hours?",
        "Who lives in London?",
    ],
    title="Chat with your SQL Database",
)
demo.launch()

Wrapping it up: Chatting with SQL and Haystack three ways

In this post, you’ve learned a few approaches to chat with your SQL databases using Haystack 👇

Building a Haystack custom component to serve as a SQL query interface, and using that component in a pipeline.
Improving the pipeline by skipping irrelevant queries with conditional routing.
Using function calling, tools, and Gradio to build a full stack text to SQL app 🚀

Thanks for following along!

If you’re hungry to learn more about SQL + LLMs, check out our post on Using Generative AI to Query Large BI Tables

Level up Your RAG Application with Speaker Diarization

Thu, 21 Mar 2024 00:00:00 +0000

LLMs work wonders on text data. Using LLMs, you can get answers to complex questions on long documents without having to read the document or even do a CTRL+F search. But what if you work with audio or video recordings?

The easiest way is to provide the LLM with the transcription of the recording. That way you can capture everything that is uttered in the audio or video. But what about information that was not spoken?

On audio or video recordings with multiple speakers, only transcribing the recording will not provide all there is to know to the LLM. The data on how many people are speaking, what each of them says is lost in written format.

So is it not possible to use LLMs to their full potential with multi-speaker recordings? Of course it is with Speaker Diarization!

Introduction to Speaker Diarization

With the assemblyai-haystack Python package by AssemblyAI, this would take you only 10 minutes to build. The Speaker Diarization model lets you detect multiple speakers in an audio file and what each speaker said. Thanks to the built-in speaker diarization feature, you can get the transcript of your audio or video recording in plain format and also speaker annotated format where each utterance in the recording is attributed to a speaker. Here is what it looks like:

…

Speaker B : Good morning.

Speaker A: So what is it about the conditions right now that have caused this round of wildfires to affect so many people so far away?

Speaker B: Well, there’s a couple of things. The season has been pretty dry already, and then the fact that we’re getting hit in the US. Is because there’s a couple of weather systems that are essentially channeling the smoke from those Canadian wildfires through Pennsylvania into the Mid Atlantic and the Northeast and kind of just dropping the smoke there.

Speaker A: So what is it in this haze that makes it harmful? And I’m assuming it is.

…

Let’s first see the benefits of passing the speaker annotated transcription to the LLM with examples. And later, we will go through the code that you can use to build with speaker labeled transcription.

For the comparison, we ran two RAG applications, one using the transcription of a recording and another speaker-labeled transcription to compare their responses. We’ll call them transcription application and speaker application from now on.

Here are the YouTube videos used in the examples:

Panel discussion on AI example audio: https://www.youtube.com/watch?v=z86LGEFyQpo

Netflix earnings call: https://www.youtube.com/watch?v=v5EjcFtq_bc

Working from home debate: https://youtu.be/w8aOZ8rBZ-E?si=pVbwDZoDLpBEcPlc

#1: Using the speaker labels, we can get each person’s opinion separately

In this panel discussion, the speakers are talking about AI technology and how they leverage it in their own organizations. Around the midpoint of this discussion, they answer the moderator’s question on what they think about building AI in-house vs. using third party providers.

The transcription application can only comment on the general thoughts that were conveyed during the panel. It cannot make the distinction between different speakers’ opinions. Whereas the speaker application reports on the opinion of each speaker.

Audio file: Panel 👇

Question: “What are each speakers’ opinions on building in-house or using third parties?”

Transcription application	Speaker application
“The speakers believe that there is no one-size-fits-all approach to building in-house or using third parties. They take a mixed approach, considering factors such as core IP, feasibility, and value. They emphasize the importance of being clear about how to use time and resources, and of collaboration and partnerships. They also acknowledge that there may be friction between different teams, but this can be mitigated through change management.”	“Speaker A is interested in understanding how companies decide between building in-house solutions or using third parties. Speaker C believes that the decision depends on whether the component is part of the company’s core IP and that a combination of market standards and in-house solutions is often used. Speaker D takes a mixed approach, using open source and partnering, and emphasizes the importance of embedding AI into the business. Speaker C thinks that the decision should be based on feasibility and value.”

Transcription application

Speaker application

“The speakers believe that there is no one-size-fits-all approach to building in-house or using third parties. They take a mixed approach, considering factors such as core IP, feasibility, and value. They emphasize the importance of being clear about how to use time and resources, and of collaboration and partnerships. They also acknowledge that there may be friction between different teams, but this can be mitigated through change management.”

“Speaker A is interested in understanding how companies decide between building in-house solutions or using third parties. Speaker C believes that the decision depends on whether the component is part of the company’s core IP and that a combination of market standards and in-house solutions is often used. Speaker D takes a mixed approach, using open source and partnering, and emphasizes the importance of embedding AI into the business. Speaker C thinks that the decision should be based on feasibility and value.”

In this other recording, there is a quick exchange of opinions on the matter of working from home versus returning to the office. The presenters in the studio are of the opinion that working from home is just as productive as working in an office but the guest on the phone line disagrees with them.

The transcription application is able to identify opposing views yet can only name the people holding an opinion if they are explicitly named in the recording. By using speaker labels, you can clearly identify all the views in a discussion as well as the speaker who voiced it.

Audio file: Working from home debate 👇

Question: “What are the two opposing opinions and how many people are on each side?”

Transcription application	Speaker application
The opposing opinions in the transcription are between those who believe that returning to the office full-time is necessary for collaboration and productivity, and those who believe that remote work can be just as effective and offers more flexibility. The first side, represented by Jim, is skeptical about returning to the office full-time and is concerned about the environmental impact of commuting. The second side, represented by the speaker, acknowledges the importance of collaboration but believes that a hybrid approach of remote and in-office work can be beneficial.	The opposing opinions are about the necessity of returning to the office and the balance of power between employees and managers. Speaker A and Speaker C seem to be on the side of flexibility and remote work, while Speaker B is more open to the idea of returning to the office but acknowledges the need for flexibility.

Transcription application

Speaker application

The opposing opinions in the transcription are between those who believe that returning to the office full-time is necessary for collaboration and productivity, and those who believe that remote work can be just as effective and offers more flexibility. The first side, represented by Jim, is skeptical about returning to the office full-time and is concerned about the environmental impact of commuting. The second side, represented by the speaker, acknowledges the importance of collaboration but believes that a hybrid approach of remote and in-office work can be beneficial.

The opposing opinions are about the necessity of returning to the office and the balance of power between employees and managers. Speaker A and Speaker C seem to be on the side of flexibility and remote work, while Speaker B is more open to the idea of returning to the office but acknowledges the need for flexibility.

#2: Using the speaker labels, we can extract practical information from the file

One of the important things you might want to analyze in recordings with multiple speakers is how many speakers there are or what their role is in this recording.

Unless it is explicitly mentioned in the recording, the transcription application cannot tell the number of speakers. It can, at best, identify that there are multiple people in this recording:

Audio file: Panel 👇

Question: “How many people are speaking in this recording?”

Transcription application	Speaker application
There are multiple people speaking in this recording, but the exact number is not provided in the transcription.	There are three people speaking in this recording: Speaker A, Speaker B, and Speaker D.

Audio file: Netflix 👇

Question: “How many speakers and moderators are in this call?”

Transcription application	Speaker application
There are four speakers and one moderator on this call.	There are three speakers in this call: Ted Sorandos, Greg Peters, and Spence Newman. Spencer Wong is the moderator.

Audio file: Working from home debate 👇

Question: “How many people are speaking in this recording?”

Transcription application	Speaker application
The transcription does not provide enough information to determine the exact number of people speaking in this recording.	There are three people speaking in this recording.

Code explanation

Let’s now see how to build an application with speaker labels using Haystack and AssemblyAI. In this Colab notebook, you will find the code if you’d like to get started quickly.

Haystack is an open-source Python framework for building production-ready LLM applications. It provides the tools needed to implement an end-to-end application and comes with the AssemblyAI integration that allows you to incorporate transcription, summarization, and speaker labeling for audio data into your pipeline. For more, check out Haystack documentation.

We will use haystack as well as assemblyai-haystack package for this application. Install all required packages using pip.

pip install haystack
pip install assemblyai-haystack
pip install "sentence-transformers>=2.2.0"

Next, we will set up the pipeline with the AssemblyAITranscriber to generate the speaker-labeled transcription of any audio file we pass.

Check out the documentation to find out more about the assemblyai-haystack Python package.

To use this piece of code, make sure you create an AssemblyAI account and get your free API key. You need an access token from Hugging Face to access the free Inference API as well.

First, initialize the required Haystack components to index documents with embeddings:

InMemoryDocumentStore: to store your documents without external dependencies or extra setup
AssemblyAITranscriber: to create speaker_labels for the given audio file and convert them into documents
DocumentSplitter: to split your documents into smaller chunks
SentenceTransformersDocumentEmbedder: to create embeddings for each document using sentence-transformers models
DocumentWriter: to write these documents into your document store

After creating all these components, initialize a Pipeline object, add these components to your pipeline, and connect() the components by indicating which component should be connected to which component next.

from haystack.components.writers import DocumentWriter
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from assemblyai_haystack.transcriber import AssemblyAITranscriber
from haystack.document_stores.types import DuplicatePolicy
from haystack.utils import ComponentDevice

speaker_document_store = InMemoryDocumentStore()
transcriber = AssemblyAITranscriber(api_key=ASSEMBLYAI_API_KEY)
speaker_splitter = DocumentSplitter(
    split_by = "sentence",
    split_length = 10,
    split_overlap = 1
)
speaker_embedder = SentenceTransformersDocumentEmbedder(device=ComponentDevice.from_str("cuda:0"))
speaker_writer = DocumentWriter(speaker_document_store, policy=DuplicatePolicy.SKIP)

indexing_pipeline = Pipeline()
indexing_pipeline.add_component(instance=transcriber, name="transcriber")
indexing_pipeline.add_component(instance=speaker_splitter, name="speaker_splitter")
indexing_pipeline.add_component(instance=speaker_embedder, name="speaker_embedder")
indexing_pipeline.add_component(instance=speaker_writer, name="speaker_writer")

indexing_pipeline.connect("transcriber.speaker_labels", "speaker_splitter")
indexing_pipeline.connect("speaker_splitter", "speaker_embedder")
indexing_pipeline.connect("speaker_embedder", "speaker_writer")

And then we’re ready to run this pipeline with any audio file. Make sure to set speaker_labels to True. It is also possible to get a summary of the contents of the audio file but we will not use it in this tutorial. We can set it to None or False.

indexing_pipeline.run(
    {
        "transcriber": {
            "file_path": "/content/Netflix_Q4_2023_Earnings_Interview.mp3",
            "summarization": None,
            "speaker_labels": True
        },
    }
)

Next, it is time to set up the retrieval augmentation (RAG) pipeline for speaker labels. For a RAG pipeline, we need:

SentenceTransformersTextEmbedder: To create an embedding for the user query using sentence-transformers models
InMemoryEmbeddingRetriever: to retrieve top_k relevant documents to the user query
PromptBuilder: to provide a RAG prompt template with instructions to be filled with retrieved documents and the user query
HuggingFaceAPIGenerator: to infer models served through Hugging Face free Serverless Inference API or Hugging Face TGI

from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.generators import HuggingFaceAPIGenerator
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.utils import ComponentDevice

open_chat_prompt = """
GPT4 Correct User: You will be provided with a transcription of a recording with each sentence or group of sentences attributed to a Speaker by the word "Speaker" followed by a letter representing the person uttering that sentence. Answer the given question based on the given context.
If you think that given transcription is not enough to answer the question, say so.

Transcription:
{% for doc in documents %}
  {% if doc.meta["speaker"] %} Speaker {{doc.meta["speaker"]}}: {% endif %}{{doc.content}}
{% endfor %}
Question: {{ question }}
<|end_of_turn|>
GPT4 Correct Assistant:
"""

retriever = InMemoryEmbeddingRetriever(speaker_document_store)
text_embedder = SentenceTransformersTextEmbedder(device=ComponentDevice.from_str("cuda:0"))
answer_generator = HuggingFaceAPIGenerator(
    api_type="serverless_inference_api",
    api_params={"model": "openchat/openchat-3.5-0106"},
    generation_kwargs={"max_new_tokens":500})
prompt_builder = PromptBuilder(template=open_chat_prompt)

After creating all these components, as you did for the first pipeline, initialize a Pipeline object, add these components to your pipeline, and connect() the components by indicating which component should be connected to which component next. As you create the connections, sometimes, you need to be more explicit about the output and input names. Explicitly connect the “documents” output of retriever with “documents” input of the prompt_builder to make the connection obvious as prompt_builder has two inputs (“documents” and “question” variables we defined in the prompt template).

from haystack import Pipeline

speaker_rag_pipe = Pipeline()
speaker_rag_pipe.add_component("text_embedder", text_embedder)
speaker_rag_pipe.add_component("retriever", retriever)
speaker_rag_pipe.add_component("prompt_builder", prompt_builder)
speaker_rag_pipe.add_component("llm", answer_generator)

speaker_rag_pipe.connect("text_embedder.embedding", "retriever.query_embedding")
speaker_rag_pipe.connect("retriever.documents", "prompt_builder.documents")
speaker_rag_pipe.connect("prompt_builder.prompt", "llm.prompt")

Once it’s done, you can use the following code to test the results on your own examples. Change the top_k value based on how many relevant documents you’d like to provide to your LLM:

question = "Who are the speakers in this recording?"

result = speaker_rag_pipe.run({
    "prompt_builder":{"question": question},
    "text_embedder":{"text": question},
    "retriever":{"top_k": 10}
})
result["llm"]["replies"][0]

Conclusion

Thanks for reading! By combining the transcription capabilities of AssemblyAI with the power of Haystack, you can enhance your RAG systems with speaker labels, ensuring a more comprehensive and accurate understanding of the content.

If you want to stay on top of the latest Haystack developments, you can subscribe to our newsletter or join our Discord community. Don’t forget to get your free API key from AssemblyAI and subscribe to AssemblyAI’s YouTube channel for weekly videos and tutorials on the latest developments in the AI world.

Hosted or self-hosted RAG? Full flexibility with NVIDIA NIM integration in Haystack

Mon, 18 Mar 2024 00:00:00 +0000

Iteration speed across design options matters

Retrieval-augmented generation (RAG) is one of the most common architectures today for building LLM applications. With RAG you can very quickly build a prototype that is tailored to your use case by connecting the LLM to the relevant data sources.

Once you have your first prototype up and running, you typically iterate a lot on your pipeline design before you go live in production: switching embedding models or generative LLMs, adding rerankers, or leveraging the metadata of your documents.

We often see our users iterate fifty or more times before settling on the design that meets their requirements. Besides optimizing the answer quality, users often have additional requirements like costs, latency or data privacy that determines their target architecture and deployment strategy for going to production.

Two crucial features are needed to support rapid prototyping during RAG development: the flexibility to experiment with many different models and features, and the ability to quickly switch between them.

Both are at the heart of our design for Haystack 2.0. New integrations with microservices from the NVIDIA AI Enterprise software platform, including NVIDIA NIM and NeMo Retriever, now allow you to iterate even faster across your RAG development cycle and give you full flexibility with deployment options: - Visit NVIDIA API Catalog to quickly experiment with the latest GPU-accelerated generative AI models on NVIDIA-hosted API endpoints. And then switch easily to self-hosted NVIDIA NIM microservice containers that can run in your own VPC or self-hosted cloud. Learn more about taking RAG applications from pilot to production in this NVIDIA blog.

What is NVIDIA NIM?

NVIDIA NIM is a set of easy-to-use microservices designed to speed up generative AI deployment in enterprises. Supporting a wide range of leading community and proprietary AI models, hosted on NVIDIA API catalog, NIM ensures seamless, scalable AI inferencing, whether on-premises or in the cloud, leveraging industry-standard APIs.

To get started, developers can visit the NVIDIA API catalog to interact with GPU-accelerated generative AI models directly from a web browser. The API catalog interface generates application code for every interaction. Developers can paste this code into their own applications to interact with the model endpoints programmatically. When ready to deploy, NVIDIA AI Enterprise subscribers can export the model to a downloadable NVIDIA NIM container and run it anywhere. This gives developers ownership to their customizations and full control of their IP and AI application.

Use NVIDIA AI with Haystack

Now we’ll walk through the nitty-gritty details of how to build a RAG pipeline with Haystack and hosted NVIDIA APIs. You will build two Haystack pipelines: an indexing pipeline to create and store documents, and a RAG pipeline to query those documents.

In order for this code to work, you will need an NVIDIA API key. Set it as an environment variable, NVIDIA_API_KEY.

First: install the Haystack NVIDIA connector:

pip install nvidia-haystack

Next, build an indexing pipeline. This example uses one of NVIDIA’s embedding models to turn documents into vectors, and adds them to the document_store:

from haystack_integrations.components.generators.nvidia import NvidiaGenerator
from haystack_integrations.components.embedders.nvidia import NvidiaEmbeddingModel, NvidiaDocumentEmbedder

from haystack import Pipeline
from haystack.dataclasses import Document
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore

documents = [
    Document(content="My name is Jean and I live in Paris."),
    Document(content="My name is Mark and I live in Berlin."),
    Document(content="My name is Giorgio and I live in Rome.")]

document_store = InMemoryDocumentStore()

document_embedder = NvidiaDocumentEmbedder(model="nvolveqa_40k")
writer = DocumentWriter(document_store=document_store)

indexing_pipeline = Pipeline()
indexing_pipeline.add_component(instance=document_embedder, name="document_embedder")
indexing_pipeline.add_component(instance=writer, name="writer")

indexing_pipeline.connect("document_embedder.documents", "writer.documents")
indexing_pipeline.run(data={"document_embedder":{"documents": documents}})

# Calling filter with no arguments will print the contents of the document store
document_store.filter_documents({})

Create a RAG pipeline to query the data.

from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.builders import PromptBuilder
from haystack_integrations.components.generators.nvidia import NvidiaGenerator
from haystack_integrations.components.embedders.nvidia import NvidiaEmbeddingModel, NvidiaTextEmbedder

prompt = """ Answer the query, based on the
content in the documents.
If you can't answer based on the given documents, say so.


Documents:
{% for doc in documents %}
 {{doc.content}}
{% endfor %}


Query: {{query}}
"""

text_embedder = NvidiaTextEmbedder(model="playground_nvolveqa_40k")
retriever = InMemoryEmbeddingRetriever(document_store=document_store)
prompt_builder = PromptBuilder(template=prompt)
generator = NvidiaGenerator(model="playground_nv_llama2_rlhf_70b")

rag_pipeline = Pipeline()

rag_pipeline.add_component(instance=text_embedder, name="text_embedder")
rag_pipeline.add_component(instance=retriever, name="retriever")
rag_pipeline.add_component(instance=prompt_builder, name="prompt_builder")
rag_pipeline.add_component(instance=generator, name="generator")

rag_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
rag_pipeline.connect("retriever.documents", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "generator")

question = "Who lives in Rome?"
result = rag_pipeline.run(data={"text_embedder":{"text": question},
                           	                                      "prompt_builder":{"query": question}})
print(result)
# {'text_embedder': {'meta': {'usage': {'prompt_tokens': 10, 'total_tokens': 10}}}, 'generator': {'replies': [Giorgio], 'meta': [{'role': 'assistant', 'finish_reason': 'stop'}], 'usage': {'completion_tokens': 3, 'prompt_tokens': 101, 'total_tokens': 104}}}

The pipeline example above uses the API catalog endpoint for LLM inference and embedding. Switching from the API catalog endpoint to a self-hosted NIM microservice only requires the small addition of an api_url init parameter to the embedder and generator.

document_embedder = NvidiaDocumentEmbedder(model="nvolveqa_40k", api_url="")

text_embedder = NvidiaTextEmbedder(model="nvolveqa_40k", api_url="")

generator = NvidiaGenerator(model="mixtral_8x7b", api_url="")


generator.warm_up()

result = generator.run(prompt="When was the Golden Gate Bridge built?")
# The Golden Gate Bridge was built in 1937 and was completed and opened to the public on May 28, 1937...

Using the NVIDIA Haystack connector, Deepset customers can apply the benefits of GPU-acceleration to their RAG applications quickly and easily.

Wrapping it up

Want to learn more about Haystack 2.0? Get started with our tutorials, cookbooks, and blog posts.
Have more questions? Join our community on Discord or sign up for our monthly newsletter.
Want to learn more about NVIDIA NIM and how to run the containers? Be on the lookout for a more in-depth technical blog about how to use NIM with Haystack.

Haystack 2.0: The Composable Open-Source LLM Framework

Mon, 11 Mar 2024 00:00:00 +0000

Today we are happy to announce the stable release of Haystack 2.0 - we’ve been working on this for a while, and some of you have already been testing the beta since its first release in December 2023.

Haystack is an open-source Python framework for building production-ready LLM applications, with integrations to almost all major model providers and databases.

At its core, Haystack 2.0 is a major rework of the previous version with a very clear goal in mind: making it possible to implement composable AI systems that are easy to use, customize, extend, optimise, evaluate and ultimately deploy to production.

We encourage you to start using Haystack 2.0 as of today, whether you’ve been a Haystack user before or not. You can get started by installing haystack-ai, our new package for Haystack 2.0

⭐️ To get started:

pip install haystack-ai and follow the get started instructions to build your first LLM app with just a few lines of code.

If you’re already using Haystack 1.0 in production, don’t worry! If your applications depend on farm-haystack and you’re not ready to migrate just yet, you don’t have to take any action: we will keep supporting Haystack 1.0, releasing security updates and critical bug fixes, giving everybody enough time to migrate. In the coming weeks, we will also start sharing some migration guides to help you along the way.

Why Haystack 2.0?

Haystack was first officially released in 2020, in the good old days when the forefront of NLP was semantic search, retrieval, and extractive question-answering. During this time, we established the core of what makes Haystack Haystack: Components and Pipelines. These allowed users to build end-to-end applications by combining their desired language models (embedding, extractive QA, ranking) with their database of choice.

The boom of LLMs in 2023 made two things clear:

👍 The pipeline-component structure is a great abstraction for building composable LLM applications with many moving parts.
👎 Haystack 1.0 often assumed that you would be doing retrieval and extractive QA over a set of documents, imposing limitations and providing a developer experience far from ideal when building LLM applications.

So, we decided that the best thing we could do for Haystack and our community was to rewrite the component and pipeline architecture to keep up with the fast-paced AI industry. While Haystack 2.0 is a complete rewrite, the underlying principle of composing components into flexible pipelines remains the same.

With that, let’s take a look at the pillars of Haystack 2.0:

Composable and customizable pipelines
A common interface for storing data
A clear path to production
Optimization and Evaluation for Retrieval Augmentation

Composable and customizable Pipelines

Modern LLM applications comprise many moving parts: retrievers, rankers, LLMs, and many more such as entity extractors, summarizers, format converters and data cleaners. Each one of these ‘subtasks’ is a component in Haystack.

With the first version of Haystack we proved that pipelines are a good abstraction for connecting all those moving parts, but some of the assumptions we made in Haystack 1.0 dated back to a pre-LLM era and needed rethinking.

One important limitation in Haystack 1.0 is that loops are not allowed, and the pipeline graph has to be acyclic. This makes it difficult to implement, for example, agents, which are often designed with a reasoning flow that loops until a task is resolved.

In Haystack 2.0 the pipeline graph can have cycles. Combined with decision components (think about if-then-else clauses in the execution flow) and routers (components that direct the execution flow towards a specific subgraph depending on the input) this can be used to build sophisticated loops that model agentic behavior.

Customizable Components

We believe that the design of an AI framework should meet the following requirements:

Be technology agnostic: Allow users the flexibility to decide what vendor or technology they want for each of these components and make it easy to switch out any component for another.
Be explicit: Make it transparent as to how these components can “talk” to each other.
Be flexible: Make it possible to create custom components whenever custom behavior is desirable.
Be extensible: Provide a uniform and easy way for the community and third parties to build their own components and foster an open ecosystem around Haystack.

All components in Haystack 2.0 (including Haystack Integrations) are built with a common “component” interface. The principle is simple:

A component implements some logic in a method called run
The run method receives one or more input values
The run method returns one or more output values

Take embedders as an example: these components expect text as input and create vector representations (embeddings) that they return as output. On the other hand, retrievers may need embeddings as input and return documents as output. When creating a new component, to decide what inputs and outputs it should have is part of the ideation process.

While there are many ready-made components built into Haystack, we want to highlight that building your own custom components is also a core functionality of Haystack 2.0.

In fact, we’ve taken advantage of this ourselves. For example, you can read about how to use the latest optimization techniques (like HyDE) in Haystack pipelines with custom components.

Since the release of Haystack 2.0-Beta, we’ve seen the benefits of having a well-defined simple interface for components. We, our community, and third parties have already created many components, available as additional packages for you to install.

We share these on the Haystack Integrations page, which has expanded to include all sorts of components over the last few months (with contributions from Assembly AI, Jina AI, mixedbread ai and more). We will continue to expand this page with new integrations and you can help us by creating a PR on haystack-integrations if you’d like to share a component with the community. To learn more about integrations and how to share them, you can check out our “Introduction to Integrations” documentation.

A common interface for storing data

Most NLP applications work on large amounts of data. A common design pattern is to connect your internal knowledge base to a Large Language Model (LLM) so that it can answer questions, summarize or translate documents, and extract specific information. For example, in retrieval-augment generative pipelines (RAG), you often use an LLM to answer questions about some data that was previously retrieved.

This data has to come from somewhere, and Haystack 2.0 provides a common interface to access it in a consistent way, independently from where data comes from. This interface is called “Document Store”, and it’s implemented for many different storage services, to make data easily available from within Haystack pipelines.

Today, we are releasing Haystack 2.0 with a large selection of database and vector store integrations. These include Chroma, Weaviate, Pinecone, Qdrant, Elasticsearch, Open Search, pgvector, MongoDB, AstraDB, Neo4j, Marqo DB, and the list will keep growing. And if your storage service is not supported yet, or should you need a high degree of customization on top of an existing one, by following our guide to creating custom document stores, you can connect your Haystack pipelines to your data from pretty much any storage service.

A clear path to production

The experience we got over the last couple of years, working on Haystack 1.0 and interacting with its community, taught us two things:

It’s essential for any AI application framework to be feature-complete and developer-friendly.
It’s only after the deployment phase that AI-based applications can truly make an impact.

While rewriting the framework from scratch, we took the opportunity to incorporate specific features that would simplify the deployment of Haystack-based AI applications in a production-grade environment:

A customizable logging system that supports structured logging and tracing correlation out of the box.
Code instrumentation collecting spans and traces in strategic points of the execution path, with support for Open Telemetry and Datadog already in place.

In addition, we decided to start a dedicated project to simplify deploying Haystack pipelines behind a RESTful API: Hayhooks.

Hayhooks is a client-server application that allows you to deploy Haystack pipelines, serving them through HTTP endpoints dynamically spawned. Two foundational features of Haystack 2.0 made this possible:

The ability to introspect a pipeline, determining its inputs and outputs at runtime. This means that every REST endpoint has well-defined, dynamically generated schemas for the request and response body, all depending on the specific pipeline structure.
A robust serialization mechanism. This allows for the conversion of Haystack pipelines from Python to a preferred data serialization format, and vice versa. The default format is YAML but Haystack is designed to easily extend support for additional serialization formats.

Optimization and Evaluation of Retrieval Augmentation

We’ve already been seeing the benefits of the new Haystack design, with pipeline optimization and evaluation being good examples of how we’ve been leveraging Haystack 2.0. How?:

It’s easier to extend the capabilities of Haystack
It’s easy to implement new integrations

Implementing the latest retrieval optimizations

Retrieval is a crucial step for successful RAG pipelines. And there’s been a lot of work to optimize this step. With Haystack 2.0, we’ve been able to:

Implement Hypothetical Document Embeddings (HyDE) easily, and we’ve already published a guide to HyDE along with an example walkthrough
Added an integration for Optimum embedders by Hugging Face

And we will be able to add more optimization techniques along the way!

Evaluation

Haystack 2.0 is being released with a few evaluation framework integrations in place:

Along with a guide to model-based evaluation.

Start using Haystack 2.0

Alongside Haystack 2.0, today we are also releasing a whole set of new tutorials, documentation, resources and more to help you get started:

Documentation: full technical documentation on all Haystack concepts and components
Tutorials: step-by-step, runnable Colab notebooks. Start with our first 2.0 tutorial “Creating Your First QA Pipeline with Retrieval-Augmentation”
Cookbooks: A collection of useful notebooks that showcase Haystack in various scenarios, using a number of our integrations.

And, as always, keep an eye out on our blog and integrations for updates and new content.

Join the Community

Stay up-to-date with Haystack:

Optimizing Retrieval with HyDE

Wed, 28 Feb 2024 00:00:00 +0000

Hypothetical Document Embeddings (HyDE) is a technique proposed in the paper “ Precise Zero-Shot Dense Retrieval without Relevance Labels” which improves retrieval by generating “fake” hypothetical documents based on a given query, and then uses those “fake” documents embeddings to retrieve similar documents from the same embedding space.

In this article, we will see how to implement and incorporate it into Haystack by creating a custom component that implements HyDE.

To learn more about how HyDE works, and where it’s useful, check out our guide on Hypothetical Document Embeddings (HyDE)

Build a Pipeline to Create Hypothetical Document Embeddings

First, let’s build a simple pipeline to generate these hypothetical documents. To do so, we will use the following Haystack components:

PromptBuilder and OpenAIGenerator to query an instruction-following language model and generate hypothetical documents.
SentenceTransformersDocumentEmbedder encodes the hypothetical documents into vector embeddings.
OutputAdapter to adapt the output of the Generator to be compatible with the input of the SentenceTransformersDocumentEmbedder, which expects List[Document]

To use the OpenAIGenerator, you need to set your OPENAI_API_KEY
export OPENAI_API_KEY="secret_string"

We first build a way to query an instruction-following language model to generate hypothetical documents.

from haystack.components.generators.openai import OpenAIGenerator
from haystack.components.builders import PromptBuilder

generator = OpenAIGenerator(model="gpt-3.5-turbo",
							generation_kwargs={"n": 5, "temperature": 0.75, "max_tokens": 400},
)

template="""Given a question, generate a paragraph of text that answers the question.
			Question: {{question}}
			Paragraph:"""
prompt_builder = PromptBuilder(template=template)

This will output a list of 5 hypothetical documents, the same number the authors used for the experiments in the paper. We then use the SentenceTransformersDocumentEmbedder to encode these hypothetical documents into embeddings.

But, the SentenceTransformersDocumentEmbedder expects List[Document] objects as input, so we need to adapt the output of the OpenAIGenerator to be compatible with the input of the SentenceTransformersDocumentEmbedder. For this, we use an OutputAdapter with a custom filter:

from haystack import Document
from haystack.components.converters import OutputAdapter
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from typing import List

adapter = OutputAdapter(
    template="{{answers | build_doc}}",
    output_type=List[Document],
    custom_filters={"build_doc": lambda data: [Document(content=d) for d in data]}
)

embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
embedder.warm_up()

We can now create a custom component, HypotheticalDocumentEmbedder, that expects documents and can return a list of hypotethetical_embeddings which is the average of the embeddings from the “hypothetical” (fake) documents.

from numpy import array, mean
from haystack import component

@component
class HypotheticalDocumentEmbedder:

@component.output_types(hypothetical_embedding=List[float])
def  run(self, documents: List[Document]):
	stacked_embeddings = array([doc.embedding for doc in documents])
	avg_embeddings = mean(stacked_embeddings, axis=0)
	hyde_vector = avg_embeddings.reshape((1, len(avg_embeddings)))
	return {"hypothetical_embedding": hyde_vector[0].tolist()}

Now we can add all of these into a pipeline and generate hypothetical document embeddings.

from haystack import Pipeline

hyde = HypotheticalDocumentEmbedder()

pipeline = Pipeline()
pipeline.add_component(name="prompt_builder", instance=prompt_builder)
pipeline.add_component(name="generator", instance=generator)
pipeline.add_component(name="adapter", instance=adapter)
pipeline.add_component(name="embedder", instance=embedder)
pipeline.add_component(name="hyde", instance=hyde)

pipeline.connect("prompt_builder", "generator")
pipeline.connect("generator.replies", "adapter.answers")
pipeline.connect("adapter.output", "embedder.documents")
pipeline.connect("embedder.documents", "hyde.documents")

query = "What should I do if I have a fever?"
result = pipeline.run(data={"prompt_builder": {"question": query}})

Below a graphical representation of the pipeline we created

Build a Complete HyDE Component

Optionally, we could also create a HypotheticalDocumentEmbedder that encapsulates the entire logic that we saw above. This way, we would be able to use this one components for improved retrieval.

This component can do a few things:

Allow the user to pick the LLM which generates the hypothetical documents
Allow users to define how many documents should be created with nr_completions
Allow users to define the embedding model they want to use to generate the HyDE embeddings.

from haystack import Pipeline, component, Document, default_to_dict, default_from_dict
from haystack.components.converters import OutputAdapter
from haystack.components.embedders.sentence_transformers_document_embedder import SentenceTransformersDocumentEmbedder
from haystack.components.generators.openai import OpenAIGenerator
from haystack.components.builders import PromptBuilder

from typing import Dict, Any, List
from numpy import array, mean

from haystack.utils import Secret

@component
class HypotheticalDocumentEmbedder:

    def __init__(
        self,
        instruct_llm: str = "gpt-3.5-turbo",
        instruct_llm_api_key: Secret = Secret.from_env_var("OPENAI_API_KEY"),
        nr_completions: int = 5,
        embedder_model: str = "sentence-transformers/all-MiniLM-L6-v2",
    ):
        self.instruct_llm = instruct_llm
        self.instruct_llm_api_key = instruct_llm_api_key
        self.nr_completions = nr_completions
        self.embedder_model = embedder_model
        self.generator = OpenAIGenerator(
            api_key=self.instruct_llm_api_key,
            model=self.instruct_llm,
            generation_kwargs={"n": self.nr_completions, "temperature": 0.75, "max_tokens": 400},
        )
        self.prompt_builder = PromptBuilder(
            template="""Given a question, generate a paragraph of text that answers the question.
            Question: {{question}}
            Paragraph:
            """
        )

        self.adapter = OutputAdapter(
            template="{{answers | build_doc}}",
            output_type=List[Document],
            custom_filters={"build_doc": lambda data: [Document(content=d) for d in data]},
        )

        self.embedder = SentenceTransformersDocumentEmbedder(model=embedder_model, progress_bar=False)
        self.embedder.warm_up()

        self.pipeline = Pipeline()
        self.pipeline.add_component(name="prompt_builder", instance=self.prompt_builder)
        self.pipeline.add_component(name="generator", instance=self.generator)
        self.pipeline.add_component(name="adapter", instance=self.adapter)
        self.pipeline.add_component(name="embedder", instance=self.embedder)
        self.pipeline.connect("prompt_builder", "generator")
        self.pipeline.connect("generator.replies", "adapter.answers")
        self.pipeline.connect("adapter.output", "embedder.documents")

    def to_dict(self) -> Dict[str, Any]:
        data = default_to_dict(
            self,
            instruct_llm=self.instruct_llm,
            instruct_llm_api_key=self.instruct_llm_api_key,
            nr_completions=self.nr_completions,
            embedder_model=self.embedder_model,
        )
        data["pipeline"] = self.pipeline.to_dict()
        return data

    @classmethod
    def from_dict(cls, data: Dict[str, Any]) -> "HypotheticalDocumentEmbedder":
        hyde_obj = default_from_dict(cls, data)
        hyde_obj.pipeline = Pipeline.from_dict(data["pipeline"])
        return hyde_obj

    @component.output_types(hypothetical_embedding=List[float])
    def run(self, query: str):
        result = self.pipeline.run(data={"prompt_builder": {"question": query}})
        # return a single query vector embedding representing the average of the hypothetical document embeddings
        stacked_embeddings = array([doc.embedding for doc in result["embedder"]["documents"]])
        avg_embeddings = mean(stacked_embeddings, axis=0)
        hyde_vector = avg_embeddings.reshape((1, len(avg_embeddings)))
        return {"hypothetical_embedding": hyde_vector[0].tolist()}

Using the `HypotheticalDocumentEmbedder` for Retrieval

As a final step, let’s see how we can use our new component in a retrieval pipeline. To start, we can create a document store that has some data in it.

from datasets import load_dataset, Dataset
from haystack import Pipeline, Document
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore

embedder_model = "sentence-transformers/all-MiniLM-L6-v2"

def  index_docs(data: Dataset):
	document_store = InMemoryDocumentStore()
	pipeline = Pipeline()
	
	pipeline.add_component("cleaner", DocumentCleaner())
	pipeline.add_component("splitter", DocumentSplitter(split_by="sentence", split_length=10))
	pipeline.add_component("embedder", SentenceTransformersDocumentEmbedder(model=embedder_model))
	pipeline.add_component("writer", DocumentWriter(document_store=document_store, policy="skip"))

	pipeline.connect("cleaner", "splitter")
	pipeline.connect("splitter", "embedder")
	pipeline.connect("embedder", "writer")
	pipeline.run({"cleaner": {"documents": [Document.from_dict(doc) for doc in data["train"]]}})

	return document_store
	
data = load_dataset("Tuana/game-of-thrones")
doc_store = index_docs(data)

Now that we’ve populated an InMemoryDocumentStore with some data, let’s see how we can use the HypotheticalDocumentEmbedder as a way to retrieve documents 👇

from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever

def  retriever_with_hyde(doc_store):
	hyde = HypotheticalDocumentEmbedder(instruct_llm="gpt-3.5-turbo", nr_completions=5)
	retriever = InMemoryEmbeddingRetriever(document_store=doc_store)
	
	retrieval_pipeline = Pipeline()
	retrieval_pipeline.add_component(instance=hyde, name="query_embedder")
	retrieval_pipeline.add_component(instance=retriever, name="retriever")

	retrieval_pipeline.connect("query_embedder.hypothetical_embedding", "retriever.query_embedding")
	return retrieval_pipeline

retrieval_pipeline = retriever_with_hyde(doc_store)
query = "Who is Araya Stark?"
retrieval_pipeline.run(data={"query_embedder": {"query": query}, "retriever": {"top_k": 5}})

Wrapping it up

If you’ve gotten this far, you know how to use the HyDE technique and how easy it is to incorporate it into Haystack. To learn more about Haystack, join us on Discord or sign up for our monthly newsletter.

Announcing the Astra DB Haystack Integration

Fri, 19 Jan 2024 00:00:00 +0000

The Haystack extension family is growing so fast, it’s hard to keep up! Our latest addition is the Astra DB extension by Datastax. It’s an open source package that helps you use Astra DB as a vector database for your Haystack pipelines.

Let’s learn about the benefits of Astra DB and how to use it with Haystack.

Benefits of Astra DB

DataStax Astra DB is a serverless vector database built on Apache Cassandra. What makes Astra DB special?

Interoperability with Cassandra’s open source ecosystem and tooling.
Astra DB supports a variety of different embedding models. One Astra database instance can have multiple collections with different vector sizes. This makes it easy to test different embedding models and find the best one for your use case.
It’s serverless. What does that mean for a database? You don’t have to manage individual instances, or deal with cumbersome upgrading or scaling. All of that is taken care of for you behind the scenes.
Enterprise scalability. Astra DB can be deployed across the major cloud providers (AWS, GCP, or Azure) and across multiple regions depending on your needs.
At the time of this writing, there’s a free tier available so you can try it without a credit card.

Create your Astra DB database

To ensure these instructions remain up to date, we’re going to point you to the Astra DB docs to explain how to create a database.

Create a free Astra DB database. Make a note of your credentials - you’ll need your Astra API endpoint and Astra application token to use the Haystack extension.
Choose the number of dimensions that matches the embedding model you plan on using. For this example we’ll use a 384-dimension model, sentence-transformers/all-MiniLM-L6-v2.
Create a collection with the same number of dimensions as your embedding model. Save the name of your collection since you’ll need this as well.

Get started with the Astra DB Haystack Integration

First, install the integration:

pip install astra-haystack sentence-transformers

Remember earlier when I mentioned you were going to need your credentials? I hope you saved them. If not, that’s okay, you can go back to the Astra Portal and grab them.

Note: if you were running this code in production, you’d want to save these as environment variables to keep things nice and secure.

from getpass import getpass
import os

os.environ["OPENAI_API_KEY"] = getpass("Enter your openAI key:")
os.environ["ASTRA_DB_API_ENDPOINT"] = getpass("Enter your Astra API Endpoint:")
os.environ["ASTRA_DB_APPLICATION_TOKEN"] = getpass("Enter your Astra application token (e.g.AstraCS:xxx ):")

Using the Astra DocumentStore in an index pipeline

Next, we’ll make a Haystack pipeline, create some embeddings from documents, and add them into the AstraDocumentStore.

import logging

from haystack import Document, Pipeline

from haystack.components.embedders import SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder
from haystack.components.writers import DocumentWriter
from haystack.document_stores.types import DuplicatePolicy
from haystack_integrations.document_stores.astra import AstraDocumentStore

logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

embedding_model_name = "sentence-transformers/all-MiniLM-L6-v2"

# Make sure ASTRA_DB_API_ENDPOINT and ASTRA_DB_APPLICATION_TOKEN environment variables are set before proceeding

# embedding_dim is the number of dimensions the embedding model supports.
document_store = AstraDocumentStore(
    duplicates_policy=DuplicatePolicy.SKIP,
    embedding_dimension=384,
)


# Add Documents
documents = [
    Document(content="There are over 7,000 languages spoken around the world today."),
    Document(
        content="Elephants have been observed to behave in a way that indicates"
        " a high level of self-awareness, such as recognizing themselves in mirrors."
    ),
    Document(
        content="In certain parts of the world, like the Maldives, Puerto Rico, "
        "and San Diego, you can witness the phenomenon of bioluminescent waves."
    ),
]
index_pipeline = Pipeline()
index_pipeline.add_component(
    instance=SentenceTransformersDocumentEmbedder(model=embedding_model_name),
    name="embedder",
)
index_pipeline.add_component(instance=DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP), name="writer")
index_pipeline.connect("embedder.documents", "writer.documents")

index_pipeline.run({"embedder": {"documents": documents}})

print(document_store.count_documents())

If all has gone well, there should be 3 documents. 🎉

Use the `AstraEmbeddingRetriever` in a Haystack RAG pipeline

In Haystack, every DocumentStore is tightly coupled with the Retriever that fetches from it. Astra DB is no exception. Here we’ll create a RAG pipeline, where the AstraEmbeddingRetriever will fetch documents relevant to our query.

from haystack.components.builders.answer_builder import AnswerBuilder
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.generators import OpenAIGenerator
from haystack_integrations.components.retrievers.astra import AstraEmbeddingRetriever

prompt_template = """
                Given these documents, answer the question.
                Documents:
                {% for doc in documents %}
                    {{ doc.content }}
                {% endfor %}
                Question: {{question}}
                Answer:
                """

rag_pipeline = Pipeline()
rag_pipeline.add_component(
    instance=SentenceTransformersTextEmbedder(model=embedding_model_name),
    name="embedder",
)
rag_pipeline.add_component(instance=AstraEmbeddingRetriever(document_store=document_store), name="retriever")
rag_pipeline.add_component(instance=PromptBuilder(template=prompt_template), name="prompt_builder")
rag_pipeline.add_component(instance=OpenAIGenerator(), name="llm")
rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder")
rag_pipeline.connect("embedder", "retriever")
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("llm.meta", "answer_builder.meta")
rag_pipeline.connect("retriever", "answer_builder.documents")

# Run the pipeline
question = "How many languages are there in the world today?"
result = rag_pipeline.run(
    {
        "embedder": {"text": question},
        "retriever": {"top_k": 2},
        "prompt_builder": {"question": question},
        "answer_builder": {"query": question},
    }
)

print(result)

The output should look like this:

{'answer_builder': {'answers': [GeneratedAnswer(data='There are over 7,000 languages spoken around the world today.', query='How many languages are there in the world today?', documents=[Document(id=cfe93bc1c274908801e6670440bf2bbba54fad792770d57421f85ffa2a4fcc94, content: 'There are over 7,000 languages spoken around the world today.', score: 0.9267925, embedding: vector of size 384), Document(id=6f20658aeac3c102495b198401c1c0c2bd71d77b915820304d4fbc324b2f3cdb, content: 'Elephants have been observed to behave in a way that indicates a high level of self-awareness, such ...', score: 0.5357444, embedding: vector of size 384)], meta={'model': 'gpt-3.5-turbo-0613', 'index': 0, 'finish_reason': 'stop', 'usage': {'completion_tokens': 14, 'prompt_tokens': 83, 'total_tokens': 97}})]}}

Wrapping it up

If you’ve gotten this far, now you know how to use Astra DB as a data source for your Haystack pipeline. To learn more about Haystack, join us on Discord or sign up for our monthly newsletter.

PDF-Based Question Answering with Amazon Bedrock and Haystack

Wed, 17 Jan 2024 00:00:00 +0000

Amazon Bedrock is a fully managed service that provides high-performing foundation models from leading AI startups and Amazon through a single API. You can choose from various foundation models to find the one best suited for your use case.

In this article, I’ll guide you through the process of creating a generative question answering application tailored for PDF files using the newly added Amazon Bedrock integration with Haystack and OpenSearch to store our documents efficiently. The demo will illustrate the step-by-step development of a QA application designed specifically for the Bedrock documentation, demonstrating the power of Bedrock in the process 🚀

Benefits of Amazon Bedrock

It provides access to a diverse range of foundation models sourced from leading AI startups, including AI21 Labs, Anthropic, Cohere, Meta, and Stability AI, along with Amazon Titan models.
You can experiment with different large language models (LLMs) seamlessly using Amazon Bedrock. No need for multiple API keys; just modify the model name and test your application with various prompts and configurations to identify the optimal model for your specific use case.
Amazon Bedrock doesn’t use your prompts and continuations in training AWS models or share them with third parties. Your training data is not employed in training the core Amazon Titan models or distributed externally. Additionally, other usage data, such as timestamps and account IDs, is not employed in model training. Source
You don’t need to manage infrastructure for hosting your models when deploying your application after the prototyping phase. Amazon Bedrock takes care of the hosting infrastructure, providing a seamless deployment experience.

Set Up Amazon Bedrock

To use Amazon Bedrock, start by signing up for an AWS account. Once you have successfully logged in, go to Amazon Bedrock Console and get started. Keep in mind that, by default, users do not have model access. You should request access from Model Access Page. For this application, we’re going to use “Titan Text G1 - Express” model by Amazon. Unfortunately, there is no free tier for Amazon Bedrock, therefore, you might need to provide payment information at this step.

Learn more about the process in this set up Amazon Bedrock guide.

Manage model access

API Keys

To use Amazon Bedrock, you need aws_access_key_id, aws_secret_access_key, and indicate the aws_region_name. Once logged into your account, create the access keys in “Security Credentials” section. For detailed guidance, refer to the documentation on Managing access keys for IAM users.

1- Find the access keys under ‘Security Credentials’ section

2- Create a new key (or use an existing one)

3- Copy and store your aws_access_key_id and aws_secret_access_key keys before you press ‘Done’

Development Environment

Install dependencies

Let’s install Amazon Bedrock and OpenSearch integration of Haystack and pypdf package to process PDF files.

pip install opensearch-haystack amazon-bedrock-haystack pypdf

Download Files

For this application, we’ll use the user guide of Amazon Bedrock. Amazon Bedrock provides the PDF form of its guide. You can download the PDF from its source or by running the code below 👇🏼

import boto3
from botocore import UNSIGNED
from botocore.config import Config
s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))

# Download the PDF file from our s3 bucket to `/content/bedrock-documentation.pdf` directory
s3.download_file('core-engineering', 'public/blog-posts/bedrock-documentation.pdf', '/content/bedrock-documentation.pdf') #

Initialize an OpenSearch Instance

OpenSearch is a fully open-source search and analytics engine that seamlessly integrates with the Amazon OpenSearch Service, allowing you to deploy, manage, and scale your OpenSearch cluster effortlessly. To get started, install OpenSearch and launch an instance by following the detailed instructions provided in the OpenSearch documentation.

Here’s how to launch OpenSearch with Docker (recommended):

docker pull opensearchproject/opensearch:2.11.0
docker run -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" -e "OPENSEARCH_JAVA_OPTS=-Xms1024m -Xmx1024m" opensearchproject/opensearch:2.11.0

Building the Indexing Pipeline

Our indexing pipeline will convert the PDF file into a Haystack Document using PyPDFToDocument and preprocess it by cleaning and splitting it into chunks before storing them in OpenSearchDocumentStore.

Let’s run the pipeline below and index our file to our document store:

from pathlib import Path

from haystack import Pipeline
from haystack.components.converters import PyPDFToDocument
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack.document_stores.types import DuplicatePolicy
from haystack_integrations.document_stores.opensearch import OpenSearchDocumentStore

## Initialize the OpenSearchDocumentStore
document_store = OpenSearchDocumentStore(hosts="http://localhost:9200", use_ssl=True, verify_certs=False, http_auth=("admin", "admin"))

## Create pipeline components
converter = PyPDFToDocument()
cleaner = DocumentCleaner()
splitter = DocumentSplitter(split_by="sentence", split_length=10, split_overlap=2)
writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP)

## Add components to the pipeline
indexing_pipeline = Pipeline()
indexing_pipeline.add_component("converter", converter)
indexing_pipeline.add_component("cleaner", cleaner)
indexing_pipeline.add_component("splitter", splitter)
indexing_pipeline.add_component("writer", writer)

## Connect the components to each other
indexing_pipeline.connect("converter", "cleaner")
indexing_pipeline.connect("cleaner", "splitter")
indexing_pipeline.connect("splitter", "writer")

Run the pipeline with the files you want to index.

indexing_pipeline.run({"converter": {"sources": [Path("/content/bedrock-documentation.pdf")]}})

Pipelines for indexing files into document stores are convenient when you need to add additional files to your document store. Feel free to rerun the pipeline with new documents whenever necessary.

Building the Query Pipeline

Let’s create another pipeline to query our application. In this pipeline, we’ll use OpenSearchBM25Retriever to retrieve relevant information from the OpenSearchDocumentStore and an Amazon Titan model amazon.titan-text-express-v1 to generate answers with AmazonBedrockGenerator. You can find other model options in supported foundation models. Next, we’ll create a prompt for our task using the Retrieval-Augmented Generation (RAG) approach with PromptBuilder. This prompt will help generate answers by considering the provided context. Finally, we’ll connect these three components to complete the pipeline.

from haystack.components.builders import PromptBuilder
from haystack.pipeline import Pipeline
from amazon_bedrock_haystack.generators.amazon_bedrock import AmazonBedrockGenerator
from haystack_integrations.components.retrievers.opensearch import OpenSearchBM25Retriever

## Create pipeline components
retriever = OpenSearchBM25Retriever(document_store=document_store, top_k=15)

## Set the AWS credentials as environment variables
os.environ["AWS_ACCESS_KEY_ID"] = "aws_access_key_id"
os.environ["AWS_SECRET_ACCESS_KEY"] = "aws_secret"
os.environ["AWS_REGION_NAME"] = "aws_region_name"

## Initialize the AmazonBedrockGenerator with an Amazon Bedrock model
generator = AmazonBedrockGenerator(model='amazon.titan-text-express-v1', max_length=500)
template = """
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Please answer the question based on the given information from Amazon Bedrock documentation.

{{question}}
"""
prompt_builder = PromptBuilder(template=template)

## Add components to the pipeline
rag_pipeline = Pipeline()
rag_pipeline.add_component("retriever", retriever)
rag_pipeline.add_component("prompt_builder", prompt_builder)
rag_pipeline.add_component("llm", generator)

## Connect components to each other
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")

Now, ask your question and learn about the Amazon Bedrock service using Amazon Bedrock models!

question = "What is Amazon Bedrock??"
response = rag_pipeline.run({"retriever": {"query": question}, "prompt_builder": {"question": question}})

print(response["llm"]["replies"][0])

Example result:

Amazon Bedrock is a fully managed service that makes high-performing foundation models (FMs) from leading AI startups and Amazon available for your use through a uniﬁed API. You can choose from a wide range of foundation models to ﬁnd the model that is best suited for your use case. Amazon Bedrock also offers a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. Using Amazon Bedrock, you can easily experiment with and evaluate top foundation models for your use cases, privately customize them with your data using techniques such as ﬁne-tuning and Retrieval Augmented Generation (RAG), and build agents that execute tasks using your enterprise systems and data sources.
With Amazon Bedrock's serverless experience, you can get started quickly, privately customize foundation models with your own data, and easily and securely integrate and deploy them into your applications using AWS tools without having to manage any infrastructure.

Keep in mind that generative models are not deterministic and the response you get might differ.

Next Steps

As the next step, feel free to enhance your document store by indexing additional files or replacing existing ones with your own data.

For this demonstration, we applied the BM25 method to generate text vectors. If you’d like to improve the application further, select an Embedder and generate dense embeddings for each file using a chosen embedding model before incorporating them into the document store.

Thank you for your interest! Stay informed about the latest developments in Haystack by subscribing to our newsletter or joining our Discord community.

Using Jina Embeddings v2 with Haystack pipelines to summarize legal documents

Wed, 10 Jan 2024 00:00:00 +0000

Jina.ai recently upgraded and expanded the capabilities of their previous embedding model in a v2 release.

With the Jina Haystack extension, you can now take advantage of these new text embedders in your Haystack pipelines! In this post, we’ll show what’s cool about Jina Embeddings v2 and how to use them.

You can follow along in the accompanying Colab notebook of a RAG pipeline that uses the Jina Haystack extension.

Advantages of Jina Embeddings v2

Handling long documents. The large token window, accommodating up to 8192 tokens, allows you to break the embeddings into larger chunks. It’s more computationally and memory-efficient to use a few larger vectors than a lot of small ones, so this allows Jina v2 to process large documents efficiently.
Improved semantic understanding. Larger text chunks also contain more context within each chunk, which can help LLMs better understand your documents. Improved understanding means better long document retrieval, semantic textual similarity, text reranking, recommendation, RAG and LLM-based generative search.
Short vector length: Jina Embeddings v2 emits embedding vectors of length 768 (base model) or 512 (small model), which are both significantly less than that of the only other embedding model that supports 8k tokens input length, while not compromising on the quality of retrieval, similarity, reranking or other downstream tasks. A shorter vector length implies cost-savings for the vector database, which typically price based on stored vector dimensions.
Fully open source 💙 There are both small and large embedding models available, depending on your computing resources and requirements. To run the embedding models yourself, check out this documentation on HuggingFace. Alternately, you can use Jina’s fully managed embedding service to handle that for you, which we’ll be doing for this demo.

Getting started using Jina Embeddings v2 with Haystack

To use the integration you’ll need a free Jina api key - get one here.

You can use Jina Embedding models with two Haystack components: JinaTextEmbedder and JinaDocumentEmbedder.

To create semantic embeddings for documents, use JinaDocumentEmbedder in your indexing pipeline. For generating embeddings for queries, use JinaTextEmbedder.

In the following code we’ll demonstrate how to use both components. You can also see the Haystack docs for some minimum viable code examples.

Summarizing legal text with a Haystack RAG pipeline

I’m not a lawyer, and neither are large language models. But LLMs are good at analyzing long, complex documents. So let’s try using the Jina v2 embedding models for some legal summarization.

In October 2023, I narrowly escaped jury duty. I had slight FOMO since the case sounded interesting (Google v. Sonos). Let’s see how it turned out.

To follow along with this demo, in addition to a Jina api key you’ll also need a Hugging Face access token, since we’ll use the Mixtral 8x7b LLM for question answering.

First, let’s install all the packages we’ll need.

pip install jina-haystack chroma-haystack pypdf

Then let’s input our credentials. Or you can set them as environment variables instead if you’re feeling fancy.

from getpass import getpass
import os

os.environ["JINA_API_KEY"] = getpass("JINA api key:")
os.environ["HF_API_TOKEN"] = getpass("Enter your HuggingFace api token: ")

Building the indexing pipeline

Our indexing pipeline will preprocess the legal document, turn it into vectors, and store them. We’ll use the Chroma DocumentStore to store the vector embeddings, via the Chroma Document Store Haystack integration.

from chroma_haystack.document_store import ChromaDocumentStore
document_store = ChromaDocumentStore()

At a high level, the LinkContentFetcher pulls this document from its URL. Then we convert it from a PDF into a Document object Haystack can understand.

We preprocess it by removing whitespace and redundant substrings. Then split it into chunks, generate embeddings, and write these embeddings into the ChromaDocumentStore.

from haystack import Pipeline

from haystack.components.fetchers import LinkContentFetcher
from haystack.components.converters import PyPDFToDocument
from haystack.components.writers import DocumentWriter
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.preprocessors import DocumentSplitter
from chroma_haystack.retriever import ChromaEmbeddingRetriever
from haystack.document_stores.types import DuplicatePolicy

from jina_haystack.document_embedder import JinaDocumentEmbedder
from jina_haystack.text_embedder import JinaTextEmbedder

fetcher = LinkContentFetcher()
converter = PyPDFToDocument()
# remove repeated substrings to get rid of headers/footers
cleaner = DocumentCleaner(remove_repeated_substrings=True)

# Since jina-v2 can handle 8192 tokens, 500 words seems like a safe chunk size
splitter = DocumentSplitter(split_by="word", split_length=500)

# DuplicatePolicy.SKIP is optional but helps avoid errors if you want to re-run the pipeline
writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP)

retriever = ChromaEmbeddingRetriever(document_store=document_store)

document_embedder = JinaDocumentEmbedder(model="jina-embeddings-v2-base-en")

indexing_pipeline = Pipeline()
indexing_pipeline.add_component(instance=fetcher, name="fetcher")
indexing_pipeline.add_component(instance=converter, name="converter")
indexing_pipeline.add_component(instance=cleaner, name="cleaner")
indexing_pipeline.add_component(instance=splitter, name="splitter")
indexing_pipeline.add_component(instance=document_embedder, name="embedder")
indexing_pipeline.add_component(instance=writer, name="writer")

indexing_pipeline.connect("fetcher.streams", "converter.sources")
indexing_pipeline.connect("converter.documents", "cleaner.documents")
indexing_pipeline.connect("cleaner.documents", "splitter.documents")
indexing_pipeline.connect("splitter.documents", "embedder.documents")
indexing_pipeline.connect("embedder.documents", "writer.documents")

# This case references Google V Sonos, October 2023
urls = ["https://cases.justia.com/federal/district-courts/california/candce/3:2020cv06754/366520/813/0.pdf"]

indexing_pipeline.run(data={"fetcher": {"urls": urls}})

Building the query pipeline

Now the real fun begins. Let’s create a query pipeline so we can actually start asking questions. We write a prompt allowing us to pass our documents to the Mixtral-8x7B LLM. Then we initiatialize the LLM via the HuggingFaceAPIGenerator.

To use this model, you need to accept the conditions here: https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1

In Haystack 2.0 retrievers are tightly coupled to DocumentStores. If we pass the document store in the retriever we initialized earlier, this pipeline can access those embeddings we generated, and pass them to the LLM.


from haystack.components.generators import HuggingFaceAPIGenerator
from haystack.components.builders.prompt_builder import PromptBuilder

from jina_haystack.text_embedder import JinaTextEmbedder
prompt = """ Answer the question, based on the
content in the documents. If you can't answer based on the documents, say so.

Documents:
{% for doc in documents %}
  {{doc.content}}
{% endfor %}

question: {{question}}
"""

text_embedder = JinaTextEmbedder(model="jina-embeddings-v2-base-en")
generator = HuggingFaceAPIGenerator(
    api_type="serverless_inference_api",
    api_params={"model": "mistralai/Mixtral-8x7B-Instruct-v0.1"})  

prompt_builder = PromptBuilder(template=prompt)
query_pipeline = Pipeline()
query_pipeline.add_component("text_embedder",text_embedder)
query_pipeline.add_component(instance=prompt_builder, name="prompt_builder")
query_pipeline.add_component("retriever", retriever)
query_pipeline.add_component("generator", generator)

query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
query_pipeline.connect("retriever.documents", "prompt_builder.documents")
query_pipeline.connect("prompt_builder.prompt", "generator.prompt")

Time to ask a question!

question = "Summarize what happened in Google v. Sonos"

result = query_pipeline.run(data={"text_embedder":{"text": question},
                                  "retriever": {"top_k": 3},
                                  "prompt_builder":{"question": question},
                                  "generator": {"generation_kwargs": {"max_new_tokens": 350}}})

print(result['generator']['replies'][0])

Answer: Google v. Sonos is a patent infringement case in which Sonos sued Google for infringing on two of its patents related to customizing and saving overlapping groups of smart speakers or other zone players according to a common theme..

Exploring more questions and documents

You can swap the question variable out and then call pipeline.run again:

What role did If This Then That play in Google v. Sonos?
What judge presided over Google v. Sonos?
What should Sonos have done differently?

The indexing pipeline is written so that you can swap in other documents and analyze them. You can try plugging the following URLs (or any PDF written in English) into the indexing pipeline and re-running all the code blocks below it.

Google v. Oracle: https://supreme.justia.com/cases/federal/us/593/18-956/case.pdf
JACK DANIEL’S PROPERTIES, INC. v. VIP PRODUCTS LLC: https://www.supremecourt.gov/opinions/22pdf/22-148_3e04.pdf

Note: if you want to change the prompt template, you’ll also need to re-run the code blocks starting where the DocumentStore is defined.

Wrapping it up

Thanks for reading! If you want to stay on top of the latest Haystack developments, you can subscribe to our newsletter or join our Discord community.

To learn more about the technologies used here, check out these blog posts:

Highlights of 2023

Fri, 05 Jan 2024 00:00:00 +0000

This post was originally shared through the Haystack newsletter. Subscribe now to stay in the loop on all things Haystack, including the latest updates, new features, captivating content, and upcoming exciting events! 🗞️

What a fantastic year it has been! In 2023, the world of AI saw tremendous progress, making it an exciting time for learning and growth. As we wind down and approach the new year, it’s a good moment to take a breather, look back on the past year, and appreciate the highlights before gearing up for what’s ahead. We’ve taken some time to review the noteworthy moments that made 2023 special for Haystack and its community.

A big thank you for being part of this journey! Here’s to a wonderful New Year filled with dizzying generative AI and NLP advancements 🥂✨

💫 Haystack Highlights

The biggest highlight of the year was Haystack 2.0-Beta, which is a more flexible, extendable, customizable version of Haystack 💯
Haystack surpassed 12,000 stars on GitHub 🌟
It has been exactly 2 years since the first stable release of Haystack 🎉
We introduced Haystack Integrations to highlight community-contributed external packages and incorporate additional technologies like Amazon, VertexAI, Elasticsearch, OpenSearch, Chroma, Pinecone, Cohere, Gradient, Jina, and Ollama that seamlessly integrate with Haystack 🧩
We hosted the first Advent of Haystack 🎄
Over 80 pull requests from the community were merged 🚀
We launched Haystack Blog, where we publish articles about Haystack, LLMs, demos, and use cases 📚 P.S. Did you know that you can contribute to Haystack Blog?

We announced Haystack 2.0-Beta and hosted the first Advent of Haystack

👥 Community Highlights

Our community has reached 2300+ members on Discord, join the Haystack Discord community to ask questions, stay informed about Haystack updates, and connect with fellow Haystackers 🤖
Discord has been the platform for numerous office hours, a handful of live coding sessions, and engaging Road to 2.0 sessions 🛣️
We proudly participated in Hacktoberfest for the first time and welcomed new Haystack contributors 🎃
We introduced our monthly Haystack community newsletter 🗞️
120+ Advent of Haystack participants took the time to give us feedback on Haystack 2.0-Beta 🎄
We participated in the Anthropic Hackathon 🧑‍💻
Haystack has spread its wings on Twitter/X, follow us at Haystack_AI 🐦

A photo from the Anthropic Hackathon

📆 Webinar Highlights

We collected some of the webinars that the Haystack community found interesting in 2023:

‘Optimizing LLMs with Retrieval Augmented Generation and Haystack 2.0’ by Sara

📖 Blog Post Highlights

Here are the most visited blog posts of 2023:

Gemini Models with Google Vertex AI Integration for Haystack: Multi-Modal components and function calling with the new Gemini integrations for Haystack 2.0
The Beginner’s Guide to LLM Prompting: Engineer your natural language prompts to get the best answers from a large language model
Deploying RAG to Production: From a working prototype to a RAG AI system in production: an overview for data scientists
How to Prevent Prompt Injections: Prevent prompt injections leveraging our new open-source model and dataset
Enhancing RAG Pipelines in Haystack: How the latest rankers optimize LLM context window utilization in RAG pipelines

We published ‘Gemini Models with Google Vertex AI Integration for Haystack’

👏 Some of Your Work

In 2023, your contributions to the community were truly outstanding, and we want to highlight some of these works that we enjoyed seeing:

@cclauss, @faaany, @danielbichuetti, @awinml, @zoltan-fedor, @benheckmann, @nickprock and @recrudesce contributed the most with their PRs
@AIAnytime on YouTube delivered two invaluable videos on Haystack:

Chat with Data App: RAG using Mistral 7B, Haystack, and Chainlit and RAG Implementation using Mistral 7B, Haystack, Weaviate, and FastAPI
Laura Funderburk hosted a mentorship program to help users build with Haystack and open source projects, in addition to her great Haystack demos

and numerous members showcased their projects with the community on the #show-and-tell channel of Discord, as well as on Linkedin and Twitter 🤩

💙 deepset Highlights

deepset, the company behind Haystack, announced a $30 million funding round led by Balderton Capital, with participation from existing investors GV, Harpoon, System.One, and Lunar. Read more about it here.

deepset raised $30 million funding

🔜 Next

The stable version of Haystack 2.0 will be out in Q1 2024. Check out Haystack public roadmap to see the upcoming features.
Dive into the world of Haystack 2.0-beta with our latest discussion entry, designed to be your go-to resource until the stable release of Haystack 2.0. Find the roadmap, release notes, and continually updated feature table in the ultimate guide for Haystack 2.0-beta.
We have just introduced a new repository: Haystack Cookbook. Stay tuned as we expand this collection with even more example notebooks throughout 2024.

To stay informed about everything related to Haystack, such as the latest updates, new features, captivating content, and upcoming exciting events, be sure to subscribe to the monthly Haystack newsletter! 🗞️

Building a Healthcare Chatbot with Mixtral 8x7b, Haystack, and PubMed

Tue, 02 Jan 2024 00:00:00 +0000

Unfortunately, countless people around the world have inadequate access to healthcare. I’m lucky to have health insurance and good medical providers taking care of me. However, I still want to educate myself before walking into a doctor’s office.

Technology can empower people to take charge of their health. Large language models can power chatbots where people can ask medical questions.

In this post, I’ll show you how I built a medical chatbot with Haystack 2.0-Beta, and the Mixtral 8x7B model by pulling research papers from PubMed.

You can follow along with this Colab. You’ll need a HuggingFace API key. Sign up for a free account here.

Challenges

Building a medical chatbot presents some challenges.

Lack of data. HIPAA and other privacy regulations make it harder to find public QA datasets to fine-tune a model.
The human impact of providing wrong answers. 😬
Staying up to date. It takes a long time to train a LLM. By the time they are released, their knowledge is already a bit stale. Medical research breakthroughs are happening all the time. I am particularly interested in long COVID, which has both impacted humanity on a massive scale, and is the subject of ongoing research.

Therefore, I decided to use a RAG pipeline to combine PubMed data with a LLM.

The PubMed RAG Pipeline

Retrieval augmented generation, or RAG, is a way of giving a LLM context so it can better answer questions.

You pass the LLM some documents, along with a query, and prompt the LLM to use the documents when answering the question.

PubMed has up to date, trustworthy medical information so it seemed like a solid document source. Plus, there’s a PyMed wrapper for the PubMed API that made querying easy peasy. We’ll wrap this in a Haystack custom component to format the results as Documents so that Haystack can use them, and add some light error handling.

from pymed import PubMed
from typing import List
from haystack import component
from haystack import Document

pubmed = PubMed(tool="Haystack2.0Prototype", email="tilde.thurium@deepset.ai")

def documentize(article):
  return Document(content=article.abstract, meta={'title': article.title, 'keywords': article.keywords})

@component
class PubMedFetcher():

  @component.output_types(articles=List[Document])
  def run(self, queries: list[str]):
    cleaned_queries = queries[0].strip().split('\n')

    articles = []
    try:
      for query in cleaned_queries:
        response = pubmed.query(query, max_results = 1)
        documents = [documentize(article) for article in response]
        articles.extend(documents)
    except Exception as e:
        print(e)
        print(f"Couldn't fetch articles for queries: {queries}" )
    results = {'articles': articles}
    return results

For the model, I went with Mixtral’s 8x7b. Mixtral is a unique kind of model that uses 8 “experts” and an internal “routing” mechanism that routes a token to a specific expert. This also means that during inferencing, not all parameters are used, which allows the model to response remarkably fast. This HuggingFace blog post explains MoE in more detail.

Generating Keywords for PubMed with Mixtral/LLMs

First, I tried an approach where I passed a plain query to PubMed. e.g. “What are the most current treatments for long COVID?” Unfortunately, that didn’t work too well. The articles returned weren’t very relevant. Which makes sense, because PubMed isn’t optimized for natural language search. It is optimized for keywords, though. And you know what’s great at generating keywords? LLMs!

So now our flow is as follows:

Our user inputs a question, such as, “What are the most current treatments for long COVID?”
We prompt the LLM to turn the question into keywords
Search PubMed and return top_k articles based on those keywords
Pass those articles to the LLM and ask them to reference it when formulating an answer.

First, initialize the LLMs and warm them up.

from haystack.components.generators import HuggingFaceTGIGenerator
from haystack.utils import Secret

keyword_llm = HuggingFaceTGIGenerator("mistralai/Mixtral-8x7B-Instruct-v0.1", token=Secret.from_token(huggingface_token))
keyword_llm.warm_up()

llm = HuggingFaceTGIGenerator("mistralai/Mixtral-8x7B-Instruct-v0.1", token=Secret.from_token(huggingface_token))
llm.warm_up()

Next, we create our prompts and our pipeline and hook everything up.

from haystack import Pipeline
from haystack.components.builders.prompt_builder import PromptBuilder

keyword_prompt_template = """
Your task is to convert the follwing question into 3 keywords that can be used to find relevant medical research papers on PubMed.
Here is an examples:
question: "What are the latest treatments for major depressive disorder?"
keywords:
Antidepressive Agents
Depressive Disorder, Major
Treatment-Resistant depression
---
question: {{ question }}
keywords:
"""

prompt_template = """
Answer the question truthfully based on the given documents.
If the documents don't contain an answer, use your existing knowledge base.

q: {{ question }}
Articles:
{% for article in articles %}
  {{article.content}}
  keywords: {{article.meta['keywords']}}
  title: {{article.meta['title']}}
{% endfor %}

"""
keyword_prompt_builder = PromptBuilder(template=keyword_prompt_template)
prompt_builder = PromptBuilder(template=prompt_template)
fetcher = PubMedFetcher()

pipe = Pipeline()

pipe.add_component("keyword_prompt_builder", keyword_prompt_builder)
pipe.add_component("keyword_llm", keyword_llm)
pipe.add_component("pubmed_fetcher", fetcher)
pipe.add_component("prompt_builder", prompt_builder)
pipe.add_component("llm", llm)

pipe.connect("keyword_prompt_builder.prompt", "keyword_llm.prompt")
pipe.connect("keyword_llm.replies", "pubmed_fetcher.queries")

pipe.connect("pubmed_fetcher.articles", "prompt_builder.articles")
pipe.connect("prompt_builder.prompt", "llm.prompt")

Try it for yourself and see!

question="What are the most current treatments for long COVID?"
pipe.run(data={"keyword_prompt_builder":{"question":question},
               "prompt_builder":{"question": question},
               "llm":{"generation_kwargs": {"max_new_tokens": 500}}})

What are the most current treatments for long COVID?
  The COVID-19 pandemic has led to a significant increase in the number of patients with post-acute COVID-19 syndrome (PACS), also known as long COVID. PACS is a complex, multisystem disorder that can affect various organs and systems, including the respiratory, cardiovascular, neurological, and gastrointestinal systems. The pathophysiology of PACS is not yet fully understood, but it is believed to be related to immune dysregulation, persistent inflammation, and microvascular injury.
The management of PACS is challenging due to its heterogeneous presentation and the lack of evidence-based treatments. Current treatment approaches are mainly supportive and aim to alleviate symptoms and improve quality of life. These include:
- Pulmonary rehabilitation for respiratory symptoms
- Cardiac rehabilitation for cardiovascular symptoms
- Cognitive-behavioral therapy for neurological symptoms
- Dietary modifications and medications for gastrointestinal symptoms
- Vaccination to prevent reinfection and further complications
- Symptomatic treatment with medications such as nonsteroidal anti-inflammatory drugs (NSAIDs), corticosteroids, and antihistamines
- Experimental treatments such as antiviral therapy, immunomodulatory therapy, and cell-based therapy

Potential issues and workarounds

PubMed API wasn’t built for high scalability, so this approach wouldn’t be great for a high traffic production system. In that case, you could consider pulling articles into a Haystack DocumentStore that is backed by persistent storage

Alternately, you could experiment with models that has been trained on a medical dataset, such as Gradient’s.

Wrapping it up

Today you learned how to use Mixtral 8x7B and Haystack RAG pipelines to build a medical chatbot. Thanks for reading! If you want to learn more about Haystack 2.0 or RAG Pipelines, these posts might be of interest:

Gemini Models with Google Vertex AI Integration for Haystack

Mon, 18 Dec 2023 00:00:00 +0000

In this article, we will introduce you to the new Google Vertex AI Integration for Haystack. While this integration introduces several new components to the Haystack eco-system (feel free to explore the full integration repo!), we’d like to start by showcasing two components in particular: the VertexAIGeminiGenerator and the VertexAIGeminiChatGenerator, using the gemini-pro and gemini-1.5-flash models.

💚 You can run the example code showcased in this article in the accompanying Notebook

The great news is, to authenticate for access to the Gemini models, you will only need to do a Google authentication in the Colab (instructions in the Colab)

VertexAIGeminiGenerator for Question Answering on Images

The new VertexAIGeminiGenerator component allows you to query Gemini models such as gemini-pro and gemini-1.5-flash. In this example, let’s use the latter, allowing us to also make use of images in our queries.

To get started, you will need to install Haystack and the google-vertex-haystack the integration:

!pip install haystack-ai google-vertex-haystack

Just like any other generator component in Haystack 2.0-Beta, to run the GeminiGenerator on its own, we simply have to call the run() method. However, unlike our other components, the run method here expects parts as input. A Part in the Google Vertex AI API can be anything from a message, to images, or even function calls. Here are the docstrings from the source code for the most up-to-date reference we could find here. Let’s run this component with a simple query 👇

from haystack_integrations.components.generators.google_vertex import VertexAIGeminiGenerator

gemini = VertexAIGeminiGenerator(model="gemini-1.5-flash", project_id='YOUR-GCP-PROJECT-ID')
gemini.run(parts = ["What is the most interesting thing you know?"])

Querying with Images

Next, let’s make use of the flexibility of parts and pass in some images alongside a question too. In the example below, we are providing 4 images containing robots, and asking gemini-1.5-flash what it can tell us about them.

import requests
from haystack.dataclasses.byte_stream import ByteStream

URLS = [
    "https://raw.githubusercontent.com/silvanocerza/robots/main/robot1.jpg",
    "https://raw.githubusercontent.com/silvanocerza/robots/main/robot2.jpg",
    "https://raw.githubusercontent.com/silvanocerza/robots/main/robot3.jpg",
    "https://raw.githubusercontent.com/silvanocerza/robots/main/robot4.jpg"
]
images = [
    ByteStream(data=requests.get(url).content, mime_type="image/jpeg")
    for url in URLS
]

result = gemini.run(parts = ["What can you tell me about this robots?", *images])
for answer in result["replies"]:
    print(answer)

💡 Fun fact: We’ve notices that Gemini consistently misidentifies the 3rd robot! Often the response is: “The third image is of Gort from the 1951 film The Day the Earth Stood Still. Gort is a robot who is sent to Earth to warn humanity about the dangers of nuclear war. He is a powerful and intelligent robot, but he is also compassionate and understanding.”. However, this robot is Marvin the Paranoid Android from the The Hitchhiker’s Guide to the Galaxy series. It would have been pretty silly if Gort looked like that! 😅

VertexAIGeminiChatGenerator for Function Calling

With gemini-pro, we can also start introducing function calling! So let’s see how we can do that. An important feature to note here is that function calling in this context refers to using Gemini to identify how a function should be called. To see what we mean by this, let’s see if we can build a system that can run a get_current_weather function, based on a question asked in natural language.

For this section, we will be using the new VertexAIGeminiChatGenerator component, which can optionally be initialized by providing a list of tools. This will become handy in a moment because we will be able to define functions and provide them to the generator as a list of tools.

For demonstration purposes, we’re simply creating a get_current_weather function that returns an object which will always tell us it’s ‘Sunny, and 21.8 degrees’.. If it’s Celsius, that’s a good day! ☀️

from typing import Annotated

def get_current_weather(
    location: Annotated[str, "The city for which to get the weather, e.g. 'San Francisco'"] = "Munich",
    unit: Annotated[str, "The unit for the temperature, e.g. 'celsius'"] = "celsius",
):
  return {"weather": "sunny", "temperature": 21.8, "unit": unit}

Next, we transform the function into a Haystack Tool object. The description of the parameteres (provided using Annotated) will be included in the schema of the tool.

from haystack.tools import create_tool_from_function

weather_tool = create_tool_from_function(get_current_weather)

We can use this tool with the VertexAIGeminiChatGenerator and ask it to tell us how the function should be called to answer the question “What is the temperature in celsius in Berlin?”:

from haystack_integrations.components.generators.google_vertex import VertexAIGeminiChatGenerator
from haystack.dataclasses import ChatMessage

gemini_chat = VertexAIGeminiChatGenerator(model="gemini-pro", project_id='YOUR-GCP-PROJECT-ID', tools=[weather_tool])

user_message = [ChatMessage.from_user("What is the temperature in celsius in Berlin?")]
replies = gemini_chat.run(messages=user_message)["replies"]
replies

With the response we get from this interaction, we can call the function get_current_weather using the ToolInvoker component and proceed with our chat:

from haystack.components.tools import ToolInvoker

tool_invoker = ToolInvoker(tools=[weather_tool])
tool_messages = tool_invoker.run(messages=replies)["tool_messages"]

messages = user_message + replies + tool_messages

res = gemini_chat.run(messages = messages)
res["replies"][0].text

Building a Full Retrieval-Augmented Generative Pipeline

Alongside the individual use of the new Gemini components above, you can of course also use them in full Haystack pipelines. Here is an example of a RAG pipeline that does question-answering on webpages using the LinkContentFetcher and the VertexAIGeminiGenerator using the gemini-1.5-flash model 👇

As we are working on the full release of Haystack 2.0, components that are currently available in the Beta release are mostly focused on text. So, truly multi-modal applications as full Haystack pipelines is not yet possible. We are creating components that can easily handle other medias like images, audio, and video and will be back with examples soon!

from haystack_integrations.components.generators.google_vertex import VertexAIGeminiGenerator
from haystack.components.fetchers.link_content import LinkContentFetcher
from haystack.components.converters import HTMLToDocument
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.rankers import TransformersSimilarityRanker
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack import Pipeline

fetcher = LinkContentFetcher()
converter = HTMLToDocument()
document_splitter = DocumentSplitter(split_by="word", split_length=50)
similarity_ranker = TransformersSimilarityRanker(top_k=3)
gemini = VertexAIGeminiGenerator(model="gemini-1.5-flash", project_id=project_id)

prompt_template = """
According to these documents:

{% for doc in documents %}
  {{ doc.content }}
{% endfor %}

Answer the given question: {{question}}
Answer:
"""
prompt_builder = PromptBuilder(template=prompt_template)

pipeline = Pipeline()
pipeline.add_component("fetcher", fetcher)
pipeline.add_component("converter", converter)
pipeline.add_component("splitter", document_splitter)
pipeline.add_component("ranker", similarity_ranker)
pipeline.add_component("prompt_builder", prompt_builder)
pipeline.add_component("gemini", gemini)

pipeline.connect("fetcher.streams", "converter.sources")
pipeline.connect("converter.documents", "splitter.documents")
pipeline.connect("splitter.documents", "ranker.documents")
pipeline.connect("ranker.documents", "prompt_builder.documents")
pipeline.connect("prompt_builder.prompt", "gemini")

Once we have the pipeline, we can run it with a query about Haystack 2.0-Beta:

question = "What do graphs have to do with Haystack?"
result = pipeline.run({"prompt_builder": {"question": question},
                   "ranker": {"query": question},
                   "fetcher": {"urls": ["https://haystack.deepset.ai/blog/introducing-haystack-2-beta-and-advent"]}})

for answer in result["gemini"]["replies"]:
  print(answer)

Now you’ve seen some of what Gemini can do, as well as how to integrate it with Haystack 🫶 If you want to learn more:

Check out the Haystack docs or tutorials
Try out the Gemini quickstart colab from Google
Participate in the Advent of Haystack

Use Gradient Models with Haystack Pipelines

Mon, 11 Dec 2023 00:00:00 +0000

One of the more cumbersome aspects of creating LLM applications is model management. Especially in cases where we need to fine-tune, host, and scale the models ourselves. In this case, having options at hand can be great. Today, we’ve expanded the Haystack 2.0 ecosystem with a new integration that can help you with just that

Gradient is an LLM development platform that offers web APIs for fine-tuning, embeddings, and inference on state-of-the-art open-source models. In this article, let’s take a look at the new Gradient integration for Haystack, and how you can use it in your retrieval-augmented generative pipelines.

You can find an example Colab here, which uses embedding and generative models from Gradient for a RAG pipeline on Notion pages.

What the Gradient Integration Provides

The Gradient Integration for Haystack comes with three new components for Haystack pipelines:

The GradientDocumentEmbedder: You can use this component to create embeddings of documents.
The GradientTextEmbedder: You can use this component to create embeddings for text snippets such as queries.
The GradientGenerator: You can use this component to generate responses using LLMs.

How to use the Gradient Integration

The Gradient platform provides an embeddings endpoint (at the time of writing, it supports bge-large) and the fine-tuning and deployment of LLMs such as Llama-2, Bloom, and more (with more planned).

To fine-tune and use models via Gradient, the first step is to create a workspace.

Create a workspace

Once you have a workspace, you will be able to select a base model and start a fine-tuning job.

Fine-tune a model

Using the GradientGenerator

You can use the GradientGenerator either with any of the available base models that Gradient provides, or with a model that you have fine-tuned on the platform.

For example, to use the llama2-7b-chat model:

os.environ["GRADIENT_ACCESS_TOKEN"] = "YOUR_GRADIENT_ACCESS_TOKEN"
os.environ["GRADIENT_WORKSPACE_ID"] = "YOUR_WORKSPACE_ID"

generator = GradientGenerator(base_model_slug="llama2-7b-chat",
                              max_generated_token_count=350)

Or, to use a model that you’ve fine-tuned on the Gradient platform, provide your model_adapter_id

os.environ["GRADIENT_ACCESS_TOKEN"] = "YOUR_GRADIENT_ACCESS_TOKEN"
os.environ["GRADIENT_WORKSPACE_ID"] = "YOUR_WORKSPACE_ID"

generator = GradientGenerator(model_adapter_id="your_finetuned_model_adapter_id",
                              max_generated_token_count=350)

Building a RAG pipeline

In this article, I’ve provided an example Colab that does question-answering on your private Notion pages, using the NotionExporter integration. We use the GradientDocumentEmbedder component to create embeddings of our notion pages and index them into an InMemoryDocumentStore. As for the RAG pipeline, you can use the GradientTextEmbedder and GradientGenerator to:

Embed the user query to retrieve the most relevant documents from our Notion pages
Generate a response using our own fine-tuned LLM from Gradient:

import os
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.builders import PromptBuilder
from haystack_integrations.components.embedders.gradient import GradientTextEmbedder
from haystack_integrations.components.generators.gradient import GradientGenerator

prompt = """ Answer the query, based on the
content in the documents.

Documents:
{% for doc in documents %}
  {{doc.content}}
{% endfor %}

Query: {{query}}
"""
os.environ["GRADIENT_ACCESS_TOKEN"] = "YOUR_GRADIENT_ACCESS_TOKEN"
os.environ["GRADIENT_WORKSPACE_ID"] = "YOUR_WORKSPACE_ID"

text_embedder = GradientTextEmbedder()
retriever = InMemoryEmbeddingRetriever(document_store=document_store)
prompt_builder = PromptBuilder(template=prompt)
generator = GradientGenerator(model_adapter_id="your_finetuned_model_adapter_id",
                              max_generated_token_count=350)

rag_pipeline = Pipeline()

rag_pipeline.add_component(instance=text_embedder, name="text_embedder")
rag_pipeline.add_component(instance=retriever, name="retriever")
rag_pipeline.add_component(instance=prompt_builder, name="prompt_builder")
rag_pipeline.add_component(instance=generator, name="generator")

rag_pipeline.connect("text_embedder", "retriever")
rag_pipeline.connect("retriever.documents", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "generator")

To run this pipeline:

question = "What are the steps for creating a custom component?"
rag_pipeline.run(data={"text_embedder":{"text": question},
                                "prompt_builder":{"query": question}})

Introducing Haystack 2.0-Beta and Advent of Haystack

Mon, 04 Dec 2023 00:00:00 +0000

Today, we are really happy to announce that we have released Haystack 2.0-Beta, alongside our first-ever Advent of Haystack: a set of 10 challenges that we will publish throughout the month of December, each introducing the features and design of Haystack 2.0-Beta.

What does this release mean for me?

Since the first day we started building Haystack 2.0, we’ve involved our community with our design decisions and the feedback we got on our proposals on GitHub and via our Discord community proved to be incredibly valuable. While this is not yet the full stable release of Haystack 2.0, we want to make this first official commitment to the new design available for you to test and truly experience how Haystack is improving. We are committed to redesigning our LLM framework, and we need your help to shape it. To participate, complete and submit a challenge, with any feedback you would like to give us about your experience.

So, should you migrate your Haystack pipelines yet? No. This is not yet a stable release of Haystack 2.0, and by the time 2.0.0 is released, there will be a set of migration guides to help you with the task. The good news is Haystack 2.0-Beta is coming to you in a brand new package haystack-ai, so you do not have to change anything with your already existing Haystack deployments that depend on farm-haystack. We will keep up the maintenance of Haystack 1.x (currently on 1.22.1) which will continue to be released under farm-haystack.

But should you use Haystack 2.0-Beta? Yes! Because Haystack 2.0-Beta introduces some major improvements to how you will build and customize production-ready LLM applications. And the best way to do so is by participating in Advent of Haystack and checking out the 2.0-Beta documentation.

To get started: pip install haystack-ai 🎉

Haystack 2.0-Beta does not have feature parity with Haystack 1.x yet. The stable release of Haystack 2.0 will happen once a higher level of feature parity exists. To see a full list of available features in this beta release, check out our release notes.

Over the last few months, we’ve also been working with some of our partners and community members on extending the Haystack integrations with integrations compatible with Haystack 2.0. Today, you can also start using those with this Beta release.

Why is Haystack Changing

When the first proper release of Haystack came about in 2020, a lot of the design centered around retrieval, embedding creation, indexing, semantic search and extractive QA. As you may have seen, that completely pivoted in the past (nearly two) years. With the rise of LLMs, we want to build more applications that make use of retrieval-augmented generation (RAG), agents and the ever-expanding LLM capabilities.

While the Haystack 1.x design - and especially the pipeline architecture - was fit for these use cases, the developer experience was sometimes unintuitive and difficult to customize, especially when diverting from the standard semantic search use cases.. This was mainly due to some major assumptions in the design of 1.x. For example, if you’ve used Haystack you know that a pipeline has to start with either a “Query” or a “File” input, even if at the end of the day you don’t want to use them. Haystack 1.x was still able to be the basis on top of which many LLM applications have been built with the addition of the PromptNode in Haystack 1.12.1. However, the design of our framework often made it difficult to navigate the vast optionality that it actually provides, with many model providers and capabilities being seemingly “hidden” in larger concepts such as the PromptNode, or even the EmbeddingRetriever.

With Haystack 2.0, our aim is to be a lot more explicit about what each component does, and (to possibly exploit an overused term) to make the code self-explanatory, on top of making the Pipeline and Component architecture a lot more flexible, extendable, customizable while maintaining our high standards in terms of production readiness.

How is Haystack Changing

Although Haystack 2.0 is a complete overhaul of the framework, it remains rooted in the fundamental abstractions that contributed to its prior success: users can continue using pipelines, document stores and nodes, which have now been rebranded as “components”.

Here we will briefly highlight what some of the major changes are, with particular regard to developer experience. To see where we are in terms of feature parity with Haystack 1.x, check out the table in our release notes.

Components

Nodes have always been considered the building blocks of Haystack pipelines, but that analogy isn’t completely true. If somebody is given building blocks, they assume they can arrange them however they want, but that’s not the case with nodes: for example, you can only put a Node of type Document Store at the very end of a pipeline. If you wonder why this happens, imagine building something with Legos, just the bricks all have different stubs and tubes sizes, and you have to try which one can possibly fit the other.

We decided to change the name Node to Component in Haystack 2.0 to stress how different the new design is. Components are still the building blocks of a Pipeline, but this time the analogy is for real:

Every Component declares its input and output types, making clear to which other Component can be connected. Connections can therefore be validated already while building a Pipeline.
Every Component is self contained and fulfils a specific purpose: we like to say that a good Component should only have one job.
Every Component must respect a strict contract to be used within a Pipeline, but everything else is left to the developer to provide maximum flexibility. You can stretch this flexibility up to the point of running a single Component “standalone”, without the need of a Pipeline.

For example, below is a component that takes in a query and returns a list of documents as documents.

from typing import List
from haystack import component, Document  
  
@component  
class MyCustomComponent():  
    
  @component.output_types(documents=List[Document])  
  def run(self, query: str):  
    # do something  
    return {'documents': docs}

For a full example, you can check out a custom component Tuana built that fetches the latest Hacker News articles.

You can find the complete documentation of Haystack 2.0-Beta components here.

Pipelines

Pipelines are a core offering of Haystack, and that’s not changing. What’s different is what a Pipeline can do and how it can be assembled. In fact, the first challenge in Advent of Haystack exposes you to the most obvious changes.

Connections

Connecting pipeline components is becoming a lot more flexible. We’re moving away from having a rigid contract requiring “Query” or “File” as the first input, and every component can be connected to any other compatible one (or even more than one), being explicit about which output is being connected to which input. As a result, the final input and output of a Pipeline run will be solely determined by how components are laid out.

In Haystack 1.x

from haystack import Pipeline

pipeline = Pipeline()
pipeline.add_node(component=my_component_1, name="My_Component_1", inputs=["Query"])
pipeline.add_node(component=my_component_2, name="My_Component_2", inputs=["My_Component_1"])

In Haystack 2.0-Beta

from haystack import Pipeline

pipeline = Pipeline()
pipeline.add_component(instance=my_component_1, name="my_component_1")
pipeline.add_component(instance=my_component_2, name="my_component_2")

pipeline.connect("my_component_1.output_1", "my_component_2.input_4")
pipeline.connect("my_component_1.output_3", "my_component_2.input_1")

Directed (Multi)Graphs (out with the acyclical)

Haystack 1.x pipelines are implemented as directed acyclic graphs (DAGs). Massi’s analogy of it is that using a pipeline in Haystack 1.x is like going down a water slide. A very powerful architecture that can branch out from a platform and join up at the main pool, building pipelines in Haystack 1.x is a matter of stringing together the Nodes that you require to build out the NLP application you’re aiming for. But it’s always one directional with a clear start and end.

In Haystack 2.0, we are getting rid of the A in DAG (acyclic), meaning we can have pipelines that can branch out, join, and also cycle back to another component. This allows us to set the Haystack framework up for pipelines that can retry, loop back, and potentially even run forever as a service. One of the first things we tried to build was a “retry” Component that allows a Pipeline to loop back if it deems an output to not be good enough.

These new pipelines are also technically multigraphs, meaning a single component with multiple outputs can connect to another single component with multiple inputs.

Serialization

Serialization means converting a pipeline to a format that you can save on your disk or send over the wire to load it later. In both Haystack 1.x and in Haystack 2.0-beta we use YAML for this, although we will be extending the support to other formats for Haystack 2.0.

However, one of the major changes to Haystack in terms of serialization is the addition of “Marshallers” to the core project. Marshallers are how we make available different serialization formats to pipelines, and they can be used to add any format that Haystack doesn’t support out of the box. For example, if you want to represent pipelines with TOML you can create a TOML Marshaller that you can pass to the serialization API.

You can find our complete documentation on Haystack 2.0-Beta Pipelines here. You can find our complete documentation on Haystack 2.0-Beta Pipeline Serialization here

Prompt Templating

One other change we are excited about is how prompt templating is changing in Haystack 2.0, which is already available in this Beta release. We are now using Jinja templating for prompts, making it very clean and readable to build prompts that have loops and that can even make use of functions inside the prompt. For example, below is a simple prompt template in Haystack 2.0 which loops through documents and also adds some meta information from those documents into the prompt.

from haystack.builders import PromptBuilder

prompt_template = """ Answer the question based on the context. Refer to the URL
in the generated answer.
Context:  
{% for doc in documents %}  
  {{doc.content}}  
  URL: {{article.meta['url']}}  
{% endfor %}  
Question: {{question}}
"""  
  
prompt_builder = PromptBuilder(template=prompt_template)

Join us in testing Haystack 2.0-Beta

The release of Haystack 2.0-Beta marks a significant milestone in the evolution of the Haystack framework. This announcement comes with an added bonus - the inaugural Advent of Haystack, featuring 10 challenges throughout December, offering you a hands-on opportunity to explore the features and design of the latest release. The development of Haystack 2.0 has been a collaborative effort with the community and while not yet the stable release, this Beta version invites you to test and engage in shaping the future of Haystack.

Cheers to the evolution of Haystack and the exciting developments ahead! 🎉

Using Generative AI to Query Large BI Tables: Our Findings

Wed, 29 Nov 2023 00:00:00 +0000

In organizations with large collections of data in tabular form, it’s the job of a data analyst to make sense of that data by extracting, transforming, and building stories around it. An analyst’s primary tool for accessing the data is SQL. Given the impressive capabilities of large language models (LLMs), it’s natural to wonder if AI can help us translate our information needs into well-formed SQL queries.

Granted, most LLMs can output at least some SQL queries based on natural language input. But can they handle large, multi-table databases in a real-world setting? In other words, can they do the job of a data analyst? A small team here at deepset set out to answer that question. Over the course of three months, we tried to find the best way to generate SQL queries on a real dataset.

Their findings in a nutshell: business intelligence (BI) is hard, and so is evaluating BI use cases. Current approaches that claim high accuracy probably don’t tell the whole story, and GPT-4 is one of the best models available for generating SQL queries. If you want to learn more about our findings – and find out which of the approaches we evaluated performed the best – then this article is for you.

We’re also releasing our benchmark: the code used to prompt and evaluate the LLMs, as well as the dataset we created for this project. We hope you find them useful 🙂!

Text-to-SQL is different from Table QA

At deepset, we have long been interested in the task of using language models to answer questions based on tabular data. Most recently, we wrote about our project with Airbus, which involved retrieving tables from pilot manuals and extracting answers from them – sometimes even using simple aggregation functions such as finding the maximum or average value of a column.

BI use cases, however, operate on a completely different scale. A query language like SQL can execute much more powerful operations on much larger tables; for example, it can perform queries across multiple tables, join them, and reorder them according to complex criteria. To distinguish this problem from the simpler (and mostly solved) task of Table QA, we called it “Business Intelligence Question Answering,” or BIQA for short. It’s also often referred to as “text-to-SQL”.

The BI use case is difficult, even for humans. This is because of the inherent ambiguity of natural language, which becomes particularly apparent when we try to map it to a formal language such as SQL. As a result, a question in natural language can have multiple interpretations in SQL. Let’s say we want to find out how many developers use Python in their day-to-day work. When we formulate this as an SQL query, we have to be specific: who exactly counts as a developer? How do we define “day-to-day”?

High standards for data quality can go a long way toward simplifying the ambiguity problem: aptly named columns, thorough schema descriptions, and logically organized tables all contribute to tabular databases that are easier to manage, for both humans and machines.

The project

The goal of our project was to find the best solution for converting natural language text into valid SQL queries that produce the correct result for the user. The generated query is returned along with the result so that users with some knowledge of SQL can verify that it is correct.

The ability to query large databases in natural language would save people a lot of time. The LLM approach also means that you can make typos, query in languages other than English, and even integrate the SQL capability into a more complex workflow – it could be used as a tool for an agent, for example.

Existing approaches

When we started, it seemed like we had a wide range of approaches to choose from. Over the past few months, many teams have tried to use the natural language understanding (NLU) capabilities of LLMs to generate SQL queries.

But when we looked more closely at their results, we found that they were lacking: Often, the best-performing model only solved a very specific problem at hand and wasn’t able to generalize to other BI use cases. This also meant that models performed very differently across benchmarks. We decided to create our own dataset and benchmark because we wanted data points that best reflected a common, real-world use case – and to ensure that the evaluation set consisted of data that the LLMs hadn’t seen before.

The dataset

For our dataset, we used the latest Developer Survey by StackOverflow. Every year, the platform polls developers and community members at large about their use of programming languages and other tools. The survey results are displayed as interactive charts on their website. In addition, StackOverflow publishes the entire dataset of questions and answers. For our dataset, we reverse-engineered StackOverflow’s SQL queries and ended up with about 120 query-answer pairs.

To make the database more like a real production environment, we normalized the database schema. For example, we split tables whose columns allowed multiple values at once into separate, related tables. Each data point consists of a natural language query and a corresponding label, the SQL equivalent of the query.

The conundrum of evaluating text to SQL

In our deep dive into existing approaches, we also found that even when they reported high-accuracy results, they were often based on questionable evaluation methods. Due to the ambiguity of natural language described above, the evaluation of text to SQL is still an unsolved problem with no standardized approaches. In our opinion, it always requires at least a manual component. This is an uncomfortable truth that is often glossed over by research teams - perhaps because they don’t want to talk about the fact that they only used a tiny evaluation dataset to make up for the time-consuming practice of manually evaluating the results.

After trying several methods, we settled on a semi-automated evaluation procedure. Our evaluation script pre-processed the output of the LLM, classifying each result as either correct (if it fully matched the ground truth in our evaluation dataset), incorrect (if the result was None), or requiring manual evaluation. We then manually processed this last set of results to verify that the result was indeed incorrect, or that the model had simply chosen a different SQL query that still produced the same result.

Obviously, this approach is very time-consuming. That’s why, over time, we added several accepted queries to our benchmark dataset. This allowed us to iterate faster because we spent less time manually evaluating.

Summarizing our approaches

We tried a number of different approaches that allowed us to explore the full range of NLP pipelines with LLMs: agents, retrieval augmentation, fine-tuning, and more. Here we summarize our results. At the end of the section, we provide an overview of all the approaches and their performance.

Schema-aware prompting

Approach: In our baseline approach, we included the schema of our database in the prompt to the LLM. The schema describes the layout of the database: table names and descriptions, along with column names. This approach gives the LLM a lot of context.

We also experimented with including rows from the database in the prompt to help the model better parse ambiguous column names and to give it more information about what to expect from a given cell. For instance, in the example below, we can see that HTML/CSS is counted as one language in the Languages table. This way, the model understands that the two are always combined, and can adjust its queries to the database accordingly.

Observations: Schema-aware prompting produces very long prompts. We have 66 tables in our database, and including detailed information about each of them resulted in a prompt with about 3000 tokens. That’s not only expensive, it also means that the model has to process more information at once, which can negatively affect its output. The simple combination of schema-aware prompting and GPT-4 got us to about 48% accuracy. The model errors ranged from invalid SQL queries that didn’t match the actual schema, to more subtle errors where the SQL query returned a result that wasn’t the correct answer to the question. Including sample rows in the prompt certainly improved performance.

Approach: An agent is an LLM that can act autonomously, given a set of tools. For example, it can iterate on the output of another LLM before returning it to the user. In our approach, we instruct the model to navigate through the database using increasingly complex SQL queries. This means that we don’t have to provide the database schema ourselves, but can let the model figure it out on a need-to-know basis. In fact, we asked the agent to begin its exploration by retrieving information about all the tables in the database. From there, it had to figure out the rest of the strategy on its own.

Observations: The agent approach allowed us to have the LLM fix invalid queries. One benefit was that the queries were always ultimately valid SQL. Here’s an example of the agent-driven approach in action (there’s a breakdown of the individual steps after the screenshot):

# Question: How many respondents consider themselves to be “definitely” or “somewhat” a part of the Stack Overflow community by age in %?

{“query”: “PRAGMA table_info(Responses);”, “show_user”: false}

# First up the agent gets the info for the Responses table.

{“query”: “SELECT DISTINCT SOComm FROM Responses;”, “show_user”: false}

# It figures out that it needs to look at the SOComm column, specifically the distinct values.

{“query”: “SELECT DISTINCT Age FROM Responses;”, “show_user”: false}

# It also checks the Age column.

{“query”: “SELECT Age, COUNT(*) as Total, SUM(CASE WHEN SOComm IN (‘Yes, definitely’, ‘Yes, somewhat’) THEN 1 ELSE 0 END) as PartOfSOCommunity FROM Responses GROUP BY Age;”, “show_user”: false}

# Performs the necessary operation but still has show_user: false.

{“query”: “SELECT Age, ROUND((SUM(CASE WHEN SOComm IN (‘Yes, definitely’, ‘Yes, somewhat’) THEN 1 ELSE 0 END) * 100.0) / COUNT(*), 2) as Percentage FROM Responses GROUP BY Age;”, “show_user”: true}

# Rounds out the percentage answer and sets show_user: true to end the generation.

While this approach allowed us to initially enter shorter prompts, the agent’s iterative method meant that the historical data was passed along with each new step the agent took: the prompt grew. So we’re faced with the same problem of a large prompt that’s hard for the model to manage and results in increasingly expensive API calls. There’s also a problem with the black-box nature of the agent approach, which sometimes resulted in errors that were harder to understand than with schema-aware prompting.

Include raw schema descriptions

Approach: In this approach, we provided not only the schema generated by the database, but also StackOverflow’s descriptions of what the columns mean. We prepended these raw descriptions in the prompt before any table schema descriptions. These hand-written descriptions contain valuable information, such as the questions used to elicit the answers. Therefore, they are expected to clarify any ambiguities in the database schema. For example, in the following screenshot, we see that the YearsCode column is defined as years of programming experience “including any education”.

Observations: Adding the descriptions for each column made the prompt even longer (~7k tokens). It improved accuracy by 10 percentage points to 58%.

Including column-level descriptions and distinct values

Approach: In this approach, we sought to enrich the column-level information available to the LLM by including descriptions for each column as part of the schema. We also included up to 20 distinct values that a column could contain along with the descriptions. This resulted in a query of ~6.5k tokens.

Observations: This further improved the results, bringing the accuracy up to 65%.

Including few-shot examples and custom instructions

Approach: After analyzing some of the problems the model was encountering, we tried modifying the prompt and adding few-shot examples to mitigate the error. A common mistake the LLM made was counting people repeatedly if they appeared in multiple rows. Instead, we want it to add a DISTINCT clause to the COUNT() function to remove the repeated appearance of the same data.

Therefore, we added the following instruction to the prompt:

Also, for percentage calculations over responses use either the main table “responses” accounting for nulls or distinct values from the Associative table. Not full counts from the Associative Tables.

In addition, we included some examples to show the LLM the desired behavior.

It could be argued that this goes beyond prompt engineering for the database or domain and comes close to overfitting the evaluation set. Nevertheless, it is a demonstration of the LLM’s responsiveness to these changes.

Observations: The model responded to the instruction and examples and mostly stopped making the observed mistakes. This brought the accuracy to 70%.

Retrieval augmentation

Approach: The model doesn’t need to know all the tables in the database to answer a question. For example, most of the questions in the evaluation set could be answered using five or fewer of the 66 tables. A retrieval module could retrieve the right tables, which would shorten the prompt and help the model focus on only relevant information. It’s important to note, however, that no retriever tuned for the specific task of matching a natural language query to a table schema was available. We used deepset/all-mpnet-base-v2, which we had seen perform well on previous table retrieval tasks.

Observations: The retrieval component didn’t work very well. The embedding model simply wasn’t able to select the relevant tables per query. The graph below plots the number of tables retrieved on the x-axis against the recall (i.e., the percentage of tables retrieved correctly) on the y-axis. It shows that even with an increased top_k value, the retrieval component didn’t manage to identify all the relevant tables. Of course, this results in the LLM not seeing the right context to generate its SQL queries in the first place.

Column-level retrieval (i.e., embedding columns separately and returning the whole table if a column has a high value) performed better than table-level retrieval. This is probably because it’s easier to match queries to individual columns, since a table with many columns that are irrelevant to the query can drown out the signal from a single column. Also, we’re dealing with a setup where a specific table (“Responses”) with a large number of columns should almost always be retrieved.

We also tried Perfect Retrieval (passing only the schemas for relevant tables) to determine the upper bound for the retrieval-based approach. Much to our surprise, it performed worse than just passing all tables (60% vs. 65.8%), due to an increased number of errors made by the LLM.

The poor performance of the RAG approach is partly due to the unbalanced dataset, where the number of columns varies greatly between tables. We could try to improve it (e.g. by switching to a better retrieval model or reformulating the setup), but given the underperformance of even Perfect Retrieval on this dataset, this exploration would be better done on a different dataset/setup.

Other models

Approach: Besides GPT-4, we also tried many open weight models: Starcoderbase, New Hope, Codegen 2.5, sqlcoder models, etc.

Observations: Among them, the recently released sqlcoder-34b-alpha (loaded in 8bit) gave the most promising results (with 35.8%), but fell short of GPT-4. The main problem we encountered was that it often hallucinated non-existent columns, e.g. work_exp, response_op_sys_professional_use.

Fine-tuning an existing OSS model

Approach: We can use a smaller open source LLM and fine-tune it on a suitable dataset. For that purpose, we fine-tuned a number of models on an EC2 instance using the resource-efficient QLoRA approach and a subset of the Spider dataset. These models included Llama 2 70B and, in particular, all code-specific models available at the time (e.g., Starcoderbase, New Hope, Codegen 2.5).

Observations: Unfortunately, all of our OSS approaches failed, most likely because the dataset we used for fine-tuning wasn’t up to the task. The Spider dataset is academically oriented and doesn’t really capture the real-world business use cases that we were interested in (and that our evaluation dataset reflects quite accurately). The performance scores were correspondingly low, around 10%.

An improved schema + evaluation results

From our qualitative analysis of model predictions and error patterns, we noticed that many table or column names were difficult to map back to the original question. This seemed to be confirmed by the better performance of the LLM when it was given descriptions of the columns in addition to the schema.

The initial task was therefore akin to giving a new business analyst a tangled database with dozens of tables and an arcane schema without any documentation or help, and then expecting them to come up with the right answers to our questions. We felt we hadn’t given the model a fair chance to do the job.

Good data hygiene is essential for BI, so we decided to sit down and change the schema, structure the tables in a way that was easier to understand, and change the column naming to be self-explanatory. For example, we renamed a column from LearnCode to LearningToCodeMethods, which greatly reduced the ambiguity of the initial column name.

All of the evaluation results we report in this article are based on the new schema, which has significantly improved the performance of various approaches across the board.

Approach	Overall % Correct
Agent-driven	40.8
Schema-aware	48.3
Schema-aware + raw descriptions	58.3
Schema-aware + column-level descriptions + distinct-20	65.8
Schema-aware + perfect retrieval + distinct-20	60.0
Schema-aware + column-level descriptions + distinct-20 + few-shot	70.0
Schema-aware + perfect retrieval (sqlcoder-34b-alpha @ 8bit)	35.8

Challenges and takeaways

Creating a general-purpose, LLM-driven solution for business intelligence in SQL is still a largely unsolved problem - which makes it all the more interesting to explore further. The main challenges relate to the following two factors:

Dataset creation and evaluation is very time-consuming

Creating a dataset for evaluating text-to-SQL approaches is much harder than for, say, extractive question answering. Essentially, you have to write a piece of code (the SQL query) for each data point. That query needs to be tested and debugged before it can go into the dataset.

And it’s not just the creation of the dataset - the evaluation itself is time-consuming because, as we described earlier, it cannot be fully automated. Text-to-SQL evaluation, as we can do it now, isn’t exactly scalable. Projects that want an accurate evaluation of their methods need to take into account that it will take a lot of their time.

The ambiguity of language and databases

As we’ve seen, natural language queries can be quite ambiguous, making them difficult to translate into SQL queries. On the other hand, there is often more than one correct way to extract a piece of information from a database, and our evaluation dataset may not consider all of them. This brings us back to the fact that we need to at least partially evaluate the results of the model by hand after each run.

Another factor that comes into play here is that databases themselves can be ambiguous - even more so if the schema isn’t particularly well designed. That’s why it helps to have someone in the loop who knows the database well - it will help you to better evaluate the generated queries. You can even use the model’s output as a sort of check on your naming practices - if a model keeps misusing a column, for example, that might be a sign that it’s time to rename it or update its description.

Demo

In addition to our dataset and benchmarks, we also published a demo of our best-performing text-to-SQL approach to share with colleagues, friends, and now you, our readers. It serves to validate whether our accuracy results are representative of a real-world scenario. We found that user feedback was quite close to our results. The demo is hosted on deepset’s enterprise platform, deepset Cloud, which creates an out-of-the-box user interface. Here’s what it looks like:

Let’s try to recreate a question from the developer survey:

How are experienced developers vs newbies planning to use AI tools in the future? Please provide the numbers.

The result is a table followed by the generated query itself. Let’s have a look at the query first:

Note that the LLM has decided to come up with its own definition of who counts as an “experienced developer,” namely someone who has been coding for more than five years. This is an example of how the model has to compensate for the ambiguity in our natural language query. Here is a snapshot of the table we get as a result:

We can now reuse this generated content by copying the table into a csv file (or other structured file format). We can even visualize it using tools such as the Python libraries pandas and matplotlib:

Now we can see that even though the class of newbies (as defined by the LLM) is much larger than the cohort of experienced developers, the expectations of both groups from AI tools seem to be very similar.

This is just a glimpse of what our text-to-SQL solution can do. Feel free to play around with the demo and give us your feedback!

Conclusion

The executive’s dream of simply asking AI for any insight or dashboard without going through a business analyst won’t come true anytime soon. The approaches we have tested are still too fragile to blindly trust their results for business-critical decisions.

However, in the same way that Copilot increases developer productivity, an AI companion for BI could significantly increase the productivity of any business analyst. They could use the LLM to generate an initial query, review the generated SQL and output, and only edit if necessary. Just from using the tool ourselves, we’re sure it would boost analyst productivity in a similar way to how code assistants have impacted developers.

It’s really hard to create a general-purpose SQL model – most success stories involve very specific use cases. For now, you can create specialized models with better fine-tuning datasets for specific use cases only. But keep in mind that creating datasets is time consuming.

A robust evaluation that still allows you to move quickly is essential to making progress. If your organization has good data management practices in place, you’re likely to be more successful in your text-to-SQL efforts.

Interested in moving beyond the AI hype and using LLMs where they can add real value? Check out our Haystack repo on GitHub and join us on Discord!

RAG Pipelines From Scratch

Tue, 21 Nov 2023 00:00:00 +0000

Retrieval Augmented Generation (RAG) is quickly becoming an essential technique to make LLMs more reliable and effective at answering any question, regardless of how specific. To stay relevant in today’s NLP landscape, Haystack must enable it.

Let’s see how to build such applications with Haystack 2.0, from a direct call to an LLM to a fully-fledged, production-ready RAG pipeline that scales. At the end of this post, we will have an application that can answer questions about world countries based on data stored in a private database. At that point, the knowledge of the LLM will be only limited by the content of our data store, and all of this can be accomplished without fine-tuning language models.

💡 I recently gave a talk about RAG applications in Haystack 2.0, so if you prefer videos to blog posts, you can find the recording here. Keep in mind that the code might be slightly outdated.

What is RAG?

The idea of Retrieval Augmented Generation was first defined in a paper by Meta in 2020. It was designed to solve a few of the inherent limitations of seq2seq models (language models that, given a sentence, can finish writing it for you), such as:

Their internal knowledge, as vast as it may be, will always be limited and at least slightly out of date.
They work best on generic topics rather than niche and specific areas unless they’re fine-tuned on purpose, which is a costly and slow process.
All models, even those with subject-matter expertise, tend to “hallucinate”: they confidently produce false statements backed by apparently solid reasoning.
They cannot reliably cite their sources or tell where their knowledge comes from, which makes fact-checking their replies nontrivial.

RAG solves these issues of “grounding” the LLM to reality by providing some relevant, up-to-date, and trusted information to the model together with the question. In this way, the LLM doesn’t need to draw information from its internal knowledge, but it can base its replies on the snippets provided by the user.

A visual representation of RAG from the original paper

As you can see in the image above (taken directly from the original paper), a system such as RAG is made of two parts: one that finds text snippets that are relevant to the question asked by the user and a generative model, usually an LLM, that rephrases the snippets into a coherent answer for the question.

Let’s build one of these with Haystack 2.0!

💡 Do you want to see this code in action? Check out the Colab notebook here.

⚠️ Warning: This code was tested on haystack-ai==2.0.0b5. Haystack 2.0 is still unstable, so later versions might introduce breaking changes without notice until Haystack 2.0 is officially released. The concepts and components however stay the same.

Generators: Haystack’s LLM components

As every NLP framework that deserves its name, Haystack supports LLMs in different ways. The easiest way to query an LLM in Haystack 2.0 is through a Generator component: depending on which LLM and how you intend to query it (chat, text completion, etc…), you should pick the appropriate class.

We’re going to use gpt-3.5-turbo (the model behind ChatGPT) for these examples, so the component we need is OpenAIGenerator. Here is all the code required to use it to query OpenAI’s gpt-3.5-turbo :

from haystack.components.generators import OpenAIGenerator

generator = OpenAIGenerator(api_key=api_key)
generator.run(prompt="What's the official language of France?")
# returns {"replies": ['The official language of France is French.']}

You can select your favorite OpenAI model by specifying a model at initialization, for example, gpt-4. It also supports setting an api_base_url for private deployments, a streaming_callback if you want to see the output generated live in the terminal, and optional kwargs to let you pass whatever other parameter the model understands, such as the number of answers (n), the temperature (temperature), etc.

Note that in this case, we’re passing the API key to the component’s constructor. This is unnecessary: OpenAIGenerator can read the value from the OPENAI_API_KEY environment variable and also from the api_key module variable of openai’s SDK.

Right now, Haystack supports HuggingFace models through the HuggingFaceLocalGenerator and HuggingFaceTGIGenerator components, and many more LLMs are coming soon.

PromptBuilder: structured prompts from templates

Let’s imagine that our LLM-powered application also comes with some pre-defined questions that the user can select instead of typing in full. For example, instead of asking them to type What's the official language of France?, we let them select Tell me the official languages from a list, and they simply need to type “France” (or “Wakanda” for a change - our chatbot needs some challenges too).

In this scenario, we have two pieces of the prompt: a variable (the country name, like “France”) and a prompt template, which in this case is "What's the official language of {{ country }}?"

Haystack offers a component that can render variables into prompt templates: it’s called PromptBuilder. As the generators we’ve seen before, also PromptBuilder is nearly trivial to initialize and use.

from haystack.components.builders.prompt_builder import PromptBuilder

prompt_builder = PromptBuilder(template="What's the official language of {{ country }}?")
prompt_builder.run(country="France")
# returns {'prompt': "What's the official language of France?"}

Note how we defined a variable, country, by wrapping its name in double curly brackets. PromptBuilder lets you define any input variable that way: if the prompt template was "What's the official language of {{ nation }}?", the run() method of PromptBuilder would have expected a nation input.

This syntax comes from Jinja2, a popular templating library for Python. If you have ever used Flask, Django, or Ansible, you will feel at home with PromptBuilder. Instead, if you never heard of any of these libraries, you can check out the syntax on Jinja’s documentation. Jinja has a powerful templating language and offers way more features than you’ll ever need in prompt templates, ranging from simple if statements and for loops to object access through dot notation, nesting of templates, variables manipulation, macros, full-fledged import and encapsulation of templates, and more.

A Simple Generative Pipeline

With these two components, we can assemble a minimal pipeline to see how they work together. Connecting them is trivial: PromptBuilder generates a prompt output, and OpenAIGenerator expects an input with the same name and type.

from haystack import Pipeline
from haystack.components.generators import OpenAIGenerator
from haystack.components.builders.prompt_builder import PromptBuilder

pipe = Pipeline()
pipe.add_component("prompt_builder", PromptBuilder(template="What's the official language of {{ country }}?"))
pipe.add_component("llm", OpenAIGenerator(api_key=api_key))
pipe.connect("prompt_builder", "llm")

pipe.run({"prompt_builder": {"country": "France"}})
# returns {"llm": {"replies": ['The official language of France is French.'] }}

Here is the pipeline graph:

Make the LLM cheat

Building the Generative part of a RAG application was very simple! So far, we only provided the question to the LLM, but no information to base its answers on. Nowadays, LLMs possess a lot of general knowledge, so questions about famous countries such as France or Germany are easy for them to reply to correctly. However, when using an app about world countries, some users may be interested in knowing more about obscure or defunct microstates that don’t exist anymore. In this case, ChatGPT is unlikely to provide the correct answer without any help.

For example, let’s ask our pipeline something really obscure.

pipe.run({"prompt_builder": {"country": "the Republic of Rose Island"}})
# returns {
#     "llm": {
#         "replies": [
#             'The official language of the Republic of Rose Island was Italian.'
#         ]
#     }
# }

The answer is an educated guess but is not accurate: although it was located just outside of Italy’s territorial waters, according to Wikipedia the official language of this short-lived micronation was Esperanto.

How can we get ChatGPT to reply to such a question correctly? One way is to make it “cheat” by providing the answer as part of the question. In fact, PromptBuilder is designed to serve precisely this use case.

Here is our new, more advanced prompt:

Given the following information, answer the question.
Context: {{ context }}
Question: {{ question }}

Let’s build a new pipeline using this prompt!

context_template = """
Given the following information, answer the question.
Context: {{ context }}
Question: {{ question }}
"""
language_template = "What's the official language of {{ country }}?"

pipe = Pipeline()
pipe.add_component("context_prompt", PromptBuilder(template=context_template))
pipe.add_component("language_prompt", PromptBuilder(template=language_template))
pipe.add_component("llm", OpenAIGenerator(api_key=api_key))
pipe.connect("language_prompt", "context_prompt.question")
pipe.connect("context_prompt", "llm")

pipe.run({
    "context_prompt": {"context": "Rose Island had its own government, currency, post office, and commercial establishments, and the official language was Esperanto."}
    "language_prompt": {"country": "the Republic of Rose Island"}
})
# returns {
#     "llm": {
#         "replies": [
#             'The official language of the Republic of Rose Island is Esperanto.'
#         ]
#     }
# }

Let’s look at the graph of our Pipeline:

The beauty of PromptBuilder lies in its flexibility. It allows users to chain instances together to assemble complex prompts from simpler schemas: for example, we used the output of the first PromptBuilder as the value of question in the second prompt.

However, in this specific scenario, we can build a simpler system by merging the two prompts into one.

Given the following information, answer the question.
Context: {{ context }}
Question: What's the official language of {{ country }}?

Using this new prompt, the resulting pipeline becomes again very similar to our first.

template = """
Given the following information, answer the question.
Context: {{ context }}
Question: What's the official language of {{ country }}?
"""
pipe = Pipeline()
pipe.add_component("prompt_builder", PromptBuilder(template=template))
pipe.add_component("llm", OpenAIGenerator(api_key=api_key))
pipe.connect("prompt_builder", "llm")

pipe.run({
    "prompt_builder": {
        "context": "Rose Island had its own government, currency, post office, and commercial establishments, and the official language was Esperanto.",
        "country": "the Republic of Rose Island"
    }
})
# returns {
#     "llm": {
#         "replies": [
#             'The official language of the Republic of Rose Island is Esperanto.'
#         ]
#     }
# }

Retrieving the context

For now, we’ve been playing with prompts, but the fundamental question remains unanswered: where do we get the correct text snippet for the question the user is asking? We can’t expect such information as part of the input: we need our system to be able to fetch this information independently, based uniquely on the query.

Thankfully, retrieving relevant information from large corpora (a technical term for extensive collections of data, usually text) is a task that Haystack excels at since its inception: the components that perform this task are called Retrievers.

Retrieval can be performed on different data sources: to begin, let’s assume we’re searching for data in a local database, which is the use case that most Retrievers are geared towards.

Let’s create a small local database to store information about some European countries. Haystack offers a neat object for these small-scale demos: InMemoryDocumentStore. This document store is little more than a Python dictionary under the hood but provides the same exact API as much more powerful data stores and vector stores, such as Elasticsearch or ChromaDB. Keep in mind that the object is called “Document Store” and not simply “datastore” because what it stores is Haystack’s Document objects: a small dataclass that helps other components make sense of the data that they receive.

So, let’s initialize an InMemoryDocumentStore and write some Documents into it.

from haystack.dataclasses import Document
from haystack.document_stores.in_memory import InMemoryDocumentStore

documents = [
    Document(content="German is the the official language of Germany."), 
    Document(content="The capital of France is Paris, and its official language is French."),
    Document(content="Italy recognizes a few official languages, but the most widespread one is Italian."),
    Document(content="Esperanto has been adopted as official language for some microstates as well, such as the Republic of Rose Island, a short-lived microstate built on a sea platform in the Adriatic Sea.")
]
docstore = InMemoryDocumentStore()
docstore.write_documents(documents=documents)

docstore.filter_documents()
# returns [
#     Document(content="German is the the official language of Germany."), 
#     Document(content="The capital of France is Paris, and its official language is French."),
#     Document(content="Esperanto has been adopted as official language for some microstates as well, such as the Republic of Rose Island, a short-lived microstate built on a sea platform in the Adriatic Sea."),
#     Document(content="Italy recognizes a few official languages, but the most widespread one is Italian."),
# ]

Once the document store is set up, we can initialize a retriever. In Haystack 2.0, each document store comes with its own set of highly optimized retrievers: InMemoryDocumentStore offers two, one based on BM25 ranking and one based on embedding similarity.

Let’s start with the BM25-based retriever, which is slightly easier to set up. Let’s first use it in isolation to see how it behaves.

from haystack.components.retrievers.in_memory import InMemoryBM25Retriever

retriever = InMemoryBM25Retriever(document_store=docstore)
retriever.run(query="Rose Island", top_k=1)
# returns [
#     Document(content="Esperanto has been adopted as official language for some microstates as well, such as the Republic of Rose Island, a short-lived microstate built on a sea platform in the Adriatic Sea.")
# ]

retriever.run(query="Rose Island", top_k=3)
# returns [
#     Document(content="Esperanto has been adopted as official language for some microstates as well, such as the Republic of Rose Island, a short-lived microstate built on a sea platform in the Adriatic Sea.")
#     Document(content="Italy recognizes a few official languages, but the most widespread one is Italian."),
#     Document(content="The capital of France is Paris, and its official language is French."),
# ]

We see that InMemoryBM25Retriever accepts a few parameters. query is the question we want to find relevant documents for. In the case of BM25, the algorithm only searches for exact word matches. The resulting retriever is very fast, but it doesn’t fail gracefully: it can’t handle spelling mistakes, synonyms, or descriptions of an entity. For example, documents containing the word “cat” would be considered irrelevant against a query such as “felines”.

top_k controls the number of documents returned. We can see that in the first example, only one document is returned, the correct one. In the second, where top_k = 3, the retriever is forced to return three documents even if just one is relevant, so it picks the other two randomly. Although the behavior is not optimal, BM25 guarantees that if there is a document that is relevant to the query, it will be in the first position, so for now, we can use it with top_k=1.

Retrievers also accepts a filters parameter, which lets you pre-filter the documents before retrieval. This is a powerful technique that comes useful in complex applications, but for now we have no use for it. I will talk more in detail about this topic, called metadata filtering, in a later post.

Let’s now make use of this new component in our Pipeline.

Our first RAG Pipeline

The retriever does not return a single string but a list of Documents. How do we put the content of these objects into our prompt template?

It’s time to use Jinja’s powerful syntax to do some unpacking on our behalf.

Given the following information, answer the question.

Context: 
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: What's the official language of {{ country }}?

Notice how, despite the slightly alien syntax for a Python programmer, what the template does is reasonably evident: it iterates over the documents and, for each of them, renders their content field.

With all these pieces set up, we can finally put them all together.

template = """
Given the following information, answer the question.

Context: 
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: What's the official language of {{ country }}?
"""
pipe = Pipeline()

pipe.add_component("retriever", InMemoryBM25Retriever(document_store=docstore))
pipe.add_component("prompt_builder", PromptBuilder(template=template))
pipe.add_component("llm", OpenAIGenerator(api_key=api_key))
pipe.connect("retriever", "prompt_builder.documents")
pipe.connect("prompt_builder", "llm")

pipe.run({
   "retriever": {"query": country},
    "prompt_builder": {
        "country": "the Republic of Rose Island"
    }
})
# returns {
#     "llm": {
#         "replies": [
#             'The official language of the Republic of Rose Island is Esperanto.'
#         ]
#     }
# }

Congratulations! We’ve just built our first, true-to-its-name RAG Pipeline.

Scaling up: Elasticsearch

So, we now have our running prototype. What does it take to scale this system up for production workloads?

Of course, scaling up a system to production readiness is no simple task that can be addressed in a paragraph. Still, we can start this journey with one component that can readily be improved: the document store.

InMemoryDocumentStore is clearly a toy implementation: Haystack supports much more performant document stores such as Elasticsearch, ChromaDB and Marqo. Since we have built our app with a BM25 retriever, let’s select Elasticsearch as our production-ready document store of choice.

How do we use Elasticsearch on our pipeline? All it takes is to swap out InMemoryDocumentStore and InMemoryBM25Retriever with their Elasticsearch counterparts, which offer nearly identical APIs.

First, let’s create the document store: we will need a slightly more complex setup to connect to the Elasticearch backend. In this example, we use Elasticsearch version 8.8.0, but every Elasticsearch 8 version should work.

from elasticsearch_haystack.document_store import ElasticsearchDocumentStore

host = os.environ.get("ELASTICSEARCH_HOST", "https://localhost:9200")
user = "elastic"
pwd = os.environ["ELASTICSEARCH_PASSWORD"]  # You need to provide this value

docstore = ElasticsearchDocumentStore(
    hosts=[host], 
    basic_auth=(user, pwd), 
    ca_certs="/content/elasticsearch-8.8.0/config/certs/http_ca.crt"
)

Now, let’s write again our four documents into the store. In this case, we specify the duplicate policy, so if the documents were already present, they would be overwritten. All Haystack document stores offer three policies to handle duplicates: FAIL (the default), SKIP, and OVERWRITE.

from haystack.document_stores import DuplicatePolicy
documents = [
    Document(content="German is the the official language of Germany."), 
    Document(content="The capital of France is Paris, and its official language is French."),
    Document(content="Italy recognizes a few official languages, but the most widespread one is Italian."),
    Document(content="Esperanto has been adopted as official language for some microstates as well, such as the Republic of Rose Island, a short-lived microstate built on a sea platform in the Adriatic Sea.")
]
docstore.write_documents(documents=documents, policy=DuplicatePolicy.OVERWRITE)

Once this is done, we are ready to build the same pipeline as before, but using ElasticsearchBM25Retriever.

from elasticsearch_haystack.bm25_retriever import ElasticsearchBM25Retriever

template = """
Given the following information, answer the question.

Context: 
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: What's the official language of {{ country }}?
"""

pipe = Pipeline()
pipe.add_component("retriever", ElasticsearchBM25Retriever(document_store=docstore))
pipe.add_component("prompt_builder", PromptBuilder(template=template))
pipe.add_component("llm", OpenAIGenerator(api_key=api_key))
pipe.connect("retriever", "prompt_builder.documents")
pipe.connect("prompt_builder", "llm")

pipe.draw("elasticsearch-rag-pipeline.png")

country = "the Republic of Rose Island"
pipe.run({
    "retriever": {"query": country},
    "prompt_builder": {"country": country}
})
# returns {
#     "llm": {
#         "replies": [
#             'The official language of the Republic of Rose Island is Esperanto.'
#         ]
#     }
# }

That’s it! We’re now running the same pipeline over a production-ready Elasticsearch instance.

Wrapping up

In this post, we’ve detailed some fundamental components that make RAG applications possible with Haystack: Generators, the PromptBuilder, and Retrievers. We’ve seen how they can all be used in isolation and how you can make Pipelines out of them to achieve the same goal. Last, we’ve experimented with some of the (very early!) features that make Haystack 2.0 production-ready and easy to scale up from a simple demo with minimal changes.

However, this is just the start of our journey into RAG. Stay tuned!

Multilingual Generative Question Answering with Haystack and Cohere

Wed, 08 Nov 2023 00:00:00 +0000

In today’s interconnected world, language should never be a barrier to accessing information. This is especially crucial in the context of travel, where travelers often rely on reviews to make informed decisions about their accommodations (I know I do). But what if you’re an English-speaking traveler trying to make sense of reviews written in multiple languages and simply want to know, “Is this place too noisy to sleep?”

In this blog post, we’ll delve into the details of multilingual retrieval and multilingual generation, and demonstrate how to build a Retrieval Augmented Generation (RAG) pipeline to generate answers from multilingual hotel reviews using Cohere models and Haystack.

This article is accompanied by a Colab notebook.

Multilingual Embeddings

Typically, embedding models are trained to create word embeddings based on their semantic similarity. This means that the embeddings for “King” and “Queen” should be closer in the vector space compared to “King” and “Bread.” However, most of these embedding models are designed for a single language and cannot capture semantic similarities between words in different languages.

Monolingual Embedding Models

Multilingual embedding models, on the other hand, are trained to capture semantic similarities between words across languages. They can create similar embeddings for “King”, “König” (king in German), and “Kral” (king in Turkish). This remarkable capability of multilingual models makes it possible to retrieve documents regardless of the language in which the documents or the query are written.

Multilingual Embedding Models

Cohere’s multilingual embedding models are cutting-edge technology designed to handle text from multiple languages. These embeddings empower the model to understand the semantics of texts in languages like Spanish, French, German, and more while providing a consistent representation for efficient cross-lingual analysis.

Let’s see these models in action using the embed-multilingual-v2.0 model to create embeddings for hotel reviews.

Storing Multilingual Embeddings

To create a question answering system for hotel reviews, the first thing we need is a document store. We’ll use an InMemoryDocumentStore to save the hotel reviews along with their embeddings.

from haystack.document_stores import InMemoryDocumentStore

document_store = InMemoryDocumentStore(embedding_dim=768, similarity= "dot_product")

Creating an Indexing Pipeline

Next, we’ll set up an indexing pipeline with a PreProcessor and EmbeddingRetriever. The PreProcessor will break down long reviews into smaller, meaningful chunks, while the EmbeddingRetriever will generate multilingual embeddings for each document.

We’ll use the embed-multilingual-v2.0 model from Cohere in the EmbeddingRetriever. You will need a Cohere API key for this model, which you can obtain here.

from haystack.nodes import EmbeddingRetriever, PreProcessor
from haystack.pipelines import Pipeline

preprocessor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=False,
    clean_header_footer=True,
    split_by="word",
    split_length=200,
    split_respect_sentence_boundary=True,
)
retriever = EmbeddingRetriever(
    embedding_model="embed-multilingual-v2.0", 
    document_store=document_store,
    api_key=COHERE_API_KEY
)

indexing_pipeline = Pipeline()
indexing_pipeline.add_node(component=preprocessor, name="preprocessor", inputs=["File"])
indexing_pipeline.add_node(component=retriever, name="retriever", inputs=["preprocessor"])
indexing_pipeline.add_node(component=document_store, name="document_store", inputs=['retriever'])

Indexing Reviews with Embeddings

Our indexing pipeline is ready to run. We will use hotel reviews in multiple languages as documents, including Portuguese, Polish, German, Spanish, French, Dutch, and English. All of these reviews belong to the same accommodation. Let’s index these reviews by running the pipeline.

documents = [Document("O ar condicionado de um dos quartos deu problema, mas levaram um ventilador para ser utilizado. Também por ser em uma área bem movimentada, o barulho da rua pode ser ouvido. Porém, eles deixam protetores auriculares para o uso. Também senti falta de um espelho de corpo inteiro no apartamento. Só havia o do banheiro que mostra apenas a parte superior do corpo."),
             Document("Durchgängig Lärm, weil direkt an der Partymeile; schmutziges Geschirr; unvollständige Küchenausstattung; Abzugshaube über Herd ging für zwei Stunden automatisch an und lies sich nicht abstellen; Reaktionen auf Anfragen entweder gar nicht oder unfreundlich"),
             Document("Das Personal ist sehr zuvorkommend! Über WhatsApp war man im guten Kontakt und konnte alles erfragen. Auch das Angebot des Shuttleservices war super und würde ich empfehlen - sehr unkompliziert! Unser Flug hatte Verspätung und der Shuttle hat auf uns gewartet. Die Lage zur Innenstadt ist sehr gut,jedoch ist die Fensterfront direkt zur Club-Straße deshalb war es nachts bis drei/vier Uhr immer recht laut. Die Kaffeemaschine oder auch die Couch hätten sauberer sein können. Ansonsten war das Appartement aber völlig ok."),
             Document("Super appartement. Juste au dessus de plusieurs bars qui ferment très tard. A savoir à l'avance. (Bouchons d'oreilles fournis !)"),
             Document("Zapach moczu przy wejściu do budynku, może warto zainstalować tam mocne światło na czujnik ruchu, dla gości to korzystne a dla kogoś kto chciałby zrobić tam coś innego niekorzystne :-). Świetne lokalizacje w centrum niestety są na to narażane."),
             Document("El apartamento estaba genial y muy céntrico, todo a mano. Al lado de la librería Lello y De la Torre de los clérigos. Está situado en una zona de marcha, así que si vais en fin de semana , habrá ruido, aunque a nosotros no nos molestaba para dormir"),
             Document("The keypad with a code is convenient and the location is convenient. Basically everything else, very noisy, wi-fi didn't work, check-in person didn't explain anything about facilities, shower head was broken, there's no cleaning and everything else one may need is charged."),
             Document("It is very central and appartement has a nice appearance (even though a lot IKEA stuff), *W A R N I N G** the appartement presents itself as a elegant and as a place to relax, very wrong place to relax - you cannot sleep in this appartement, even the beds are vibrating from the bass of the clubs in the same building - you get ear plugs from the hotel -> now I understand why -> I missed a trip as it was so loud and I could not hear the alarm next day due to the ear plugs.- there is a green light indicating 'emergency exit' just above the bed, which shines very bright at night - during the arrival process, you felt the urge of the agent to leave as soon as possible. - try to go to 'RVA clerigos appartements' -> same price, super quiet, beautiful, city center and very nice staff (not an agency)- you are basically sleeping next to the fridge, which makes a lot of noise, when the compressor is running -> had to switch it off - but then had no cool food and drinks. - the bed was somehow broken down - the wooden part behind the bed was almost falling appart and some hooks were broken before- when the neighbour room is cooking you hear the fan very loud. I initially thought that I somehow activated the kitchen fan"),
             Document("Un peu salé surtout le sol. Manque de service et de souplesse"),
             Document("De comfort zo centraal voor die prijs."),
             Document("Die Lage war sehr Zentral und man konnte alles sehenswertes zu Fuß erreichen. Wer am Wochenende nachts schlafen möchte, sollte diese Unterkunft auf keinen Fall nehmen. Party direkt vor der Tür so das man denkt, man schläft mitten drin. Sehr Sehr laut also und das bis früh 5 Uhr. Ab 7 kommt dann die Straßenreinigung die keineswegs leiser ist."),
             Document("Ótima escolha! Apartamento confortável e limpo! O RoofTop é otimo para beber um vinho! O apartamento é localizado entre duas ruas de movimento noturno. Porem as janelas, blindam 90% do barulho. Não nos incomodou"),
             Document("Nous avons passé un séjour formidable. Merci aux personnes , le bonjours à Ricardo notre taxi man, très sympathique. Je pense refaire un séjour parmi vous, après le confinement, tout était parfait, surtout leur gentillesse, aucune chaude négative. Je n'ai rien à redire de négative, Ils étaient a notre écoute, un gentil message tout les matins, pour nous demander si nous avions besoins de renseignement et savoir si tout allait bien pendant notre séjour."),
             Document("Boa localização. Bom pequeno almoço. A tv não se encontrava funcional."),
             Document("Céntrico. Muy cómodo para moverse y ver Oporto. Edificio con terraza propia en la última planta. Todo reformado y nuevo. Te traen un estupendo desayuno todas las mañanas al apartamento. Solo que se puede escuchar algo de ruido de la calle a primeras horas de la noche. Es un zona de ocio nocturno. Pero respetan los horarios.")
]

indexing_pipeline.run(documents=documents)

Retrieval Augmented Generation (RAG)

Now that we have multilingual embeddings indexed in our document store, let’s create the part of the demo where users interact the most: Retrieval-Augmented Generation (RAG) Pipeline.

A RAG pipeline consists of two parts: document retrieval and answer generation.

Multilingual Document Retrieval

In the document retrieval step of a RAG pipeline, the retriever creates an embedding for the query in the multilingual vector space and retrieves the most similar top_k documents to the query from the document store. In our case, the retrieved documents will be hotel reviews.

We’ll use the same EmbeddingRetriever instance that we used in the indexing pipeline.

Retrieval Mechanism with Document Store

Multilingual Answer Generation

In the generation step of the RAG pipeline, we’ll use an LLM (a generative model) to generate an answer based on the retrieved documents.

Let’s create a prompt template to use for hotel reviews. In this template, we’ll have two prompt variables: {join(documents)} and {query}. These variables will later be filled with the user question and the retrieved hotel reviews outputted from the retriever.

from haystack.nodes import AnswerParser, PromptTemplate

prompt="""
You will be provided with reviews in various languages for an accommodation. 
Create a concise and informative answer for a given question based solely on the given reviews.
\nReviews: {join(documents)}
\nQuestion: {query};
\nAnswer:
"""
template = PromptTemplate(
    prompt=prompt,
    output_parser=AnswerParser())

After defining the PromptTemplate, we need to define the PromptNode. For this generative case, we’ll use Cohere’s command model and initialize the PromptNode with an API key, model name, and the PromptTemplate we defined previously. Then, we’ll connect the EmbeddingRetriever to the PromptNode to finalize our RAG pipeline.

prompt_node = PromptNode(model_name_or_path="command", api_key=COHERE_API_KEY, default_prompt_template=template)

rag_pipeline = Pipeline()
rag_pipeline.add_node(component=retriever, name="Retriever", inputs=["Query"])
rag_pipeline.add_node(component=prompt_node, name="PromptNode", inputs=["Retriever"])

Now we can run the pipeline with various questions and learn if this accommodation is good to stay at! 🏡

results = rag_pipeline.run("Is this place too noisy to sleep?", params={"Retriever": {"top_k": 3}})
print(results["answers"][0].answer)

>> "Based on the information provided in the reviews, it seems that the accommodation can be very noisy, especially at night. Multiple reviewers mentioned..."

To learn more about how you can use Cohere models in Haystack pipelines, check out our integration page for Cohere🩵

Guide to Using Zephyr Models to Generate Answers on Your Data

Mon, 06 Nov 2023 00:00:00 +0000

Hugging Face recently announced their new open-source LLM, Zephyr-7B Beta, which is a fine-tuned version of Mistral 7B V.01 that focuses on helpfulness and outperforms many larger models on MT-Bench and AlpacaEval benchmarks. In this article, we’re going to show you how to use the new Zephyr models in a full retrieval-augmented generation pipeline, in a way that can work on your own private data.

Following the theme of our previous article, we will show you how to build a pipeline that uses Zephyr with Haystack, but we will also take the opportunity and show you how to do this with the preview package of Haystack 2.0.

Update: we released Haystack 2.0-Beta on December 4th 2023, the code in this article has been updated to work with this new release.

Querying Zephyr with Haystack

To start with, let’s see how to query Zephyr-7B-Beta with Haystack as it is. To load and manage open-source LLMs in Haystack 2.0, we can use the HuggingFaceLocalGenerator. If you are using a free Colab instance (with limited resources), you can load the model using 4-bit quantization (passing the appropriate huggingface_pipeline_kwargs to our Generator). For an introduction to Quantization in Hugging Face Transformers, you can read this simple blog post.

from haystack.components.generators import HuggingFaceLocalGenerator

generator = HuggingFaceLocalGenerator("HuggingFaceH4/zephyr-7b-beta",
                                          huggingface_pipeline_kwargs={"device_map":"auto",
                                                        "model_kwargs":{"load_in_4bit":True,
                                                          "bnb_4bit_use_double_quant":True,
                                                          "bnb_4bit_quant_type":"nf4",
                                                          "bnb_4bit_compute_dtype":torch.bfloat16}},
                                          generation_kwargs={"max_new_tokens": 350})

Once we have initialized this generator, we can simply call run() with a query. Here’s an example that we tried:

generator.run("Please write a rhyme about Italy.")

'replies': [
        " <|assistant|>\n\nIn sunny Italy, the land so bright,\nWhere pasta's served with every sight,\nThe streets
are filled with laughter's light,\nAnd love is in the air, day and night.\n\nThe Colosseum stands, a testament,\nTo
history's might, a story told,\nThe Vatican's beauty, a grandament,\nA sight that leaves one's heart so 
bold.\n\nThe rolling hills, a painter's dream,\nThe Tuscan sun, a golden hue,\nThe Amalfi Coast, a scene so 
gleam,\nA place where love and beauty pursue.\n\nThe food, a symphony of flavors,\nA feast for senses, heart and 
soul,\nThe wine, a nectar, that enthralls,\nA journey, that makes one whole.\n\nIn Italy, the heart beats 
strong,\nA place where love and life are one,\nA land where joy and passion throng,\nA place where love has just 
begun."
    ]

Using Zephyr in a Retrieval-Augmented Generative Pipeline

Now that we’ve seen the pretty simple process of initializing a Haystack 2.0 (preview) component that can query Zephyr, let’s have a look at how we can set up a full pipeline that works on our own data. This pipeline will use Zephyr-7B-Beta as the LLM to generate responses.

The Prompt

The first step is to define a prompt template that will effectively describe the manner in which we interact with an LLM. In Haystack 2.0 (preview, but eventually also the actual major release), prompt templates can be defined using the Jinja2 templating language. This provides a wide range of customizability to your prompts at query time. These templates are then used by a PromptBuilder that builds a full prompt before sending it to an LLM.

from haystack.components.builders import PromptBuilder

prompt_template = """<|system|>Using the information contained in the context, 
give a comprehensive answer to the question.
If the answer is contained in the context, also report the source URL.
If the answer cannot be deduced from the context, do not give an answer.
<|user|>
Context:
  {% for doc in documents %}
  {{ doc.content }} URL:{{ doc.meta['url'] }}
  {% endfor %};
  Question: {{query}}
  
<|assistant|>
"""
prompt_builder = PromptBuilder(template=prompt_template)

Let’s quickly walk through what we are seeing above. We will look at both the syntax present to prompt Zephyr specifically, and the syntax that we use in Haystack 2.0 (preview) to build prompts programmatically.

Zephyr ‘syntax’

Everything you see tagged with <|system|> is what the LLM expects as the main instruction, defining how it should approach user input. Everything tagged with <|user|> is what the LLM expects as user input, or even more broadly put, what it expects at “query time”. Finally, <|assistant|> tags the part of the prompt that the model will complete.

Haystack Syntax

You will also notice that there is a for loop in the prompt. What you are seeing is fully Jinja2 templating language. We use Jinja2 with Haystack 2.0 prompt templates. You can modify this section depending on what sort of input you want to provide the LLM.

Here, we are looping through every doc in documents and adding doc.content and doc.meta['url'] to the prompt. After the for loop, we are also adding the query.

In the next section, we will see how we use this prompt template in a full Haystack 2.0 (preview) pipeline so that we can fill this template to create a fully formed prompt for Zephyr.

The Pipeline

The core concept of a pipeline is to provide a structure with which data can be funnelled from one component to the other. Each component might be responsible for doing one specific thing such as keyword search, embedding retrieval, web search, translation and so on. In this case, we would like to build a retrieval-augmented generative pipeline that performs 2 major steps:

Given a query by a user, it retrieves the most relevant documents from our document store. This way, we can fill the documents loop in our prompt template with only the most relevant context.
Prompts the LLM (in this case Zephyr) with the fully formed (retrieval-augmented) prompt and receives a response.

To do so, we will be using 4 components and 1 document store. For demonstration purposes, we’ve used the InMemoryDocumentStore. However, you could use a number of other document stores such as Weaviate, Pincone, Milvus etc. (see our document store integrations for a full list).

Our document store has our “own” data in it. As an example, we’ve indexed documents that have information about music in our document store. Specifically, we indexed the Wikipedia pages of various bands.

You can have a look at the full list of the bands we used, as well as how we set up our indexing pipeline in our Colab example 😊.

Now that we have our data in place in a document store, here are the rest of the 4 components we use:

A SentenceTransformersTextEmbedder: We use this component to create an embedding for the query we receive from the user. That way we can use this embedding to do embedding retrieval from our document store. In this example, we use "thenlper/gte-large" as our embedding model
An InMemoryEmbeddingRetriever: We use this component to retrieve the most relevant documents from the InMemoryDocumentStore by doing embedding retrieval.
A PromptBuilder: We use the builder we defined in the previous section
A HuggingFaceLocalGenerator: We use the generator we created above, which uses the HuggingFaceH4/zephyr-7b-beta model

Once again, let’s define all of our components:

from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.generators import HuggingFaceLocalGenerator
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.utils import ComponentDevice

text_embedder = SentenceTransformersTextEmbedder(model="thenlper/gte-large",
                                                 device=ComponentDevice.from_str("cuda:0"))

retriever = InMemoryEmbeddingRetriever(document_store=document_store, top_k=5)

prompt_template = """<|system|>Using the information contained in the context, 
give a comprehensive answer to the question.
If the answer is contained in the context, also report the source URL.
If the answer cannot be deduced from the context, do not give an answer.
<|user|>
Context:
  {% for doc in documents %}
  {{ doc.content }} URL:{{ doc.meta['url'] }}
  {% endfor %};
  Question: {{query}}
  
<|assistant|>
"""
prompt_builder = PromptBuilder(template=prompt_template)

generator = HuggingFaceLocalGenerator("HuggingFaceH4/zephyr-7b-beta",
                                      pipeline_kwargs={"device_map":"auto",
                                                       "model_kwargs":{"load_in_4bit":True,
                                                                       "bnb_4bit_use_double_quant":True,
                                                                       "bnb_4bit_quant_type":"nf4",
                                                                       "bnb_4bit_compute_dtype":torch.bfloat16}},
                                      generation_kwargs={"max_new_tokens": 350}
                                      )

And then we build our pipeline. Our first step is to add these components to the pipeline, and next, we make the connections between each component:

rag = Pipeline()
rag.add_component("text_embedder", text_embedder)
rag.add_component("retriever", retriever)
rag.add_component("prompt_builder", prompt_builder)
rag.add_component("llm", generator)

rag.connect("text_embedder", "retriever")
rag.connect("retriever.documents", "prompt_builder.documents")
rag.connect("prompt_builder.prompt", "llm.prompt")

We can also visualize what our final pipeline looks like with rag.draw("rag.png").

Running the Pipeline and Restuls

The SentenceTransformersTextEmbedder with name text_embedder expects text as input. While the prompt_builder expects query. Our aim is to provide these inputs when we run the pipeline. The rest of the required inputs for each component are provided to them by the preceding components. Let’s run the pipeline with the question “What is the style of the Cure?”

query = "What is the style of the Cure?"

results = rag.run({
      "text_embedder": {"text": query},
      "prompt_builder": {"query": query}
    }
  )

answer = results["llm"]["replies"][0]

The result we got from the pipeline above is the following:

The style of The Cure, as described in the context, is "increasingly dark and tormented" and had a strong influence
on the emerging genre of gothic rock. This style is evident in their second album, "Seventeen Seconds," and is 
characterized by their use of "dominant, melodic bass lines; whiny, strangulated vocals; and a lyric obsession with
existential, almost literary despair." (Source: https://en.wikipedia.org/wiki/The_Cure)

Additionally, the Cure's influence on other musical artists spanning multiple genres, as listed in the context, 
further supports the description of their style as gothic rock. Some of these artists include Billy Corgan of the 
Smashing Pumpkins, Slowdive, Ride, Miki Berenyi of Lush, and Britt Daniel of Spoon, among others. (Source: 
https://en.wikipedia.org/wiki/The_Cure)

Overall, The Cure's style can be characterized as gothic rock, with a focus on dark, introspective lyrics and a 
distinctive sound that includes dominant bass lines and whiny, strangulated vocals.

Notice that the source URL is also reported. This was because in the prompt template, we added the URL for each doc in documents to the prompt with URL:{{ doc.meta['url'] and the instructions said: “If the answer is contained in the context, also report the source URL.”

Customizing RAG Pipelines to Summarize Latest Hacker News Posts

Fri, 22 Sep 2023 00:00:00 +0000

Over the last few months, the team at deepset has been working on a major upgrade in the Haystack repository. Along the way, we’ve been sharing our updates and design process for the upcoming Haystack 2.0 with the community, as well as releasing new components in a preview package. This means that you can already start exploring features coming to Haystack 2.0 using the preview components available in the haystack-ai package (pip install haystack-ai).

Update: we released Haystack 2.0-Beta on December 4th 2023, the code in this article has been updated to work with this new release.

In this article, I’ll cover two major concepts in Haystack

Components: These are the smallest building blocks in Haystack. They are meant to cover one simple task. As well as using components available in the core Haystack project, it will be easier than ever in 2.0, to create your own custom components.
Pipelines: These are made by connecting components to each other. Pipelines in 2.0 are more flexible than ever and enable you various new connection patterns between your components.

While components and pipelines have been at the core of Haystack since the beginning, Haystack 2.0 introduces some significant changes to how they are constructed.

We’ll look at how to create custom components and pipelines using the Haystack 2.0 preview. I’ll share a custom Haystack component that fetches the latest posts from Hacker News, and show how we can use it in a retrieval-augmented generative (RAG) pipeline to generate summaries of Hacker News posts.

Components in Haystack 2.0

A component is a class that does one thing. That thing could be to ‘prompt GPT3.5’, or ‘translate’, or ‘retrieve documents’, and so on.

While Haystack comes with a set of components in the core project, we hope that with Haystack 2.0 you will also be able to easily build components to your own custom requirements.

In Haystack 2.0, a class can become a component with just two additions:

A @component decorator on the class declaration.
A run function with a decorator @component.output_types(my_output_name=my_output_type) that describes what output the pipeline should expect from this component.

And that’s about it.

Building a Custom Hacker News Component

I’ll admit, the idea for this custom component came from one of our amazing Haystack ambassadors on Discord during a live coding session (thanks rec 💙) — and it turned out pretty well! So let’s take a look at how we create a custom component that fetches the latest k posts from Hacker News.

First, we create a HackernewsNewestFetcher. For it to be a valid Haystack component, it will also need a run function. For now, let’s create a stub function that simply returns a dictionary containing a single key ‘articles’ with the value ‘Hello world!’.

from haystack import component  
  
@component  
class HackernewsNewestFetcher():  
    
  @component.output_types(articles=str)  
  def run(self):  
    return {'articles': 'Hello world!'}

Now let’s make our component actually fetch the latest posts from Hacker News. We can use the newspapers3k package to crawl and get the contents of given URLs. We will also change the output type to return a list of Document objects.

from typing import List  
from haystack import component, Document  
from newspaper import Article  
import requests  
  
@component  
class HackernewsNewestFetcher():  
    
  @component.output_types(articles=List[Document])  
  def run(self, last_k: int):  
    newest_list = requests.get(url='https://hacker-news.firebaseio.com/v0/newstories.json?print=pretty')  
    articles = []  
    for id in newest_list.json()[0:last_k]:  
      article = requests.get(url=f"https://hacker-news.firebaseio.com/v0/item/{id}.json?print=pretty")  
      if 'url' in article.json():  
        articles.append(article.json()['url'])  
  
    docs = []  
    for url in articles:  
      try:  
        article = Article(url)  
        article.download()  
        article.parse()  
        docs.append(Document(content=article.text, meta={'title': article.title, 'url': url}))  
      except:  
        print(f"Couldn't download {url}, skipped")  
    return {'articles': docs}

We now have a component that, when run, returns a list of Documents containing the contents of the (last_k) latest posts on Hacker News. Here we store the output in the articles key of the dictionary.

Pipelines in Haystack 2.0

A pipeline is a structure that connects one component’s output to another component’s input until a final result is reached.

A pipeline is created with a few steps:

Create a Pipeline:
pipeline = Pipeline()
Add components to the pipeline:
pipeline.add_component(instance=component_a, name=”ComponentA”)
pipeline.add_component(instance=component_b, name=”ComponentB”)
Connect an output from one component to the input of another:
pipeline.connect("component_a.output_a", "component_b.input_b")

There are already enough components available in the Haystack 2.0 preview for us to build a simple RAG pipeline that uses our newHackernewsNewestFetcher for the retrieval augmentation step.

Building a RAG Pipeline to Generate Summaries of Hacker News Posts

To build a RAG pipeline that can create a summary for each of the latest k posts on Hacker News, we will use two components from the Haystack 2.0 preview:

The PromptBuilder: This component allows us to create prompt templates using Jinja as our templating language.
The OpenAIGenerator: This component simply prompts the specified GPT model. We can connect the PromptBuilder output to this component to customize how we interact with our chosen model.

First, we initialize all of the components we will need for the pipeline:

from haystack import Pipeline  
from haystack.components.builders.prompt_builder import PromptBuilder  
from haystack.components.generators import OpenAIGenerator
from haystack.utils import Secret

prompt_template = """  
You will be provided a few of the latest posts in HackerNews, followed by their URL.  
For each post, provide a brief summary followed by the URL the full post can be found at.  
  
Posts:  
{% for article in articles %}  
  {{article.content}}  
  URL: {{article.meta['url']}}  
{% endfor %}  
"""  
  
prompt_builder = PromptBuilder(template=prompt_template)  
llm = OpenAIGenerator(mode="gpt-4", api_key=Secret.from_token('YOUR_API_KEY'))  
fetcher = HackernewsNewestFetcher()

Next, we add the components to a Pipeline:

pipeline = Pipeline()  
pipeline.add_component("hackernews_fetcher", fetcher)  
pipeline.add_component("prompt_builder", prompt_builder)  
pipeline.add_component("llm", llm)

And finally, we connect the components to each other:

pipeline.connect("hackernews_fetcher.articles", "prompt_builder.articles")  
pipeline.connect("prompt_builder", "llm")

Here, notice how we connect hackernews_fetcher.articles to prompt_builder.articles. This is because prompt_builder is expecting articles in its template:

Posts:  
{% for article in articles %}  
  {{article.contnet}}  
  URL: {{article.meta['url']}}  
{% endfor %}

The output and input keys do not need to have matching names. Additionally, prompt_builder makes all of the input keys available to your prompt template. We could, for example, provide a documents input to prompt_builder instead of articles. Then our code might look like this:

prompt_template = """  
You will be provided a few of the latest posts in HackerNews, followed by their URL.  
For each post, provide a brief summary followed by the URL the full post can be found at.  
  
Posts:  
{% for document in documents %}  
  {{document.content}}  
  URL: {{document.meta['url']}}  
{% endfor %}  
"""  
  
[...]  
  
pipeline.connect("hackernews_fetcher.articles", "prompt_builder.documents")

Notice how the prompt now refers to documents, and the connect call now attaches to the corresponding prompt_builder.documents input.

Now that we have a pipeline, we can run it. Here is what I got as a response at about 22:45 CET on September 21st 🤗

result = pipe.run(data={"hackernews_fetcher":{"last_k": 2}})  
print(result['llm']['replies'][0])

Response:

1. "The translation world has legends of its own, but not all legends involve greatness.   
Many provide pain, confusion, or comedy, as these examples of bad game translation prove."   
- This post shares a humorous look at some examples of poor video game translations that have   
resulted in confusion and comedy. The author seeks to highlight that while translation is often   
necessary in game localization, it can sometimes yield suboptimal results.  
Link: https://legendsoflocalization.com/bad-translation/  
  
2. “Recently, I found myself returning to a compelling series of   
blog posts titled Zero-cost futures in Rust by Aaron Turon about what would   
become the foundation of Rust's async ecosystem.”   
- This post provides an in-depth analysis of the current state of Rust's   
'async' ecosystem, drawing upon the author's own experiences and Aaron Turon's   
blog series, "Zero-cost futures in Rust". The author also discusses the benefits and   
negatives of the current async ecosystem, the problems with ecosystem fragmentation,   
the state and issue of async-std, alternative runtimes, the complexities of writing async code,   
the benefits of synchronous threads over async, and the obsessiveness of Rust landscape with an   
async-first approach. The post concludes with the notion that async Rust should be used only   
when necessary and that the smaller, simpler language inside Rust (the synchronous Rust)   
should be the default mode.  
Link: https://corrode.dev/blog/async/

Further Improvements

This custom component was created as an experiment and you could certainly take it much further in a real-world application.

For example, our experimental component does nothing to reduce the length of the content in each article. This means that GPT-4 may struggle to give a good response, especially when setting last_k to a high number.

Deploying RAG to Production

Wed, 13 Sep 2023 00:00:00 +0000

As data scientists, we have often mastered the art of prototyping. We can use machine learning frameworks like Haystack to build, test, and fine-tune data-driven systems. We’re experts at gathering stakeholder feedback, quantifying it, and interpreting the various performance metrics. But it can get tricky for many of us when we want to move these systems into a production environment, where they become available for general use.

With Haystack, developers can build complex LLM pipelines on top of their own text databases, using state-of-the-art tools: from conversational AI to semantic search and summarization. One of the most talked about architectures these days is RAG, which stands for retrieval augmented generative AI. RAG pipelines combine the power of a generative LLM with the insights contained in your data, to create truly helpful user interfaces. To learn more, check out our blog post on RAG on The Deep Dive.

But deploying to production is an art in itself. And it can be intimidating. There are many elements that need to work together for a successful deployment, each of which is a potential source of failure. Fortunately, here at deepset, we have a team of knowledgeable people who deploy Haystack-powered pipelines to production on a regular basis. So in this article, I’ll share what I learned from them – so that you can get your RAG system into production ASAP.

From Prototype to Production

Prototyping is the process of building versions of your system – iteratively designing, deploying, and testing them – until you arrive at a configuration that meets your needs and has the potential to generate real value in production. While this process is challenging enough, it is only the beginning of your system’s life cycle.

The second important step in an applied AI project is deploying the system to production. The difference between this and a development environment, to put it bluntly, is that a lot more can go wrong. That’s because you have no control over how many people will be querying your system - and expecting it to respond quickly. You also have no (or very little) way of knowing when people will query it. So you need to plan for contingencies, and build a system that is scalable (meaning that its processing power can grow and shrink as needed) and robust (so that a high load on the system doesn’t cause it to crash).

When you deploy your system to production, it is also much harder to predict what people will use it for. That’s why you need to monitor your system’s performance and react accordingly. But LLM observability and pipeline monitoring are complex topics that we will leave for our next blog post.

The Use Case

Not every machine learning powered system is the same. In fact, you could argue that they’re all different – their nature depends on the problem we’re solving, the pipeline design, the underlying data, and the scale of the project.

Let us consider a mid-sized project using a RAG pipeline. The pipeline is connected to a regularly updated data store. Such a project could be used in an online news environment, for example, or in an employee-facing system that ingests company reports on a regular basis. In summary, our system has the following requirements:

It can ingest a large batch of documents when we first build it (our existing collection of online news or company reports).
It can be updated with additional, smaller batches of documents on a regular basis (say, daily, or whenever a new document arrives).
It can be queried at any time from a user interface (probably within a browser).

While the first two points are handled by an indexing pipeline, the second is handled by a query pipeline. In Haystack, you define both in the same yaml file.

Moving to Production

There are many tools you can use on your way to production. The exact choice depends on your needs and preferences – and if you have deployed a system before, you probably already have a preferred setup. Either way, let’s take stock of what you need for a successful deployment:

A production-ready, managed database, such as OpenSearch, Weaviate, or Pinecone. The advantage of using a database managed by a third party is that they take care of all the complex stuff like database maintenance and security for you.
A server that can host your database and compute infrastructure: you will need CPUs as well as GPUs – possibly for indexing, quite definitely for inference. There are many options for cloud providers, so choose the one you’re most comfortable with. The largest are AWS, Azure, and GCP.
An orchestration tool like Kubernetes (often styled K8s), which communicates with both the server (where your data resides and the pipelines run) and the client (the user interface that sends requests through a REST API).
In most real-world projects, you’ll also want to test your application locally before deploying it to the external server. You can set up a local Kubernetes environment by using k3d. It allows you to create a lightweight Kubernetes cluster in docker on your own machine. See Kristof’s article for detailed instructions.

Deploying the indexing pipeline to production

Indexing is the process of adding documents to your database. How you go about indexing in production depends a bit on whether you’re working with a keyword retriever or an embedding retriever (or both, in a hybrid retrieval setup). A keyword retriever is fast and doesn’t require any special hardware. An embedding retriever is a different beast.

Embedding methods take more time because they have to run your documents through a Transformer-based language model. The model spits out a dense, semantically rich vector that is added to the database, making it searchable for the later retrieval step. Because this step is computationally expensive, you’ll want to speed it up using GPUs.

Preparing the database

During indexing, documents and their corresponding text embeddings are stored in the database’s memory. Before signing up for a managed database, it’s recommended to have at least a rough idea of how much space your documents and vectors will take up. That’s because adding more space later is a hassle – yet buying too much space and then not using it can cost you thousands of $€¥.

The amount of space you need depends largely on the length of your vectors. For example, Cohere’s giant text embeddings are five times longer than the ones we often use in production – and therefore require five times as much space. So optimizing for vector length can therefore save you a lot of money every month.

Another concept you’ll encounter when setting up your managed database is “high availability.” This refers to storing your documents on more than one server in more than one physical location. This practice of redundancy ensures that your documents will still be available, even if one server is temporarily or permanently down.

Preprocessing and indexing your documents

During prototyping, you have defined the indexing pipeline, which specifies how you preprocess your documents, before adding them to the database. For production, you’ll move the indexing pipeline to your cloud provider, where it will be deployed on virtual machines by Kubernetes. It’s useful to summarize all the details of your configuration – such as the credentials for your document store, the pipeline yaml itself, and rules for the scaling of pipelines and hardware – in a Helm chart. Kubernetes will then deploy your system according to the settings specified in the chart.

How you send the raw files to your external service depends largely on your application and the origin of your files. For example, you can write a script that sends batches of files to your indexing endpoint at a fixed time each day or week, or you can configure a stream that sends the files as they come in. When you expect new data to index, you can enable autoscaling in Kubernetes, which will cause it to create replicas of your indexing pipeline that can then run in parallel.

Because indexing as a process is not as time-sensitive as querying, you can queue your files for indexing. Queues are used to ensure that requests can be accepted almost immediately, but processed sometime in the (near) future. To learn more about queuing with KEDA, see our series of articles about scaling indexing pipelines.

Once your document embeddings are ready, the service adds them to the database, where they can now be queried.

Deploying the Query Pipeline to Production

Unlike indexing, querying is extremely time-sensitive. When your users query your RAG system, they expect answers quickly – even the slightest delay can cause them to give up on your product and look elsewhere. Therefore, it is critical that your production query pipeline is always available and able to handle even a large number of queries at once. This means, among other things, that the pipeline must be able to scale on demand.

Again, Kubernetes is our friend. It receives the requests through the Haystack REST API endpoint and creates replicas of the query pipeline when many requests come in at once. It also ensures that these pipeline replicas have the necessary hardware resources available when needed. The queries are processed by the pipelines and the retrieval-augmented, LLM-generated responses are returned to your application, where your users can interact with them.

The nitty-gritty technical details of scaling are handled by our orchestration tool. Additionally, model hosting services like SageMaker or Hugging Face Inference can be helpful to scale model inference separately. Aside from these automated solutions, we have the ability to tweak the scaling of our pipelines ourselves. To do this, it’s useful to think about the following questions:

How many requests do I want each replica to handle simultaneously?
What is the expected peak load?
Can we accept a higher latency to reduce costs?

Answering these questions will help you optimize your setup and can save you money – but don’t worry too much about these settings, because they can always be adjusted. First, you need to get your system into production, then you can monitor and improve it – more on that in our next blog post. 🙂

From RAGs to Riches

It’s one thing to offer prototyping with LLMs, and quite another to provide all the elements needed to bring a system to production. Only the latter can actually help your users and customers gain valuable insights in the blink of an eye.

Haystack is the framework for AI engineers and data scientists to build and deploy production-ready systems powered by the latest in AI.

Are you interested in learning more about building advanced systems with LLMs to solve real-world use cases? Perhaps you’re even working on building and deploying your own custom RAG system? We’d love to hear from you – join the Haystack Discord, where we talk about LLMs, retrieval augmentation, and much, much more.

Talk to YouTube Videos with Haystack Pipelines

Fri, 08 Sep 2023 00:00:00 +0000

You can use this Colab for a working example of the application described in this article.

In this article, I’ll be showing an example of how to leverage transcription models like OpenAI’s Whisper, so as to build a retrieval augmented generation (RAG) pipeline that will allow us to effectively search through video content.

The example application I’ll showcase is able to answer questions based on the transcript extracted from the video. I’ll use the video by Erika Cardenas as an example. In the video, she talks about chunking and preprocessing documents for RAG pipelines. Once we’re done, we will be able to query a Haystack pipeline that will respond based on the contents of the video.

Transcribing and Storing the Video

To get started, we first need to set up an indexing pipeline. These pipelines in Haystack are designed to be given files of some form (.pdf, .txt, .md and in our case, a YouTube link), and store them in a database. The indexing pipeline is also used to design and define how we would like files to be prepared. This often involves file conversion steps, some preprocessing, and maybe also some embedding creation and so on.

The way we design the components and structure of this pipeline will also be important for another type of pipeline we will create in the next section: The RAG pipeline, also often referred to as the query or LLM pipeline too. While the indexing pipeline defines how we prepare and store data, an LLM pipeline uses said stored data. A simple example of the impact an indexing pipeline has on the RAG pipeline is that depending on the model we’re using, we may have to chunk our files to be longer or shorter.

Reusability

The idea behind Haystack pipelines is that once created, they can be re-invoked when needed. This ensures that data is treated the same way each time. In terms of indexing pipelines, this means we have a way to keep our databases for RAG pipelines always up to date. In a practical sense for this example application, when there’s a new video we want to be able to query, we re-use the same indexing pipeline and run the new video through it.

Creating the Indexing Pipeline

In this example, we’re using Weaviate as our vector database for storage. However, Haystack provides a number of Document Stores which you can pick from.

First, we create our WeaviateDocumentStore:

import weaviate  
from weaviate.embedded import EmbeddedOptions  
from haystack.document_stores import WeaviateDocumentStore  
  
client = weaviate.Client(  
  embedded_options=weaviate.embedded.EmbeddedOptions()  
)  
  
document_store = WeaviateDocumentStore(port=6666)

Next, we build the indexing pipeline. Here, our aim is to create a pipeline that will create transcripts of YouTube videos. So, we use the WhisperTranscriber as our first component. This component uses Whisper by OpenAI, an automatic speech recognition (ASR) system which can be used to transcribe audio into text. The component expects audio files, and returns transcripts in Haystack Document form, ready to be used in any Haystack pipeline.

We also include preprocessing, as well as embedding creations in our pipeline. This is because when it’s time to create the RAG pipeline, we would like to do semantic search on the indexed files.

from haystack.nodes import EmbeddingRetriever, PreProcessor  
from haystack.nodes.audio import WhisperTranscriber  
from haystack.pipelines import Pipeline  
  
preprocessor = PreProcessor()  
embedder = EmbeddingRetriever(document_store=document_store,   
                              embedding_model="sentence-transformers/multi-qa-mpnet-base-dot-v1")  
whisper = WhisperTranscriber(api_key='OPENAI_API_KEY')  
  
indexing_pipeline = Pipeline()  
indexing_pipeline.add_node(component=whisper, name="Whisper", inputs=["File"])  
indexing_pipeline.add_node(component=preprocessor, name="Preprocessor", inputs=["Whisper"])  
indexing_pipeline.add_node(component=embedder, name="Embedder", inputs=["Preprocessor"])  
indexing_pipeline.add_node(component=document_store, name="DocumentStore", inputs=["Embedder"])

Next, we create a helper function that extracts the audio of YouTube videos, and we can run the pipeline, for this, we install the pytube package 👇

from pytube import YouTube  
  
def youtube2audio (url: str):  
    yt = YouTube(url)  
    video = yt.streams.filter(abr='160kbps').last()  
    return video.download()

Now, we can run our indexing pipeline with a URL to a YouTube video:

file_path = youtube2audio("https://www.youtube.com/watch?v=h5id4erwD4s")  
indexing_pipeline.run(file_paths=[file_path])

The Retrieval Augmented Generative (RAG) Pipeline

This part is certainly the fun part. We now define our RAG pipeline. This will be the pipeline that defines how we query our videos. Although RAG pipelines often are built for question-answering, they can be designed for a number of other use cases. What the pipeline does in this case, is largely defined by what prompt you provide the LLM. You can find various prompts for different use cases in the PromptHub.

The Prompt

For this example, we’ve gone with a commonly used style of question-answering prompts, although you can of course change this prompt to do what you want to achieve. For example, changing it to a prompt that asks for a summary might be interesting. You could also make it more general. Here we’re also informing the model that the transcripts belong to Weaviate videos.

You will be provided some transcripts from Weaviate YouTube videos.   
Please answer the query based on what is said in the videos.  
Video Transcripts: {join(documents)}  
Query: {query}  
Answer:

In Haystack, these prompts can be included in a pipeline with the PromptTemplate and PromptNode components.

While the PromptTemplate is where we define the prompt and the variables the prompt expects as inputs (in our case documents and query), the PromptNode is really the interface with which we interact with LLMs. In this example, we’re using GPT-4 as our model of choice, but you can change this to use other models from Hugging Face, SageMaker, Azure and so on.

from haystack.nodes import PromptNode, PromptTemplate, AnswerParser  
  
video_qa_prompt = PromptTemplate(prompt="You will be provided some transcripts from Weaviate YouTube videos. Please answer the query based on what is said in the videos.\n"  
                                        "Video Transcripts: {join(documents)}\n"  
                                        "Query: {query}\n"  
                                        "Answer:", output_parser = AnswerParser())  
  
prompt_node = PromptNode(model_name_or_path="gpt-4", 
                         api_key='OPENAI_KEY', 
                         default_prompt_template=video_qa_prompt)

The Pipeline

Finally, we define our RAG pipeline. The important thing to note here is how the documents input gets provided to the prompt we are using.

Haystack retrievers always return documents. Notice below how the first component to get the query is the same EmbeddingRetriever that we used in the indexing pipeline above. This allows us to embed the query using the same model that was used for indexing the transcript. The embeddings of the query and indexed transcripts are then used to retrieve the most relevant parts of the transcript. Since these are returned by the retriever as documents, we are able to fill in the documents parameter of the prompt with whatever the retriever returns:

video_rag_pipeline = Pipeline()  
video_rag_pipeline.add_node(component=embedder, name="Retriever", inputs=["Query"])  
video_rag_pipeline.add_node(component=prompt_node, name="PromptNode", inputs=["Retriever"])

We can run the pipeline with a query. The response will be based on what Erika said in the example video we’re using 🤗

result = video_rag_pipeline.run("Why do we do chunking?")

The result I got for this was the following:

Chunking is done to ensure that the language model is receiving the most   
relevant information and not going over the context window. It involves   
splitting up the text once it hits a certain token limit, depending on   
the model or the chunk size defined. This is especially useful in documents   
where subsequent sentences or sections may not make sense without the   
information from previous ones. Chunking can also help in providing extremely   
relevant information when making queries that are specific to titles or   
sections.

Further Improvements

In this example, we’ve used a transcription model that is able to transcribe audio into text, but it is unable to distinguish between speakers. A follow up step I would like to try is to use a model that allows for speaker distinction. This would allow me to ask questions and in the response from the model, get an understanding of who provided that answer in the video.

Another point I would like to make is that this pipeline, which was for demonstration purposes, uses a light-weight yet quite effective sentence-transformers model for retrieval, and the default setting for preprocessing. More could definitely be done to find out what the best embedding model for retrieval would be. And taking inspiration from Erika’s video, chunking and preprocessing of the transcribed documents could be evaluated and improved.

To discover more about the available pipelines and components that would help you build custom LLM applications, check out the Haystack documentation.

Talk to Haystack Docs: Creating a Domain-Focused Q&A RAG Pipeline with WebRetriever

Mon, 04 Sep 2023 00:00:00 +0000

Ever since its introduction, WebRetriever has proven useful in the Haystack ecosystem. As its name implies, WebRetriever allows fetching documents from the Internet and channelling them into Haystack pipelines.

Under the hood, WebRetriever uses a search engine to look up relevant documents to retrieve from the web. Although users could customize the domain scope of the search queries even before (using the “site:” AND/OR syntax mixed with the actual query), it often felt like a workaround rather than a feature. Even worse – it created a barrier for users to exploit the WebRetriever capabilities fully. We needed to make it more intuitive and less “hacky”.

With the release of Haystack 1.20, WebRetriever can now limit searches to specific domains instead of searching the entire Internet. In this article, we’ll demonstrate how this small adjustment allows the easy creation of a specialized Q&A system that answers questions using documents only from a specific domain (Haystack docs, in this case).

Although we utilize Haystack docs as a use case in this article, we can easily envision many other use cases. Businesses can create RAG pipelines pointing WebRetriever to product-specific FAQs or troubleshooting guides from the product’s official website, thus effectively creating AI-powered assistants to answer customer queries. Students and researchers can scope WebRetriever to academic journals, arXiv, or PubMed to answer questions related to their field of study. Users can set the WebRetriever to collect product reviews or specifications from trusted websites when purchasing products. It takes a considerable amount of effort to exhaust all the potential use cases.

The Power of WebRetriever

The WebRetriever is a Haystack component that extracts relevant documents from the web. It leverages an instance of WebSearch to obtain search result links; then it fetches content from the links and extracts the raw text content as a Document list.

The WebRetriever supports three distinct modes of operation:

Snippets Mode: In this mode, the WebRetriever generates a list of Document instances, where each Document represents a snippet or a segment from a web page result. It’s important to note that this mode does not involve actual web page retrieval.

Raw Documents Mode: In this mode, the WebRetriever generates a list of Document instances, where each Document represents an entire web page (retrieved from the search result link) devoid of any HTML and containing only the raw text content.

Preprocessed Documents Mode: This mode is similar to the Raw Documents Mode but includes an additional step – the raw text from each retrieved web page is divided into shorter Document instances using a specified PreProcessor.

Although knowledgeable users could use Google “site:” search syntax to manually hack a query, the new “allowed_domains” WebRetriever init parameter empowers users to limit the search to specific websites easily, making their Q&A focused and reliable.

RAG pipeline setup

To better understand the building blocks of the “talk to the website” use case, let’s review each code section of the pipeline example [1]. Before assembling the RAG pipeline, we must set up the https://serper.dev API key and choose an LLM to use. It could be any LLM, but we’ll limit the choice to Anthropic Claude, OpenAI GPT models, and open-source models available on the free tier of Hugging Face Inference API.

import os
from typing import Dict, Any


search_key = os.environ.get("SERPERDEV_API_KEY")
if not search_key:
    raise ValueError("Please set the SERPERDEV_API_KEY environment variable")

models_config: Dict[str, Any] = {
    "openai": {"api_key": os.environ.get("OPENAI_API_KEY"), "model_name": "gpt-3.5-turbo"},
    "anthropic": {"api_key": os.environ.get("ANTHROPIC_API_KEY"), "model_name": "claude-instant-1"},
    "hf": {"api_key": os.environ.get("HF_API_KEY"), "model_name": "tiiuae/falcon-7b-instruct"},
}

Next, we define a simple prompt template to instruct LLM on how to construct the answer. Using the defined PromptTemplate, we also initialize PromptNode with additional parameters like API key, maximum answer length, streaming settings, etc.

from haystack.nodes import PromptNode, PromptTemplate

prompt_text = """
Synthesize a comprehensive answer from the provided paragraphs and the given question.\n
Focus on the question and avoid unnecessary information in your answer.\n
\n\n Paragraphs: {join(documents)} \n\n Question: {query} \n\n Answer:
"""
prompt_node = PromptNode(
    model["model_name"],
    default_prompt_template=PromptTemplate(prompt_text),
    api_key=model["api_key"],
    max_length=768,
    model_kwargs={"stream": stream},
)

We initialize the WebRetriever component with specific parameters like the search engine API key, allowed domains for fetching documents, and document caching. WebRetriever is now set to retrieve documents only from the domain “haystack.deepset.ai”.

from haystack.document_stores import InMemoryDocumentStore
from haystack.nodes.retriever import WebRetriever

web_retriever = WebRetriever(
    api_key=search_key,
    allowed_domains=["haystack.deepset.ai"],
    top_search_results=10,
    mode="preprocessed_documents",
    top_k=50,
    cache_document_store=InMemoryDocumentStore(),
)

Next, we create our Haystack RAG pipeline. One by one, we add our components: WebRetriever, sampler, ranker, and finally, PromptNode as an answer generation. We add the pipeline nodes in a sequence, defining the data flow through the pipeline.

from haystack import Pipeline
from haystack.nodes.sampler import TopPSampler
from haystack.nodes.ranker import LostInTheMiddleRanker

pipeline = Pipeline()
pipeline.add_node(component=web_retriever, name="Retriever", inputs=["Query"])
pipeline.add_node(component=TopPSampler(top_p=0.90), name="Sampler", inputs=["Retriever"])
pipeline.add_node(component=LostInTheMiddleRanker(1024), name="LostInTheMiddleRanker", inputs=["Sampler"])
pipeline.add_node(component=prompt_node, name="PromptNode", inputs=["LostInTheMiddleRanker"])

Finally, we enter a loop that allows users to input their queries. Our example code runs these questions through the pipeline to produce answers which are then printed to the console. Super simple!

while True:
    user_input = input("\nAsk question (type 'exit' or 'quit' to quit): ")
    if user_input.lower() == "exit" or user_input.lower() == "quit":
        break
    if stream:
        print("Answer:")
    response = pipeline.run(query=user_input)
    if not stream:
        print(f"Answer: {response['results'][0]}")

Conclusion

Haystack’s WebRetriever enables users to quickly and effortlessly set up a highly-specialized Q&A system drawing content directly from the specified domains. Whether you are setting up an AI-powered Q&A assistant for your website, or if you are an engineer sifting through a list of arXiv publications, Haystack’s modular architecture lets you quickly set up a RAG pipeline that fits your particular use case.

We’ve only scratched the surface with this simple example, but we hope it encourages you to further explore Haystack RAG pipelines and components, allowing you to customize Haystack for your particular scenario. We look forward to seeing how the community utilizes WebRetriever and other Haystack components to address today’s complex information retrieval challenges.

References

[1] https://github.com/deepset-ai/haystack/blob/main/examples/talk_to_website.py

Generative Documentation Q&A with Weaviate and Haystack

Sat, 02 Sep 2023 00:00:00 +0000

You can use this Colab for a working example of the application described in this article.

Retrieval augmented generation is the golden child of LLM applications lately. The idea behind it is simple: LLMs do not know the entire world, least of all your specific world. But, with the use of retrieval techniques, we can provide the most useful pieces of information to an LLM so that it has the context with which to reply to queries that it otherwise would not have been trained to know about or answer.

This technique is now being used to power many search systems. In this article, we show how to build such a system using Haystack, an open-source LLM framework, and Weaviate, a vector database. Our resulting pipeline will answer queries about Haystack, and provide references to the documentation pages containing the answer.

A few weeks ago, a colleague and I built a custom component for Haystack: the ReadmeDocsFetcher. Haystack’s design is centered around small units called components. The idea behind the framework is to provide simple building blocks that allow you to create your own custom components beyond the ones provided inherently through the Haystack project. The Haystack documentation is hosted through ReadMe, so we designed this node to fetch requested documentation pages from ReadMe and process them in a way that can be used in a full LLM pipeline.

The Indexing Pipeline

Now we can get started building our pipeline. First, we create an indexing pipeline to write all the requested documentation pages on https://docs.haystack.deepset.ai to our Weaviate database. The nice thing about building indexing pipelines is that they can be reused. If there are any new pages, we can push them through our indexing pipeline to ensure that the dabase that we we use for our RAG pipelines is always up to date.

For this indexing pipeline, we make use of the custom built ReadmeDocsFetcher. Eventually we’ll want to do embedding retrieval so that we can get the most relevant documents from the database. So for this demo, we’re using a sentence-transformers model to create the vector representations of my documents.

Weaviate has a convenient feature called Weaviate Embedded that we can use here. It allows us to run a Weaviate database within Colab:

import weaviate  
from weaviate.embedded import EmbeddedOptions  
from haystack.document_stores import WeaviateDocumentStore  
  
client = weaviate.Client(  
  embedded_options=weaviate.embedded.EmbeddedOptions()  
)  
  
document_store = WeaviateDocumentStore(port=6666)

Once we have that, we initialize all of the components we’ll need for the final indexing pipeline:

from readmedocs_fetcher_haystack import ReadmeDocsFetcher  
from haystack.nodes import EmbeddingRetriever, MarkdownConverter, PreProcessor  
  
converter = MarkdownConverter(remove_code_snippets=False)  
readme_fetcher = ReadmeDocsFetcher(api_key=readme_api_key,   
                                   markdown_converter=converter,   
                                   base_url="https://docs.haystack.deepset.ai")  
embedder = EmbeddingRetriever(document_store=document_store,   
                              embedding_model="sentence-transformers/multi-qa-mpnet-base-dot-v1")  
preprocessor = PreProcessor()

And then we simply build and run the pipeline. It will preprocess and create embeddings for all the documentation pages under https://docs.haystack.deepset.ai:

from haystack import Pipeline  
  
indexing_pipeline = Pipeline()  
indexing_pipeline.add_node(component=readme_fetcher, name="ReadmeFetcher", inputs=["File"])  
indexing_pipeline.add_node(component=preprocessor, name="Preprocessor", inputs=["ReadmeFetcher"])  
indexing_pipeline.add_node(component=embedder, name="Embedder", inputs=["Preprocessor"])  
indexing_pipeline.add_node(component=document_store, name="DocumentStore", inputs=["Embedder"])  
indexing_pipeline.run()

The Retrieval Augmented Generative (RAG) Pipeline

Before we jump into the RAG pipeline itself, I’d like to talk about two key building blocks of the pipeline in isolation: the prompt and the LLM of choice.

As mentioned above, my aim here is to build a pipeline that can reference the documentation pages a reply comes from. Specifically, I want to get a URL for me to click and read in more detail. Whether a RAG pipeline can achieve this depends heavily on the instruction the LLM is provided. It also depends on whether the LLM itself is designed to consume such an instruction.

Here, we can say I ‘splashed out’. While you can use open-source LLMs with Haystack (from Hugging Face, hosted on SageMaker, locally deployed, the choice is really yours) I went with GPT-4. One of the main reasons for my choice is simply because, from experience, GPT-4 has been the most performant with the type of prompt (instruction) I intended to use for this application. That being said, please let me know if you have different observations 🙏

The Prompt

Here is the prompt we use for this demo. It asks for each retrieved document to be followed by the URL that it comes from. The URL of each document is present in the metadata of the documents that we wrote to the WeaviateDocumentStore 👇

You will be provided some conetent from technical documentation,   
where each paragraph is followed by the URL that it appears in.   
Answer the query based on the provided Documentation Content. Your answer   
should reference the URLs that it was generated from.   
Documentation Content: {join(documents,   
                             delimiter=new_line,   
                             pattern='---'+new_line+'$content'+new_line+'URL: $url',   
                             str_replace={new_line: ' ', '[': '(', ']': ')'})}  
Query: {query}  
Answer:

Notice how we’re structuring the prompt so that documents (which will be provided by the retriever when we add this to the pipeline) are separated from each other, and the content is always followed by the URL it comes from. We can do this because each document that we wrote into our database has the url in its metadata.

We use the prompt above to create a PromptTemplate called the **answer_with_references_prompt**.

from haystack.nodes import PromptTemplate, AnswerParser  
  
answer_with_references_prompt = PromptTemplate(prompt = """You will be provided some conetent from technical documentation, where each paragraph is followed by the URL that it appears in. Answer the query based on the provided Documentation Content. Your answer should reference the URLs that it was generated from. Documentation Content: {join(documents, delimiter=new_line, pattern='---'+new_line+'$content'+new_line+'URL: $url', str_replace={new_line: ' ', '[': '(', ']': ')'})}\nQuery: {query}\nAnswer:""", output_parser=AnswerParser())

You can explore other example prompts we’ve made use of, including a similar one for referencing on the PromptHub.

The Pipeline

Now to simply piece this all together. First, we define a PromptNode, which is the interface for interacting with an LLM.

from haystack.nodes import PromptNode  
prompt_node = PromptNode(model_name_or_path="gpt-4",   
                        api_key='YOUR_OPENAI_KEY',   
                        default_prompt_template=answer_with_references_prompt,   
                        max_length=500)

Finally we create our pipeline. It has two components. First, we reuse the same EmbeddingRetriever from earlier to retrieve relevant documents from the database. Second, we use the PromptNode to generate an answer based on these retrieved documents.

pipeline = Pipeline()  
pipeline.add_node(component = embedder, name = "Retriever", inputs = ["Query"])  
pipeline.add_node(component = prompt_node, name = "GPT-4", inputs=["Retriever"])

When we run it, this final pipeline will: retrieve some relevant documents, add them into the prompt we created, and then send the resulting complete prompt to GPT-4 for an answer.

For example:

pipeline.run("What are the optional installations of Haystack?", params = {"Retriever": {"top_k": 5}})

Resulted in the following complete prompt (abridged):

You will be provided some conetent from technical documentation, where each paragraph is followed   
by the URL that it appears in. Answer the query based on the provided Documentation Content.   
Your answer should reference the URLs that it was generated from.   
Documentation Content:   
---  
", "1-0": "all-gpu", "1-1": "Install Haystack and all optional dependencies, including developer   
tools with GPU support.", "2-0": "aws", "2-1": "Install Haystack and AWS SageMaker PromptNode support",   
"3-0": "colab", "3-1": "Install Haystack and all dependencies needed to run Haystack in Google Colab.",   
"4-0": "crawler", "4-1": "Install Haystack and all web crawling tools.",  
URL: https://docs.haystack.deepset.ai/docs/installation  
---  
[...]  
URL: https://docs.haystack.deepset.ai/docs/extras  
---  
[...]  
Query: What are the optional installations of Haystack?  
Answer:

When we ran this, the reply we got was the following:

Haystack offers several optional installations to support different use cases. These include:  
1. "all-gpu" to install Haystack with all optional dependencies, including developer tools with GPU support.  
2. "aws" for AWS SageMaker PromptNode support.  
3. "colab" to install all dependencies needed to run Haystack in Google Colab.  
4. "crawler" for web crawling tools.  
5. "dev" for all development tools needed by contributors.  
6. "ocr" for all OCR tools.  
7. "docstores" for all DocumentStores.  
8. "docstores-gpu" for all DocumentStores with GPU support.  
9. "faiss" for FAISS support for the FAISSDocumentStore.  
10. "faiss-gpu" for FAISS support for the FAISSDocumentStore with GPU.  
11. "inference" for all dependencies for model inference on local machine, including torch.  
12. "opensearch" for OpenSearch support for the OpenSearchDocumentStore.  
13. "elasticsearch", "elasticsearch7", and "elasticsearch8" for Elasticsearch support for the ElasticsearchDocumentStore.  
14. "pdf" for PyMuPDF for PDFToTextConverter.  
15. "pinecone" for Pinecone support for the PineconeDocumentStore.  
16. "preprocessing" for the basic preprocessing tools, such as langdetect for language identification and nltk for precise document splitting.  
17. "file-conversion" for all the dependencies for file conversion and parsing, like python-docx, tika, markdown.  
18. "ray" for Ray support for the RayPipeline.  
19. "weaviate" for Weaviate support for the WeaviateDocumentStore.  
  
Additionally, there are other components like "AnswerToSpeech" and "DocumentToSpeech" which are not installed with Haystack core but can be installed separately.  
  
For the latest release of Haystack and all its dependencies, you can use the command 'pip install farm-haystack(all)' or 'pip install farm-haystack(all-gpu)' for GPU-enabled dependencies.  
  
URLs:   
- https://docs.haystack.deepset.ai/docs/installation  
- https://docs.haystack.deepset.ai/docs/extras

Further Improvements

So far, we have only used a single retrieval technique. This could be considerably improved with a hybrid retrieval approach, which you can also implement with Weaviate and Haystack. This, in my opinion, would provide a healthier system that is intended for documentation search. While I am able to ask fully formed questions in this setup, I may want to provide a way for a user to simply search for ‘EmbeddingRetrievers’ for example.

In this article, we’ve seen how to build a simple RAG setup that uses a clever prompt to get replies with references to documentation. To discover more about the available pipelines and components that would help you build custom LLM applications, check out the Haystack documentation.

Enhancing RAG Pipelines in Haystack: Introducing DiversityRanker and LostInTheMiddleRanker

Tue, 29 Aug 2023 00:00:00 +0000

The recent improvements in Natural Language Processing (NLP) and Long-Form Question Answering (LFQA) would have, just a few years ago, sounded like something from the domain of science fiction. Who could have thought that nowadays we would have systems that can answer complex questions with the precision of an expert, all while synthesizing these answers on the fly from a vast pool of sources? LFQA is a type of Retrieval-Augmented Generation (RAG) which has recently made significant strides, utilizing the best retrieval and generation capabilities of Large Language Models (LLMs).

But what if we could refine this setup even further? What if we could optimize how RAG selects and utilizes information to enhance its performance? This article introduces two innovative components aiming to improve RAG with concrete examples drawn from LFQA, based on the latest research and our experience — the DiversityRanker and the LostInTheMiddleRanker.

Consider the LLM’s context window as a gourmet meal, where each paragraph is a unique, flavorful ingredient. Just as a culinary masterpiece requires diverse, high-quality ingredients, LFQA question-answering demands a context window filled with high-quality, varied, relevant, and non-repetitive paragraphs.

In the intricate world of LFQA and RAG, making the most of the LLM’s context window is paramount. Any wasted space or repetitive content limits the depth and breadth of the answers we can extract and generate. It’s a delicate balancing act to lay out the content of the context window appropriately. This article presents new approaches to mastering this balancing act, which will enhance RAG’s capacity for delivering precise, comprehensive responses.

Let’s explore these exciting advancements and how they improve LFQA and RAG.

Background

Haystack is an open-source framework providing end-to-end solutions for practical NLP builders. It supports a wide range of use cases, from question-answering and semantic document search all the way to LLM agents. Its modular design allows the integration of state-of-the-art NLP models, document stores, and various other components required in today’s NLP toolbox.

One of the key concepts in Haystack is the idea of a pipeline. A pipeline represents a sequence of processing steps that a specific component executes. These components can perform various types of text processing, allowing users to easily create powerful and customizable systems by defining how data flows through the pipeline and the order of nodes that perform their processing steps.

The pipeline plays a crucial role in web-based long-form question answering. It starts with a WebRetriever component, which searches and retrieves query-relevant documents from the web, automatically stripping HTML content into raw text. But once we fetch query-relevant documents, how do we make the most of them? How do we fill the LLM’s context window to maximize the quality of the answers? And what if these documents, although highly relevant, are repetitive and numerous, sometimes overflowing the LLM context window?

This is where the components we’ll introduce today come into play — the DiversityRanker and the LostInTheMiddleRanker. Their aim is to address these challenges and improve the answers generated by the LFQA/RAG pipelines.

The DiversityRanker enhances the diversity of the paragraphs selected for the context window. LostInTheMiddleRanker, usually positioned after DiversityRanker in the pipeline, helps to mitigate the LLM performance degradation observed when models must access relevant information in the middle of a long context window. The following sections will delve deeper into these two components and demonstrate their effectiveness in a practical use case.

DiversityRanker

The DiversityRanker is a novel component designed to enhance the diversity of the paragraphs selected for the context window in the RAG pipeline. It operates on the principle that a diverse set of documents can increase the LLM’s ability to generate answers with more breadth and depth.

An artistic interpretation of the DiversityRanker algorithm’s document ordering process, courtesy of MidJourney. Please note that this visualization is more illustrative than precise.

The DiversityRanker uses sentence transformers to calculate the similarity between documents. The sentence transformers library offers powerful embedding models for creating meaningful representations of sentences, paragraphs, and even whole documents. These representations, or embeddings, capture the semantic content of the text, allowing us to measure how similar two pieces of text are.

DiversityRanker processes the documents using the following algorithm:

It starts by calculating the embeddings for each document and the query using a sentence-transformer model.
It then selects the document semantically closest to the query as the first selected document.
For each remaining document, it calculates the average similarity to the already selected documents.
It then selects the document that is, on average, least similar to the already selected documents.
This selection process continues until all documents are selected, resulting in a list of documents ordered from the document contributing the most to the overall diversity to the document that contributes the least.

A technical note to keep in mind: the DiversityRanker uses a greedy local approach to select the next document in order, which might not find the most optimal overall order for the documents. DiversityRanker focuses on diversity more than relevance, so it should be placed in the pipeline after another component like TopPSampler or another similarity ranker that focuses more on relevance. By using it after a component that selects the most relevant documents, we ensure that we select diverse documents from a pool of already relevant documents.

LostInTheMiddleRanker

The LostInTheMiddleRanker optimizes the layout of the selected documents in the LLM’s context window. This component is a way to work around a problem identified in recent research [1] that suggests LLMs struggle to focus on relevant passages in the middle of a long context. The LostInTheMiddleRanker alternates placing the best documents at the beginning and end of the context window, making it easy for the LLM’s attention mechanism to access and use them. To understand how LostInTheMiddleRanker orders the given documents, imagine a simple example where documents consist of a single digit from 1 to 10 in ascending order. LostInTheMiddleRanker will order these ten documents in the following order: [1 3 5 7 9 10 8 6 4 2].

Although the authors of this research focused on a question-answering task — extracting the relevant spans of the answer from the text — we are speculating that the LLM’s attention mechanism will also have an easier time focusing on the paragraphs in the beginning and the end of the context window when generating answers.

LLMs struggle to extract answers from the middle of the context, adapted from Liu et al. (2023)[1]

LostInTheMiddleRanker is best positioned as the last ranker in the RAG pipeline as the given documents are already selected based on similarity (relevance) and ordered by diversity.

Using the new rankers in pipelines

In this section, we’ll look into the practical use case of the LFQA/RAG pipeline, focusing on how to integrate the DiversityRanker and LostInTheMiddleRanker. We’ll also discuss how these components interact with each other and the other components in the pipeline.

The first component in the pipeline is a WebRetriever which retrieves query relevant documents from the web using a programmatic search engine API (SerperDev, Google, Bing etc). The retrieved documents are first stripped of HTML tags, converted to raw text, and optionally preprocessed into shorter paragraphs. They are then, in turn passed to a TopPSampler component, which selects the most relevant paragraphs based on their similarity to the query.

After TopPSampler selects the set of relevant paragraphs, they are passed to the DiversityRanker. DiversityRanker, in turn, orders the paragraphs based on their diversity, reducing the repetitiveness of the TopPSampler-ordered documents.

The selected documents are then passed to the LostInTheMiddleRanker. As we previously mentioned, LostInTheMiddleRanker places the most relevant paragraphs at the beginning and the end of the context window, while pushing the worst-ranked documents to the middle.

Finally, the merged paragraphs are passed to a PromptNode, which conditions an LLM to answer the question based on these selected paragraphs.

LFQA/RAG pipeline — Image by author

The new rankers are already merged into Haystack’s main branch and will be available in the upcoming 1.20 release slated for the end of August 2023. We included a new LFQA/RAG pipeline demo in the project’s examples folder.

The demo shows how DiversityRanker and LostInTheMiddleRanker can be easily integrated into a RAG pipeline to improve the quality of the generated answers.

Case study

To demonstrate the effectiveness of the LFQA/RAG pipelines that include the two new rankers, we’ll use a small sample of half a dozen questions requiring detailed answers. The questions include: “What were the key events and influences that led to Renaissance; how did these developments shape modern Western culture?”, “What are the primary causes of climate change on both global and local scales?”, and more. To answer these questions well, LLMs require a wide range of historical, political, scientific, and cultural sources, making them ideal for our use case.

Comparing the generated answers of the RAG pipeline with two new rankers (optimized pipeline) and a pipeline without them (non-optimized) would require complex evaluation involving human expert judgment. To simplify evaluation and to evaluate the effect of the DiversityRanker primarily, we calculated the average pairwise cosine distance of the context documents injected into the LLM context instead. We limited the context window size in both pipelines to 1024 words. By running these sample Python scripts [2], we have found that the optimized pipeline has an average 20–30% increase in pairwise cosine distance [3] for the documents injected into the LLM context. This increase in the pairwise cosine distance essentially means that the documents used are more diverse (and less repetitive), thus giving the LLM a wider and richer range of paragraphs to draw upon for its answers. We’ll leave the evaluation of LostInTheMiddleRanker and its effect on generated answers for one of our upcoming articles.

Conclusion

We’ve explored how Haystack users can enhance their RAG pipelines by using two innovative rankers: DiversityRanker and LostInTheMiddleRanker.

DiversityRanker ensures that the LLM’s context window is filled with diverse, non-repetitive documents, providing a broader range of paragraphs for the LLM to synthesize the answer from. At the same time, the LostInTheMiddleRanker optimizes the placement of the most relevant paragraphs in the context window, making it easier for the model to access and utilize the best-supporting documents.

Our small case study confirmed the effectiveness of the DiversityRanker by calculating the average pairwise cosine distance of the documents injected into the LLM’s context window in the optimized RAG pipeline (with two new rankers) and the non-optimized pipeline (no rankers used). The results showed that an optimized RAG pipeline increased the average pairwise cosine distance by approximately 20–30%.

We have demonstrated how these new rankers can potentially enhance Long-Form Question-Answering and other RAG pipelines. By continuing to invest in and expand on these and similar ideas, we can further improve the capabilities of Haystack’s RAG pipelines, bringing us closer to crafting NLP solutions that seem more like magic than reality.

References

[1] “Lost in the Middle: How Language Models Use Long Contexts” at https://arxiv.org/abs/2307.03172

[2] Script: https://gist.github.com/vblagoje/430def6cda347c0b65f5f244bc0f2ede

[3] Script output (answers): https://gist.github.com/vblagoje/738253f87b7590b1c014e3d598c8300b

Hybrid Document Retrieval

Tue, 22 Aug 2023 00:00:00 +0000

Document retrieval is the art of extracting relevant documents from a corpus in response to an input. Like many language-related tasks today, it can benefit greatly from the dense, semantic embeddings produced by encoder models. These models have learned to embed documents in an abstract vector space that captures their content, allowing users to phrase their queries freely in natural language, rather than trying to match the exact keywords contained in a document.

Still, there are many scenarios where a more basic, keyword-based approach can outperform the semantic one. For example, keyword-based methods like BM25 have been shown to perform better in an out-of-domain setting.

That’s because dense encoder models need to be trained on data, and if that data is too different from the use case at hand, they’re likely to underperform. In practice, therefore, the keyword approach can yield better results for highly technical or “niche” use cases than an encoder model that hasn’t been fine-tuned.

Since both approaches have their advantages and disadvantages, it is a good idea to combine them. All you need is two retrievers and a way to merge their output. We also recommend using a ranker on top of them. As you’ll see, Haystack makes it easy to put all these components together in a modular and reusable pipeline.

Recap: what is document retrieval and why do we care?

In large-scale NLP systems, we often work with huge collections of documents. Now, what happens when we pass a query to such a system? It’s not really feasible to run a computationally demanding language model – like those used in generative AI or extractive question answering – on the entire corpus. Doing so would be a waste of resources and time. The solution? Retrievers.

Retrievers leverage different document search techniques to extract the right documents from a database. In document search, each document is represented as a single vector. Thus, the retrieval module only needs to process one embedding per document. This is a highly efficient way to pre-select the right documents for subsequent processing steps.

Retrievers are super useful in their own right, as they drive most search applications. But they’re most often discussed in the context of composite systems or pipelines. As a pre-selection mechanism, retrievers extract related documents in response to a query. They then pass those documents on to downstream tasks like extractive QA, generative AI (in what’s known as a RAG scenario) or summarization.

What types of retrievers are there?

Retrievers can be roughly divided into two categories: on the one side, there’s the sparse, keyword-based approach à la Tf-Idf, on the other side, the dense, embedding-based approach that uses Transformers.

Characteristics of the sparse approach

Sparse retrievers produce vectors whose length equals the size of the vocabulary. Because every document in the corpus only contains a fraction of all the words in the corpus, these vectors are typically sparse: long, with many zeroes, and only few non-zero values. The sparse retrieval algorithm most commonly used today is BM25, an improved version of the classic Tf-Idf.

Sparse embedding techniques are by definition lexical: they can only represent and match those words that are part of the vocabulary. They don’t require any training, making them language- and domain-agnostic.

Characteristics of the dense approach

The biggest difference to the sparse approach is that a dense retriever requires data and training. During training, the language model learns how to best embed documents as vectors from the data itself.

Dense retrievers produce shorter vectors than their sparse counterparts. These condensed vectors consist mostly of scalar values, which represent semantic features instead of lexical occurrences. The features, which the model acquires during training, are harder to interpret than the sparse ones.

Because dense retriever models have learned to represent their training data, they’re likely to perform poorly on any use case that is out of that data’s domain. For example, an embedding model trained on Wikipedia articles may fail to handle tweets properly: the language used is too different. Similarly, a model trained on financial data will not be able to do a good job with medical reports.

While it is always possible to fine-tune an existing model, this requires data, resources, and the know-how.

Combining dense and sparse retrievers

To address the weaknesses of both retrievers (or, more positively, to take advantage of both), we can simply use two retrievers in our pipeline and merge their output. Such a hybrid retrieval pipeline is easy to set up in a modular framework like Haystack.

A hybrid retrieval pipeline in Haystack

Haystack uses modular pipelines and nodes to ensure straightforward customization. A basic retriever pipeline consists of an input (the query), a retriever node, optional additional nodes, and an output.

But we can easily customize the pipeline to contain two retriever nodes instead of only one. We just need to remember to combine the documents retrieved by both methods in a meaningful way. To do this, we’ll first use a JoinDocuments node:

Both retrievers return ranked lists of results with relevance scores attached to each document (note that, since they employ different scoring techniques, the scores are not really meaningful in a hybrid retrieval setting). You can use different methods for joining these results lists. Which one you use – and whether you add another module on top of the JoinDocuments node – depends on your use case.

Concatenation: all documents (sans duplicates) are simply appended to the final list of results. This method is sufficient if you plan to use all the results and don’t care about their order. This could be the case, for example, in an extractive question answering pipeline. Concatenation can also be used in combination with a powerful ranking model – more on that later.
Reciprocal rank fusion (RRF): this formula reranks the documents returned by both retrievers, giving priority to those which appear in both results lists. Its purpose is to push the most relevant documents to the top of the list. It is useful if the order of your results is important, or if you only want to pass on a subset of your results to the next node.
Merging: documents are ranked according to the scores returned by the retrievers. This method is useful if you want to prioritize the results from one retriever over another – and if the relevance scores from retrievers are comparable. For example, if you want to join documents from two different dense retrievers because they return documents from different document stores. This option is not useful for hybrid retrieval.

Depending on your application, you now have the option to add an intermediate ranking step after merging the documents. This is the most sophisticated method for ordering documents returned by two different retrievers. For example, if your pipeline uses a generative LLM or a summarizer as the next node, you can re-rank your documents to ensure that the most relevant ones are at the top for better results.

Adding a SentenceTransformersRanker node produces the most relevant results, while also adding some latency. It uses a powerful cross-encoder trained to determine the relevance of a document to a given query – similar to the dense retriever’s embedding model, but with a slightly different architecture.

Unlike the embedding model used for retrieval, the ranker can only feasibly process a small number of documents, making it particularly suitable as a scoring mechanism after the retriever. Not only does it put the results of both retrievers in a more meaningful order, but it also standardizes the relevance scores of the documents, making it possible to use these scores for further downstream tasks.

Superior search with Haystack

For a hands-on experience with hybrid retrieval, check out this tutorial from Nicola, one of our community members. Follow along in Colab or in your IDE to see how you can build a hybrid retrieval pipeline in just a few lines of code!

Haystack is the framework of choice for developers looking to build powerful, yet easily customizable natural language search systems that leverage state-of-the-art language models from any source.

Join our friendly Discord community for help with questions regarding Haystack and open source NLP in general, as well as interesting discussions around the latest LLMs. Let’s build something amazing! 🚀

Shaping Haystack 2.0

Mon, 14 Aug 2023 00:00:00 +0000

Since Haystack v1.15, we’ve been slowly introducing new components and features to Haystack in the background in preparation for Haystack 2.0 (or v2). After the work we’ve put into the new design of the Haystack API over the last few months, we’re at a point where we would love to start involving the Haystack community in our thought process and slowly gather your input and feedback. In this article, we would like to highlight where we are at for the design of the new Haystack API for v2, what we want to achieve with the new design, and what our current considerations are.

❓ What does the new 2.0 version mean?

Haystack v2 will be a major update to the design of Haystack nodes and pipelines. We believe that the pipeline concept is a fundamental requirement and an optimal fit for building applications with LLMs. Therefore, Pipelines and Nodes will continue to be the foundation of Haystack v2. However, the general pipeline structure, Nodes API, and the connection between DocumentStore and Retrievers will change. So, this is going to be a breaking change for Haystack users.

🏆 Motivation behind Haystack 2.0

At deepset, we put a lot of thought and care into maintaining Haystack as a robust, user-friendly, and production-ready LLM framework. As we have collected feedback from the Haystack community over the years and observed the advancements in the NLP field, such as LLMs and Agents, we see the need to update the pipeline structure with Haystack v2 to better align with our users’ needs and state-of-the-art NLP approaches.

When ready, Haystack v2 will introduce many improvements, flexibility and, most importantly, it will allow Haystack users to implement customizations and extensions to Haystack much more easily. The new pipeline structure will allow for more flexible, robust, and powerful pipelines. As we change the pipeline structure, we’ll be adapting all components to the new structure, therefore, rewriting many of them. This update gives us the opportunity to enhance the pipeline structure to better make use of LLMs, improve our Agent and Memory implementations, better define the connection between the DocumentStore and Retriever, and so on.

📍 Current status of Haystack 2.0

Haystack v2 is still a work in progress. We are defining the requirements for a more powerful and robust LLM framework with continuous feedback from the community, and we’re implementing the new Haystack API so that it’s aligned with the advances in NLP.

Although still in beta, you can find what’s been implemented so far in the preview package of the Haystack repository. To learn how and when components will be migrated, have a look at the Migrate Components to Pipeline v2 roadmap item, where we keep track of issues and PRs about Haystack v2. For a detailed overview of the current state of v2, check out Sara’s presentation about Haystack v2.

Additionally, here is the complete list of proposals so far shaping the design of Haystack v2:

⭐ Highlights of Haystack 2.0

Pipeline Nodes will be now called Components.
The new pipeline structure will provide better support for LLMs. The flexible connection between components will introduce new mechanisms, such as parallel branching and looping, that extend the capabilities of pipelines. Components will control the input and output of the pipeline. Thus, components with dynamic input parameters, such as those that use prompts with variables, will easily integrate into the pipeline. Overall, these refinements will not only improve the linear workflows but also ensure that pipelines seamlessly align with the nature of LLMs.

Here is what a RAG pipeline might look like in Haystack v2.👇🏼

Keep in mind that the components are still work in progress and being discussed in the “ LLM Support in Haystack 2.0” proposal.

Representation of a RAG pipeline in Haystack v2
The Components API will change. Components will define the name and the type of all of their inputs and outputs. The new API will reduce complexity and make it easier to create custom components such as Haystack integrations for third-party APIs and databases. The connections between components will be validated before query time, and Haystack will generate better error messages with instructions on fixing the errors.
Retrievers will be customized for DocumentStore, not for retrieval methods. Each DocumentStore will have its own Retriever, highly specialized for that specific DocumentStore, handling all its requirements without being bound to a generic interface. Integrating a new DocumentStore will be easier, and the specialized Retriever will be able to adapt more quickly to the new features of the DocumentStore.
The Embedder will be a separate component instead of being a part of a Retriever. Retrievers won’t be responsible for creating embeddings, the new Embedder component will handle the creation of embeddings. The Retriever class will be simplified, and adding support for new embedding providers and approaches will be more straightforward.
Pipeline serialization will be more flexible and optimized for humans. JSON, TOML, HCL will be used as serialization formats. Serialization and deserialization of pipelines sharing the same component instance will be possible.

➡️ What’s next?

As we iterate on Haystack 2.0, we’ll share the design proposals in GitHub Discussion: Shaping Haystack 2.0 and start a conversation about topics where we need your input. As we share more information about Haystack 2.0, please feel free to share your feedback or concerns. You can always contact us using the comments section in the GitHub Discussion or the Haystack Discord server to ask questions.

Question Answering in the Cockpit

Wed, 26 Jul 2023 00:00:00 +0000

Large language models (LLMs) are incredibly powerful, and we at deepset are stoked about their seemingly limitless potential. But while models like Llama 2 and GPT-4 continue to make waves, a related but less buzzworthy technology has consistently been delivering great value for a range of use cases.

Based on smaller, open-source Transformer models, extractive question answering (QA) is one of the most compelling knowledge management techniques to emerge from NLP in recent years. As a result, organizations are beginning to recognize the opportunities that extractive QA can bring to large knowledge bases.

One such company is Airbus. Using our open-source framework Haystack for applied NLP, they built a complex QA system to extract answers from their extensive manuals for pilots, which are used both in training and during flight. What makes this use case particularly interesting is that the manuals contain not only text but also a lot of tables. Thankfully, modern QA systems can handle those too, using a special technique known as table question answering (Table QA).

This case study describes how Airbus’ artificial intelligence research unit investigated how NLP could help human agents find solutions in critical situations, based on their technical documentation. They used Haystack to solve the complex problem of processing both text and tables. The result, a composite text and table QA system, will make work easier and more efficient for different teams in the company.

The problem

Aircraft are complex machines, and flying them requires a high degree of mental and physical fitness. Airbus’ Flight Crew Operating Manual (FCOM) serves to address all the issues a pilot might encounter during training or even while flying the airplane. Questions such as “What is the standard procedure when the cockpit windshield has been cracked by a bird?” or “Under which weather conditions is it unsafe to land?” are all answered in the manual, which spans thousands of pages.

Modern commercial aircraft like the Airbus A320 are actually quite good at detecting issues through the use of sensors, and handling them in an automated fashion according to protocol. But sensors can fail, and there are problems that can’t be detected by sensors, such as a passenger being sick. In those cases, it is crucial for pilots to find the solution in the minimum amount of time.

Currently, Airbus pilots are using a digitized version of the manual, which they can search using basic keyword-based functionalities. On this blog, we have written extensively about keyword searches, and how they are surpassed by semantic search systems in many respects.

A major downside of the keyword-based approach is that users need to know the exact words that are used to describe an issue. Users can sometimes struggle to find the exact combination of words that will lead them to the correct result, which can waste valuable time, especially in critical situations. This meant that Airbus’ existing information retrieval system had significant potential for improvement in terms of both speed and relevance.

It should be stressed that the current system works quite well for most of the experienced pilots, who have worked with it for many years, and have gotten very good at finding the information they need. But the AI team at Airbus wanted to see if it was possible to use newer, more advanced approaches to build a more intuitive system with a flatter learning curve that would return more accurate information within a shorter time.

In stressful situations especially, pilots need to get to the info as quickly as possible to be able to react in time. We wanted to assess whether a deep learning-based system for advanced question answering could reduce the retrieval time. — Alexandre Arnold, AI research unit, Airbus

Like many technical documents, the FCOM contains a lot of tables. During the project’s initial phase, the Airbus team realized that tables play such an integral part in the pilots’ information extraction process that their system wouldn’t be complete without a module that could do question answering on tables. Luckily, TaPas — Google’s popular Table QA model — does just that. Having worked with deepset before and knowing of our expertise in working with TaPas-based models, the AI team at Airbus found it easy to decide on Haystack as the framework of choice for implementing a QA system for pilots that could handle both text and tables.

What does it mean to do question answering on tables?

Tables contain a lot of information in an extremely condensed format. They differ from text in that they come in a predictable, structured form, and can typically provide answers to many different questions. For instance, we could ask “What is the maximum crosswind for takeoff when the ground is frozen?” and a human agent could, with perhaps some initial difficulties, retrieve the answer from the following table. Then we could go on to ask questions about other actions and weather conditions, all equally answerable by looking at this table alone.

Table QA models like TaPas can do much the same. They combine the deep language model’s capacity for understanding the semantics of natural language (such a model would be able to grasp that the word “frozen” in our question refers to the condition of “ice” in the table) with the additional knowledge about how to navigate a table’s rows and columns to retrieve the right cell. That combination results in a powerful technology able to extract information from thousands of tables much more quickly than even the most experienced human could ever accomplish.

Even more than that, these models can do aggregations on tables — just like a SQL query, but in natural language. For instance, we could ask “What is the maximum velocity when the flaps are in the first position?” and Table QA would return “230 kt” — the larger of the two values associated with flaps lever position 1 in the table below.

If you want to learn more about Table QA, be sure to check out our tutorial and blog post on the topic. But of course, Airbus’ actual task is much more complex than the use cases illustrated so far. First of all, the final system wouldn’t know whether the answer to a given query could be found in a table or in a piece of text. And second, it would have to retrieve the relevant table or text from the collection of FCOM pages itself.

Data annotation and preparation

For a machine learning model to really shine, it has to be adapted to the specific data that it will encounter in the real world. This is especially true for technical domains like the aerospace industry. So the first step for Airbus was to annotate some training data that could be used for fine-tuning both the general language model (for handling text) and the TaPas model (for handling tables).

Annotation, for QA, consists of coming up with questions and highlighting the text passages that answer them. For instance, we could ask “How should the crew handle illness in the cabin?” and a possible answer span could be “In case of a sick passenger, first verify if there is a medical doctor on board.” The variation in wording and sentence structure makes sure that the system learns to match questions and answers on the basis of meaning, rather than the words used.

To annotate the first set of data, the team organized an internal hackathon and invited operations experts with deep knowledge of the documentation — some of whom had even helped write the manual. These experts were invited to come up with questions that are representative of what a pilot would ask in a real-world scenario — like the one in the previous paragraph — and identify the answers in the manual.

For the annotation of the tabular data, on the other hand, the team made use of the fact that tables follow a much more structured format than natural language. They came up with a set of 240 question templates, which they filled with different values, matching each question to its answer in a semi-automated fashion. For instance, given the tables above, one could come up with questions like “What is the maximum speed of the aircraft?” or “What is the maximum crosswind for landing?”

Finally, one of the biggest sources of confusion in FCOMs is the fact that different problems and solutions can be described in very similar terms. Context is key in that case, and is often provided by headers or captions. For instance, different sections could talk about “engine failure,” but which specific engine was concerned would be clarified by the chapter or section header.

Using the traditional, keyword-based approach, an experienced pilot could quickly determine which engine a text is talking about by considering additional information found on the page. However, since the language model can only parse text passages of about 200 words at a time, that additional information would have to be included elsewhere. Thankfully, Haystack’s nodes — the modular building blocks which make up the final NLP system — make it easy to include and retrieve such information in the form of metadata that is added to a textual or tabular data point during preprocessing.

The solution

The final system uses a retriever-reader pipeline — a classic architecture for question answering. Because the computationally expensive question answering model itself shouldn’t be run on the entire corpus, such a pipeline uses a retriever. It preselects the documents that to be processed by the reader, which contains the QA model. In Airbus’ case, however, the team actually used two separate retriever-reader pipelines (one for text and one for tables), which consisted of a single entry point for the query and then were joined at the end.

This complex setup is needed because, when a pilot asks a question, the system doesn’t know whether the answer to the query will be found in a table or in text. It therefore runs the query through both the Table QA pipeline and the “classic” text QA pipeline. The results are then joined in the final node, which picks the correct answer based on the confidence values attached to the results.

Results and outlook

The team at Airbus is excited about the results, especially about the performance of the table section of the pipeline: “These documents are so long and have so much information encoded in tables,” Alexandre explains. “Being able to pinpoint the right cell in a table within more than a thousand pages, and on that basis, provide the right answer in less than one second: that is a big value.”

While the system in its current form may still be too experimental for critical “life and death” situations, it can already be used for many other applications, such as in engineering and pilot training.

Alexandre and his team had started the project to find out whether NLP technologies could be used in highly technical domains with strict accuracy requirements — or whether the quest for “browsing documentation in a more natural way,” as Alexandre puts it, was still out of reach. After concluding this phase of the project, Alexandre is confident:

The results are promising. While we still need to work on the robustness of such systems and the operational performance before any imaginable deployment, we now see the value and the potential of the technology. We want to make sure that we keep up to speed with the space of NLP, so that our organization can leverage it from day one once it’s really perfectly useful for operations. This is really a gold mine!

Take your information retrieval system to new heights

As an extremely active field of research, NLP is coming up with new, ingenious solutions for real-world applications all the time. You can use natural language to retrieve documents, tables, and even images: there are fewer and fewer limits to what kind of information these systems can process.

If you want to see the results of such a system in action, get started with Haystack today.

We’re always happy to see new members join our community of applied-NLP enthusiasts. Head over to our community page to join the deepset Discord server or find us on GitHub.

Good Listener: How Memory Enables Conversational Agents

Fri, 07 Jul 2023 00:00:00 +0000

Memory is what turns a powerful LLM into an empathetic interlocutor who can remember what you’ve said before. But how does it work?

Today we want to lift the hood on how memory is implemented in Haystack. We’ll explain the differences between memory injection and memory as a tool and show you how to get around the context window’s length limitation by summarizing.

If you want to understand how memory works computationally or start building your own conversational AI interface with Haystack, this article is for you.

Why memory matters

For an exchange of words to feel like a conversation, both parties require some kind of memory: they need to remember what’s been said before, and they need to be able to refer back to salient entities and concepts by using, for example, pronouns.

Consider the following dialogue from the movie “Booksmart:”

Amy: Name one person whose life was so much better because they broke a couple of rules.

Molly: Picasso.

Amy: He broke art rules. Name a person who broke a real rule.

Molly: Rosa Parks.

Amy: Name another one.

Molly: Susan B. Anthony.

Amy: God dammit.

This conversation only works because Amy and Molly know what was said previously. It’s clear from the context that the pronoun “he” refers back to Picasso, and when Amy says, “Name another one,” she doesn’t have to spell out that she means “another person who broke the rules.”

Large language models and memory

By default, LLMs don’t have an inbuilt notion of memory. As far as an out-of-the-box LLM is concerned, every prompt it receives is the beginning of an entirely new interaction.

Agent without Memory

For a conversation to feel natural, an advanced chatbot needs to have some context beyond the current question-answer pair. This property is known as either memory or history.

Agent with Memory

To make conversational agents aware of previous exchanges, different implementations of memory exist.

Memory injection for Agents

A Conversational Agent in Haystack is a basic chatbot that can generate answers to user queries through an LLM. Memory is passed on to the prompt along with every new query.

As the diagram shows, the memory stores each answer by the Agent along with the user’s query. In this implementation, memory is simply a list. Every new exchange between human and AI is appended to the list so it can be injected back into the next prompt.

To illustrate this, let’s look at a dialogue with multiple exchanges. I used OpenAI’s gpt-3.5-turbo for these examples, the same model that powers ChatGPT. (For the full tutorial on how to set up a conversational agent with Haystack, have a look at our tutorial.)

First, we import the PromptNode and initialize it with the model’s name, the API key from OpenAI, and the maximum length for the node’s output:

from haystack.nodes import PromptNode  
  
model_name = 'gpt-3.5-turbo'  
prompt_node = PromptNode(model_name, api_key=MY_API_KEY, max_length=256)

We can now set up our conversational agent (aka chatbot) by passing the prompt node to it. By default, this initializes a memory, which is injected back into the user’s prompt with each new call.

from haystack.agents.conversational import ConversationalAgent  
  
conversational_agent = ConversationalAgent(prompt_node=prompt_node)

Let’s now ask the model a few questions to accumulate some memory:

conversational_agent.run("Who was Pablo Picasso?")  
  
>>> Pablo Picasso was a renowned Spanish artist who is considered one of the most influential artists of the 20th century. He is known for co-founding the Cubist movement and  for his contributions to Surrealism. His work encompasses various styles and mediums, including painting, sculpture, ceramics, and collage.

We can easily verify that the memory injection works by referring to the painter with a pronoun in our next query:

conversational_agent.run("Where did he live?")  
  
>>> Picasso was born in Malaga, Spain, but he spent most of his adult life in France, particularly in Paris. He also had periods of residency in Barcelona and various other locations throughout Europe.

Interesting! But now to the burning question:

conversational_agent.run("How can I learn to paint like him?")  
  
>>> To learn to paint like Picasso, you can start by studying his works and understanding his artistic style. Analyze his use of color, form, and composition. Experiment with different techniques and mediums that he used, such as Cubism or collage. Practice regularly and don't be afraid to take risks with your art.

We can now take a closer look at the agent’s memory by calling:

print(conversational_agent.memory.load())  
  
>>> Human: Who was Pablo Picasso?  
AI: Pablo Picasso was a renowned Spanish artist who is considered one of the most influential artists of the 20th century. He is known for co-founding the Cubist movement and  for his contributions to Surrealism. His work encompasses various styles and mediums, including painting, sculpture, ceramics, and collage.  
Human: Where did he live?  
AI: Picasso was born in Malaga, Spain, but he spent most of his adult life in France, particularly in Paris. He also had periods of residency in Barcelona and various other locations throughout Europe.  
Human: How can I learn to paint like him?  
AI: To learn to paint like Picasso, you can start by studying his works and understanding his artistic style. Analyze his use of color, form, and composition. Experiment with different techniques and mediums that he used, such as Cubism or collage. Practice regularly and don't be afraid to take risks with your art.

Evidently, all our exchanges with the model have been stored in memory. It also becomes clear why, using this solution, you can quickly run into problems. As the conversation between human and AI goes on and the list of previous exchanges in the memory grows, the LLM’s context window may become too small. When the prompt starts to overflow the context window, the quality of the model’s output suffers.

Cost is another factor: companies like OpenAI charge users on the basis of tokens, so you’ll want to keep your prompts short. As a solution, you can summarize your memory periodically.

Summarizing memory

By using a separate model to summarize your Agent’s memory at regular intervals, you keep your memory manageable. In Haystack, you can initialize a conversation summary memory that generates summaries of the last few exchanges between human and AI. In our example, we’ll simply use the same model for summarizing as for the conversational agent – so we initialize the memory with our prompt node again:

from haystack.agents.memory import ConversationSummaryMemory  
  
summary_memory = ConversationSummaryMemory(prompt_node)

Note that this memory class uses a specific prompt template, which instructs the model to condense the conversation. We’ll see later how we can adjust those instructions ourselves. Let’s initialize the conversational agent again, this time with our summary memory:

from haystack.agents.conversational import ConversationalAgent  
  
conversational_agent = ConversationalAgent(prompt_node=prompt_node, memory=summary_memory)

After running this agent with the same questions as before, we take a look at the memory again:

print(conversational_agent.memory.load())  
  
>>> Pablo Picasso was a highly influential Spanish artist known for his contributions to various art movements. He lived in Spain and France.  
Human: How can I learn to paint like him?  
AI: To learn to paint like Picasso, you can start by studying his works and understanding his artistic style. Analyze his use of color, form, and composition. Experiment with different techniques and mediums that he used, such as Cubism or collage. Practice regularly and don't be afraid to take risks with your art.

We can see that by using the summary, we’ve managed to condense our memory considerably. Interestingly, the model has decided to keep the last exchange almost in its entirety.

By default, the node creates a summary after three exchanges, but you can change that by adjusting the summary_frequency parameter. You can even set that parameter to 1 to create a summary after every exchange. In the next example, we do exactly that, and we also tell the model to keep the memories extra short (ten words at most). To that end, we import the prompt template class and specify our requirements in a custom template, which we then pass on to the summary memory.

from haystack.agents.memory import ConversationSummaryMemory  
from haystack.nodes import PromptTemplate  
  
my_template = PromptTemplate("Create a short summary (max 10 words) of the following chat transcript by shortening and summarizing the content without losing important information:\n{chat_transcript}\nCondensed Transcript:",)  
summary_memory = ConversationSummaryMemory(prompt_node, prompt_template=my_template, summary_frequency=1)

How does our memory look after three exchanges with the new prompt template?

print(conversational_agent.memory.load())  
  
>>> Picasso: Influential Spanish painter known for Cubism and innovation.Picasso lived in Spain, France, and the United States.Study Picasso's style, experiment with materials, take art classes.

We’ve been able to considerably shorten our memory, while preserving all the important information!

The example uses a proprietary model from OpenAI. Nevertheless, you can also use our conversation summary memory class with open-source models from the Hugging Face model hub. A good choice is this BART-large model, which was designed specifically for summarizing conversations. Note, however, that older models can’t follow instructions like LLMs do. To make it work, you can use the following code:

summary_node = PromptNode('philschmid/bart-large-cnn-samsum', max_length=256, model_kwargs={"task_name": "text2text-generation"})  
template = PromptTemplate("{chat_transcript}")  
summary_memory = ConversationSummaryMemory(summary_node, prompt_template=template)

Note that while the above solutions manage to keep the memory much shorter than the accumulative version we used initially, they still result in a growing memory, because the summaries are appended. This open issue aims to solve this problem by summarizing the entire memory, thus allowing you to keep it at a stable length throughout the entire conversation.

Memory for Conversational Agents with Tools

The Conversational Agent lets you set up a basic chatbot functionality. But it’s nowhere near using all the capabilities an Agent offers. Agents are so powerful because they can use an LLM’s reasoning skills for dynamic solution planning: given a set of tools, agents can use those tools at their own discretion to arrive at the optimal final answer.

Agents keep track of their thinking process – the solution strategy which leads them to their final answer – in a “transcript.” If you don’t want to pass both a memory record and a transcript to the conversational agent, you can simply implement memory as another tool.

In this setup, the Agent, upon receiving a prompt, first consults the memory tool to see whether it can provide any context for answering the user’s query. The Agent’s output is then stored in memory along with the query after every iteration, as in our examples before.

Memory as a tool opens up a range of possibilities. You’re free to configure it in the way that works best for you. Perhaps you want to keep track of an enormous amount of human-AI interactions in a memory database. You could then serve your memory through a sophisticated tool like an extractive QA pipeline or a summarization pipeline.

In the extractive QA pipeline solution, your conversational agent sends a query to your memory database, and the pipeline seeks to return the most relevant snippets from the memory that help contextualize the user’s query. In the summarization pipeline solution, the memory tool returns a summary of relevant documents instead. Implementing memory as these tools again helps you avoid bumping into the LLM’s token limit.

Don’t forget to follow us

As a company, deepset is invested in bringing LLMs to every application. We also like sharing the latest insights on all things NLP with our community. If that sounds good to you, follow us on Twitter.

Our OSS framework Haystack comes with all the tools needed to build state-of-the-art systems with LLMs. Head over to the Haystack GitHub repo to learn more.

To see what other people are building with Haystack, or if you need help with your own projects, join our server on Discord. It’s a friendly community managed by our Devrel advocates, where you can meet others with an interest in NLP.

Share and Use Prompts with PromptHub

Thu, 29 Jun 2023 00:00:00 +0000

With the release of Haystack 1.18, we’ve also officially rolled out a new prompt sharing and collaboration service: PromptHub. This new product currently (day of release: 29 June 2023) hosts a set of prompts we at deepset have created, with instructions on how to use them with the PromptTemplate and PromptNode.

Prompts that are provided in PromptHub are all maintained in their own public repository on GitHub, deepset-ai/prompthub. Each prompt comes with a YAML file, housing the prompt itself, and a “prompt card” of the same name, which is a markdown file explaining the intended use case of the prompt and how to use it with a Haystack PromptNode.

In the following sections, we will walk through how you may use prompts hosted on the PromtHub, as well as your own local prompts.

Using Prompts on PromptHub via the API

The easiest way to use prompts from the PromptHub is to simply copy over their identifier, and add this into the prompt field in a PromptTemplate.

Copy the identifier of a prompt from the PromptHub. For example, let’s try deepset/summarization and add this as your prompt in a PromptTemplate

from haystack.nodes import PromptTemplate

my_summarizer_prompt = PromptTemplate(prompt="deepset/summarization")

Each prompt on the PromptHub will have instructions on how they are intended to be used, including instructions on what the best output_parser for that prompt might be.

For example, the deepset/question-answering prompt is best used with an output_parser as described on the PromptHub.


from haystack.nodes import PromptTemplate, AnswerParser

question_answering_template = PromptTemplate("deepset/question-answering", output_shapers=AnswerParser())

Once you’ve created your PromptTemplate, you can use it in a PromptNode by setting it as the default_prompt_template or even setting it as the prompt_template in a single prompt() call.

Once you’ve pulled a prompt from the hub, the file will be cached on your machine. So you do not have to worry about making an API call each time.

Creating and Using Local Prompts

You always have the option to create your own prompt and provided it as the prompt to a PromptTemplate. For example:

from haystack.nodes import PromptTemplate

my_prompt_template(prompt="Answer the following query: {query}")

But now, you can also make use of and maintain your own local collection of prompts as YAML files.

Each prompt on the PromptHub is actually a YAML file containing a short description, the name of the author, a tag describing the main task it covers, as well as the prompt text itself. For example, below is the content of the deepset/question-answering prompt.

description: A simple prompt to answer a question given a set of documents
meta:
  authors:
  - deepset
name: deepset/question-answering
tags:
- question-answering
text: "Given the context please answer the question. Context: {join(documents)};\n\nQuestion: {query}; \nAnswer:\n"
version: 0.1.1

Similarly, you can create your own files that hold prompts. For example, here’s a file called my_yoda_prompt.yaml

description: A simple prompt to answer any question in the style of Yoda
meta:
  authors:
  - deepset
name: tuana/yoda-qa
tags:
- question-answering
text: "Answer the following query in the style of Yoda. Query: {query} \n Answer:"
version: 0.1.1

Which can be used in a PromptTemplate with the following code:

from haystack.nodes import PromptTemplate

yoda_prompt = PromptTemplate(prompt="path/to/my_yoda_prompt.yaml")

Just in case you were curious: When asked ‘What is the meaning of life’ with this prompt, GPT-4’s result was “42, the meaning of life is. But truly, in one’s own understanding and experience, find the meaning of life, one must. Hmmmm.”

Contributing to the PromptHub

If you’ve created a prompt that you would like to share with the Haystack community and the world, all you have to do is create a pull request that includes your YAML file with the prompt, and a Markdown file with your prompt card explaining how to use it 🥳. Once you’ve done this, your prompt will be available to everyone with the name you give it. A good idea is to prefix it with your name, like we’ve done with deepset/ prompts.

What Is a Large Language Model (LLM)?

Fri, 23 Jun 2023 00:00:00 +0000

AI is here to stay. But while applications like autonomous driving and even image generation have only touched few people’s lives so far, generative AI in the form of high-quality chatbots has taken the world by storm. The informative, creative, and at times deceptively eloquent responses by the likes of ChatGPT are made possible by a novel technology known as large language models (LLMs).

In this article, we’ll talk about what LLMs are and how they’re produced, what kinds of LLMs exist, and whether it’s truly just their size that sets them apart from other language models. In the end, we’ll show you how you, too, can use LLMs in Haystack, our open source framework for NLP.

What is a large language model?

As with most groundbreaking technologies, LLMs didn’t come about overnight. Rather, they are part of a long-term trend in natural language processing that has been happening over the last few years. Since the introduction of BERT, state-of-the-art language models have followed the same architectural paradigm, which saw them develop an ever-better understanding of the intricacies and nuances of natural (that is, human) language.

Developers took that new-found ability for processing language to the next level by building ever larger language models, which require an incredible amount of computing power during training: the LLM was born. In addition to their size (which is calculated based on their number of trainable parameters), these models also have to ingest huge amounts of data when they’re trained.

BERT has 110M parameters, BERT-large 336M. The largest GPT-3 model has 175B. The exact size of GPT-4, OpenAI’s latest LLM, is unknown, but it is estimated at above 1 trillion.

So technically, a large language model is simply, well, a language model that is large. In practice, though, the term LLM has taken on a more specific meaning: it is often used to refer to those large models that are able to generate coherent, human-like output in response to instructions by a user. In this article, therefore, we’ll concentrate on the models that satisfy this definition, and ignore other LLMs.

How are LLMs created?

Like most modern machine learning models, LLMs are created by feeding a complex neural network architecture with data representing the domain the model needs to learn. Hence, computer vision models ingest image data, while language models need to see textual data during training. Have a look at our entry-level discussion of language models to learn more.

Pre-training an LLM

In the first training phase, the LLM learns a representation of the data through text completion. Upon seeing a sequence of words, its training objective is to generate the most likely next word. Through this technique, it learns to emulate our own linguistic intuition. However – and here’s where the “large” part truly makes a difference – while small models arrive at a stage where they can form well-formed, grammatical responses in the language they’re trained on, LLMs go much further. These models have so many parameters that they can easily learn a representation of the world knowledge contained in the data.

Let’s look at some examples to make that point clearer. Here’s the kind of intuition a small language model learns compared to that of an LLM:

When instructed correctly, the large model can complete text in a way that produces coherent and informative answers, translations, summaries, and much, much more. That is why prompting or prompt engineering plays such an important role when working with these models. Have a look at our article on LLM prompting to learn more.

Evidently, an LLM doesn’t only know how language works. It also knows how the world works – at least the part of it that is represented in the training data. This training data consists of huge collections of texts gleaned from the internet (the Common Crawl corpus, for example) and other sources.

So far, we only talked about the first step of training language models – be they small, medium-sized, or large. However, in the case of ChatGPT and other instruction-following LLMs like it, the developers included additional complex training steps to arrive at these models’ impressive conversational abilities.

Fine-tuning an LLM

To recap, a pre-trained LLM can complete text prompts, using not only linguistic intuition but also knowledge about the world. However, that text completion can still take many different forms. Imagine we input the prompt “Explain the moon landing to a 6 year old.” To the pre-trained model, the following two outputs are equally good completions:

“Many years ago, a group of people first landed on the moon…”
“Explain gravity to a 5 year old.”
(Example adapted from OpenAI)

For us who know what a conversation looks like, the second answer sounds as though the model is trolling us. So during the fine-tuning phase, researchers need to teach the LLM the notion of a proper exchange of questions and answers. To that end, they create a dataset of human-generated answers in response to a set of prompts and fine-tune the model on that data. This step is called “supervised finetuning” (SFT).

The LLM has now acquired a skeletal understanding of human language-based interactions. But as we all know, such interactions are not always appropriate, especially when you’re trying to emulate a neutral agent in the form of a chatbot. Because the model has ingested a large amount of the internet during training – complete with chat forums and other unvetted content – it is still prone to generating answers that are offensive, unqualified, or even harmful.

To create a truly helpful chatbot, the developers of ChatGPT therefore introduced a second training step that uses “reinforcement learning” – a classic method from machine learning – to guide the model towards more optimal answers. Based on human rankings of different answers to the same prompt, they taught a secondary model to distinguish between good and bad outputs.

So in the final step towards ChatGPT as we know it, this secondary model is used to teach the LLM to better match users’ expectations and provide answers that are informative, neutral, and politically correct. This last fine-tuning step is known as “reinforcement learning from human feedback” (RLHF).

The world of LLMs: size vs. speed

While the release of ChatGPT marked the point where the general public learned about large language models and their abilities, LLMs were being developed well before that – and not just by OpenAI. Google’s LaMDA (137 billion parameters), NVIDIA and Microsoft’s MT NLG (530 billion parameters), and the Chinese BAAI’s model WuDao 2.0 (1.75 trillion parameters) all operate on the principle that bigger is better.

All of these are proprietary models: some of them can be used via an API (and for a fee), while others are still entirely gatekept. This runs counter to a widespread standard in NLP, by which a language model’s trained weights are shared for fair general use. (For example, Google’s BERT and Meta’s RoBERTa were open-sourced from the start and could be used out of the box or fine-tuned by anyone.)

A closed-source model, on the other hand, is even more of a black box than other neural nets: no one knows exactly which data went into these models or how their output is filtered. In the case of GPT-4, it’s not even clear how big the model actually is. What’s more, being trained on large collections of publicly available internet data, these models are only made possible by the practice of data-sharing in the first place.

But the NLP community responded to the companies’ proprietary aspirations with lightning speed, training and open-sourcing its own large language models. You can take a look at the Open LLM Leaderboard on Hugging Face or the Chatbot Arena, which hides the predictive models’ identities during experiments, to find out about the best performing models in the space.

Open-source models are able to improve so much faster because they can build on each other’s successes. LLMs like Falcon, Vicuna, and Alpaca are trained transparently and on publicly available datasets. And though they are much smaller than their closed-source counterparts, the quality of these models’ output is beginning to catch up.

LLMs + Haystack = 🚀

Naturally, LLMs have made an enormous impact on the world of NLP. Entirely new areas of research like prompt engineering and hallucinations have sprung up, generating heated debates and an increased speed of development that brings innovations every day.

Haystack, our OSS Python framework for NLP, lets you build your own LLM-powered systems by incorporating agents, prompt nodes, and your large language models of choice – be they closed- or open-source – within user-facing applications.

Thanks to Haystack’s modular structure, changing one line in your code is all that’s needed to switch between models. Experiment with various models on different platforms to see which ones work best for your use case. And once the next powerful OSS LLM comes along, you can simply plug it into your Haystack pipeline and start using it right away. With Haystack, you are ready for the future 🚀

Do you like learning about all things NLP? We host a server in Discord where we answer questions and organize regular discussions about interesting topics from the world of natural language processing. Make sure to stop by – and if you’re interested in the latest research about LLMs, check out the nlp-discussions channel while you’re there!

We hope to welcome you to our growing community of NLP enthusiasts soon! 🙂

The Beginner's Guide to LLM Prompting

Thu, 15 Jun 2023 00:00:00 +0000

Large language models (LLMs) have made it possible for everyone to interact with an AI – not through code but using prompts in natural language. The fact that language now acts as an interface to complex models makes it necessary to investigate the prompts we use more closely.

When used correctly, generative models can produce highly valuable results for businesses. Getting your prompts right is therefore necessary to harness LLMs’ enormous potential, especially when you’re looking to incorporate it into your product. That is why an entire industry has formed around the topic of “prompt engineering.” In this post, we’ll explain our approach to this technique and share the dos and don’ts of prompting.

What is a prompt?

A prompt is an instruction to an LLM. If you have interacted with an LLM like ChatGPT, you have used prompts. Ideally, a prompt elicits an answer that is correct, adequate in form and content, and has the right length.

Essentially, prompting is about packaging your intent in a natural-language query that will cause the model to return the desired response.

How to come up with good prompts

Good prompts follow two basic principles: clarity and specificity. Clarity describes the use of simple, unambiguous language that avoids jargon and overly complex vocabulary. So rather than keeping your queries short and snappy, better take the long-winded way to make your point sufficiently clear to the LLM.

Example of an unclear prompt:

Who won the election?

Example of a clear prompt:

Which party won the 2023 general election in Paraguay?

Specificity, on the other hand, refers to the need for context. Tell your model as much as it needs to know to answer your question. Sometimes, this can veer almost into storytelling territory.

Example of an unspecific prompt:

Generate a list of titles for my autobiography.

Example of a specific prompt:

Generate a list of ten titles for my autobiography. The book is about my journey as an adventurer who has lived an unconventional life, meeting many different personalities and finally finding peace in gardening.

Let’s look at a few tricks to make our prompts even better.

Prompt engineering hacks

Prompting is not a science, and calling it “engineering” may be a slight stretch. But researchers and users of LLMs have identified some tricks that consistently improve the models’ responses.

Do say “do,” don’t say “don’t”

This point actually continues our specificity theme because “do” instructions are by nature more specific than “don’t”. So rather than telling the model what not to do, it’s usually better to specify exactly what we want it to do – if we know what that is.

Let’s imagine that, in our previous example, we want to make sure that the LLM doesn’t produce overly long titles (as those models tend to do). Rather than saying:

Don't make the titles too long

it is better to be specific:

Each title should be between two and five words long.

Use few-shot prompting

In our example, we asked for a particular item (the title for an autobiography) of a certain length (two to five words). This type of instruction without any concrete examples is known as “zero-shot prompting.”

Most models, however, benefit from “few-shot prompting,” where you feed your model some examples to learn the desired pattern. In our prompt, we can add a few existing titles as examples to help the model get a feel for what we’re looking for.

Here are some examples: Long walk to freedom, Wishful drinking, I know why the caged bird sings.

Structure your prompt in a meaningful way

Elements like quotation marks, bullet points, line breaks, etc. make it easier for humans to parse a text. And guess what? The same is true for LLMs. Let’s apply that insight to our previous example and spell out the full prompt once more.

Generate a list of ten titles for my autobiography. The book is about my journey as an adventurer who has lived an unconventional life, meeting many different personalities and finally finding peace in gardening. Each title should be between two and five words long.

### Examples of great titles ###  
      
- “Long walk to freedom”  
- “Wishful drinking”  
- “I know why the caged bird sings”

Now that we’ve perfected our prompt, what does the LLM’s output look like? Here’s what ChatGPT came up with in response:

“Seeking Skies, Finding Roots” is hilarious and I would absolutely consider reading that book.

Use leading words

Sometimes, the simplest techniques are the most effective. Researchers from Tokyo University have shown that telling a model to “think step by step” before its response can generate more accurate results and help the model correct its own mistakes. This method is known as using “leading words,” because we gently guide the model towards a more effective plan for problem-solving.

The model is forced to break down its solution into multiple, more manageable steps rather than being allowed to just hazard a guess. Seeing how this simple trick improves the model’s performance is pretty striking. Here is an example of a math puzzle that requires just a bit of recursive thinking:

The LLM confidently responds – with the wrong answer. Here’s what happens when we ask it to “think step by step:”

And this time, the final answer is correct.

Prompting pitfalls

In addition to these techniques for improvement, it helps to know which elements in an instruction can throw the model off. Some people, for example, try to incorporate requests for short, one-word answers into their prompts to prevent the models from rambling.

However, as the last example has shown, wordier answers might not be a bad thing when it comes to LLMs – but rather, a technique these models use to arrive at a correct response. LLMs may soon learn to keep their thought processes to themselves. In the meantime, we should let them ramble and refrain from requests for one-word answers.

Another problem happens when you exceed the model’s context length. Due to the nature of these large Transformer models, an LLM’s “context window” (which is the portion of text it can process at once) comprises both the prompt and its response. If you exceed a model’s context length, its output will deteriorate.

While this is a fascinating topic, it is also a fairly complex one – context lengths vary widely between models, and it isn’t even trivial to measure the length of your prompt. Just remember that if the length of your prompt becomes an issue, you should look into breaking the problem down into smaller steps.

Become a better prompter

This article leans heavily on Vladimir Blagojevic’s webinar on “Prompting LLMs Using Haystack.” In it, Vladimir, who is a member of the core engineering team here at deepset, takes a deep dive into the art of prompting and shows how Haystack, our open source framework for NLP, offers just the right tools to get prompting right.

For even more tips on how to come up with great prompts, have a look at our guidelines for prompt engineering.

Prompts also play a huge role in Agents: the latest fascinating trend that lets you harness the power of LLMs. Agents receive a sort of “super-prompt” that instructs the LLM to break its reasoning into manageable steps and delegate those to the tools it deems most capable of solving the tasks. Have a look at our blog post on Agents to learn more.

Finally, you don’t always need to write your own prompts. Head over to our PromptHub, where you find prompts for all kinds of applications, which you can use out of the box or tune to your liking, to get the best answers from your LLM.

Run Haystack Pipelines in production with Ray Serve

Wed, 14 Jun 2023 00:00:00 +0000

Ray Serve is a library built on top of the Ray framework for building online inference APIs. Serve is designed to be framework-agnostic, and while its simple design lets you quickly integrate pretty much any Python logic you need to deploy, building up complex inference services is still possible and straightforward.

Haystack pipelines can be quite complex, but even the simpler ones consist of multiple components, which in turn might rely on different models and technologies - this aspect make them a good benchmark to test out Ray Serve’s capabilities.

Serve a Retriever - Reader QA pipeline

Prepare a set of documents

Let’s begin by serving locally a Haystack pipeline that’s simple but non-trivial: a question-answering system working on an existing set of documents. For simplicity, we’ll use a Docker image running Elasticsearch and populated with a set of documents about countries and capitals that can be queried right ahead. So before we start, let’s run the image:

docker run -p 9200:9200 -p 9300:9300 -d deepset/elasticsearch-countries-and-capitals:latest

The Elasticsearch instance will accept connections on localhost through the port 9200.

Run the Haystack pipeline

Haystack pipelines can be defined through Python code or a yaml definition that is eventually and transparently converted into Python code. We’ll use the latter format, so open an editor and save the following definition in a file called pipeline.yml

version: ignore

# define all the building-blocks for a Pipeline
components:    
  - name: DocumentStore
    type: ElasticsearchDocumentStore
    params:
      host: localhost

  - name: Retriever
    type: DensePassageRetriever
    params:
      document_store: DocumentStore
      top_k: 5

  - name: Reader
    type: FARMReader
    params:
      model_name_or_path: deepset/roberta-base-squad2
      context_window_size: 1000
      return_no_answer: true

pipelines:
  - name: query
    nodes:
    - name: Retriever
      inputs: [Query]
    - name: Reader
      inputs: [Retriever]

To perform a query, we need a bit of Python code to read the yaml configuration above and tell Haystack to run the resulting pipeline:

from pathlib import Path

from haystack import Pipeline

config = Path("pipeline.yml")
pipeline = Pipeline.load_from_yaml(config, "query")
answer = pipeline.run("What is the capital of France?")

print(f"Answer: {answer}")

As you can see, each time we want to make a question we need to rebuild the pipeline and call the run method - let’s make an inference service out of this and query it over HTTP.

Run the Haystack pipeline through Ray Serve

A Ray Serve deployment looks exactly like a regular HTTP server from the caller perspective, and to provide an “ingress” endpoint the bare minimum requirement is writing a Python class similar to the following:

@serve.deployment
class MinimalIngress:
  async def __call__(self, request: Request) -> str:
      name = await request.json()["name"]
      return f"Hello {name}"

The presence of an async def __call__ method is the only contract we need to fulfill in order to spawn a Ray Serve deployment. To create a deployment that will run a Haystack pipeline, let’s create a file called capitals.py and add the following code:

from pathlib import Path

from ray import serve
from starlette.requests import Request
from haystack import Pipeline

@serve.deployment
class HaystackQA:
    def __init__(self):
        # Load the pipeline from file and store it in self._pipeline
        # so we can reuse it every time __call__ is invoked.
        config = Path("pipeline.yml")
        self._pipeline = Pipeline.load_from_yaml(config, "query")

    async def __call__(self, request: Request) -> str:
        query: str = str(await request.body())

        res = self._pipeline.run(query=query)
        answers = res.get("answers", [])
        if answers:
            # Return the first answer
            return answers[0].answer
        return ""

haystack_deployment = HaystackQA.bind()

A couple of things to note about the code above:

We added a __init__ method where we initialize the pipeline once and for the whole lifetime of the deployment.
Every class decorated with @serve.deployment gets a bind method that tells Ray Serve to attach our HaystackQA class to a deployment object that we name haystack_deployment. In a moment, when we will start the server process, Ray Serve will pick up the value of this variable.

The content of the Python file capitals.py is now all we need to “serve” our deployment. Ray Serve can be run from Python directly, but for this example we’ll show how to start the process from the command line, which is closer to what we would do in a production use case. From a shell where the Python environment containing ray is active, you can just run:

serve run capitals:haystack_deployment

You should see several log lines scrolling down in the shell, and if everything went well, a final message Deployed Serve app successfully. will tell us that the deployment is ready to accept connections - we can make a query now. We will use curl but really any HTTP client will work the same. From the command line:

curl -X POST -d "What is the capital of Italy?" http://localhost:8000

It might take a while to get a response the very first time, while Haystack downloads the model locally, but in the end you should see the response Rome.

Scale the deployment

What we did so far doesn’t look much different from putting a pipeline.run() call behind a REST API, but Ray Serve shines when it’s time to put our pipeline in production. For example, let’s say we want to scale our HaystackQA horizontally to serve multiple requests at once. All we need to do is to change the Python code like this:

@serve.deployment(num_replicas=3)  # this is the only line to change!
class HaystackQA:
    def __init__(self):
			...

We can now stop the server and restart it with the same serve run command: that’s it, we know have 3 instances of our HaystackQA deployment as you can see from the Ray control panel running at http://127.0.0.1:8265/#/actors

In case we want to adapt the number of deployments to the current traffic so we can optimize costs, Ray Serve offers autoscaling out of the box. All we need to change is again the @serve.deployment decorator like this:

@serve.deployment(
    autoscaling_config={
        "min_replicas": 1,
        "initial_replicas": 2,
        "max_replicas": 5,
        "downscale_delay_s": 30,
    }
)
class HaystackQA:
    def __init__(self):
			...

The autoscaling_config parameters are self-explanatory and the docs are very helpful in detailing the logic behind that. In this case we tell Ray Serve that we want to scale up our deployment to up to 5 deployment replicas and scale down to at least 1. We also want to start with 2 replicas and downscale if there’s no activity for 30 seconds. Once again you can observe how this works behind the scenes by starting the server and looking at the dashboard on http://127.0.0.1:8265/#/actors.

What’s next

The deployment strategy we introduced in this article is ok for many use cases ranging from trying out Haystack to demo applications and small POCs, but close readers will note the limit of scaling up and down a whole Haystack pipeline: certain nodes of a pipeline might benefit more than others of horizontal scaling and redundancy, can we do that with Ray Serve? The answer is yes, and we’ll see in an upcoming article how we can split a pipeline into multiple deployments that can be then managed by Ray Serve with different strategies.

Generative vs. Extractive Language Models

Mon, 22 May 2023 00:00:00 +0000

Generative language models like ChatGPT have taken the world by storm. Aside from their value in areas like programming and translation, generative models’ main selling point is their ability to come up with fluent, well-formed natural language responses. Like humans, these models seem to generate their answers out of thin air, and without looking up the answers in a factual database.

This sets them apart from their humbler siblings: extractive language models. Both extractive and generative models are based on the breakthrough Transformer architecture that ushered in a new generation of natural-language understanding (NLU) for machines.

However, extractive models differ from generative models in that they explicitly need a context to extract information from at querying time, and they return that information as direct quotations from the source, whereas generative models are trained for generating language and capable of writing new text from scratch. Compared to the fluency of their generative counterparts, extractive models like RoBERTa and ELECTRA can therefore seem much less capable.

But it is becoming increasingly clear that generative models suffer from their own set of problems, like their size, the fact that many of them are proprietary (which isn’t ideal for everyone), and most significantly, their tendency to make things up. What’s more, extractive models have a much better track record in real-world applications — for example, in private or public semantic search engines, or in information extraction systems.

So if you’re considering the use of a modern language model in your own project, this article is for you. We’ll talk about the strengths and limitations of extractive and generative language models, and show you which use cases benefit most from which kind of model.

The popularity of generative language models

In November 2022, OpenAI made its latest large language model — ChatGPT — available to the general public. No coding skills were needed to use it — suddenly everyone was able to “chat” with a highly performant language model and experience for the first time how far these models have come in recent years. This has led to a wave of GPT-related content that variously explains, hypes, and warns against this technology.

However, the foundation of these models are not new. GPT models are built up on Transformer architecture first introduced by Google in 2017 — years before ChatGPT arrived on the scene. The new Transformer architecture quickly gained widespread attention and became the foundation for numerous subsequent advancements in NLP.

Generative models can have any number of parameters, but to yield the impressive results we’ve seen from ChatGPT, it seems that number needs to be almost inconceivably huge — the models in the GPT-3 series have 175 billion parameters, while GPT-4 is said to have more than a trillion. In terms of training data, too, more is more for generative models: not only are they trained on vast amounts of data, but they also seem to store a representation of the facts in that data (rather than just using it to build their linguistic intuition), which leads to an impression of omniscience.

Closed-source language models like those by OpenAI are only accessible through an API and not always available for general download. But there are also impressive open source generative models shared widely on platforms like the Hugging Face model hub. Open source generative models like Vicuna and OpenAssistant are catching up quickly though, generating results that could soon dictate the state of the art.

How do extractive models differ from generative ones?

In terms of architecture, broadly speaking, extractive models like those based on BERT belong to the same family of language models as their generative counterparts, since both use the Transformers. However, the main distinction lies in the usage of these models. Unlike generative models, extractive models require both a query and a specific piece of text as the foundation for their response. It then encodes the query as a high-dimensional, semantically informed vector exactly like a generative model does. But rather than generating an answer from scratch based on the given context as generative models do, the extractive model marks the section in the text document that, according to the model’s prediction, provides the best answer to the query.

So what’s the use of such a model, when you need to provide it with the document that contains the answer? Well, for one, it can be very useful when you need to extract the same kind of information from multiple documents: a classic information extraction scenario. But even more significantly, applied NLP has come up with an ingenious pipeline paradigm to allow extractive models to operate on a large collection of documents, too. In a Retriever-Reader pipeline, a faster model pre-selects the best document “candidates,” which are then given a closer read by the extractive model to find the best answer possible.

Their purely extractive property means the model can only provide answers that quote verbatim from a text. Abstraction, paraphrasing, and the formation of well-formed answers are not in the repertoire of this model family.

Because of their extractive nature, these models have no use for storing factual information, and therefore do not necessarily benefit from having more parameters. Compared to their generative cousins, extractive models are therefore usually much smaller in size and require less training data.

Their relative smallness also makes it easier to share these models. In fact, the most popular models are all available on the model hub, where everyone can download them either for direct use or for fine-tuning on a more specific dataset.

The problem with generative models — and how to combat it

A recurring problem with generative models is that they sometimes make up information and treat it as ground truth. This behavior is commonly referred to as “hallucinations” and the main reason for this is the model’s ability to produce language rather than extracting the answer from a given text.

Hallucinations can occur when the model has insufficient information to answer a question accurately, when it receives the same prompt multiple times, or when it is prompted to come up with an argument that runs counter to commonly accepted facts. The last case shows how blurry the notion of hallucinations is. After all, it is perfectly common to make up facts: in fiction, for example, in hypothetical examples used for illustrative purposes, or in writing assignments for school.

A problem arises, however, when a user expects factual knowledge, and instead receives a hallucination without realizing it. Considering that generative models have been proposed for various tasks such as teaching, programming, and writing, the significant presence of hallucinated content in ChatGPT’s output makes it necessary to rethink their applicability in the real world.

While several methods have been proposed to detect hallucinations amongst a generative model’s output, we at deepset often use the approach of improving the reliability of these models through retrieval augmentation.

Retrieval augmentation: feed the ground truth to your generative model

In retrieval-augmented question answering (and other applications), we don’t rely solely on the facts contained within the parameters of the model itself. Instead, we make use of the customizable nature of modern NLP systems, and connect the generative model to a database of curated documents — much like we saw earlier in the extractive QA setup.

When the model is prompted for an answer, it then uses that database as its factual basis, while still relying on its amazing ability to understand and produce natural language. As an interim step, we use a Retriever to retrieve the most relevant documents in our database and embed it into our query to the model. A system with human-like conversational capabilities and a fact-checked knowledge base: retrieval augmentation gets you the best of both worlds.

Connecting your generative model to a database doesn’t only combat hallucinations; it also helps with accuracy more generally. The information stored in the language model’s parameters becomes outdated quickly, but you can make sure to update the documents in your database on a regular basis.

However, there are also downsides to this method when combined with a proprietary third-party model.

Some companies are wary of sending the potentially sensitive data in their documents to an external service. For these organizations, it may be preferable to host the generative model locally rather than use a third-party provider.

Generative or extractive, which one is better?

Now that we’ve gained a good sense of what characterizes generative and extractive language models, it’s time to look at the best use cases for each type of model. The decision depends on the character of your application itself, as well as your technical requirements and financial limitations.

When to use generative models

Generative models are necessary when you need your application’s output to sound like a natural response that could have been produced by a human. That’s why these models are popular for chatbots, or as writing aides in areas like copywriting and software documentation.

Another type of application that calls for generative models is any application that includes paraphrasing. As illustrated above, extractive models can only copy content from a knowledge base, while generative ones produce completely new utterances from scratch. This is indispensable in translation, for instance (in fact, the very first Transformer was built for translation purposes), and extremely useful in other abstractive use cases like summarization.

Regarding the financial and technical requirements, keep in mind that, as of now, the most high-performing generative models aren’t open-source. OpenAI, for example, charges a fee for the use of its models through its API. You need to consider how this will impact your budget, especially in case of unexpected spikes in requests to your app.

Some developers might also be uncomfortable with the notion that they’re not in charge of the model’s controls, or of the data used to train or fine-tune it. Retrieval augmentation can help with this final point — as long as you’re ok with sending your internal documents to an external service.

When to use extractive models

Extractive models are commonly used in question answering. The most typical application of this technique is thus in use cases where the user wants to receive answers verbatim from a large collection of documents. It can, for example, be used to extract answers from technical documents in a short amount of time. As a rule of thumb, the more sensitive the application, the higher the need for textual faithfulness, and the lower the tolerance for paraphrases.

But question answering can mean much more than just that. In fact, it is a technique often used to extract information from texts, which can then be used further downstream in an application. Let’s say you have, for instance, a collection of business reports and want to identify the main entities involved in certain operations. You can use an extractive model to extract those answers from the text, make sure they contain only named entities such as persons, organizations, or places, and use those entities to populate a knowledge graph.

In short, extractive models are the way to go when you’re interested in the information itself, rather than fluent answers; when you want to use them in downstream tasks; and when your application requires textual faithfulness. Researchers have also found that extractive models operate better on out-of-domain data than generative ones.

In terms of technical requirements, the big difference compared to generative models is that you usually host them on your own infrastructure — giving you both full control and full responsibility over them. While these models are available on the model hub for free, it’s useful to have access to one or more GPUs for fine-tuning, and to be able to allocate resources to data-centric tasks like annotation and qualitative evaluation.

Finally, keep in mind that not all Transformer models for NLP fit into the extractive and generative categories! In fact, there’s a whole barrage of language models that don’t return any answers at all, but that instead match or classify documents on the basis of their semantic content. These models, which are faster and less resource-consuming than the models discussed in this article, can be used in applications like semantic search — whether public-facing or business-internal — as well as semantic FAQ search.

Generative and extractive: Haystack has it all

Whether you’re looking to build applications based on generative or extractive language models: Haystack, our open-source framework for applied NLP, makes handling models of either family a breeze. With our composable pipelines and hands-on tutorials, you can build your own prototype in no time. To get started, why not have a look at our components for extractive and generative question answering?

Are you interested in exchanging tips and opinions on generative vs. extractive QA, large language models, or other topics in NLP with like-minded people? Come join our Discord community and visit us on GitHub :)

How to Prevent Prompt Injections: An Incomplete Guide

Fri, 19 May 2023 00:00:00 +0000

ChatGPT is awesome, HuggingChat is awesome, Alpaca is awesome. However, if you want to use these models in an application, for example, for your customer support, you may encounter a new problem that you should be aware of: Prompt injections.

This guide showcases approaches to handling prompt injections. It also includes a brief overview of the first public prompt injection datasets and one of the first pre-trained prompt injection detection models available on Hugging Face that you can use to combat attacks against your system.

What are prompt injections?

Prompt injections are prompts that trick a generative language model into writing something the model providers clearly did not intend,for example., hate speech. We can roughly distinguish two types of prompt injection: goal hijacking & prompt leakage.

Both types can pose serious threats to organizations or individuals trying to leverage generative AI:

Goal hijacking: AI systems, like any powerful tool, can be misused. If a model can be tricked into generating harmful or inappropriate content, it can be weaponized against individuals or communities, leading to real-world harm. The organization hosting the model might face serious public backlash. In extreme cases, they might even violate laws.
Prompt leakage: As prompt engineering is developing into a novel discipline, companies might lose competitive advantages if their model inadvertently leaks sensitive information. Depending on the architecture of the AI system, prompt leakage could also violate privacy norms and laws.

How can we handle injections?

The first step to improve resilience against prompt injections is to improve the robustness of the internal prompt that is added to the user input. Let’s look at an example from a retrieval augmented question-answering system:

By putting the user input into curly brackets, separating it by additional delimiters, and adding text after the input, the system becomes more robust against prompt injections. Other potential safety measures according to Perez & Ribeiro (2022) include setting a lower temperature and increasing the frequency penalty. Additionally, since elaborate prompt injections may require a lot of text to provide context, simply limiting the user input to a reasonable maximum length makes prompt injection attacks a lot harder.

However, these measures may not suffice in many cases. So, what else can we do?

Let’s detect injections!

Ideally, we filter all prompt injection attempts even before being passed to our generative model. This will not only help prevent injection attacks but also save us money as a classifier model is typically a lot smaller than a generative model.

Building the dataset

To train an injection classifier, we first assembled a novel dataset of 662 widely varying prompts, including 263 prompt injections and 399 legitimate requests. As legitimate requests, we included various questions and keyword-based searches. We then split the data into train and test datasets.

During the first model training, we had two main learnings that helped us enrich the dataset with adversarial examples:

Including translations. Simply switching the language of the prompt injection may cause security measures being bypassed. To prevent that, we included translations of both prompt injections as well as legitimate requests. We avoided leakage by making sure that each prompt and its respective translation always remain within the respective training or test set.
Including stacked prompts. Initially, we were able to fool our first models into thinking a prompt was legitimate by combining both legitimate and injection prompts in one prompt. This was especially true if the legitimate part was a prompt that the model had seen during training. Consequently, we included adversarial examples by randomly stacking legitimate and injection prompts within the training and test set, respectively.

The dataset is available on Hugging Face: https://huggingface.co/datasets/deepset/prompt-injections

Training the model

We fine-tuned the popular state-of-the-art DeBERTa base model using the transformers library and Google Colab. The resulting model achieves 99.1% accuracy on our holdout test set, only failing in one edge case.

You can find and try the model on Hugging Face: https://huggingface.co/deepset/deberta-v3-base-injection. It’s fully open-source.

Embedding the model in an AI system

Once you have a classification model, there are multiple ways you can put it into production. For example, you can use this model within the TransformersQueryClassifier from Haystack as a filter in a QA AI system:

Final remarks

Despite its surprisingly convincing performance, using the model as a filter does not guarantee that no prompt injections pass the check. The model should be treated as an additional layer of security, rather than as a complete solution to the problem of prompt injection.

Also, the model is trained for the chosen type of legitimate prompts. If your use case requires very different prompts to pass as legitimate, you can simply replace our legitimate prompts by yours and fine-tune the model.

We hope the model and dataset will prove as valuable for your LLM projects as it has been for ours!

Scaling NLP indexing pipelines with KEDA and Haystack — Part 1: The Application

Mon, 01 May 2023 00:00:00 +0000

Large language models have been one of the most hyped technologies lately. By combining them with a vector database that acts as a long-term memory or document store, you can improve prompts with additional context. Applications that use this technique, such as ChatGPT Plugins and Google’s project Magi, are rapidly gaining popularity. For many business users, the ability to enhance a prompt with private or recent data is what makes the difference between a prototype and a production-ready NLP application.

But how do we create a service that can turn files into searchable documents? And how can we efficiently scale these applications to index millions of files?

The goal of this article is to build a system that can run use case-specific indexing pipelines for large-scale production NLP applications. It will focus on how to create, deploy, and dynamically scale these applications using Haystack and KEDA. The first part covers creating a Python application with Haystack. The second part details how to deploy and scale the application using KEDA on Kubernetes.

This tutorial only covers turning files into text snippets, embeddings or arbitrary preprocessed data. We will not cover the augmented LLM application itself. If you are interested in creating such applications, feel free to use this tutorial: Build a Search Engine with GPT-3

Natural language processing basics

The idea behind using documents for search is to fill in the gaps in the model’s knowledge, such as the latest news. This is known as Retrieval-Augmented Generation (RAG), which means that in addition to its own knowledge, the model also uses the documents you provide to generate answers. You can find a step-by-step tutorial on how to create a search engine that uses preprocessed documents by this tutorial here.

The diagram illustrates the process of passing text files (🔡) through a model (🔮) that generates vectors and stores them in a vector database, such as Opensearch, Pinecone, or Weaviate. Later, these documents are fed to the model and the resulting output is provided to the user.

Now, let’s discuss how to efficiently generate these documents and input them into a vector database.

How to scale indexing applications

For a proof of concept, engineers typically create embeddings on virtual machines with GPUs and add them to vector databases as a one-time job. Once in production, we will need to run these jobs periodically or on demand whenever a new file needs to be added to the model’s knowledge base. There are two major factors that we want to optimize:

throughput — What is the maximum number of files that we can index within a given time period?
latency — How long does it take for a file to become available for retrieval after it has been added?

We will use stateless indexing consumers written in Python and learn how to use KEDA for autoscaling on Kubernetes. This will enable horizontal scaling for throughput and on-demand startup for latency. Let’s dive into designing the architecture!

Architecture

The main idea behind the architecture we are going to explore is to queue up indexing requests and spin up consumers that process these tasks in batches. Each consumer receives file-by-file messages and creates documents. We use the following concepts:

Batching — to improve throughput and optimize GPU utilization.
Decoupling of queueing and processing — to enable stable and scalable long-running processes.
Autoscaling — improves throughput by scaling based on the number of pending files.

There are various technologies out there that we can use to implement this. We will use:

AWS s3 (a simple storage service) — for storing the raw files that should be used for search
AWS SQS (a simple queue service) - for decoupling queueing, processing, and batching
Kubernetes — for running our containerized application
Indexing pipeline (Haystack) - “Consumer” application that subscribes to a queue and pulls the files from s3
KEDA (Kubernetes Event-driven Autoscaling) — for autoscaling our indexing pipelines

You can find all the code snippets and configuration files used throughout this article in this repository. You can use them to recreate this project from scratch. Link to repo: https://github.com/ArzelaAscoIi/haystack-keda-indexing

Indexing Consumer

This chapter will guide you through the process of creating an application that continuously pulls upload file notifications from an SQS queue for processing. Upon receiving a new message, the application fetches the corresponding file from S3 and converts it into multiple documents with embeddings.

These “consumers” are the core service of our architecture. We will deploy them to Kubernetes and scale them based on the queue length with KEDA.

If you are only interested in scaling these applications using KEDA and do not need to know the details of this service, you can use the Docker image arzelaascoli/keda-haystack-consumer:latest in the next section and skip writing the Python code.

We will start by creating a code snipped that receives a list of local, already downloaded, files and converts them into documents.

Indexing Pipeline

Haystack provides a simple way to create pipelines using YAML files. In this example, we will use a standard template, but we will simplify it slightly to suit our needs.

# pipeline yaml from Haystack: https://github.com/deepset-ai/templates/blob/69519af7178095d53cb5e879c8ac696d77c96aed/pipelines/GenerativeQuestionAnswering_gpt.yaml#L6  
  
version: "1.15.1"  
components:  
  ## Uncomment if you have running an OpenSearch instance  
  # - name: DocumentStore  
  #   type: OpenSearchDocumentStore  
  #   params:  
  #     user:   
  #     password:   
  - name: Retriever  
    type: EmbeddingRetriever  
    params:  
      # document_store: DocumentStore # uncomment to connect the retriever to an OpenSearch instance  
      embedding_model: sentence-transformers/multi-qa-mpnet-base-dot-v1  
      model_format: sentence_transformers  
      top_k: 3  
  - name: AnswerGen  
    type: OpenAIAnswerGenerator  
    params:  
      model: text-davinci-003  
      api_key:   
      max_tokens: 200  
      temperature: 0.8  
      frequency_penalty: 0.1  
      presence_penalty: 0.1  
      top_k: 3  
  - name: FileTypeClassifier  
    type: FileTypeClassifier  
  - name: TextConverter  
    type: TextConverter  
  - name: PDFConverter  
    type: PDFToTextConverter  
  - name: Preprocessor  
    type: PreProcessor  
    params:  
      split_by: word  
      split_length: 250  
      split_overlap: 20  
      language: en  
      split_respect_sentence_boundary: True  
  
pipelines:  
  # not required in our case, but can be used to query the document store  
  # with Haystack  
  - name: query  
    nodes:  
      - name: Retriever  
        inputs: [Query]  
      - name: AnswerGen  
        inputs: [Retriever]  
  - name: indexing  
    nodes:  
      - name: FileTypeClassifier  
        inputs: [File]  
      - name: TextConverter  
        inputs: [FileTypeClassifier.output_1]  
      - name: PDFConverter  
        inputs: [FileTypeClassifier.output_2]  
      - name: Preprocessor  
        inputs: [TextConverter, PDFConverter]  
      - name: Retriever  
        inputs: [Preprocessor]  
      ## Uncomment if you have a running a document store.  
      ## For this tutorial, we will generate embeddings and documents,  
      ## but not store them in a document store to keep it generic.  
      ## You can use any of Haystack's document stores here.  
      ## Docs: https://docs.haystack.deepset.ai/docs/document_store  
      # - name: DocumentStore  
      #   inputs: [Retriever]

This template is split into an indexing and a query pipeline. Let’s go through the steps we run for the indexing pipeline first:

FileTypeClassifier — checks the file type for .txt or .pdf files
TextConverter — extracts the text from .txt files
PDFConverter — extracts the text from .pdf files
Preprocessor — splits the texts from files into smaller chunks of size 250 characters
Retriever — computes embeddings for chunked texts by running the model sentence-transformers/multi-qa-mpnet-base-dot-v1 from Hugging Face
DocumentStore (disabled) — This node sends the created documents, including the vectors, to a selected database. For the purposes of this tutorial, we will skip this step. However, for a scalable system, you will need to set up a highly scalable vector database. Some suitable options may include OpenSearch, Weaviate, Qdrant, or others. Haystack supports many different document store clients.

Since we will just focus on indexing files, we will not load and run the query pipeline. Nevertheless, this pipeline just has two steps:

Retriever — Perform a k-nearest neighbors search to find documents that match your prompt.
AnswerGen — OpenAI’s text-davinci-003 model receives a prompt, which includes the retrieved text from step 1, and generates an answer in response.

In the next step, we will load the pipeline so that we can run indexing for local files.

# pipeline.py   
# link to file: https://github.com/ArzelaAscoIi/haystack-keda-indexing/blob/main/pipeline.py  
  
from haystack import Pipeline  
  
def get_pipeline(yaml_path: str) -> Pipeline:  
    return Pipeline.load_from_yaml(yaml_path, pipeline_name="indexing")  
  
## example usage  
# pipeline = get_pipeline("./pipelines/pipeline.yaml")  
# documents = pipeline.run(file_paths=[Path(".your-file.txt")])

Now that we can generate documents for local files, we need to write the “glue code” that connects our indexing pipelines with messages from SQS and files from S3. For testing purposes, we also create a code snipped to queue files for indexing.

Glue Code for Indexing with SQS and S3

Let’s start with creating a class S3Client that can upload and download files to Amazons s3 service. AWS offers a convenient way to communicate with resources through boto3. We will use this library and write simple wrappers for its methods.

# aws_service.py  
# link to file: https://github.com/ArzelaAscoIi/haystack-keda-indexing/blob/main/aws_service.py  
  
class S3Client:  
    def __init__(self, bucket_name: str, local_dir: str) -> None:  
        self.s3 = boto3.client(  
            "s3",  
            endpoint_url=AWS_ENDPOINT,  # allows using localstack  
            region_name=AWS_REGION,  # allows using localstack  
        )  
        self.bucket_name = bucket_name  
        self.local_dir = local_dir  
  
    def upload_file(self, local_path: Path) -> None:  
        """  
        Upload file to s3 with the same name as the file on local filesystem.  
        :param local_path: Path to file on local filesystem  
        """  
        file_name = os.path.basename(local_path)  
        self.s3.upload_file(  
            str(local_path),  
            self.bucket_name,  
            file_name,  
            ExtraArgs={"ACL": "public-read"},  
        )  
  
    def download_files(self, s3_keys: List[str]) -> List[Path]:  
        """  
        Download files from s3 to local directory.  
  
        TODO: Make this async to speed up the upload process  
        TODO: delete files from local directory after processing  
  
        :param s3_keys: List of keys on s3  
        :returns List of paths to downloaded files  
        """  
        paths: List[Path] = []  
        for s3_key in s3_keys:  
            response = self.s3.get_object(Bucket=self.bucket_name, Key=s3_key)  
            file_name = os.path.basename(s3_key)  
            local_path = os.path.join(self.local_dir, file_name)  
            paths.append(Path(local_path))  
            with open(local_path, "wb") as f:  
                f.write(response["Body"].read())  
  
        return paths

After implementing the file upload and download functionality, we need to enable consumers to fetch pending S3 keys that are queued up for indexing. To do this, we create a SQSClient class that can publish and receive messages from the Amazon SQS service. Similar to the S3Client, boto3 provides some simple methods for this purpose.

class SQSClient:  
    def __init__(self, queue_name: str) -> None:  
        self.sqs = boto3.client(  
            "sqs",  
            endpoint_url=AWS_ENDPOINT,  # allows using localstack  
            region_name=AWS_REGION,  # allows using localstack  
        )  
        self.queue_name = queue_name  
        self.queue = self.sqs.get_queue_url(QueueName=self.queue_name)  
  
    def publish_key(self, key: str) -> None:  
        """  
        Publish key to SQS queue  
  
        :param key: s3 Key to publish  
        """  
  
        self.sqs.send_message(QueueUrl=self.queue["QueueUrl"], MessageBody=key)  
  
 @contextmanager  
    def fetch_keys(self) -> Generator[List[str], None, None]:  
        """  
        Fetch messages from defined queue  
        :returns List of keys on s3  
        """  
  
        # receive up to 10 messages at once from sqs  
        response = self.sqs.receive_message(  
            QueueUrl=self.queue["QueueUrl"],  
            MaxNumberOfMessages=10,  
            WaitTimeSeconds=0,  
        )  
        # get s3 keys from sqs messages  
        messages = response.get("Messages", [])  
        s3_keys = [message["Body"] for message in messages]  
  
        try:  
            # We are using a context manager to ensure the messages  
            # are deleted from the queue after handling them was  
            # successful.  
            yield s3_keys  
            # This will be run after the message was processed  
            for message in messages:  
                self.sqs.delete_message(  
                    QueueUrl=self.queue["QueueUrl"],  
                    ReceiptHandle=message["ReceiptHandle"],  
                )  
        except Exception as error:  
            raise error

We will now connect both of these clients to a service that has two methods:

get_files - This method fetches file messages from SQS, downloads the corresponding file from S3, and returns the local path.
upload_file - This method uploads a file from a local path to S3 and publishes the key on SQS to be picked up by the indexing consumers later.

# aws_service.py  
# link to file: https://github.com/ArzelaAscoIi/haystack-keda-indexing/blob/main/aws_service.py  
  
class AWSService:  
    def __init__( self, queue_name: str, bucket_name: str, local_download_dir: str ) -> None:  
        self.sqs_client = SQSClient(queue_name)  
        self.s3_client = S3Client(bucket_name, local_download_dir)  
  
    def get_files(self) -> List[Path] | None:  
        with self.sqs_client.fetch_keys() as messages:  
            if not messages:  
                return  
            paths = self.s3_client.download_files(messages)  
            return paths  
  
    def upload_file(self, local_path: Path) -> None:  
        self.s3_client.upload_file(local_path)  
        self.sqs_client.publish_key(local_path.name)

Set up a local AWS environment with LocalStack

We use LocalStack to simplify the development process and avoid the need to use real AWS infrastructure. To start LocalStack, run docker-compose up in the terminal using the docker-compose.yaml file.

# docker-compose.yaml   
# link to file:https://github.com/ArzelaAscoIi/haystack-keda-indexing/blob/main/docker-compose.yaml  
  
version: "3"  
services:  
  localstack:  
    image: localstack/localstack:1.4.0  
    ports:  
      - "127.0.0.1:4566:4566" # LocalStack Gateway  
      - "127.0.0.1:4510-4559:4510-4559" # external services port range  
    environment:  
      - SERVICES=s3,sqs  
      - AWS_ACCESS_KEY_ID=test  
      - AWS_SECRET_ACCESS_KEY=test  
    volumes:  
   # startup script to create a sqs queue and a s3 bucket  
      # https://github.com/ArzelaAscoIi/haystack-keda-indexing/blob/main/scripts/sqs_bucket_bootstrap.sh  
      - ./scripts:/docker-entrypoint-initaws.d/ # startup script found here

After startup, we will run a script that creates a test-queue and a test-bucket that we can use to test our application. In the next steps, we will create AWS clients to read messages from SQS and download files from s3.

Running the consumer

Now that we can run pipelines, upload and download files, and set up our local AWS environment, we need to write our main service. On a code level, all we need to implement is a simple loop to continuously pull messages and run indexing.

# consumer.py  
# link to file:https://github.com/ArzelaAscoIi/haystack-keda-indexing/blob/main/consumer.py  
  
# To learn more about logging in python check out my other article  
# about structolg! https://medium.com/@ArzelaAscoli/writing-professional-python-logs-e1f31635b60b  
logger = structlog.get_logger(__name__)  
  
# Initialize AWS service  
aws_service = AWSService(SQS_QUEUE, S3_BUCKET, LOCAL_DOWNLOAD_DIR)  
  
# load pipeline  
pipeline = get_pipeline("./pipelines/pipeline.yaml")  
  
while True:  
    # fetch files from aws  
    files: List[Path] = aws_service.get_files()  
    if not files:  
        logger.info("No files to process")  
        sleep(5)  
        continue  
  
    # process files if found  
    logger.info("Found files", files=files)  
  
    # run indexing for downloaded files  
    documents = pipeline.run(file_paths=files)  
    logger.info("Processed files", documents=documents)

We will now create an image of our application using the Haystack GPU base image, which has most of our dependencies pre-installed.

# Dockerfile   
# link to file: https://github.com/ArzelaAscoIi/haystack-keda-indexing/blob/main/Dockerfile  
  
FROM deepset/haystack:base-gpu-v1.15.0  
WORKDIR /home/user  
  
COPY . .  
RUN pip3 install -r requirements.txt  
  
CMD ["python3", "consumer.py"]

By adding the startup configuration to docker-compose, we can not only run LocalStack but also our application in Docker by calling docker-compose up.

# docker-compose.yaml  
# link to file: https://github.com/ArzelaAscoIi/haystack-keda-indexing/blob/main/docker-compose.yaml  
...  
consumer:  
    build:  
      context: .  
      dockerfile: Dockerfile  
    environment:  
      - AWS_ENDPOINT=http://localstack:4566  
      - AWS_REGION=eu-central-1  
      - AWS_ACCESS_KEY_ID=test  
      - AWS_SECRET_ACCESS_KEY=test

We are now ready to test our application by uploading files and checking the logs for resulting documents.

Testing with Docker Compose

To start our Docker Compose stack, we use the command docker-compose up. Additionally, we provide a code snippet for file uploading.

# upload.py  
# link to file:   
  
aws_service = AWSService(SQS_QUEUE, S3_BUCKET, LOCAL_DOWNLOAD_DIR)  
aws_service.upload_file(Path("./data/test.txt"))

If we run python3 upload.py upload.py now, we will see the following output in our Docker console.

haystack-keda-indexing-localstack-1  | 2023-04-22T09:55:44.981  INFO --- [   asgi_gw_1] localstack.request.aws     : AWS s3.GetObject => 200  
haystack-keda-indexing-localstack-1  | 2023-04-22T09:55:45.007  INFO --- [   asgi_gw_0] localstack.request.aws     : AWS sqs.DeleteMessage => 200  
haystack-keda-indexing-consumer-1    | 2023-04-22 09:55:45 [info     ] Found files                    files=[PosixPath('/tmp/test.txt')]  
Converting files: 100%|██████████| 1/1 [00:00<00:00, 81.28it/s]  
Preprocessing: 100%|██████████| 1/1 [00:00<00:00, 35.10docs/s]  
Batches: 100%|██████████| 1/1 [00:03<00:00,  3.18s/it]  
haystack-keda-indexing-consumer-1    | 2023-04-22 09:55:48 [info     ] Processed files                documents={'documents': ['content': 'this is text', 'content_type': 'text', 'score': None, 'meta': {'_split_id': 0}, 'id_hash_keys': ['content'], 'embedding': '', 'id': '46ec22c7eafaea7c43eef7d996fd04ce'}>], 'file_paths': [PosixPath('/tmp/test.txt')], 'root_node': 'File', 'params': {}, 'node_id': 'Retriever'}  
haystack-keda-indexing-localstack-1  | 2023-04-22T09:55:48.327  INFO --- [   asgi_gw_0] localstack.request.aws     : AWS sqs.GetQueueUrl => 200

We have a stateless service that processes indexing requests asynchronously and can operate in Docker Compose. We can now deploy this service to Kubernetes, configure KEDA, and scale it.

Conclusion and next steps

This article explains how to create a stateless consumer application that turns files into documents. We learned that we can use this application with different models, preprocessors, and file types which are defined in a single configuration file using Haystack. In order to ship this application to a production-ready deployment, we will deploy this service on Kubernetes and define custom autoscaling in the next chapter.

Scaling NLP indexing pipelines with KEDA and Haystack — Part 2: The Deployment

Mon, 01 May 2023 00:00:00 +0000

In the first part of this article series, we discussed the power of retrieval-augmented generation. We also explored how to create a Python application that converts files into searchable documents with embeddings via Haystack pipelines. However, merely having a Python program that converts files into text snippets and embeddings on a single machine is not enough for a production-ready deployment.

In this part, we will explore how to deploy an indexing consumer to Kubernetes and how to autoscale it using KEDA. This will allow us to efficiently add text and embeddings to our vector database that can power a retrieval augmented LLM search engine like this.

We will use the architecture described in the first part, which involves queuing files to be indexed on AWS SQS and consuming them in parallel using Haystack pipelines. Let’s start deploying and scaling!

Deployment on Kubernetes and scaling with KEDA

In this section, we will learn how to set up KEDA on Kubernetes and configure autoscaling to scale our consumers based on the number of pending files in Kubernetes. We will use the following tools:

k3d for creating a local Kubernetes cluster to deploy our consumers
KEDA for scaling the consumers after deployment
localstack as a local AWS cloud stack, for testing our application

To communicate with our local Kubernetes cluster, we will use kubectl.

⚠️ For certain types of indexing pipelines, GPUs are required. This is particularly true when embeddings are generated on the same machine (rather than through an external inference service) or when a model is used within the pipeline. In such cases, it is necessary to have GPU pods within the Kubernetes cluster to run the model.

Installation and setup

We will need to set up a local Kubernetes cluster and deploy a list of services before we can start deploying and scaling our application.

Create a local Kubernetes cluster

To begin, create a new Kubernetes cluster named haystack-keda-cluster using k3d.

k3d cluster create haystack-keda-cluster  
# check the the status via: kubectl cluster-info

Next, we will create a namespace called indexing that we will use to deploy our services.

kubectl create namespace indexing

Install services — KEDA and LocalStack

To set up LocalStack, add the helm chart and install LocalStack in the indexing namespace:

helm repo add localstack https://localstack.github.io/helm-charts  
helm install localstack localstack/localstack --namespace indexing

We will repeat the same steps with KEDA.

helm repo add kedacore https://kedacore.github.io/charts  
helm install keda kedacore/keda --namespace indexing

We can validate the setup by running indexing kubectl get pods -n indexing.

Create an SQS queue and an S3 bucket

Our indexing consumers will connect to queues on LocalStack and download files from S3. Therefore, we need to create the necessary resources before deploying our application.

To create a queue and a bucket, we will use the same shell script that we used in our development environment with Docker Compose. To run the script from within the container, pipe the script into the kubectl exec command:

cat ./scripts/sqs_bucket_bootstrap.sh | kubectl exec -i -n indexing deployment/localstack -- /bin/bash

If we fetch the logs via kubectl logs -f deployment/localstack -c localstack -n indexing, we should see that a queue and a bucket were created.

2023-04-22T15:19:34.166  INFO --- [   asgi_gw_1] localstack.request.aws     : AWS sqs.CreateQueue => 200  
2023-04-22T15:19:34.533  INFO --- [   asgi_gw_0] localstack.request.aws     : AWS s3.CreateBucket => 200

Deploying the indexing consumer

Now that we have LocalStack and KEDA deployed to our Kubernetes cluster, we can start deploying our indexing consumer. The indexing consumers are deployed as Kubernetes deployments by using a deployment file deployment-consumer.yaml.

# deployment-consumer.yaml  
# link to file:   
  
kind: Deployment  
apiVersion: apps/v1  
metadata:  
  name: indexing-consumer  
  labels:  
    k8s-app: indexing-consumer  
spec:  
  # we want to start with 0 replicas and scale up on demand  
  replicas: 0  
  selector:  
    matchLabels:  
      k8s-app: indexing-consumer  
  template:  
    metadata:  
      name: indexing-consumer  
      labels:  
        k8s-app: indexing-consumer  
    spec:  
      containers:  
        - name: indexing-consumer  
          command: ["python3", "consumer.py"]  
          # public consumer image  
          image: arzelaascoli/keda-haystack-consumer:latest  
          env:  
            # localstack configuration  
            - name: AWS_ENDPOINT  
              value: http://localstack:4566  
            - name: AWS_REGION  
              value: eu-central-1  
            - name: AWS_ACCESS_KEY_ID  
              value: test  
            - name: AWS_SECRET_ACCESS_KEY  
              value: test  
          # Resource estimations  
          # TODO: adjust these to our needs and the load we expect  
          resources:  
            requests:  
              memory: 1000Mi  
              cpu: 750m  
            limits:  
              memory: 2500Mi  
              cpu: 2000m

We can apply this YAML to our namespace indexing with kubectl:

kubectl apply -f ./kubernetes/deployment-consumer.yaml --namespace indexing

To validate that consumers can successfully start and connect to the queue, we can scale up the replicas to 1, and check the running pods.

# scale deployment   
kubectl scale deployment indexing-consumer --namespace=indexing --replicas=1  
# get pods   
kubectl get pods -n indexing  
# check logs   
kubectl logs -f deployment/indexing-consumer -c indexing-consumer -n indexing

The system will log that no files were found to be processed:

│ 2023-04-23 15:43:14 [info     ] No files to process                                                                                                       │  
│ 2023-04-23 15:43:19 [info     ] No files to process                                                                                                       │  
│ 2023-04-23 15:43:24 [info     ] No files to process                                                                                                       │  
│ 2023-04-23 15:43:29 [info     ] No files to process

Next, we will set up autoscaling based on the length of the SQS queue to enable autoscaling and scaling to zero if no files are pending.

Configure autoscaling based on queue length

After successfully creating all the required services to index files, we can now configure KEDA to check the queue length and scale the indexing consumers accordingly.

To set up a KEDA trigger on SQS, we need to configure authentication by creating a Kubernetes secrets object.

# secrets-localstack.yaml  
# link to file:   
  
apiVersion: v1  
kind: Secret  
metadata:  
  name: aws-secrets  
  namespace: indexing  
data:  
  AWS_ACCESS_KEY_ID: dGVzdA== # base64 encoded string for "test"  
  AWS_SECRET_ACCESS_KEY: dGVzdA== # base64 encoded string for "test"

This secret is then mapped via a TriggerAuthentication object to KEDA, which will use credential based authentication.

# trigger-authentication.yaml  
# link to file:   
  
apiVersion: keda.sh/v1alpha1  
kind: TriggerAuthentication  
metadata:  
  name: keda-trigger-auth-aws-credentials  
  namespace: indexing  
spec:  
  secretTargetRef:  
    - parameter: awsAccessKeyID # Required.  
      name: aws-secrets # Required.  
      key: AWS_ACCESS_KEY_ID # Required.  
    - parameter: awsSecretAccessKey # Required.  
      name: aws-secrets # Required.  
      key: AWS_SECRET_ACCESS_KEY # Required.

The operator is now allowed to access LocalStacks resources, and we can create a scaled object with a aws-sqs-queue trigger.

apiVersion: keda.sh/v1alpha1  
kind: ScaledObject  
metadata:  
  name: indexing-consumer-scaled-object  
spec:  
  scaleTargetRef:  
    apiVersion: apps/v1  
    kind: Deployment  
    name: indexing-consumer # Mandatory. Must be in the same namespace as the ScaledObject  
  pollingInterval: 30  
  cooldownPeriod: 300  
  idleReplicaCount: 0  
  minReplicaCount: 0  
  maxReplicaCount: 2  
  fallback: # defines a number of replicas to fallback to if a scaler is in an error state.  
    failureThreshold: 3  
    replicas: 4  
  triggers:  
    - type: aws-sqs-queue  
      authenticationRef:  
        name: keda-trigger-auth-aws-credentials  
      metadata:  
        # KEDA will use the value of the environment variable of the `consumer-file-ingestion` containers  
        queueURL: http://localhost:4566/000000000000/test-queue  
        queueLength: "10" # Should roughly equal the number of messages that can be processed in 1 minute  
        awsRegion: "eu-central-1"  
        awsEndpoint: "http://localstack:4566"  
        scaleOnInFlight: "false" # Exclude in-flight messages from the queue length calculation

After applying these three YAMLs via kubectl apply --f ./kubernetes/keda --namespace indexing, we can forward the port to allow uploading files to LocalStack.

 kubectl port-forward deployment/localstack 4566:4566 -n indexing

We can use an upload script to add a file test.txt by running python3 upload.py.

# upload.py   
# link to file:   
aws_service = AWSService(SQS_QUEUE, S3_BUCKET, LOCAL_DOWNLOAD_DIR)  
aws_service.upload_file(Path("./data/test.txt"))

Once the file is successfully uploaded and queued, KEDA will take care of scaling the deployment from 0 to 1 replica. Kubernetes will list an indexing-consumer pod.

NAME                                               READY   STATUS                
localstack-8fc647d9d-xkrsk                         1/1     Running               
keda-operator-metrics-apiserver-7bcfdd7c9b-7pbkp   1/1     Running              
keda-operator-6857fbc758-xtc44                     1/1     Running               
keda-admission-webhooks-59978445df-q85jr           1/1     Running              
indexing-consumer-656d98db6f-psz6q                 0/1     ContainerCreating

After startup, the files will be fetched and removed from the queue, and indexed.

Conclusion and next steps

This article explains how to create a scalable application to convert text and PDF files into documents containing text and embeddings. KEDA allows for the on-demand scaling of each application. With this simple architecture, we are able to horizontally scale the creation of embeddings.

How was KEDA useful? — KEDA enabled us to scale consumers based on queue length. An alternative solution involves using horizontal pod autoscaling based on CPU usage, which would be triggered once elements are fetched from the queue. However, this approach does not allow for scaling down to zero. Since these tasks require GPUs, one idle machine that is constantly running can be expensive.

How do I deploy this without k3d? — This tutorial is applicable to any Kubernetes cluster. Simply follow the instructions provided.

What resources do I need? — When running this in production, GPU nodes are necessary for the cluster, which may require additional configuration.

Is there a simple way to deploy multiple pipelines? — In one of the next articles, I will share an article on how to use the Kubernetes Operator Framework ( Kopf), written in Python, to dynamically create these resources.

New DocumentStore Integration for Haystack: QdrantDocumentStore

Tue, 18 Apr 2023 00:00:00 +0000

We’re thrilled to introduce the QdrantDocumentStore, the latest addition to the Haystack DocumentStore family by Qdrant. It’s an open source package that provides powerful capabilities such as efficient search, high-dimensional vector retrieval, and flexible launch options.

Let’s learn about DocumentStores and how to use QdrantDocumentStore with your pipelines!

Haystack DocumentStores

Haystack is an end-to-end NLP framework that provides a modular approach to building state-of-the-art generative AI, QA, and semantic knowledge base search systems. A core component of most modern NLP systems is a database that can efficiently store and retrieve vast amounts of text data. Vector databases are a great way of doing this. These databases can store vector representations of text while also implementing efficient ways to retrieve them at speed. To this end, Haystack provides a set of native DocumentStores that allow you to connect to and query your data that might be in Elasticsearch, Pinecone, Weaviate, or others. This way, you can store and maintain your data within one of these databases while simultaneously using them within your Haystack pipelines and applications.

New DocumentStore by Qdrant

The latest addition to the Haystack DocumentStore family is the QdrantDocumentStore. This new DocumentStore is a first of a kind by being an external package maintained by the Qdrant team. Qdrant is a vector similarity search engine that provides a production-ready service with a convenient API for storing, searching, and managing vectors with an additional payload. Qdrant is optimized for high-dimensional vector search and supports various similarity metrics, making it a powerful tool for various NLP applications.

Advantages of QdrantDocumentStore

Some of the advantages of using Qdrant with Haystack include:

Efficient search and retrieval of high-dimensional vectors
mmap support for HNSW graph storage
Filtering capabilities integrated directly into the vector search phase
Constraints based on metadata attributes
Reducing the memory footprint up to 4 times by using Scalar Quantization
Storing even a million vectors with a small amount of RAM
Open source! 😍

Qdrant provides several options to launch an instance. A local in-memory/disk-persisted mode is a good choice for test scenarios and quick experiments in Colab or Jupyter Notebook, where you plan to store only a few vectors. When you need to scale, you can easily switch to on-premise server deployment using Docker containers. If you don’t want to manage the infrastructure at all, there is also a fully managed Qdrant Cloud offering with a free forever 1GB cluster. All is possible when using Haystack! 💙

Haystack is a comprehensive framework that enables developers to construct complete NLP pipelines, and Qdrant is a powerful vector similarity search engine. To ensure seamless collaboration between the two tools and create robust NLP applications, QdrantDocumentStore is a separate package maintained by Qdrant, specifically designed to integrate with Haystack. By having its own release cycle, QdrantDocumentStore can be updated independently from Haystack, resulting in a more streamlined and efficient workflow.

How to Use QdrantDocumentStore

If you want to use the QdrantDocumentStore with Haystack, you can find the source code in the qdrant-haystack repository on GitHub. With this integration, you can take advantage of the robust search capabilities of Qdrant with the natural language processing features of Haystack, creating a powerful tool for semantic search and retrieval.

To use the QdrantDocumentStore with Haystack, you can install qdrant-haystack:

pip install qdrant-haystack

Then, initialize QdrantDocumentStore:

from qdrant_haystack.document_stores import QdrantDocumentStore

document_store = QdrantDocumentStore(
    ":memory:",
    index="Document",
    embedding_dim=512,
    recreate_index=True,
    hnsw_config={"m": 16, "ef_construct": 64}  # Optional
)

If you’d like to enable the Scalar Quantization for your embeddings, you can just pass the proper configuration in the same way as you would do with raw QdrantClient:

from qdrant_haystack.document_stores import QdrantDocumentStore
from qdrant_client.http import models

document_store = QdrantDocumentStore(
    "http://localhost:6333", 
    index="Document",
    embedding_dim=512,
    recreate_index=True,
    quantization_config=models.ScalarQuantization(
        scalar=models.ScalarQuantizationConfig(
            type=models.ScalarType.INT8,
            quantile=0.99,
            always_ram=True,
            ),
     ),
)

QdrantDocumentStore supports all the configuration properties available in the Qdrant Python client.

Once you have initialized the document store, you can write your documents using the write_documents method. This will allow you to seamlessly integrate your vector database with Haystack and use its functionalities.

document_store.write_documents(docs)

That’s it! For more examples and documentation on using Qdrant with Haystack, refer to the qdrant-haystack GitHub repository, and if you have any questions, join Haystack community on Discord.

Writing Professional Python Logs

Thu, 13 Apr 2023 00:00:00 +0000

If you are a Python developer who has been struggling with inconsistent and unhelpful logs, then this article is for you! In this post, we will discuss how structlog can help you write logs that not only assist you in your daily work but also enforce certain best practices to ensure consistency in logging throughout your project.

We will also explore common logging challenges and how structlog can help overcome them. So whether you are a beginner or an experienced developer, keep reading to learn how structlog can revolutionize your logging process!

In the past, I often neglected logs and only wrote a few when they helped me in my development environment. These logs were useful for finding bugs or validating that the code works as intended, but they were only helpful for local debugging.

After learning about structlog, I now know how to write logs that not only assist me in my daily work, but also enforce certain best practices that ensure consistency in logging throughout the project.

This makes parsing logs for your shipped application easier and improves your monitoring in tools like Datadog or Grafana.

In this article, I will share some of my experiences with structlog and demonstrate the power of this library by walking you through an example scenario. Let’s assume that a user wants to upload files and we have already created an endpoint that accepts these requests. In the past, I might have written logs like this:

import logging  
...   
logging.debug("Start uploading file with name %s. Number of retries: %s", file_name, 0)  
...  
logging.error("Failed uploading file with name %s. Current retry: %s.", file_name, retries)  
...

Basically, there is nothing wrong with this code snippet, which prints the following logs.

These two log lines describe what is happening during a request. However, we could improve the notation of the retries, add more context and make the log lines more readable.

Classic logging challenges

Before we jump into the hands on part, let’s first understand some common logging challenges based on the example above.

1. Missing contextual information

The log line precisely describes that a my_file failed to upload. This information might be sufficient for a development environment where there is only one user and one client sending requests, the data is stored on the local file system, and all requests are executed sequentially.

However, when encountering this kind of log in a production system, typical questions that might arise are:

Who sent this request?
Which organization does the user belong to?
Which file store did we connect to? Is there a session identifier or something else that helps us trace down the error?
Is this log line connected to a request identifier?

To obtain this information for log lines, we would need to add this extra information to each log line, and in the worst case, fetch the same value from a database multiple times to add this informational context throughout the codebase.

2. Transition from development-only to machine readable logs

Depending on the environment, you may want to have log lines that are prettified to allow humans to read them easily, but also provide them in a machine-readable format in a production system. Within our production deployment, we might want to count the total number of retries or filter for a failed file name. In this scenario, we would need to run substring matches. It would be much easier to write queries that let you filter, maybe even in an SQL-like manner for events.

select file_name from logs where retry_count > 1

3. Inconsistent wording

It is common for log lines within the same Python module to look completely different from one another. Sometimes, important information is even missing. Depending on the author, log lines can look very diverse. Entities can be referred to by different names, and the values of variables may be added to the log with or without quotes, or their wording may be changed throughout the module.

This makes searching for specific events difficult and time-consuming.

Structlog

Structlog offers a lot of nice features that help you write logs in a faster and less painful way. It helps you to add contextual data to your logger and offers a modular configuration for parsing your log lines in a machine readable and developer friendly way.

In our case, this means that we get beautiful colored log lines within our development environment and parsed JSONs for our monitoring, add file names as contextual data and use some more convenience features. It is important to mention that although the native logger can be set up to log as JSON, structlog provides this capability out of the box.

First step — Replace python’s native logger with structlog

Let’s use structlog for the same code snippet as above. We need to obtain the logger and add the variables to our log messages. A code snippet could look like this:

import structlog   
logger = structlog.get_logger()  
...   
logger.debug("Start uploading file.", file_name=file_name,retries=0)  
...  
logger.error("Failed uploading file.", file_name=file_name,retries=retries)  
...

The first thing we might notice is that the variables are not longer loaded to the string but rather kwargs of the debug and error methods. This not only adds the values of these variables to the logger but also binds the key names to the log message. The resulting log lines will look like this:

If we compare this to the solution we had before, we can see that the file name and retries are added to the log lines by mapping key values instead of inserting the values as a raw string.

Second step — Bind contextual data to loggers

Generally there are two ways of adding contextual data to loggers with structlog.

1. Explicitly binding variables, by calling .debug(…) . We will update the previous code snipped by first binding the file_name to the logger and using the logger instance to print log messages.

import structlog   
logger = structlog.get_logger()  
...   
log = logger.bind(file_name=file_name)  
log.debug("Start uploading file.", retries=0)  
...  
log.error("Failed uploading file.", retries=retries)  
...

This allows us to remove the file name from all the log lines and move it to the top where it is just called once. Structlog will take care of adding it to each log line.

2. Implicitly binding context variables, by calling structlog.contextvars.bind_contextvars(…) — This function uses context variables that are used whenever the logger within the same thread prints log messages. A code snippet that prints the same log messages using context variables might look like this:

# /controller/file.py  
  
import structlog   
  
logger = structlog.get_logger()  
...   
structlog.contextvars.bind_contextvars(file_name=file_name)  
logger.debug("Start uploading file.", retries=0)  
...  
logger.error("Failed uploading file.", retries=retries)

However, our scenario is not the preferred use case for using context variables to add values to your logger. As we learned in the logging challenges, contextual data is often missing in the classic approach. Before processing the file, we assume that an authentication and authorization mechanism is in place to validate whether a user has access to the endpoint. In a framework like FastAPI, this workload is handled in middleware. Since we have already fetched the user ID here to authenticate the request, let’s add this key-value pair to the logger:

# /auth/middleware.py   
  
def verify_user(request: Request) -> bool:  
 if not validate(request.token):  
  return False  
   
 user_id = get_user_id(request.token)  
 structlog.contextvars.bind_contextvars(user_id=user_id)  
 return True

This way structlog will take care of adding this contextual information for us and we don’t need to worry about adding the user id manually.

Third Step — Configuration for machine readable logs

After learning how to attach information to log lines using structlog, we now want to configure it to send prettified log lines in our local development environment setup and parsed JSONs for our logging service. We will follow the first and simplest solution from the structlog docs. Before we proceed, we need to understand two concepts from structlog: Preprocessors and Renderers.

Preprocessors are used to extend the content of log messages. This can be done by adding timestamps, modifying fields to obfuscate user names, or dropping log lines based on defined conditions.

Renderers are used to control the representation of log lines. Depending on the use case, they can display beautified and colorful log lines, or machine-readable JSONs.

The ability to write custom implementations of these nodes provides a lot of extensibility.

We will use default structlog preprocessors to define the content of our log lines. The following diagram illustrates how to handle log lines that use a structlog logger. After the output step, we add a renderer based on the environment that formats the log lines as human or machine readable output.

Logging architecture from structlog docs

We begin by configuring a set of processors that add timestamps, set the log level, and merge the context variables used in step 2.

import structlog  
from structlog.dev import ConsoleRenderer  
from structlog.processors import JSONRenderer  
  
# Timestamper preprocessor that to add unified timestamps to each log  
timestamper = structlog.processors.TimeStamper(fmt="iso", utc=True)  
  
# Structlog preprocessors  
structlog_processors = [  
  structlog.stdlib.add_log_level  
    structlog.processors.add_log_level,  
    structlog.contextvars.merge_contextvars,  
    structlog.processors.StackInfoRenderer(),  
    structlog.dev.set_exc_info,  
    timestamper,  
]

Now we want to define a function that either returns structlogs ConsoleRendereror JSONRenderer, based on the environment settings and append this renderer to the processors.

# __init__.py  
  
from structlog.dev import ConsoleRenderer  
from structlog.processors import JSONRenderer  
  
def get_renderer() -> Union[JSONRenderer, ConsoleRenderer]:  
    """  
    Get renderer based on the environment settings  
    :return structlog renderer  
    """  
    if os.get("DEV_LOGS", True):  
        return ConsoleRenderer()  
    return JSONRenderer()  
  
  
## Structlog  
structlog.configure(  
    processors=structlog_processors + [get_renderer()],  
    wrapper_class=structlog.stdlib.BoundLogger,  
    context_class=dict,  
    logger_factory=structlog.PrintLoggerFactory(),  
    cache_logger_on_first_use=False,  
)

⚠️ This will only affect the log lines printed within your application and will not affect logs produced by modules used inside your application. If you also want these logs to be formatted as JSON, you need to follow tutorial 3 of the structlog documentation.

After setting the environment variable DEV_LOGS to False our log lines are printed as raw JSONs.

This format can now be used in your monitoring system and easily be parsed and aggregated since its not longer a stream of text, but rather a stream of easy to parse JSONs.

In our local environment we can still use the old colored log lines:

Limitations

Although structlog alone will not solve all inconsistencies with logging, it can help to use a more consistent and less error-prone workflow.

To increase consistency in our codebase, we use a set of default naming conventions. While these conventions are not enforceable, they can contribute to a common understanding of how to write logs.

_id <- ( required) every entity (e.g. file) that is processed needs to be added to the logger
_name <- (optional) every entity (e.g. file) can be added with the suffix _name for a better readability
_ <- (optional) additional properties of entities are allowed to be added with the key name as a suffix
← (optional) Sometimes, you may need to add case-specific information. In such cases, you can use additional arbitrary keys. However, it is recommended to follow the naming conventions from 1–3, if possible.

This is for sure not the only possible naming, but the one that works for us best.

Summary

We have learned the following:

Using a key-value format for variables helps create uniform log messages and promotes consistent naming conventions.
Contextual data can be explicitly added by binding variables to loggers or implicitly added using context variables, which enriches log lines with useful metadata.
Choosing a structlog renderer allows for easy parsing of log lines into a machine-readable or prettified, human-friendly format based on the environment.

About me: I am a developer based in Cologne, working at deepset. I am part of the team building “deepset Cloud,” which is powered by the open source framework Haystack.

Introducing Agents in Haystack: Make LLMs resolve complex tasks

Thu, 30 Mar 2023 00:00:00 +0000

Wouldn’t it be cool if you could enter a query and an LLM makes decisions on-the-fly about what resources it needs to resolve the query? For example, that it needs to search the web and retrieve relevant resources. Or that it needs to search through your company files first. That’s now possible with Agents!

With the release of Haystack 1.15, we’re officially introducing the Agent to the Haystack ecosystem. The implementation of Agents is inspired by two papers: the MRKL Systems paper (pronounced ‘miracle’ 😉) and the ReAct paper. If you like reading papers, I highly recommend these two. Here, I’ll explain how we’re introducing this functionality to Haystack.

Let’s first talk about what an Agent is, and then see how you can start using Agents in Haystack. First, we need to briefly talk about what a ‘prompt’ is.

What is a Prompt?

Long story short, a prompt is an instruction. In the world of Natural Language Processing, these instructions can often be things like ‘Answer the given query’, or ‘Summarize the following piece of text’. In recent months, new large language models (LLMs) were made available, such as the ever improving GPT models by OpenAI (text-davinci-003, gpt-3.5-turbo, and GPT-4). These have shown great capability in consuming and acting on increasingly complex prompts.

Haystack users may already have encountered the PromptNode, which works alongside an LLM that consumes instructions. You have the option to define your own prompt in a PromptTemplate, or use one of our defaults.

For example, here is a prompt that I defined in a PromptTemplate for the ‘Should I follow?’ demo on Hugging Face. Take a look if you’re interested in how to use custom prompts in Haystack!

In this case, I wanted to have a system that can tell me what type of things a Twitter user has been posting about. The prompt starts like this:

You will be given a twitter stream belonging to a specific profile. Answer with a summary of what they’ve lately been tweeting about and in what languages. You may go into some detail about what topics they tend to like tweeting about. Please also mention their overall tone, for example: positive, negative, political, sarcastic or something else.

Why does this matter to understand Agents? Let’s have a look 👇

What is an Agent?

Agents are a way to leverage this ability of LLMs to understand and act on prompts. In essence, an Agent is an LLM that has been given a very clever initial prompt. The prompt tells the LLM to break down the process of answering a complex query into a sequence of steps that are resolved one at a time.

Agents become really cool when we combine them with ‘experts’, introduced in the MRKL paper. Simple example: an Agent might not have the inherent capability to reliably perform mathematical calculations by itself. However, we can introduce an expert - in this case a calculator, an expert at mathematical calculations. Now, when we need to perform a calculation, the Agent can call in the expert rather than trying to predict the result itself.

For example, an Agent that is asked “Who was the US president ten years ago today?”. A simplified view of the Agent’s thought process as it breaks down this question might look like this:

“I have to answer the question: What is today’s date?”
“Now I know todays date is 29th March 2023, I need to answer: what is 29 March 2023 minus 10 years?”
“Now I need to answer the question: Who was the US president on 29 of March 2013?”

At each step, the Agent may decide to make use of an expert to come to a resolution. Notice how it is able to figure out that it first has to ask a question, then perform a calculation with the answer, then finally ask another question. Whether it is able to do any of those things depends on which experts it has access to. In Haystack, these experts are called Tools.

What are Tools?

In the world of the Agent, a Tool is an expert that the Agent can query at any time. For example, an Agent might be given a Tool that can search the web (let’s call it ‘Websearch’). If the Agent sees that there’s a need to search the web and it has the Websearch in its set of tools, it will use it.

With the release of Haystack 1.15, we are making it possible to give Haystack components (the Nodes, Pipelines and more) to an Agent as Tools. For example, a Tool might simply be an ExtractiveQAPipeline. This is a pipeline that has access to a DocumentStore which contains your own data and, given a query, uses an extractive question answering model to find an answer in your own data. With this Tool, your Agent can use the pipeline to query your own data whenever it sees fit.

Our first release of Agents will let you use several ready-made Haystack pipelines and Nodes as Tools: the ExtractiveQAPipeline, DocumentSearchPipeline, the PromptNode, the new WebQAPipeline and more. For a full list of available Tools, check out our Agent documentation. We will soon be expanding the list of Tools available out of the box, and you will be able to create your own custom Tools as well.

How does an Agent select a Tool? Each Tool comes with a description. This is arguably one of the most important attributes of a Tool, as it is used by the Agent to make the Tool selection. For example, the description of a Tool that does extractive question answering on docuemtns about USA presidents could be: ‘Useful for when you need to answer questions about the presidents of the USA’.

Let’s imagine that we do actually have our own data store containing information about US presidents. We create an ExtractiveQAPipeline for this data and supply it to our Agent as a Tool called ‘ExtractiveQATool’. When we ask the Agent a question, here’s what the output might look like:

Question: What year was the 1st president of the USA born?

Thought: Let’s think step-by-step, I first need to find out who the first president of the USA was.

Tool: ExtractiveQATool

Tool Input: Who was the first president of the USA?

Observation: George Washington

Thought: Now that I know who the first president of the USA was, I can use the same tool to find out when he was born.

Tool: ExtractiveQATool

Tool Input: When was George Washington born?

Observation: February 22, 1732

Thought: That’s the answer to the question.

Final Answer: 1732

Pay attention to the highlighted sections of the output. Notice how the Agent goes step-by-step to come up with an action plan. Depending on what the next step is, it is able to select a Tool (in this case our ‘ExtractiveQATool’) to perform the desired action.

The options for these Tools are effectively limitless. It is up to you the user, to define your own Tools and decide which ones to give your Agent access to.

What is the Difference between an Agent and a Pipeline?

Now let’s see how this differs from the Haystack Pipeline. A Pipeline is a powerful structure in Haystack. A simple pipeline might consist of:

A Retriever: finds the most relevant documents for a query.
A Reader looks at some piece of text and extracts an answer to a question from it.

When we run a pipeline with a query, data will flow from one node to the next (in our example: first retrieving documents and them extracting an answer from them), until it gets to the end of the pre-defined pipeline flow and outputs a result. That’s it. That’s where our time with you comes to an end, curtains close, we’re all happy. As you can see, a pipeline is inherently a one-pass system. And this is precisely where an Agent is different (very different) to a pipeline.

Where a Pipeline has a clear start and end, input and result, an Agent does not (although you may define the maximum number of iterations you will allow it). The core difference between these two structures is that while a Pipeline is designed to be a single-iteration system, an Agent is a many-iterations system. The number of iterations is defined by the Agent, which decides after each iteration whether it has come to a final answer or not. (Of course, with a system in place to prevent infinite loops!)

The Agent owes this talent to three things:

powerful LLMs that are able to consume increasingly complex instructions;
a well-defined prompt that defines the lifecycle of an Agent and causes it to continue until it has reached a solution; and
a set of Tools that are each good at doing something really well.

Customizing an Agent

You may have already noticed and started to use the PromptNode in Haystack. This node is an interface with LLMs such as text-davinci-003 by OpenAI; the Flan T5 models by Google and so on.

At its core, the Agent is a PromptNode that has been given a specific prompt. By default, the Agent you will find in Haystack will be using the PromptTemplate called “zero-shot-react”. However, you are free to create your own PromptTemplate and use that as the basis of your Agent.

The same goes for Tools. In Haystack, you can now turn virtually any Haystack component into a Tool that can be used by an Agent. This means you have the freedom to design and create Tools with your own Pipelines and Nodes, including the PromptNode. This makes Tools incredibly flexible and will allow you to add all sorts of capabilities to your agents.

An Example of Creating and Adding Tools

For a walkthrough of creating an Agent and adding Tools to it, visit our first tutorial on how to ‘Answer Multihop Questions with Agents’. But here’s the simple idea behind it:

First, create a tool. Let’s assume the following pipeline:

my_pipeline = ExtractiveQAPipeline(retriever=some_retriever, reader=some_reader)

This could become a tool as follows:

from haystack.agents import Tool
my_qa_tool = Tool(name="ExtractiveQATool", pipeline_or_node=my_pipeline, 
                  description="Useful for when you need to answer questions related to Yoda", 
                  output_variable="answers")

Note that description property you supply to your Tool is really important. The descriptions will be used by your Agent to make its decision about which Tool to use for the task at hand.

Then, you would simply add this Tool to the set of Tools an Agent has access to. Or as I like to imagine it, you, as the all-seeing divinity in charge of the measly Agent, grant it the ability to (as per the description) use the “ExtractiveQATool” to “answer questions related to Yoda” 😊

from haystack.agents import Agent
from haystack.nodes import PromptNode

prompt_node = PromptNode(model_name_or_path="text-davinci-003", api_key='OPENAI_API_KEY', stop_words=["Observation:"])

agent = Agent(prompt_node=prompt_node)

agent.add_tool(my_qa_tool)

We’re very excited to see how you use Agents and Tools in Haystack, and we can’t wait to start adding more capabilities and usability improvements. Join us on Discord or follow the activity in the Haystack repo to see what’s next for Haystack and the Agent

Announcing the Haystack Blog

Fri, 03 Mar 2023 00:00:00 +0000

We are thrilled to announce the launch of our new blog, a space to learn and share! The Haystack community has been growing at an incredible speed, reaching over 7k stars on GitHub and 900+ members on Discord, and we’re always looking for ways to make our community experience even better. Haystack blog is another step in that direction. 🚀

You might already be familiar with our deepset blog where we share tons of great articles. While we will continue to publish new content there too, on this new blog you will find more content targeted to developers building with Haystack specifically. It will be easier and faster for the Haystack developer to find the NLP content that is most relevant to them! 🏎️

Our goal is to create a space where you can find articles about state-of-the-art NLP technologies, such as LLMs, open source development, building with Haystack, community updates, and more. 📚

In addition to providing a space for our community to access high-quality NLP content, we would also be thrilled to give you the opportunity to submit your content and contribute to the Haystack blog! We believe learning from a diverse set of backgrounds is essential and we want to make sure everyone has a chance to share their knowledge and experience. 🤩

So, whether you are a seasoned NLP expert or just starting your journey, we invite you to check out our new blog. We are excited to hear your feedback and to continue growing our community together. 🙌

Do you have questions? Join our Discord community!

Build a Search Engine with GPT-3

Tue, 31 Jan 2023 00:00:00 +0000

If you’ve been online lately, then you’ve likely seen the excitement about OpenAI’s newest language model, ChatGPT. ChatGPT is astonishingly good at many things, including debugging code and rewriting text in whatever style you ask it. As an offshoot of GPT-3.5, a large language model (LLM) with billions of parameters, ChatGPT owes its impressive amount of knowledge to the fact that it’s seen a large portion of the internet during training — in the form of the Common Crawl corpus and other data.

After decades of chatbots that couldn’t even remember what they said a moment ago, it’s understandable that people are excited by a language model that can hold a conversation and create a solid semblance of intelligence. But when it comes to the validity of answers generated by these huge models, we need to stay critical. LLMs especially are prone to hallucinations: producing text that sounds sensible at first but doesn’t hold up to closer scrutiny, and presenting things as fact that are made up entirely.

Semantic search engines — our specialty here at deepset — are often powered by extractive question answering models. These models return snippets from the knowledge base verbatim, rather than generating text from scratch the way ChatGPT does. However, many applications can benefit from the abilities of generative LLMs. That’s why Haystack, deepset’s open-source framework for applied natural language processing (NLP), allows you to leverage multiple GPT models in your pipeline. With this approach, you can build a GPT-powered semantic search engine that uses your own data as ground truth and bases its natural-language answers on the information it contains.

You can think of Haystack as a comprehensive and highly flexible toolbox whose primary goal is to make building different flavors of NLP systems easy and fast, yet transparent. Besides providing a comfortable entry point to the OpenAI API, Haystack offers all the other components you need to successfully implement an end-to-end NLP system with GPT: a vector database, a module for retrieval, and the pipeline that combines all those elements into one queryable system.

In this article, we’ll demonstrate how to build a generative question answering system that uses the GPT-3 “ davinci-003” model to present results in convincing natural language.

The advent of large language models

Aren’t all modern language models large? It’s true — ever since the Transformer was introduced as a baseline-beating architecture for all kinds of NLP tasks, models have grown in size. But while the largest BERT model has 336 million parameters, OpenAI’s largest GPT-3.5 model — which ChatGPT is based on — has 520 times as many.

So what does GPT do with all its additional capacity? From observation, we can say that GPT is exceptionally good at understanding implication and intent. It can remember what’s been discussed earlier in the conversation, including figuring out what you’re referring to with words like “he” or “before that,” and it can tell you when your question doesn’t make sense. All of these properties account for the increased sense of actual intelligence. It also has to generate language from scratch, a much harder task than returning the right section from a corpus. But it has been shown, too, that GPT uses a lot of its parameters to store facts — actual information, which it then uses for tasks like code debugging and answering general-knowledge questions.

But here’s the problem: GPT can still make massive mistakes, and these are harder to detect because it is so good at conversing and making its answers and code examples look right. Early in December, the programming Q&A forum Stack Overflow temporarily banned ChatGPT-generated solutions. Meanwhile, hallucinations in large language models have spawned a new field of research.

However, there is a way to use GPT models in a way that is safer and generates value. By connecting the generative model to a textual database with curated, domain-specific content — say, a corpus of product reviews, a collection of financial reports, or a database with research papers — you can combine factual accuracy with GPT’s conversational prowess. With Haystack, you can set up such a GPT-based search engine in no time. The search engine is placed on top of your textual database, and returns natural-language answers in reply to an input query.

Different types of search engines

Semantic search engines come in different varieties and can roughly be distinguished by the type of answer they return. The answers could consist of matching documents (in document search), answer spans (in extractive QA), or newly generated answers (in generative QA).

The GenerativeQAPipeline: Haystack’s component for a generative search engine

For each of those search paradigms, Haystack offers ready-made pipelines: search engine configurations with placeholders for language models for maximum efficiency. In this tutorial, we use the GenerativeQAPipeline. It consists of a retriever (to find relevant documents) and a generator (to write text), chained together. The retriever connects to the database. Like the generator, it is often (but not necessarily) based on a Transformer model. Its task is to retrieve the documents from the database that are most likely to contain valuable information, based on a user’s input query. Our generative model then uses those documents as a factual basis to write its answer.

How to build a search engine using GPT-3

Before we get started, let’s quickly talk about the tools you need to follow along.

Prerequisites

You need to install Haystack. We use version 1.13.
To use the OpenAI API, you need to make an account and generate an API key. Note that while the first queries are free, you need to pay once you reach a certain limit. (OpenAI grants you an initial budget of $18, which is more than enough to complete this tutorial and play around with the pipeline.)
Here we use a small dataset of 18 Wikipedia articles about Berlin, the capital city of Germany. You can, of course, use your own dataset.
Since our embeddings and answers are provided through the OpenAI API, it’s not necessary to use GPUs for this guide. However, if you want to experiment with different models, we recommend that you work in a Colab notebook and activate the GPU (under “Runtime -> Change runtime type”).

Converting and preprocessing

Before you can set up the pipeline, you need to preprocess your data and add them to the document store, or database. There are many options for document stores in Haystack. This tutorial uses FAISS, which is a vector database.

Before you can feed data into the document store, you need to get it into the right format. The DocumentStore expects data to be supplied in the form of a Haystack data type called Document — a dictionary data type that stores information as a set of related fields (such as the document text and its metadata). The convert_files_to_docs function retrieves your files from a directory and converts them into Haystack documents. If you’re working with articles from Wikipedia, you can use the clean_wiki_text cleaning function that removes some Wikipedia-specific boilerplate.

from haystack.utils import convert_files_to_docs, clean_wiki_text  
  
docs = convert_files_to_docs(dir_path=DOC_DIR, clean_func=clean_wiki_text, split_paragraphs=True)

Depending on your own data’s format, you may need to follow a slightly different procedure for preprocessing. With Haystack, you can extract data from websites, or convert different file formats like pdf, txt, or docx files. Have a look at our preprocessing tutorial and the documentation page of the FileConverter to learn more.

Many documents, including Wikipedia articles about popular topics, can be very long. You need to make sure that the documents in your database are short enough for the embedding model to properly capture their meaning. To do this, you can use the PreProcessor to split them into shorter text snippets. We suggest a split length of 100 tokens per snippet, and an overlap of three tokens, to make sure no information gets lost.

from haystack.nodes import PreProcessor  
  
preprocessor = PreProcessor(  
 clean_empty_lines=True,  
 clean_whitespace=True,  
 clean_header_footer=False,  
 split_by="word",  
 split_length=100,  
 split_overlap=3,  
 split_respect_sentence_boundary=False,  
)  
  
processed_docs = preprocessor.process(docs)

What do these processed documents look like? Let’s have a look at one of them:

processed_docs[0]

<Document: {'content': 'The Berlin U-Bahn (German: [ˈuː baːn]; short for Untergrundbahn, "underground railway") is a rapid transit system in Berlin, the capital and largest city of Germany, and a major part of the city\'s public transport system. Together with the S-Bahn, a network of suburban train lines, and a tram network that operates mostly in the eastern parts of the city, it serves as the main means of transport in the capital.\nOpened in 1902, the U-Bahn serves 175 stations[1] spread across nine lines, with a total track length of 155.4 kilometres (96 miles 45 chains),[3] about 80% of which is underground.[4]  Trains run',   
'content_type': 'text',   
'score': None,   
'meta': {'name': 'Berlin U-Bahn.txt', '_split_id': 0},   
'embedding': None,   
'id': 'd2bf58a531b2500250650b43b1cce290'}>

Every document has been turned into an object of the Document class, which is a dictionary that contains not only the document’s text, but also some automatically generated metadata, like which file the text came from.

Initializing the DocumentStore

Time to set up the document store — for example, the vector-optimized FAISS database. When you initialize the document store, you need to know the length of your retriever’s document vector embeddings — its internal representations that it will produce for each document. Since you’ll be working with the high-dimensional text-embedding-ada-002 model from OpenAI, you need to set the vectors’ embedding_dim to 1536.

from haystack.document_stores import FAISSDocumentStore  
  
document_store = FAISSDocumentStore(faiss_index_factory_str="Flat", embedding_dim=1536)

Now, delete any existing documents in the database, and add the preprocessed documents that you generated earlier.

document_store.delete_documents()  
document_store.write_documents(processed_docs)

Note that so far, the database only contains the plain-text documents. To add the high-dimensional vector embeddings — the representations of each document that make sense to the language model and that it can use for semantic search — you need to set up the model for retrieval.

Retriever

The retriever is the module that matches your query to the documents in the database, and retrieves those that it deems most likely to contain the answer. Retrievers can be keyword-based (like tf-idf and BM25), or they can encode semantic similarity through the use of Transformer-generated text vectors. In the latter case, the retriever is also used to index the documents in your database — that is, turn them into high-dimensional embeddings that the retriever can then search.

You’ll be working with OpenAI’s most recent retrieval model, text-embedding-ada-002. To initialize it in Haystack, you need to provide your OpenAI API key.

from haystack.nodes import EmbeddingRetriever  
  
retriever = EmbeddingRetriever(  
 document_store=document_store,  
 embedding_model="text-embedding-ada-002",  
 batch_size = 32,  
 api_key=MY_API_KEY,  
 max_seq_len = 1024  
)

When you set up the retriever, you connect it directly to your document store. Now you can use the update_embeddings method to turn the raw documents in the document store into high-dimensional vectors that the retrieval model can search and compare.

document_store.update_embeddings(retriever)

Generator

You are now ready to initialize the GPT model that will generate text for you. The OpenAIAnswerGenerator node can use four different GPT models. You can use the highest performing GPT-3.5 model, text-davinci-003.

from haystack.nodes import OpenAIAnswerGenerator  
  
generator = OpenAIAnswerGenerator(api_key=MY_API_KEY, model="text-davinci-003", temperature=.5, max_tokens=30)

We recommend increasing the max_tokens parameter from the default value of 13 to 30, so that the GPT model can produce longer sequences. We also suggest you set the temperature to .5 (default is .2), which gives the model slightly more liberty in generating its answers. The lower the temperature, the more faithful the model remains to the underlying source texts.

Pipeline

Now that all the individual elements of your GPT search engine are set up, it’s time to pass them to your generative QA pipeline.

from haystack.pipelines import GenerativeQAPipeline  
  
gpt_search_engine = GenerativeQAPipeline(generator=generator, retriever=retriever)

And that’s it! Your GPT-powered search engine is ready for querying.

Querying the pipeline

Now you can ask your system some general questions about Berlin (or whatever other topic your dataset is about). In addition to the query itself, you can pass a few parameters to the search engine, like the number of documents that the retriever should deliver to the generator and the number of answers that should be generated (both designated “top_k”).

query = "What is Berlin known for?"
params = {"Retriever": {"top_k": 5}, "Generator": {"top_k": 1}}  
  
answer = gpt_search_engine.run(query=query, params=params)

To print the answer generated by your pipeline, import Haystack’s handy print_answers function. It lets you determine the amount of detail that you want to see when printing the answer. Setting it to minimum will print only the answer string. So what’s the search engine’s answer to the question above?

from haystack.utils import print_answers  
  
print_answers(answer, details="minimum")

>>> Query: What is Berlin known for?  
Answers:  
[ { 'answer': "Berlin is known for its diverse culture, its nightlife, its contemporary arts, and its high quality of life."}]

Note that this answer is generated from scratch: it’s not a quotation from any of the Wikipedia articles, but has been written based on the content in them.

Generated answers are context-dependent

Remember how we said earlier that the GPT-3 model generates its answers on the basis of the documents that it receives? You can now test that by running the generator in isolation, without the retriever. You can’t run it without any documents at all, though, so you need to pass it a single snippet. Here’s what happens if you use the snippet about the U-Bahn that was printed out above.

generator.predict("What is Berlin known for?", documents=[processed_docs[0]])

>>> Query: What is Berlin known for?  
Answers:  
[ { "answer": "The Berlin U-Bahn."}]

The system answers that Berlin is well known for its underground railway, because that’s all the knowledge it has from that one document.

Now, go back to the full version of the search engine — the one that’s ingested your whole dataset (for example, all 18 Wikipedia articles about Berlin) — and ask it a few more questions, to get a better feeling for how your search engine operates.

Example 1

Query: When is the best time to visit Berlin?  
Answers:  
[ { "answer": """ Berlin is a great city to visit year-round, 
but the besttime to visit is during the summer months, from June to 
 August."""}]

Example 2

Query: Do people from Berlin have a dialect?  
Answers:  
[ { "answer": """ Yes, people from Berlin have a dialect, which is a variant   
 of the Brandenburgish dialect."""}]

Example 3

Query: Tell me about some interesting buildings in Berlin.  
Answers:  
[ { "answer": """The Berlin TV Tower is a TV Tower in Berlin, Germany. It   
 is Berlin’s tallest structure, at a height of 1,207 feet. """}]

Example 4

Query: How was the TV tower built?  
Answers:  
[ { "answer": """The TV tower was built by a team of architects between   
 1965 and 1969. The tower was built in a record-breaking 
 time of just 53 months,"""}]

One interesting fact about GPT is that it doesn’t always return the same answers. When prompted multiple times with the same query, it will try to come up with different answers. When we worked through this tutorial ourselves, our model came up with a major hallucination, after receiving the same question multiple times. Have a look at this answer:

Query: How was the TV tower built?  
Answers:  
[ { "answer": """ The TV Tower was built by the Soviet troops by setting the   
 contents on fire and turning the tower into a makeshift  
 chimney."""}]

This answer — which was ranked lower than the correct answer — is rather absurd. But it should serve as a reminder that the output of a generative model, even when it looks like a well-formed answer, can be entirely hallucinated and should be fact-checked.

Example 5

Query: Is Berlin a good place for clubbing?  
Answers:  
[ { "answer": """Yes, Berlin is a good place for clubbing. There are many 
  nightclubs, including the Watergate, Tresor, and Berghain."""}]

Comparing generative QA to extractive QA

Like this generative pipeline, a search engine that’s based on extractive QA returns answers about a corpus of documents in response to a query in natural language. But because the extractive QA system extracts its answers verbatim from the document text itself, it has some restrictions compared to the GPT search engine.

An extractive QA model is unable to produce the kind of conversational elements you can see in the examples above, like answering “Yes,” or repeating parts of the question in its answer. More significantly, it cannot answer questions as comprehensively as the GPT model, because it isn’t able to aggregate information from different texts.

In the last example, GPT affirmed that Berlin is a good place for clubbing, before going on to list some examples of famous night clubs.

But when an extractive QA model is asked the same question, it can only answer the question in an implicit manner, by extracting the text spans that it deems most relevant to the query. Here are two answers returned by such an extractive QA search engine, including the section (“context”) from which they were extracted:

Query: Is Berlin a good place for clubbing?  
Answers:  
[ { "answer": """ Partygoers in Germany often toast the New Year with a glass   
 of sparkling wine""",  
 "context": """ke place throughout the entire city. Partygoers in Germany   
 often toast the New Year with a glass of sparkling wine.\n  
 Berlin is home to 44 theaters and """},  
 { "answer": "Berlin’s club scene a premier nightlife venue”,  
 "context": """rticularly those in Western and Central Europe, made 
 Berlin’s club scene a premier nightlife venue. After the   
 fall of the Berlin Wall in 1989, many hi"""}]

Should you use generative or extractive QA?

The advantages of generative versus extractive QA are clear: it has better conversational skills, produces well-formed sentences in natural language, and can aggregate knowledge from multiple sources in a single answer. But, as we have seen, it also has some downsides. You should consider using extractive over generative QA in the following cases:

When you want to work with smaller, open-source models. As we’ve seen, GPT models are huge, and once you’ve reached a certain limit of queries, you need to pay to use the API. The Hugging Face model hub, on the other hand, hosts thousands of open-source, pre-trained models that you can download for free.
When you want transparency over where the model gets its information from. Extractive models don’t hallucinate. Naturally, these models can return wrong answers, too. But these are much easier to detect by checking the context from which they were extracted.
When you want to use the answers in a downstream task. One of the most popular applications for extractive question answering is in information extraction systems. Such systems have no use for the conversational abilities of a generative search engine. Rather, they require a search functionality that can extract factual information from large corpora quickly and reliably.

Haystack: the main framework for building search engines

Generative QA, extractive QA, translation, summarization and much, much more: with Haystack you can build the system that is best suited to solve your specific problem, using the hottest new architectures.

Our application-focused approach to NLP provides you with the modular building blocks to set up your own system in the shortest amount of time. Check out the Haystack repository to learn more, or have a look at our documentation.

Want to chat about GPT-3, search engines, and all things NLP? Join our Discord!

The Definitive Guide to BERT Models

Mon, 16 Jan 2023 00:00:00 +0000

Anyone who has studied natural language processing (NLP) can tell you that the state of the art moves exceptionally fast. Big players like Google, Facebook, or OpenAI employ large teams of experts to come up with new solutions that bring computers ever closer to a seemingly human-like understanding of language. This results in model architectures and other approaches quickly becoming obsolete, and what was considered cutting-edge technology six months ago may almost seem outdated today. Nevertheless, some models make such an impact that they become foundational knowledge even as they are eclipsed by their successors.

One model architecture for which this is true is BERT (short for Bidirectional Encoder Representations from Transformers, an unwieldy name almost certainly picked for its friendly acronym). Although the first BERT model — born late 2018 — is rarely used in its original form today, the adaptability of this model architecture in terms of tasks, languages, and even sizes means that direct BERT offspring are still thriving in all sorts of fields.

In the high-churn world of language models, it can be difficult to keep up and find the best option for your project. This post aims to refresh your knowledge of BERT, provide a survey of the various models that have iterated past the BERT baseline, and help you find the right BERT-like model for you.

Who is BERT?

Google researchers designed BERT as a general language model, adapting the Transformer architecture, which had made an enormous impact on the field of NLP just a year earlier. Aside from improving Google’s search results through its deep understanding of semantics, BERT’s main function is as a basis for specific “downstream” tasks like question answering or sentiment analysis. That’s because its ability to process written language at a near-human level greatly aids the BERT language model in solving other language-based tasks.

The impression BERT made on the NLP landscape in 2018 was incredible. After it was shown in the original paper that models based on BERT’s pre-trained architecture could outperform their competitors on many different tasks, industry observers predicted that this new model paradigm would be a game changer, one blog post even going so far as calling BERT “one model to rule them all.”

What sets BERT apart?

To understand the tremendous success of BERT, we need to dip into the history of language modeling. Before Transformers, language models were based on recurrent neural nets (RNNs). RNNs had a major edge over previous, “static” text embeddings like Doc2Vec in that they were able to capture the meaning of words in context.

However, RNN-based language models still had serious shortcomings. They are strictly sequential, meaning that they process the input token after token, rather than all at once. Plus, RNNs perform weakly on longer sequences, prompting researchers to implement remedies such as “memory cells” in long short-term memory (LSTM) networks, and processing an input string twice: once from left to right, and once from right to left. ELMo, often considered BERT’s direct predecessor, is an example of such a bidirectional LSTM model.

In contrast, BERT follows the Transformer architecture introduced in the seminal paper from 2017. Transformers are built on self-attention, a new core algorithm for detecting patterns, which enables the model to learn how individual tokens in an input sequence are contextually related to each other. Embeddings created by self-attention-driven Transformers are therefore highly dynamic and can easily differentiate language that machines traditionally find difficult, like homonyms or jargon words. In addition, self-attention can process input sequences at once, rather than sequentially, allowing for a faster training procedure.

However, in deep learning, the model architecture itself is just one element that contributes to a model’s success — the other one is the training task and the data it uses. One of the training tasks used to help BERT acquire language is “masked language modeling” (MLM). During training, individual tokens in the training input are masked, and loss is calculated based on how well BERT can predict those tokens. What makes this task so useful is that the training data for it can be generated in a trivial manner: simply obscure a random token in the source sequence, and make that same token the prediction target. No expensive data annotation is required.

Other directions in Transformers

BERT’s basic architecture is not the only way to use the self-attention developments of transformers. The well-known Generative Pre-trained Transformer (GPT) family of models utilizes the same baseline self-attention modules as BERT, but includes a decoding layer capable of generating text from inputs.

Keep in mind that although GPT gets a lot of public attention, BERT’s lack of a “decoder” component doesn’t mean GPT is better. BERT is a lighter model, we don’t use it for tasks that require language generation.

The BERT family

The success of the basic BERT model has led to a large number of popular variants. Newer, more powerful models soon appeared, BERT got siblings in all kinds of languages — some were even multilingual — and researchers used BERT as a base model to fine-tune and adapt to different tasks and language domains. Recently, we’ve seen a push for making Transformer models like BERT ever smaller and more efficient.

Below, we will break down some of these broad categories in the BERT family. You can explore all the different BERTs available on the Hugging Face model hub, and use the search and filter options to find one that fits your needs.

RoBERTa: the overachieving big sister

In 2019, Facebook engineers developed a significantly more powerful version of BERT using the same model architecture. By optimizing hyperparameters and feeding the model more data, they made sizable improvements to the basic BERT model while maintaining resource size and inference complexity. Another large step forward was the change over to dynamic MLM, providing varying sentence masking to increase the robustness of learning.

The resulting architecture was named RoBERTa, and was quickly adopted as a standard baseline over BERT. Today RoBERTa is still commonly used as a baseline, as in our standard question answering model for English.

Polyglottal BERT

Wikipedia is a great resource for training sets in many languages, so it is relatively trivial to retrain basic BERT models on other languages if a large enough Wikipedia exists.

At deepset, we are proud to have contributed some of the best-performing German language BERTs to the NLP community. Be sure to check out a variety of English and German language models on our website. German BERT is in the fine company of a large international model family that contains members such as AraBERT, CamemBERT, and RuBERT.

Many people also rely on multilingual BERTs for generalized semantic context. One of BERT’s more impressive offspring is the Multilingual BERT developed by Google, a BERT model trained on 104 different languages, which is capable of “zero shot” adaptation to a new language domain. This may be a quick and tidy solution to generalized language modeling, but when you need a specific semantic model, nothing beats training it yourself (as we demonstrated with German BERT).

BERT in the real world: fine-tuning

General language models are great, not because they are particularly useful on their own, but because they provide a deep understanding of a language’s semantics that can be used in downstream tasks. The models for those tasks are produced through fine-tuning.

In fine-tuning, the core architecture of a pre-trained model like BERT or RoBERTa is altered slightly to achieve a different target output. It is then trained on data specific to the downstream task: for a question answering model, that could be a dataset consisting of questions and answers, like SQuAD; for a summarization model, a dataset with long- and short-form document pairs. The advantage to these specializations is that a pre-trained base model requires much less specific data and much less training time to become performant in a given speciality.

Fine-tuning is a transfer learning method that has enabled BERT models to accomplish tasks as diverse as detection of “toxic” language, summarization, named entity recognition, question answering, and many others.

How BERT got its doctorate: domain adaptation

Language is notoriously divergent in different domains. Even when dialects aren’t formally distinguished, even a small vocabulary of jargon, terms of art, or antiquated usage can throw off the effectiveness of a language model in a particular application. Domain adaptation is the practice of gearing a general language model towards a more specialized jargon of a language.

Here are some examples of BERT models that have been adapted to different domains, to be used for specific business applications:

BioBERT: a BERT model retrained on a biomedical corpus, enabling applications like this named entity recognition model, which contextualizes obscure designations in text.
SciBERT: a similar model, but trained on over one million published articles. Performs at state of the art for computer science articles.
BERTweet: a RoBERTa model trained on 850 million tweets, seen used in analyzing public discourse around novel vaping technology.
FinBERT: a model that was first adapted to the financial domain, before fine-tuning it to the particularly tricky task of sentiment analysis on news snippets from the world of finance.

BERT today: the quest for smaller and faster models

A major restriction in the use of BERT (and other state-of-the-art models) today is the speed and fluidity of inference tasks. Most active development on these models is seeking to make them faster for better customer experience, or even scaled down enough to fit into smartphones and run locally.

That is why a current focus in BERT research is model distillation: using large, performant models to direct the training of smaller, more efficient models. The smaller, distilled model learns to imitate most of the larger model’s learning — knowledge which the smaller models would not have the bandwidth to acquire during their own training.

A large, pre-trained model serves as a teacher for an untrained smaller model, which is trained to mimic the teacher. The result is a compressed, less resource intensive model with similar results.

What’s next for BERT?

As our ability to shrink BERT improves, researchers are eager to expand the scope and abilities of the original model. Much of the innovation in BERT’s development is happening in changes to the training tasks that BERT is learning from.

Much of current research is directed towards finding important alternative training tasks, rather than innovating BERT’s core architecture. It is thought that adding a wider variety of training tasks will especially improve BERT as a core model from which others can be fine-tuned. Such multi-task learning — a single model that can solve multiple tasks — may have better semantic generalization and therefore better performance on individual tasks.

Pushing the envelope, some directions of research are developing multimedia BERT models. Using a modified BERT as a core semantic embedding, models can be trained to translate from strings to other media, as in our semantic image search tutorial.

Conclusion

If you want to learn more about BERT models through personal tinkering, our Haystack framework is a seamless tool for setting up a project. You can dive in right now with this tutorial on OpenAI’s GPT models for text generation.

If you’d like to ask about our tutorials around BERT or just connect with developers working at the cutting edge of NLP, our Discord is full of friendly BERTologists ready to chat.

How to Build a Semantic Search Engine in Python

Wed, 23 Nov 2022 00:00:00 +0000

Semantic search is the task of retrieving documents from a collection of documents (also known as a ‘corpus’) in response to a query asked in natural language. Powered by the latest Transformer language models, semantic search allows you to access the best matches from your document collection within seconds, and on the basis of meaning rather than keyword matches. As well as being helpful in its own right, semantic search also forms the basis for many complex tasks, like question answering or text summarization.

Over the last decade or so, Python has become the principal language for machine learning (ML) and natural language processing (NLP). In this article, we will show you how to set up a semantic search engine in Python, placing it on top of your document collection of choice, with our open source Haystack framework. Thanks to Haystack’s modular setup and the availability of high-quality pre-trained language models, you’ll be able to set up your own semantic search system in less than twenty minutes.

A refresher on semantic search

Like all Transformer-based language models, the models used in semantic search encode text (both the documents and the query) as high-dimensional vectors or embeddings. We can then use similarity measures like cosine similarity to understand how close in meaning two vectors (and their associated texts) are. Texts that are similar in meaning are closer to each other, while unrelated texts are more distant. While illegible to humans, the vector-based representation works very well for computers to represent meaning.

The superiority of semantic search over a keyword-based approach becomes clear if we look at an example. Think of the difference between the queries “why can’t I commit changes” — a perennial problem for the novice Git user — and “why can’t I commit to changes” — a problem for the indecisive. The preposition “to” entirely changes the meaning of the query, which is impossible to detect with simple keyword matching. A semantic language model (like the one used by Google) will embed the two queries in disparate locations of the vector space.

Semantic search is great for disentangling subtleties like this. Here are some scenarios where semantic search will be particularly helpful:

When your application is searching a large or ambiguous corpus. Many heterogeneous corpuses present problems for simple keyword matching. Imagine trying to compare reward programs between airlines. Every company has a different term for its customer currencies and rebate programs. Many of those terms borrow from common concepts like miles, points, or rewards. A semantic search engine can easily capture the similarity of these terms, where exact text matching would not.
When your users aren’t finding what they’re looking for through keywords. In keyword matching systems, good results are dependent on an exact match between corpus and query. When users are unable to find quality search results, they often painfully iterate search terms, hunting for the right combination to unlock the corpus. This is a poor user experience, and can be replaced with a more comfortable and natural semantic search model.
When you want to offer your users a more intuitive search interface. Semantic search is simply more comfortable and enjoyable for sifting through documents. Human minds work in human terms, and most people prefer it when they don’t have to adapt their language to a computer.

Now, let’s look at how Haystack and Python make it possible to build semantic search systems in just a few lines of code.

Haystack: Applied NLP in Python

Haystack is our framework for applied NLP that uses a modular, mix-and-match approach to building NLP systems. These days, the highest-performing language models are huge. Training such a language model yourself on your local PC is impractical, uneconomical and, frankly, often impossible. However, thanks to centralized platforms like the Hugging Face model hub, pre-trained models can be shared and reused by everyone.

With pre-trained models being readily available, the Haystack framework offers the architecture that makes them work in practice: the Python library comes with several preconfigured pipelines, where all you need to do is plug in the language model that works for you. This makes it easy to build prototype systems quickly, using different models. At the same time, Haystack offers all the flexibility that we’ve come to expect from Python libraries, making it extremely easy to configure and customize your own pipeline.

Our pipelines are essentially nodes connected by edges. A node can hold a specific language model, which transforms its input before passing it on to the next node. Once you’ve decided on the pipeline setup that works best for your use case, you can easily iterate through different language models, to find the combination that gives you the best results. Note that whenever you change the model combination, you’ll need to initialize your pipeline from scratch.

Building a Semantic Search Engine: Prerequisites

To build a semantic search prototype, think about three aspects in advance: what documents you want to search, the design of your pipeline, and which language models to use (often, you’ll have a number of models that you want to compare to each other).

For the context of this guide, imagine that you’ve got some gardening work to do, and you’ve acquired a corpus of e-manuals that should help you find the best tools for different tasks. Unfortunately, there are too many documents to look through manually. If you’re not very familiar with the gardening jargon, a keyword-based search would likely be only moderately successful.

You can experiment with different pipeline designs for your semantic search system. First, try using the basic, ready-made DocumentSearch pipeline, which consists only of one node: the retriever, which extracts the best matches from your documents collection.

The retriever selects the best documents in response to a query. Before you can use a Transformer-based retrieval model in practice, you’ll need to index your documents; that is, pre-process them and let the retrieval model turn them into vector representations, which it can compare later. In more complex pipeline architectures, the retriever often works as a “sieve” that presents a small pre-selection of documents to the next node, taking the burden of processing high amounts of texts off more computationally costly models.

Build Your Semantic Search System With The Haystack Python Framework

We’ll be working with the FAISS document store as our database, which is optimized for working with vector representations. Make sure to install Haystack (we use version 1.11) with FAISS support enabled:

pip install git+https://github.com/deepset-ai/haystack.git  
pip install 'farm-haystack[faiss]'

You can then start by reading in and converting the files, having stored them as .txt documents locally under datapath. If you want to use the same .txt files that we use in this project, you can download them from this Google Drive folder Haystack’s convert_files_to_docs function recognizes the files’ format automatically and converts them into objects of the Document class, which can then be read into the document store.

from haystack.utils import convert_files_to_docs  
  
all_docs = convert_files_to_docs(dir_path=datapath, split_paragraphs = "True")

Now, import the FAISSDocumentStore class, initialize the document store, and add the documents.

from haystack.document_stores.faiss import FAISSDocumentStore

document_store = FAISSDocumentStore(faiss_index_factory_str="Flat", similarity="cosine")  
document_store.write_documents(all_docs)

Next, import the retriever class and initialize it with the retrieval model that you want to use for this task. Set the top_k parameter — the number of documents returned by the retriever — to 3.

from haystack.nodes import EmbeddingRetriever  

model = 'sentence-transformers/multi-qa-mpnet-base-dot-v1'  
retriever = EmbeddingRetriever(document_store=document_store, use_gpu=True, embedding_model=model, top_k=3)

Now you’ll need to update the documents in the document store by running them through the retriever model. This is also known as indexing. Since the Transformer model needs to process all the documents in your database, this step will probably take several minutes.

document_store.update_embeddings(retriever)

After you’ve indexed your documents, it’s finally time to set up the pipeline by importing the class and initializing it with the retriever.

from haystack.pipelines import DocumentSearchPipeline
  
semantic_search_pipeline = DocumentSearchPipeline(retriever=retriever)

Your Python Semantic Search Engine in Action

Now that your documents have been stored and indexed in the document store, and your pipeline is set up and connected to it, it’s time to ask your questions. Keep in mind that since you’re using a document search pipeline, the results will be documents rather than answers. You could start with the following:

question = "What's a good machine for cutting grass?"  
prediction = semantic_search_pipeline.run(query=question)

The prediction variable stores a dictionary. Have a quick look at its contents:

print(prediction.keys())  
>>> dict_keys(['documents', 'root_node', 'params', 'query', 'node_id'])

The documents key is the most interesting of all of these: it contains the documents that match your query, according to the model. Look at the first 100 characters of each result:

for i, pred in enumerate(prediction['documents']):  
   print(i, pred.content[:100] + '\n')  
>>> 0 Fast, clean and dependable mowing. Land Pride Flail Mowers are the perfect choice for schools and ot  
>>> 1 All-Flex Mowers. Our name says it all! Maintaining a large area of grass in a picturesque setting is  
>>> 2 - Steep Slopes 0° to 31° - High production mowing of roadways and parks - Steering wheel steered for

These are looking great! Clearly the language model can understand that “cutting grass” is very similar to the action of “mowing.” If you remember that the underlying model is a general language model that has not been fine-tuned to the specific topic of gardening (or to the texts at hand), the potency of these models becomes evident. Now try a different query:

question = "What are the best tools for digging a ditch?"  
prediction = semantic_search_pipeline.run(query=question)

And print the three results again:

>>> 0 Double Pocket Chart Stand with Storage Assembly Instructions Contents: A. Bottom Side Pole (4) B. Mi  
>>> 1 Ideal for ditching, road grading and all-around farm use. RBT40 SERIES 65 - 100 HP -- 84" 96" 108" Ca  
>>> 2 Small farms and food plots ... just right for the Land Pride disc. Land Pride Disc Harrows break up th

These results definitely seem on topic — but the order could be improved. After all, the second result sounds like the ideal match. You could replace your retrieval model with the sentence-transformers/multi-qa-mpnet-base-dot-v1 model, which is much better suited to this asymmetric semantic search task. However, for the purpose of this tutorial, we’ll show you a different approach to improving your semantic search engine.

Expand Your Python Semantic Search Engine

As an alternative to using a more complex model, you can expand your pipeline by setting up a custom design that contains a ranker node in addition to the retriever. The ranker uses its own Transformer-based language model to re-rank the documents it receives from the retriever.

First, you need to import two additional classes — the general pipeline and the ranker node — and instantiate them.

from haystack import Pipeline  
from haystack.nodes import SentenceTransformersRanker  
  
pipeline = Pipeline()  
ranker_model = "cross-encoder/ms-marco-MiniLM-L-12-v2"  
ranker = SentenceTransformersRanker(model_name_or_path=ranker_model, top_k=3)

Next, add the two modules — retriever and ranker — to your pipeline object. Whenever you add a node to a pipeline, you need to give it a name by which you can refer to it. You also need to tell the pipeline how the nodes are connected to each other by specifying the origin of the input for each node. By default, the first input is always designated Query.

pipeline.add_node(component=retriever, name='Retriever', inputs=['Query'])  
pipeline.add_node(component=ranker, name='Ranker', inputs=['Retriever'])

And print the first part of the top three documents again:

>>> 0 Ideal for ditching, road grading and all-around farm use. RBT40 SERIES 65 - 100 HP -- 84" 96" 108" Ca  
>>> 1 Small farms and food plots ... just right for the Land Pride disc. Land Pride Disc Harrows break up th  
>>> 2 Double Pocket Chart Stand with Storage Assembly Instructions Contents: A. Bottom Side Pole (4) B. Mi

Adding a ranker to your semantic search engine can definitely improve your results, especially if you’re unhappy with the retriever’s results. From here on, there are no limits to your creativity. For instance, the retriever-ranker setup works well in connection with a summarizer node, in a popular custom pipeline setup. And, of course, we would be remiss if we didn’t mention the crucial role of semantic search in most applications of open-domain question answering (QA).

Because question answering models are so computationally expensive, it is necessary to install a coarse “sieve” that can extract the best answer candidates to a query from a large corpus. Semantic search models like the retrieval model you used earlier have that property, and are therefore part of every open-domain question answering system. In a question-answering pipeline, the reader node contains the actual QA model, which “reads” through only the documents that the retriever has pre-selected from the larger corpus.

Come Join the Haystack Community

Haystack is the framework of choice for the Python programmer looking to implement a semantic search system. Moreover, anyone who wants to use the latest NLP models in their product — be it an app, a website, or something else — can do so easily through Haystack’s modular and application-oriented nature.

If you want to tweak your semantic search system and use, say, a different database as your document store, have a look at our documentation.

Do you have more questions or simply want to see what other people are up to in the applied NLP space? We run a vibrant NLP community on Discord, which is open to everyone. Meet other open source enthusiasts and talk directly to our team members.

Want to add more nodes to your pipeline, fine-tune your own models or create new datasets? Why not check out the Haystack repository on GitHub — and give us a star while you’re there?

When and How to Train Your Own Language Model

Wed, 03 Aug 2022 00:00:00 +0000

Many people, when considering whether to incorporate natural language processing functionality into their product, shy away from the perceived complexity of the task. Modern-day NLP operates with huge language models that learn from large amounts of data.

What many beginners don’t know, however, is that the vast majority of use cases don’t require training a new language model from scratch. There are already tens of thousands of pre-trained models freely available online, which can be used out of the box.

Still, there are many use cases that do benefit from fine-tuning or domain adaptation, which means refining a pre-trained language model on a smaller custom dataset. In this article, we’ll guide you through the process of experimenting with different language models and understanding when to train your own models.

Recap: What Is a Language Model?

A language model is a computational, data-based representation of a natural language. Natural languages are languages that evolved from human usage (like English or Japanese), as opposed to constructed languages like those used for programming. With the help of language models, computers can process natural language.

A language model is not a knowledge base. Sometimes we can be misled into thinking that language models know things because they answer all our questions so eloquently. But in actuality, most language models are purely geared towards a human-like understanding of language, not the memorization of facts. Their perceived knowledge comes from the database that they operate on.

With the help of a framework like Haystack, language models may be placed on top of any textual database. That’s why you could deploy any language model out of the box for your own use case, without modifying or training it any further.

Working with Pre-trained Language Models

The Hugging Face model hub is the go-to location for finding pre-trained language models — that is, models which have already been trained and are ready for use. You’ll find tens of thousands of models on the model hub, which differ in language, use case and size.

Some Preliminary Considerations

We usually recommend experimenting with multiple models, to find the one that is best suited to your application, your overall system pipeline and your computing resources. To find the right models for your use case, it’s helpful to think about a couple of questions in advance:

Who are the prospective users and what can they expect from your system?
How fast does it need to be?
How accurate does it need to be?
What computing resources do you have?

The last three questions are relevant because they might put some restrictions on the size of the model that you can operate with. High-performing language models are usually very large, which means that they take up space on your hard drive, are slow to train and take longer to make a prediction. In the past years, the NLP field has come up with ingenious techniques for knowledge distillation that make models smaller while retaining their prediction quality.

Another point to consider is the shape of your data. Is it already neatly stored in some database, like Elasticsearch or Weaviate? Or does it consist of a collection of documents in some folder on your desktop? In the latter case, you’ll probably want to perform some preprocessing. With just a few lines of code, Haystack lets you extract text from PDF or TXT files, as well as cleaning and splitting it into snippets of uniform length (have a look at this tutorial for details.)

Experimenting with Different Models

Once you’ve chosen a couple of candidate models, it’s time to plug them into your pipeline and start evaluating them. To assess how suited the models’ capabilities are to your use case, it’s a good idea to prepare a few samples from your own data and annotate them.

The importance of curating your own datasets cannot be overstated. Machine learning models revolve entirely around data. If they’re trained on low-quality data, the models themselves won’t be worth much. Similarly, you can only evaluate the quality of a model’s predictions if you have ground-truth labels against which those predictions can be compared.

Our evaluation mode outputs a couple of metrics that quantify a model’s prediction quality. If you’ve never evaluated an NLP model before, the F1 score is a pretty safe bet. It strikes a balance between precision (how many of the data points that the model assigned to a class actually belong to the class?) and recall (how many datapoints of a class did the model correctly assign to the class?) by taking their harmonic mean, a mathematical average that works well with ratios. Now you can run your pipeline with different models and compare their F1 scores.

In our experience, out-of-the-box models perform reasonably well for most use cases. When it comes to question answering in English, for example, most of our users are perfectly happy with the prediction quality of our RoBERTa-base-SQuAD model. But if the scores are too low for your liking, or you’re just feeling experimental, you could go a step further and fine-tune an existing model. In that case, the original score acts as a baseline against which you can compare your next generation models.

Sometimes, it’s hard to define what “good” really means. Is an F1 score of .75 good or bad? It depends! The best way to find out is to let your users judge. While we know that involving your users at this early stage might be tricky, they can provide invaluable feedback. For example, you could set up different pipelines in deepset Cloud and let your users provide feedback directly in the interface. That will give you a much clearer idea of what your data can accomplish in combination with a pre-trained language model, and whether you might want to fine-tune it.

When to Train Your Own Language Model

It will rarely be feasible to train your own Transformer-based language models from scratch. However, if you do have the data and computing resources to train a new BERT model for an underrepresented language, be sure to share the model on the model hub, so that others can benefit from it too. You can read about our own experience training several Transformer-based language models for German here.

In most cases, however, you’ll want to look into domain adaptation and/or fine-tuning. While the industry has come to use the two terms interchangeably, they originally describe two different techniques: fine-tuning means further training a general-purpose, pre-trained model to accomplish a specific task, question answering for example. Domain adaptation, on the other hand, means further training a model to better understand a domain-specific language, like legal or scientific jargon.

If you find that the prediction quality of the existing models is not up to scratch — either because your use case is not captured well, or your documents use a domain-specific language not represented by the likes of FinBERT, LEGAL-BERT and SciBERT — then it’s time to annotate more data and use it to subject your pre-trained models to a few more training steps.

How to Fine-tune a Language Model

If done correctly, fine-tuning can be a rewarding process. As you tweak your pre-trained model and feed it more use-case-specific data, its prediction quality will increase, at times dramatically.

You can fine-tune your model in Haystack with just a few lines of code. Find out how by following our tutorial. Here’s what else you should know about fine-tuning:

You’ll need more data. Unlike in the process above, where we employed user-specific data to evaluate a language model, for fine-tuning you’ll also need labeled data for training. The number of datapoints required depends on your use case, the size of the models and how diverse your overall dataset is. As an alternative (or complement) to annotating data, you could look into data augmentation.
Fine-tuning (and model training in general) is an iterative process. Evaluate your model once it’s been trained, and try to beat that score by tweaking some model parameters and training it again. To identify your ideal model settings, you’ll probably need to go through a few iterations of train-evaluate-tweak-repeat. Fine-tuning a model in Haystack is as simple as calling .train() on an initialized model.
Pro-tip: experiment with smaller models rather than larger ones. To reduce the time consumed by iteratively training your models during fine-tuning, you’ll want to keep the model small. We often fine-tune a smaller version of a given architecture (say, BERT-base instead of BERT-large) to find the optimal hyperparameter settings. Once you’ve found the right settings, you can apply them to the larger model.
Pro-tip: distill your models. Most use cases will benefit from smaller, faster models. That’s why we are huge fans of model distillation. That technique lets you “distill” the knowledge from a larger model into a smaller one. The result is a model that’s faster, more environmentally friendly, and cheaper to use in production. What’s not to like?

The Case for Data Labeling

Many people underestimate the role data labeling can play in machine learning. It’s boring, monotonous, and can seem disappointingly manual when we expect our models to perform some kind of AI magic. That’s probably why we often see teams spend way too many hours trying to squeeze more out of their language models, when that time would actually be much better spent annotating some data.

What really makes for good models is annotated data, specifically “difficult” annotations, as they can teach your model to deal with cases that even humans find hard to handle.

While we’ll admit that annotation might not be the most fun work, there are tools to make the process easier for everyone. For example, the Haystack annotation tool provides the framework for a more streamlined process. Clear guidelines go a long way toward a well annotated and consistent dataset. It’s also valuable to engage with your own data intimately, as it will increase your understanding of the use case and why certain predictions may be hard for your model.

So to really drive the point home: we recommend investing in data annotation rather than model creation. Machine learning researchers have worked hard to come up with model architectures that emulate linguistic intuition faithfully, and new techniques are constantly emerging to make existing models smaller and faster. But you and your team’s expertise lies in your own data — and that is precisely the area where you can have the biggest impact on your models’ performance.

Getting Started

Whether on a hosted platform like deepset Cloud or in your own setup, model training doesn’t have to be a burden; it can actually be kind of fun. Annotate data, tweak some parameters and watch your language model as it becomes smarter and more adapted to your use case.

We have accompanied many different teams on their quest to make NLP work for them. Come talk to us about your project in our Discord community, or check out the Haystack GitHub repository.

If the process of evaluating and fine-tuning manually seems daunting and time-consuming, have a look at deepset Cloud, our end-to-end platform for designing and deploying NLP-based solutions.

Finally, if you’re looking to learn more about modern NLP, make sure to download our free ebook “NLP for Product Managers.”

What is a Language Model?

Wed, 20 Jul 2022 00:00:00 +0000

Our aim at deepset is that everyone, no matter their level of technical background, can harness the power of modern natural language processing (NLP) and language models for their own use case. Haystack, our open-source framework, makes this a reality.

When we talk to our users, we encounter common sources of confusion about NLP and machine learning. Therefore, in the upcoming blog posts, we want to explain some basic NLP concepts in understandable language. First up: language models.

Language Models in NLP

Language models take center stage in NLP. But what is a language model? To answer that question, let’s first clarify the term model and its use in machine learning.

What is a machine learning model?

The real world is complex and confusing. Models serve to represent a particular field of interest — a domain — in simpler terms. For example, weather models are simplified representations of meteorological phenomena and their interactions. These models help us understand the weather domain better and make predictions about it.

In machine learning, models are much the same. They serve mainly to predict events based on past data, which is why they’re also known as forecasting or predictive models.

The data that we feed to an ML algorithm allows it to devise a model of the data’s domain. That data should represent reality most faithfully, so that the models which are based on it can approximate the real world as closely as possible.

What is a language model?

A language model is a machine learning model designed to represent the language domain. It can be used as a basis for a number of different language-based tasks, for instance:

and plenty of other tasks that operate on natural language.

In a domain like weather forecasting, it’s easy to see how past data helps a model to predict a future state. But how do you apply that to language? In order to understand how the concept of prediction factors into language modeling, let’s take a step back and talk about linguistic intuition.

Linguistic intuition

As the speaker of a language, you have assembled an astonishing amount of knowledge about it, much of which cannot be taught explicitly. It includes judgments about grammaticality (whether or not a sentence is syntactically correct), synonymity (whether two words mean roughly the same) and sentence completion. Suppose I asked you to fill in the gap in the following sentence:

“Julia is looking for ___ purse.”

You’d probably say “her” or “my” or any other pronoun. Even a possessive noun phrase like “the cat Pablo’s” would work. But you wouldn’t guess something like “toothbrush” or “Las Vegas.” Why? Because of linguistic intuition.

Training a language model

Language models seek to model linguistic intuition. That is not an easy feat. As we’ve said, linguistic intuition isn’t learned through schooling but through constant use of a language (Noam Chomsky even postulated the existence of a special “ language organ” in humans). So how can we model it?

Today’s state of the art in NLP is driven by large neural networks. Neural language models like BERT learn something akin to linguistic intuition by processing millions of data points. In machine learning, this process is known as “training.”

To train a model, we need to come up with tasks that cause it to learn a representation of a given domain. For language modeling, a common task consists of completing the missing word in a sentence, much like in our example earlier. Through this and other training tasks, a language model learns to encode the meanings of words and longer text passages.

So how do you get from a computational representation of a language’s semantic properties to a model that can perform specific tasks like question answering or summarization?

General-purpose Versus Domain-specific Language Models

General language models like BERT or its bigger sister RoBERTa require huge amounts of data to learn a language’s regularities. NLP practitioners often use Wikipedia and other freely available collections of textual data to train them. By now, BERT-like models exist for practically all the languages with a sufficiently large Wikipedia. In fact, we at deepset have produced several models for German and English, which you can check out on our models page.

So what can you do with these models? Why are they so popular? Well, BERT can be used to enhance language understanding, for example in the Google search engine. But arguably the biggest value of general-purpose language models is that they can serve as a basis for other language-based tasks like question answering. By exposing it to different datasets and adjusting the training objective, we can adapt a general language model to a specific use case.

Fine-tuning a language model

There are many tasks that benefit from a representation of linguistic intuition. Examples of such tasks are sentiment analysis, named entity recognition, question answering, and others. Adapting a general-purpose language model to such a task is known as fine-tuning.

Fine-tuning requires data specific to the task you want the model to accomplish. For instance, to fine-tune your model to the question-answering task, you need a dataset of question-answer pairs. Such data often needs to be created and curated manually, which makes it quite expensive to generate. On the bright side, fine-tuning requires much less data than training a general language model.

Where to look for models

Both general-purpose models and fine-tuned models can be saved and shared. The Hugging Face model hub is the most popular platform for model-sharing, with tens of thousands of models of different sizes, for different languages and use cases. Chances are high that your own use case is already covered by one of the models on the model hub.

To help you find a model that might fit your needs, you can use the interface on the left side of the model hub page to filter by task, language, and other criteria. This lets you specifically look for models that have been trained for question answering, summarization, and many other tasks. Once you’ve found a suitable model, all you need to do is plug it into your NLP pipeline, connect to your database, and start experimenting.

How to handle domain-specific language

Though we often talk about languages as if they were homogeneous entities, the reality is very far from that. There are, for example, some professional domains — like medicine or law — that use highly specialized jargon, which non-experts can barely understand. Similarly, when a general BERT model is used to process data from one of those domains, it might perform poorly — just like a person without a degree in the field.

A technique called domain adaptation provides the solution: here, the pretrained model undergoes additional training steps, this time on specialized data like legal documents or medical papers.

The Hugging Face model hub contains BERT-based language models that have been adapted to the scientific, medical, legal, or financial domain. These domain-specific language models can then serve as a basis for further downstream tasks. For instance, this highly specialized model extracts named entities (like names for cells and proteins) from biomedical texts in English and Spanish.

What Can Language Models Do?

Language models can seem very smart. In this demo, for example, we show how well our RoBERTa model can answer questions about the Game of Thrones universe. It’s important to note, though, that this language model doesn’t actually know anything. It is just very good at extracting the right answers from documents — thanks to its mastery of human language and the fine-tuning it received on a question-answering dataset. It operates similarly to a human agent reading through documents to extract information from them, only much, much faster!

Other types of language models take a completely different approach. For example, the famed GPT family of generative language models actually do memorize information. They have so many parameters — billions — that they can store information picked up during training in addition to learning the language’s regularities.

So what can a language model do? Exactly what it’s been trained to do — not more, not less. Some models are trained to extract answers from text, others to generate answers from scratch. Some are trained to summarize text, others simply learn to represent language.

If your documents don’t use highly specialized language, a pre-trained model might work just fine — no further training required. Other use cases, however, might benefit from additional training steps. In our upcoming blog post, we’ll explore in more detail how you can work with techniques like fine-tuning and domain adaptation to get the most out of language models.

Composable NLP with Haystack

Modern NLP builds on decades of research and incorporates complex concepts from math and computer science. That’s why we promote a practice of composable NLP with Haystack, which lets users build their own NLP-based systems through a mix-and-match approach. You don’t have to be an NLP practitioner to use our framework, just as you don’t need to know anything about hardware or electricity to use a computer.

Want to see how to integrate pre-trained language models into an NLP pipeline? Check out our GitHub repository or sign up to deepset Cloud.

To learn more about NLP, make sure to download our free ebook NLP for Product Managers.

Finally, we’d be happy to see you on our Discord server, where members of the deepset team answer questions from the community.

What Is Text Vectorization? Everything You Need to Know

Fri, 03 Dec 2021 00:00:00 +0000

For as long as we have had computers, there has been the question of how to represent data in a way that machines can work with. In natural language processing (NLP), we often talk about text vectorization — representing words, sentences, or even larger units of text as vectors (or “vector embeddings”). Other data types, like images, sound, and videos, may be encoded as vectors as well. But what exactly are those vectors, and how can you use them in your own applications?

In this post, we’ll track the history of text vectorization in machine learning to develop a full understanding of the modern techniques. We’ll briefly look at traditional count-based methods before moving on to Word2Vec embeddings and BERT’s high-dimensional vectors. We will discuss how Transformer-based language models have brought deep semantics to text vectorization, and what that means for modern search systems. Finally, we will look at the recent and exciting trend of vector databases.

Count-Based Text Vectorization: Simple Beginnings

In programming, a vector is a data structure that is similar to a list or an array. For the purpose of input representation, it is simply a succession of values, with the number of values representing the vector’s “dimensionality.” Vector representations contain information about the qualities of an input object. They offer a uniform format that computers can easily process.

Bag of Words

One of the simplest vectorization methods for text is a bag-of-words (BoW) representation. A BoW vector has the length of the entire vocabulary — that is, the set of unique words in the corpus. The vector’s values represent the frequency with which each word appears in a given text passage:

TF-IDF

Weighted BoW text vectorization techniques like TF-IDF (short for “term frequency-inverse document frequency), on the other hand, attempt to give higher relevance scores to words that occur in fewer documents within the corpus. To that end, TF-IDF measures the frequency of a word in a text against its overall frequency in the corpus.

Think of a document that mentions the word “oranges” with high frequency. TF-IDF will look at all the other documents in the corpus. If “oranges” occurs in many documents, then it is not a very significant term and is given a lower weighting in the TF-IDF text vector. If it occurs in just a few documents, however, it is considered a distinctive term. In that case, it helps characterize the document within the corpus and as such receives a higher value in the vector.

BM25

While more sophisticated than the simple BoW approach, TF-IDF has some shortcomings. For example, it does not address the fact that, in short documents, even just a single mention of a word might mean that the term is highly relevant. BM25 was introduced to address this and other issues. It is an improvement over TF-IDF, in that it takes into account the length of the document. It also dampens the effect of having many occurrences of a word in a document.

Because BoW methods will produce long vectors that contain many zeros, they’re often called “sparse.” In addition to being language-independent, sparse vectors are quick to compute and compare. Semantic search systems use them for quick document retrieval.

Let’s now look at a more recent encoding technique that aims to capture not just the lexical but also the semantic properties of words.

Word2Vec: Inferring Meaning from Context

Words are more than just a collection of letters. As speakers of a language, we might understand what a word means and how to use it in a sentence. In short, we would understand its semantics. The sparse, count-based methods we saw above do not account for the meaning of the words or phrases that our system processes.

In 2013, a team led by NLP researcher Tomáš Mikolov came up with the Word2Vec method, which could represent the semantic and syntactic properties of words through “word embeddings.” Word2Vec follows the idea that the meaning of words lies in their distributional properties — the contexts in which a word is used. There are two main implementations of Word2Vec ( CBOW and skip-gram). Both train a shallow neural net to represent words as feature vectors of variable length (typically 300). These vectors are dense, meaning that they consist of mostly floating point values, rather than zeros.

In the high-dimensional Word2Vec embedding space, similar words lie close to each other. For example, we would expect the words “orange” and “apple” to be close, while, say, “house” or “space ship” should be further away from the pair. Semantic textual similarity is measured by way of a distance metric, typically cosine similarity. While it is virtually impossible to picture 300 dimensions in your head, the embeddings can be reduced to just two dimensions for visualization purposes, as in this interactive example.

Often, we will want to encode the meaning of longer passages, like sentences of entire documents. We could add or average the individual word vectors to produce one vector for the entire text. Another approach is the Doc2Vec model, which has been trained in a manner similar to Word2Vec, only on documents rather than words.

However, despite encoding semantic properties of words, the resulting text vectorization still leaves much to be desired. For one, Word2Vec encodings are rigid: a given word will always be encoded in the same way, regardless of its position in the sentence, or whether it has multiple meanings. The model also has no way to account for unknown words.

Finally, Word2Vec-based methods have not had much success at representing longer text passages. Luckily, all of these problems have been timely addressed by a new technique for producing vectors: the Transformer-powered BERT.

Transformers: Bringing Deep Semantics to Text Vectorization

The legendary BERT (“Bidirectional Encoder Representations from Transformers”) language model by Jacob Devlin and colleagues overcomes the obstacles presented by Word2Vec-based embedding methods. BERT is able to produce contextualized word vectors by encoding a word’s position in the text in addition to the word itself. Furthermore, this approach can account for unknown words as well as words with multiple meanings.

BERT’s success is based on its Transformer architecture, as well as the vast amounts of data that it uses to learn. During training, BERT “reads” the entire English-language Wikipedia and the BooksCorpus, a large collection of unpublished novels. Successors to BERT, like RoBERTa, are trained on even bigger text collections. This way, Transformer-based language models learn a deep, context-aware representation of language.

Semantic Search Systems: Better Document Retrieval with High-Dimensional Vectors

In the context of semantic search systems, we need a way to vectorize both the documents in our document store, as well as the query that we want to match them to. We’ll present two different techniques that use Transformers to accomplish that task: Sentence-BERT and Dense Passage Retrieval.

Sentence-BERT, or SBERT, uses an encoder that can convert longer passages of text into vectors. In a semantic document search system, SBERT generates one vector for each existing document. These vectors are then indexed alongside the original document in the database. At inference time, SBERT converts the new document into a vector and compares it to the vectors in the database. The more similar two vectors are, the more similar their underlying documents.

Dense Passage Retrieval (DPR) is a popular text vectorization method in the context of question answering systems. It is based on a dual-encoder approach and can retrieve documents that are likely to contain the answer to a given question. Consider an example from the DPR paper, where the query “Who is the bad guy in lord of the rings?” is mapped to the sentence “Sala Baker is best known for portraying the villain Sauron in the Lord of the Rings trilogy.”

As the example shows, DPR is able to pair questions and answer passages, even when they do not use the exact same words. The model understands that “bad guy” is a synonym for “villain.” This is what sets Transformer-based methods apart from keyword-matching techniques like TF-IDF and BM25. However, both SBERT and DPR do take longer during indexing time. That is because they run every document through a Transformer model, before storing it in the database alongside its high-dimensional vector representation.

Vectors First: The Power of Vector Databases

Popular document oriented databases like Elasticsearch and OpenSearch let you store and search your vectorized text along with the original documents. This is useful as it allows you to combine keyword-based approaches with the more recent, Transformer-based methods. But with growing data collections and a shift from lexical, keyword-based to semantic search systems, there’s an increased need for databases that are tailored specifically to vectorized data.

Lately, a new generation of databases has emerged to address that need. Vector-optimized databases were designed specifically for storing and searching vector data. Note that “data” doesn’t just mean text — other data types, such as images or videos, can be stored in a vector database as well. These databases are particularly fast at computing similarity between text vectors through optimization for “ nearest neighbor” algorithms.

Vector databases are highly scalable, allowing for quick search even when on millions of documents. Haystack supports FAISS, Milvus, and Weaviate vector databases— and we cannot wait to see what you’ll build with them!

Get Started with Haystack

Haystack is our flagship NLP framework enabling developers to add a variety of modern NLP tasks, such as semantic search, question answering, and summarization to their applications. Compute your own vectors, add them to the database of choice, and start building powerful natural language processing pipelines.

Get started today by checking out our GitHub repository or by joining our community on Discord!

Blog

MCP + Haystack: A Practical Guide for AI Engineers

What is MCP (Model Context Protocol)?

Why MCP and Haystack are a strong combination

Approach 1: Haystack as an MCP client

MCPTool: bind a single MCP tool

MCPToolset: load a whole server’s tools at once

SearchableToolset: scale to large tool catalogs

Approach 2: Haystack as an MCP server using Hayhooks

Getting started

How a pipeline becomes a tool

Connecting your IDE or assistant

Approach 3: Haystack as an MCP server using Haystack Enterprise Platform

Haystack Docs MCP Server

Choosing the right approach

MCP use cases

Conclusion

Context Engineering for Agentic Systems: What Goes Into Your Agent's Mind

Why context is so important for agentic systems

When less is more

The cost dimension

What fills the context window in an agentic system

What the context actually looks like

Building a Haystack agent

Strategies for managing context growth

Delegation to subagents

Improving retrieval quality

Summarisation and compaction

Adding only relevant tools to the context

Offloading notes (scratchpad / working memory)

What’s coming next in this series

Multimodal Search with Gemini Embedding 2 in Haystack

Meet Gemini Embedding 2

Using Gemini Embeddings in Haystack

Multimodal Embeddings

Cross-modal Retrieval

What’s Next

Haystack Ecosystem: One Name, One Product Family, One Look

The Open Source to Enterprise Story of Haystack

The Haystack Logo: What Changed and Why

The Haystack Product Family

What Is Haystack?

What Is Haystack Enterprise Starter?

What Is Haystack Enterprise Platform?

Looking Ahead

Lufthansa Industry Solutions Uses Haystack to Power Enterprise RAG

The Challenge: Connecting Scattered Knowledge

Choosing Haystack: Flexibility Meets Stability

The Technical Architecture: How It All Comes Together

Core Components

1. Frontend and Authentication

2. Ingestion Pipelines

3. Query Services

4. Observability and Monitoring

Measuring Success: From Latency to Feedback Loops

Lessons Learned: Build Quality Early, Iterate Fast

What’s Next

Share Your Story with Us

How TAC Built an Agentic Chatbot with Haystack to Transform Trade Promotions Workflows

The Challenge: Simplifying Complex Workflows

Choosing Haystack: From Exploration to Production

The Technical Architecture: From Pipelines to Agents

Key Components of the Solution

Evaluating the Performance

Lessons Learned & Advice for Other Teams

What’s Next

Share Your Story with Us

Announcing Haystack Enterprise Starter: Best Practices and Support

What’s Inside Haystack Enterprise Starter

What About Open Source?

How Haystack Enterprise Starter Fits into the Bigger Picture

Want to Learn More?

Summarize Hacker News Posts with Haystack & OPEA

1. Introduction & Motivation

2. Prerequisites

3. Building the News Fetcher

4. Integrating the LLM (OPEAGenerator)

5. Crafting the Prompt

6. Building the Pipeline

7. Running the Pipeline

`MCPTool`: bind a single MCP tool

`MCPToolset`: load a whole server’s tools at once

`SearchableToolset`: scale to large tool catalogs