Integration: vLLM Invocation Layer

Use the vLLM inference engine with Haystack

Authors

Lukas Kreussel

GitHub Repo PyPI Package

Simply use vLLM in your haystack pipeline, to utilize fast, self-hosted LLMs.

Overview
Installation
Usage

Overview

vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. It is an open-source project that allows serving open models in production, when you have GPU resources available.

vLLM can be deployed as a server that implements the OpenAI API protocol and integration with Haystack comes out-of-the-box. This allows vLLM to be used with the OpenAIGenerator and OpenAIChatGenerator components in Haystack.

For an end-to-end example of vLLM + Haystack, see this notebook.

Installation

vLLM should be installed.

you can use pip: pip install vllm (more information in the vLLM documentation)
for production use cases, there are many other options, including Docker ( docs)

Usage

You first need to run an vLLM OpenAI-compatible server. You can do that using Python or Docker.

Then, you can use the OpenAIGenerator and OpenAIChatGenerator components in Haystack to query the vLLM server.

from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.utils import Secret

generator = OpenAIChatGenerator(
    api_key=Secret.from_token("VLLM-PLACEHOLDER-API-KEY"),  # for compatibility with the OpenAI API, a placeholder api_key is needed
    model="mistralai/Mistral-7B-Instruct-v0.1",
    api_base_url="http://localhost:8000/v1",
    generation_kwargs = {"max_tokens": 512}
)

response = generator.run(messages=[ChatMessage.from_user("Hi. Can you help me plan my next trip to Italy?")])

Integration: vLLM Invocation Layer

Table of Contents

Overview

Installation

Usage