
Integration: vLLM Invocation Layer
Use the vLLM inference engine with Haystack
Simply use vLLM in your haystack pipeline, to utilize fast, self-hosted LLMs.
Table of Contents
Overview
vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. It is an open-source project that allows serving open models in production, when you have GPU resources available.
vLLM can be deployed as a server that implements the OpenAI API protocol and integration with Haystack comes out-of-the-box.
This allows vLLM to be used with the
OpenAIGenerator
and
OpenAIChatGenerator
components in Haystack.
For an end-to-end example of vLLM + Haystack, see this notebook.
Installation
vLLM should be installed.
- you can use
pip
:pip install vllm
(more information in the vLLM documentation) - for production use cases, there are many other options, including Docker ( docs)
Usage
You first need to run an vLLM OpenAI-compatible server. You can do that using Python or Docker.
Then, you can use the OpenAIGenerator
and OpenAIChatGenerator
components in Haystack to query the vLLM server.
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.utils import Secret
generator = OpenAIChatGenerator(
api_key=Secret.from_token("VLLM-PLACEHOLDER-API-KEY"), # for compatibility with the OpenAI API, a placeholder api_key is needed
model="mistralai/Mistral-7B-Instruct-v0.1",
api_base_url="http://localhost:8000/v1",
generation_kwargs = {"max_tokens": 512}
)
response = generator.run(messages=[ChatMessage.from_user("Hi. Can you help me plan my next trip to Italy?")])