Use the ⚡ vLLM inference engine in Haystack 2.x


      

Notebook by Stefano Fiorucci

This notebook shows how to use the vLLM inference engine in Haystack 2.x.

Install vLLM + Haystack

  • we install vLLM using pip ( docs)
  • for production use cases, there are many other options, including Docker ( docs)
# we check that CUDA is >=12.1 (https://docs.vllm.ai/en/latest/getting_started/installation.html#install-with-pip)
! nvcc --version
! pip install vllm haystack-ai

Run a vLLM OpenAI-compatible server in Colab

vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API. Read more in the docs.

In Colab, we start the OpenAI-compatible server using Python. For environments that support Docker, we can run the server using Docker ( docs).

Significant parameters:

  • model: TheBloke/notus-7B-v1-AWQ is the AWQ quantized version of a good LLM by Argilla. Several model architectures are supported; models are automatically downloaded from Hugging Face as needed. For a comprehensive list of the supported models, see the docs.

  • quantization: awq. AWQ is a quantization method that allows LLMs to run (fast) when GPU resources are limited. Simple blogpost on quantization techniques

  • max_model_len: we specify a maximum context length, which consists of the maximum number of tokens (prompt + response). Otherwise, the model does not fit in Colab and we get an OOM error.

# we prepend "nohup" and postpend "&" to make the Colab cell run in background
! nohup python -m vllm.entrypoints.openai.api_server \
                  --model TheBloke/notus-7B-v1-AWQ \
                  --quantization awq \
                  --max-model-len 2048 \
                  > vllm.log &
# we check the logs until the server has been started correctly
!while ! grep -q "Application startup complete" vllm.log; do tail -n 1 vllm.log; sleep 5; done

Chat with the model using OpenAIChatGenerator

Once we have launched the vLLM-compatible OpenAI server, we can simply initialize an OpenAIChatGenerator pointing to the vLLM server URL and start chatting!

from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.utils import Secret

generator = OpenAIChatGenerator(
    api_key=Secret.from_token("VLLM-PLACEHOLDER-API-KEY"),  # for compatibility with the OpenAI API, a placeholder api_key is needed
    model="TheBloke/notus-7B-v1-AWQ",
    api_base_url="http://localhost:8000/v1",
    generation_kwargs = {"max_tokens": 512}
)
messages = []

while True:
  msg = input("Enter your message or Q to exit\n🧑 ")
  if msg=="Q":
    break
  messages.append(ChatMessage.from_user(msg))
  response = generator.run(messages=messages)
  assistant_resp = response['replies'][0]
  print("🤖 "+assistant_resp.content)
  messages.append(assistant_resp)