Use the ⚡ vLLM inference engine with Haystack

_{Last Updated:
March 10, 2025}

This notebook shows how to use the vLLM inference engine with Haystack.

Install vLLM + Haystack

we install vLLM using pip ( docs)
for production use cases, there are many other options, including Docker ( docs)

# we check that CUDA is >=12.1 (https://docs.vllm.ai/en/latest/getting_started/installation.html#install-with-pip)
! nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0

! pip install vllm haystack-ai

Run a vLLM OpenAI-compatible server in Colab

vLLM can be deployed as a server that implements the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API. Read more in the docs.

In Colab, we start the OpenAI-compatible server using Python. For environments that support Docker, we can run the server using Docker ( docs).

Significant parameters:

model: TheBloke/notus-7B-v1-AWQ is the AWQ quantized version of a good LLM by Argilla. Several model architectures are supported; models are automatically downloaded from Hugging Face as needed. For a comprehensive list of the supported models, see the docs.
quantization: awq. AWQ is a quantization method that allows LLMs to run (fast) when GPU resources are limited. Simple blogpost on quantization techniques
max_model_len: we specify a maximum context length, which consists of the maximum number of tokens (prompt + response). Otherwise, the model does not fit in Colab and we get an OOM error.

# we prepend "nohup" and postpend "&" to make the Colab cell run in background
! nohup python -m vllm.entrypoints.openai.api_server \
                  --model TheBloke/notus-7B-v1-AWQ \
                  --quantization awq \
                  --max-model-len 2048 \
                  > vllm.log &

nohup: redirecting stderr to stdout

# we check the logs until the server has been started correctly
!while ! grep -q "Application startup complete" vllm.log; do tail -n 1 vllm.log; sleep 5; done

INFO 02-16 10:57:39 llm_engine.py:72] Initializing an LLM engine with config: model='TheBloke/notus-7B-v1-AWQ', tokenizer='TheBloke/notus-7B-v1-AWQ', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, seed=0)
INFO 02-16 10:57:43 weight_utils.py:164] Using model weights format ['*.safetensors']
INFO 02-16 10:57:43 weight_utils.py:164] Using model weights format ['*.safetensors']
INFO 02-16 10:57:55 llm_engine.py:322] # GPU blocks: 4108, # CPU blocks: 2048
INFO 02-16 10:57:58 model_runner.py:636] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 02-16 10:57:58 model_runner.py:636] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 02-16 10:57:58 model_runner.py:636] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.

Chat with the model using OpenAIChatGenerator

Once we have launched the vLLM-compatible OpenAI server, we can simply initialize an OpenAIChatGenerator pointing to the vLLM server URL and start chatting!

from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.utils import Secret

generator = OpenAIChatGenerator(
    api_key=Secret.from_token("VLLM-PLACEHOLDER-API-KEY"),  # for compatibility with the OpenAI API, a placeholder api_key is needed
    model="TheBloke/notus-7B-v1-AWQ",
    api_base_url="http://localhost:8000/v1",
    generation_kwargs = {"max_tokens": 512}
)

messages = []

while True:
  msg = input("Enter your message or Q to exit\n🧑 ")
  if msg=="Q":
    break
  messages.append(ChatMessage.from_user(msg))
  response = generator.run(messages=messages)
  assistant_resp = response['replies'][0]
  print("🤖 "+assistant_resp.text)
  messages.append(assistant_resp)

Enter your message or Q to exit
🧑 hello. can you help planning my next travel to Italy?
🤖 Certainly! I'd be happy to help you plan your next trip to Italy. Here are some steps to help you plan your trip:

1. Determine your travel dates: Decide when you want to travel to Italy. Keep in mind that peak season is from June to August, so prices may be higher, and crowds may be larger.

2. Decide on your destination: Italy is a large country with many beautiful destinations. Consider which cities and regions you would like to visit. Some popular destinations include Rome, Florence, Venice, Amalfi Coast, Tuscany, and the Italian Lakes.

3. Research flights and transportation: Look for flights that fit your budget and travel dates. If you're planning on traveling between cities, research trains and buses. Familiarize yourself with the transportation options in your destination cities.

4. Consider accommodation: Research different types of accommodations, such as hotels, vacation rentals, or hostels. Look for a place that fits your budget and travel style.

5. Plan your itinerary: Make a list of the things you want to see and do in each destination. Consider how much time you want to spend in each location.

6. Research tours and activities: Look for tours and activities that interest you. This could include food tours, wine tastings, museum visits, or tours of historic sites.

7. Learn some Italian: It's always a good idea to learn some Italian before your trip, even if it's just basic phrases. This will help you communicate with locals and make your trip more enjoyable.

8. Pack appropriately: Make sure to pack appropriately for the season and destination. Check the weather forecast before you pack.

9. Exchange currency: Research the best place to exchange currency in your destination. Also, check if your bank or credit card offers no-fee withdrawals.

10. Purchase travel insurance: Consider purchasing travel insurance to protect yourself in case of unforeseen circumstances, like a cancelled flight or lost luggage.

I hope these steps help you plan your next trip to Italy! Let me know if you have any questions or if you need further assistance.
Enter your message or Q to exit
🧑 Please list 5 places I should absolutely visit
🤖 Certainly! Here are 5 places in Italy that I think should be on your must-visit list:

1. Rome - The Eternal City is home to ancient ruins like the Colosseum and the Roman Forum, as well as iconic landmarks like the Vatican City, the Pantheon, and the Trevi Fountain. Rome is also known for its delicious food and vibrant nightlife.

2. Florence - Known as the birthplace of the Renaissance, Florence is home to incredible art and architecture, including the Uffizi Gallery, the Duomo, and the Palazzo Vecchio. Be sure to try some of Florence's famous gelato while you're there!

3. Venice - A city built on water, Venice is known for its canals, palaces, and bridges. Don't miss a gondola ride through the city's canals or a visit to St. Mark's Basilica and the Doge's Palace.

4. Amalfi Coast - This scenic coastline is home to charming towns like Positano, Ravello, and Amalfi, as well as stunning cliffs, beaches, and seaside restaurants.

5. Cinque Terre - This UNESCO World Heritage Site is a series of five colorful villages on the Italian Riviera. Hike between the villages or take a boat ride to see the beautiful coastline.

These are just a few of the many incredible places to visit in Italy, but I hope this helps you start planning your itinerary!
Enter your message or Q to exit
🧑 Nice! Can you expand the fith point?
🤖 Absolutely! The Cinque Terre, which translates to "Five Lands," is a series of five picturesque villages located along the Italian Riviera. The villages are Monterosso al Mare, Vernazza, Corniglia, Manarola, and Riomaggiore. Here are some must-see highlights of each village:

1. Monterosso al Mare - This village is the largest of the five, and is known for its long sandy beach, medieval castle ruins, and the Church of San Francesco, which offers stunning views from its bell tower.

2. Vernazza - This village is often considered the jewel of the Cinque Terre. Don't miss the stunning views from the top of the castle ruins, and be sure to try some of Vernazza's famous seafood dishes.

3. Corniglia - This village is perched on a cliff and can only be reached by foot or by a narrow stairway. Once you reach the top, you can visit the Church of San Lorenzo, which offers stunning views of the sea.

4. Manarola - This village is known for its colorful houses that cascade down the cliffs, as well as its ancient church, San Carlo, and the Church of St. Lawrence.

5. Riomaggiore - This village is the southernmost of the five and is known for its narrow streets, colorful houses, and seaside restaurants. Don't miss the Church of San Giovanni Battista, which sits at the edge of the village and offers stunning views of the sea.

The best way to experience the Cinque Terre is by hiking between the villages along the Sentiero Azzurro (Blue Trail). This hike offers incredible views of the coastline and the villages below. Be sure to check the weather forecast before you go, as the trail can be challenging in rainy or windy conditions.

I hope this helps you plan your visit to the stunning Cinque Terre!
Enter your message or Q to exit
🧑 Q