📣 Haystack 2.28 is here! Pass agent State directly to tools & components - no extra wiring needed

Use the ⚡ vLLM inference engine with Haystack


      

vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.

This notebook shows how to use it with Haystack.

Install vLLM + Haystack integration

  • we install vLLM using uv ( installation docs). For production use cases, there are also other options, including Docker ( docs).
  • we also install vllm-haystack, the vLLM/Haystack integration.
! uv pip install vllm vllm-haystack nest-asyncio python-weather

Serving and using Generative Language Models

vLLM primarily supports most open-weights Generative Language Models.

vllm serve launches an OpenAI-compatible server.

In the following cell we spin up a server running on localhost:8000.

We serve Qwen/Qwen3-0.6B, a very small but good Language Model, with reasoning and tool calling capabilities. Some parameters like reasoning-parser and tool-call-parser follow the values indicated in Qwen documentation.

We also set some parameters specific for our Colab environment:

  • --enforce-eager: disables the construction of CUDA graph. This negatively impacts performance but reduces memory requirements and server start time.
  • --gpu-memory-utilization 0.5: limits GPU utilization. This is needed since later we’ll run two other servers in this notebook for embedding and ranking models.
! setsid nohup vllm serve "Qwen/Qwen3-0.6B" --port 8000 \
            --reasoning-parser qwen3 \
            --max-model-len 1024 \
            --enforce-eager \
            --dtype half \
            --enable-auto-tool-choice \
            --tool-call-parser hermes \
            --gpu-memory-utilization 0.5 > vllm.log 2>&1 < /dev/null &
! until curl -s http://localhost:8000/health > /dev/null; do sleep 0.5; done; echo "ready"
ready

Chat with the model

Once we have launched the vLLM server, we can simply initialize a VLLMChatGenerator ( docs) pointing to the vLLM server URL and start chatting!

Here we disable reasoning ( Qwen docs) and set a streaming_callback to allow streaming.

from haystack_integrations.components.generators.vllm import VLLMChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.components.generators.utils import print_streaming_chunk

llm = VLLMChatGenerator(
    model="Qwen/Qwen3-0.6B",
    generation_kwargs = {"extra_body": {"chat_template_kwargs": {"enable_thinking": False}}},
    streaming_callback=print_streaming_chunk
)
message = ChatMessage.from_user("Write Python code to print the reverse of a string using a for loop.")
response = llm.run(messages=[message])
[ASSISTANT]
Here's a Python code snippet that uses a **for loop** to print the **reverse** of a string:

```python
original_string = input("Enter a string: ")

for i in range(len(original_string) - 1, -1, -1):
    print(original_string[i], end=" ")
print()
```

### Explanation:
- The loop runs from the last character (`i = len(original_string) - 1`) down to the first character (`i = 0`).
- It prints each character in the original string in reverse order.
- A space is added between each character for readability.
- Finally, a print statement adds a newline character.

Not bad but, given “hello” as input, this would print “o l l e h”. Let’s see if reasoning helps.

With reasoning

We simply enable reasoning ( Qwen docs).

from haystack_integrations.components.generators.vllm import VLLMChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.components.generators.utils import print_streaming_chunk

llm = VLLMChatGenerator(
    model="Qwen/Qwen3-0.6B",
    generation_kwargs = {"extra_body": {"chat_template_kwargs": {"enable_thinking": True}}},
    streaming_callback=print_streaming_chunk
)
message = ChatMessage.from_user("Write Python code to print the reverse of a string using a for loop.")
response = llm.run(messages=[message])
[REASONING]

Okay, I need to write Python code that prints the reverse of a string using a for loop. Let me think about how to approach this.

First, I remember that strings in Python are immutable, so I can't modify them directly. So the first step is to convert the string into a list of characters. That way, I can reverse the order of the list and then join them back into a string.

So, for example, if the input is "hello", converting it to a list would give ["h", "e", "l", "l", "o"]. Reversing this list would give ["o", "l", "l", "e", "h"], and then joining them back with an empty string would give "olleh".

But wait, how to reverse a list in Python? Oh right, using slicing. The slice would be [::-1], which reverses the list. So the code would be something like:

s = input("Enter a string: ")
chars = list(s)
chars.reverse()
reversed_str = ''.join(chars)
print(reversed_str)

Wait, but in Python, when you reverse a list, the order is reversed. So using chars.reverse() reverses the list, and then joining them would give the reversed string. That should work.

Testing this with "hello" would produce "olleh". Let me check if there's any edge cases. What if the string is empty? Well, the code would handle it, but maybe the input is supposed to be non-empty. The problem says to print the reverse, so maybe it's okay.

Another thing to consider: what if the string has non-alphanumeric characters? Well, the code converts everything to lowercase, so that's handled. So the code should work regardless.

So putting it all together, the steps are: take the input, split into characters, reverse the list, join back, print.

I think that's all. Let me write the code accordingly.
[ASSISTANT]


Here's a Python solution that reverses a string using a for loop:

```python
s = input("Enter a string: ")
chars = list(s)
for i in range(len(chars) - 1, -1, -1):
    print(chars[i], end="")
print()
```

**Explanation:**

1. **Reading Input:** The input string is read from the user using `input()`.
2. **Converting to List:** The string is converted into a list of characters using `list(s)`.
3. **Reversing the List:** Using a for loop with a range that starts from the last character and goes backward to the first, the list is reversed.
4. **Printing the Result:** Each character is printed in reverse order, and the loop ends with a print statement to ensure the output is printed.

This approach efficiently reverses the string using a single loop and handles all edge cases, including empty strings.

Better!

We can also easily extract reasoning as follows:

response["replies"][0].reasoning
ReasoningContent(reasoning_text='\nOkay, I need to write Python code that prints the reverse of a string using a for loop. Let me think about how to approach this.\n\nFirst, I remember that strings in Python are immutable, so I can\'t modify them directly. So the first step is to convert the string into a list of characters. That way, I can reverse the order of the list and then join them back into a string.\n\nSo, for example, if the input is "hello", converting it to a list would give ["h", "e", "l", "l", "o"]. Reversing this list would give ["o", "l", "l", "e", "h"], and then joining them back with an empty string would give "olleh".\n\nBut wait, how to reverse a list in Python? Oh right, using slicing. The slice would be [::-1], which reverses the list. So the code would be something like:\n\ns = input("Enter a string: ")\nchars = list(s)\nchars.reverse()\nreversed_str = \'\'.join(chars)\nprint(reversed_str)\n\nWait, but in Python, when you reverse a list, the order is reversed. So using chars.reverse() reverses the list, and then joining them would give the reversed string. That should work.\n\nTesting this with "hello" would produce "olleh". Let me check if there\'s any edge cases. What if the string is empty? Well, the code would handle it, but maybe the input is supposed to be non-empty. The problem says to print the reverse, so maybe it\'s okay.\n\nAnother thing to consider: what if the string has non-alphanumeric characters? Well, the code converts everything to lowercase, so that\'s handled. So the code should work regardless.\n\nSo putting it all together, the steps are: take the input, split into characters, reverse the list, join back, print.\n\nI think that\'s all. Let me write the code accordingly.\n', extra={})

Structured outputs

This model also supports structured outputs.

Let’s try this, with reasoning and 0 temperature. These settings should help produce a more reliable output.

response_format = {
    "type": "json_schema",
    "json_schema": {
        "name": "capital_info",
        "strict": True,
        "schema": {
            "type": "object",
            "properties": {"capital": {"type": "string"}, "population": {"type": "number"}},
            "required": ["capital", "population"],
        },
    },
}

llm = VLLMChatGenerator(
    model="Qwen/Qwen3-0.6B",
    generation_kwargs={"extra_body": {"chat_template_kwargs": {"enable_thinking": True}},
                       "response_format": response_format, "temperature": 0.0},
)
result = llm.run(
    [ChatMessage.from_user("What's the capital of France and its population? Respond in JSON.")]
)
import json

json.loads(result["replies"][0].text)
{'capital': 'Paris', 'population': 67000000}

Tool-calling Agent

Now let’s build a simple agent with a weather tool. The example below requires handling parallel tool calls, which our model supports.

import asyncio
from typing import Annotated

from haystack.tools import tool
from haystack.components.agents import Agent

from haystack.dataclasses import ChatMessage
import python_weather

# only needed in Jupyter notebooks where there is an event loop running
import nest_asyncio
nest_asyncio.apply()


@tool
def get_weather(location: Annotated[str, "The location to get the weather for"]) -> dict:
    """A function to get the weather for a given location"""
    async def _fetch_weather():
        async with python_weather.Client(unit=python_weather.METRIC) as client:
            weather = await client.get(location)
            return {
                "description": weather.description,
                "temperature": weather.temperature,
                "humidity": weather.humidity,
                "precipitation": weather.precipitation,
                "wind_speed": weather.wind_speed,
                "wind_direction": weather.wind_direction
            }

    return asyncio.run(_fetch_weather())


chat_generator = VLLMChatGenerator(
    model="Qwen/Qwen3-0.6B",
    generation_kwargs = {"extra_body": {"chat_template_kwargs": {"enable_thinking": True}}},
)

agent = Agent(chat_generator=chat_generator, streaming_callback=print_streaming_chunk, tools = [get_weather])
message = ChatMessage.from_user("What's the weather like in Marrakesh? And in Stockholm? Where should I go if I like the heat?")
response = agent.run(messages=[message])
[REASONING]

Okay, the user is asking about the weather in Marrakesh and Stockholm, and they want to know where to go if they like the heat. Let me start by recalling the tools provided. There's a function called get_weather that takes a location as a parameter. So I need to call this function twice, once for each city. First, Marrakesh, then Stockholm. 

Wait, the user mentioned both cities. So I should make two separate tool calls. The first one for Marrakesh, and the second for Stockholm. After that, the user wants to know where to go if they like the heat. But the tools only provide weather data. The function doesn't have a parameter for preferences, so I can't make a third tool call here. 

I need to check if there's any other function, but looking at the tools provided, only get_weather is available. So the answer should include two tool calls, each for Marrakesh and Stockholm, and then mention that the heat is a factor to consider for the destination.
[ASSISTANT]


[TOOL CALL]
Tool: get_weather 
Arguments: {"location": "Marrakesh"}

[TOOL CALL]
Tool: get_weather 
Arguments: {"location": "Stockholm"}

[TOOL RESULT]
{'description': 'Partly cloudy', 'temperature': 20, 'humidity': 68, 'precipitation': 0.0, 'wind_speed': 13, 'wind_direction': WindDirection.NORTHEAST}

[TOOL RESULT]
{'description': 'Sunny', 'temperature': 10, 'humidity': 46, 'precipitation': 0.0, 'wind_speed': 18, 'wind_direction': WindDirection.NORTHWEST}

[REASONING]

Okay, let me process the user's question. They asked for the weather in Marrakesh and Stockholm, and where to go if they like the heat. 

First, I called the get_weather function twice with Marrakesh and Stockholm. The responses came back with the weather details for each city. 

Now, I need to present this information clearly. For Marrakesh, the weather is partly cloudy with a temperature of 20°C, humidity 68%, and wind speed 13 km/h. In Stockholm, it's sunny with 10°C, humidity 46%, and wind 18 km/h. 

The user also mentioned they want to know where to go if they like the heat. Since Marrakesh has a higher temperature (20°C) compared to Stockholm (10°C), I should highlight Marrakesh as the destination they'd prefer. 

I should structure the answer to first state each city's weather, then mention the recommendation based on the temperature. Keep it concise and friendly.
[ASSISTANT]


Here's the weather information for Marrakesh and Stockholm:

- **Marrakesh**:  
  - **Weather**: Partly cloudy  
  - **Temperature**: 20°C  
  - **Humidity**: 68%  
  - **Wind**: 13 km/h (NORTHEAST)  

- **Stockholm**:  
  - **Weather**: Sunny  
  - **Temperature**: 10°C  
  - **Humidity**: 46%  
  - **Wind**: 18 km/h (NORTHWEST)  

If you like the heat, Marrakesh is the perfect destination! 🌞

Well done!

Serving and using Embedding Models

vLLM also supports embedding models, used for computing semantic vectors from text.

We serve the classic sentence-transformers/all-MiniLM-L6-v2 embedding model, on port 8001.

--enforce-eager and --gpu-memory-utilization are only used to limit memory utilization in Colab.

! setsid nohup vllm serve "sentence-transformers/all-MiniLM-L6-v2" --port 8001 \
            --enforce-eager \
            --gpu-memory-utilization 0.1 > vllm_embedding.log 2>&1 < /dev/null &
! until curl -s http://localhost:8001/health > /dev/null; do sleep 0.5; done; echo "ready"
ready

Let’s try both VLLMTextEmbedder (docs) and VLLMDocumentEmbedder (docs) and create a simple retrieval pipeline.

from haystack import Document, Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.embedders.vllm import (
    VLLMDocumentEmbedder,
    VLLMTextEmbedder,
)
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever

document_store = InMemoryDocumentStore(
    embedding_similarity_function="cosine"
)

docs = [
    Document(content="I saw a black horse running"),
    Document(content="Germany has many big cities"),
    Document(content="My name is Wolfgang and I live in Berlin"),
    Document(content="My name is Luca and I live in Milan"),
    Document(content="Italy is a country in Europe"),
]

document_embedder = VLLMDocumentEmbedder(
    model="sentence-transformers/all-MiniLM-L6-v2",
    api_base_url="http://localhost:8001/v1",
)

documents_with_embeddings = document_embedder.run(docs)["documents"]
document_store.write_documents(documents_with_embeddings)

query_pipeline = Pipeline()

query_pipeline.add_component(
    "text_embedder",
    VLLMTextEmbedder(
        model="sentence-transformers/all-MiniLM-L6-v2",
        api_base_url="http://localhost:8001/v1",
    ),
)

query_pipeline.add_component(
    "retriever",
    InMemoryEmbeddingRetriever(
        document_store=document_store,
        top_k=3,
    ),
)

query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")

query = "Who lives in Berlin?"

retrieved_docs = query_pipeline.run(
    {"text_embedder": {"text": query}}
)["retriever"]["documents"]

print()

for doc in retrieved_docs:
    print(doc.content)
    print(f"score: {doc.score}")
    print("---")
Calculating embeddings: 1it [00:03,  3.29s/it]



My name is Wolfgang and I live in Berlin
score: 0.6686882275912469
---
Germany has many big cities
score: 0.6028554173106674
---
My name is Luca and I live in Milan
score: 0.42261355074687296
---

Serving and using Ranking Models

vLLM also supports ranking models, generally used after keyword or embedding retrieval to reorder the retrieved documents by relevance to the query. vLLM supports different ranking models, including cross-encoders and late interaction models.

We serve the BAAI/bge-reranker-base cross-encoder model, on port 8002.

--enforce-eager and --gpu-memory-utilization are only used to limit memory utilization in Colab.

! setsid nohup vllm serve "BAAI/bge-reranker-base" --port 8002 \
            --enforce-eager \
            --gpu-memory-utilization 0.1 > vllm_ranking.log 2>&1 < /dev/null &
! until curl -s http://localhost:8002/health > /dev/null; do sleep 0.5; done; echo "ready"
ready

Let’s build a two-stage retrieval pipeline:

from haystack import Document, Pipeline
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.rankers.vllm import VLLMRanker

docs = [
    Document(content="France is a country in Western Europe"),
    Document(content="Berlin is the capital of Germany"),
    Document(content="Paris is the capital city of France"),
    Document(content="The economy of France is one of the largest in Europe"),
    Document(content="Madrid is the capital city of Spain"),
    Document(content="Lyon is a major city in France known for cuisine"),
    Document(content="French cuisine is famous worldwide"),
    Document(content="Rome is the capital of Italy"),
    Document(content="Marseille is a port city in southern France"),
    Document(content="Tourism in France attracts millions every year"),
    Document(content="Vienna is the capital of Austria"),
    Document(content="Toulouse is a large city in France known for aerospace"),
    Document(content="The president of France lives in Paris"),
    Document(content="Barcelona is a major city in Spain"),
    Document(content="France has a rich history and culture"),
]
document_store = InMemoryDocumentStore()
document_store.write_documents(docs)

retriever = InMemoryBM25Retriever(document_store=document_store, top_k=10)
ranker = VLLMRanker(model="BAAI/bge-reranker-base", api_base_url="http://localhost:8002/v1", top_k=5)

document_ranker_pipeline = Pipeline()
document_ranker_pipeline.add_component(instance=retriever, name="retriever")
document_ranker_pipeline.add_component(instance=ranker, name="ranker")

document_ranker_pipeline.connect("retriever.documents", "ranker.documents")

query = "france cities"
ranked_docs = document_ranker_pipeline.run(
    data={
        "retriever": {"query": query},
        "ranker": {"query": query},
    },
)["ranker"]["documents"]

print()
for doc in ranked_docs:
  print(doc.content)
  print(f"score: {doc.score}")
  print("---")
Paris is the capital city of France
score: 0.9862945675849915
---
Lyon is a major city in France known for cuisine
score: 0.9146546721458435
---
Toulouse is a large city in France known for aerospace
score: 0.8589105010032654
---
Marseille is a port city in southern France
score: 0.8240598440170288
---
France is a country in Western Europe
score: 0.3078511953353882
---

Nice work! 🇫🇷

In this notebook, we explored serving Generative Language Models, Embedding Models and Ranking Models via vLLM and performing inference through Haystack.

For more information, check the documentation:

Notebook by Stefano Fiorucci