๐Ÿ†• Faster agents with parallel tool execution and guardrails & moderation for safer apps. See what's new in Haystack 2.15 ๐ŸŒŸ
Maintained by deepset

Integration: Llama Stack

Use the Llama Stack generation models.

Authors
deepset

Table of Contents

Overview

Llama Stack is an open-source framework consisting of AI building blocks and unified APIs that standardizes building AI Apps across different environments.

The LlamaStackChatGenerator allows you to leverage any LLMs made available by inference providers hosted on a Llama Stack server. It abstracts away the specifics of the underlying provider, enabling you to use the same client-side code across different inference backends. For a list of supported providers and configuration details, refer to the Llama Stack documentation.

To use this chat generator, youโ€™ll need:

  • A running instance of a Llama Stack server (either local or remote)
  • A valid model name supported by your chosen inference provider

Below are example configurations for using the Llama-3.2-3B model:

Ollama as the inference provider:

chat_generator = LlamaStackChatGenerator(model="llama3.2:3b")

vLLM as the inference provider: chat_generator = LlamaStackChatGenerator(model="meta-llama/Llama-3.2-3B")

Installation

pip install llama-stack-haystack

Usage

Standalone with vLLM inference

import os
from haystack.dataclasses import ChatMessage
from haystack_integrations.components.generators.llama-stack import LlamaStackChatGenerator

client = LlamaStackChatGenerator(model="meta-llama/Llama-3.2-3B")
response = client.run(
    [ChatMessage.from_user("What are Agentic Pipelines? Be brief.")]
)
print(response["replies"])
{'replies': [ChatMessage(_role=<ChatRole.ASSISTANT: 'assistant'>, _content=[TextContent(text='The capital of Vietnam is Hanoi.')], _name=None, _meta={'model': 'openai/gpt-4o-mini', 'index': 0, 'finish_reason': 'stop', 'usage': {'completion_tokens': 8, 'prompt_tokens': 13, 'total_tokens': 21, 'completion_tokens_details': CompletionTokensDetails(accepted_prediction_tokens=None, audio_tokens=None, reasoning_tokens=0, rejected_prediction_tokens=None), 'prompt_tokens_details': PromptTokensDetails(audio_tokens=None, cached_tokens=0)}})]}

LlamaStackChatGenerator also support streaming responses if you pass a streaming callback:

import os
from haystack.dataclasses import ChatMessage
from haystack_integrations.components.generators.llama-stack import LlamaStackChatGenerator
from haystack.components.generators.utils import print_streaming_chunk


client = LlamaStackChatGenerator(
    model="meta-llama/Llama-3.2-3B",
    streaming_callback=print_streaming_chunk,
)

response = client.run([ChatMessage.from_user("Summarize RAG in two lines.")])

print (response)

License

llama-stack-haystack is distributed under the terms of the Apache-2.0 license.