Maintained by deepset

Integration: Hugging Face API

Use models through Hugging Face APIs - Inference Providers, Inference Endpoints, TGI and TEI

Authors

deepset

GitHub Repo PyPI Package

Overview
Installation
Usage

Overview

With this integration, you can use models through Hugging Face APIs:

Serverless Inference API (Inference Providers): access many models from different providers through a unified API.
Inference Endpoints: deploy models on dedicated, fully managed infrastructure.
Self-hosted Text Generation Inference (TGI) and Text Embeddings Inference (TEI) servers.

Haystack supports Hugging Face models in other ways too:

Hugging Face Transformers for local models (LLMs, extractive QA, classification, NER)
Sentence Transformers for local embedding and ranking models
Optimum for high-performance inference with ONNX Runtime

Installation

pip install huggingface-api-haystack

Usage

Unless you are using a self-hosted TGI/TEI server, set your Hugging Face token as the HF_API_TOKEN or HF_TOKEN environment variable.

Components

This integration provides several components to interact with Hugging Face APIs:

HuggingFaceAPIChatGenerator: chat generation with LLMs.
HuggingFaceAPITextEmbedder: creates an embedding for text (used in query/RAG pipelines).
HuggingFaceAPIDocumentEmbedder: enriches documents with embeddings (used in indexing pipelines).
HuggingFaceTEIRanker: ranks documents based on their similarity to the query, using a TEI endpoint.

Chat Generation

Use HuggingFaceAPIChatGenerator with the Serverless Inference API (Inference Providers):

from haystack.dataclasses import ChatMessage
from haystack_integrations.components.generators.huggingface_api import HuggingFaceAPIChatGenerator

generator = HuggingFaceAPIChatGenerator(
    api_type="serverless_inference_api",
    api_params={"model": "Qwen/Qwen2.5-7B-Instruct", "provider": "together"},
)

result = generator.run("What's Natural Language Processing? Be brief.")
print(result)

To use a dedicated Inference Endpoint or a self-hosted TGI server, pass its URL instead:

generator = HuggingFaceAPIChatGenerator(
    api_type="inference_endpoints",  # or "text_generation_inference" for self-hosted TGI
    api_params={"url": "<your-endpoint-url>"},
)

Embedding Models

To create semantic embeddings for documents, use HuggingFaceAPIDocumentEmbedder in your indexing pipeline. For generating embeddings for queries, use HuggingFaceAPITextEmbedder.

from haystack_integrations.components.embedders.huggingface_api import HuggingFaceAPITextEmbedder

text_embedder = HuggingFaceAPITextEmbedder(
    api_type="serverless_inference_api",
    api_params={"model": "BAAI/bge-small-en-v1.5"},
)

print(text_embedder.run("I love pizza!"))
# {'embedding': [0.017020374536514282, -0.023255806416273117, ...]}

Both embedders also work with a self-hosted TEI server:

text_embedder = HuggingFaceAPITextEmbedder(
    api_type="text_embeddings_inference",
    api_params={"url": "http://localhost:8080"},
)

Ranking Models

Use HuggingFaceTEIRanker to rank documents with a reranking model served by a TEI endpoint:

from haystack import Document
from haystack_integrations.components.rankers.huggingface_api import HuggingFaceTEIRanker

ranker = HuggingFaceTEIRanker(url="http://localhost:8080", top_k=2)

docs = [Document(content="The capital of France is Paris"),
        Document(content="The capital of Germany is Berlin")]

result = ranker.run(query="What is the capital of France?", documents=docs)
print(result["documents"][0].content)
# The capital of France is Paris