Integration: Hugging Face API
Use models through Hugging Face APIs - Inference Providers, Inference Endpoints, TGI and TEI
Table of Contents
Overview
With this integration, you can use models through Hugging Face APIs:
- Serverless Inference API (Inference Providers): access many models from different providers through a unified API.
- Inference Endpoints: deploy models on dedicated, fully managed infrastructure.
- Self-hosted Text Generation Inference (TGI) and Text Embeddings Inference (TEI) servers.
Haystack supports Hugging Face models in other ways too:
- Hugging Face Transformers for local models (LLMs, extractive QA, classification, NER)
- Sentence Transformers for local embedding and ranking models
- Optimum for high-performance inference with ONNX Runtime
Installation
pip install huggingface-api-haystack
Usage
Unless you are using a self-hosted TGI/TEI server, set your Hugging Face token as the HF_API_TOKEN or HF_TOKEN environment variable.
Components
This integration provides several components to interact with Hugging Face APIs:
-
HuggingFaceAPIChatGenerator: chat generation with LLMs. -
HuggingFaceAPITextEmbedder: creates an embedding for text (used in query/RAG pipelines). -
HuggingFaceAPIDocumentEmbedder: enriches documents with embeddings (used in indexing pipelines). -
HuggingFaceTEIRanker: ranks documents based on their similarity to the query, using a TEI endpoint.
Chat Generation
Use
HuggingFaceAPIChatGenerator with the Serverless Inference API (Inference Providers):
from haystack.dataclasses import ChatMessage
from haystack_integrations.components.generators.huggingface_api import HuggingFaceAPIChatGenerator
generator = HuggingFaceAPIChatGenerator(
api_type="serverless_inference_api",
api_params={"model": "Qwen/Qwen2.5-7B-Instruct", "provider": "together"},
)
result = generator.run("What's Natural Language Processing? Be brief.")
print(result)
To use a dedicated Inference Endpoint or a self-hosted TGI server, pass its URL instead:
generator = HuggingFaceAPIChatGenerator(
api_type="inference_endpoints", # or "text_generation_inference" for self-hosted TGI
api_params={"url": "<your-endpoint-url>"},
)
Embedding Models
To create semantic embeddings for documents, use
HuggingFaceAPIDocumentEmbedder in your indexing pipeline. For generating embeddings for queries, use
HuggingFaceAPITextEmbedder.
from haystack_integrations.components.embedders.huggingface_api import HuggingFaceAPITextEmbedder
text_embedder = HuggingFaceAPITextEmbedder(
api_type="serverless_inference_api",
api_params={"model": "BAAI/bge-small-en-v1.5"},
)
print(text_embedder.run("I love pizza!"))
# {'embedding': [0.017020374536514282, -0.023255806416273117, ...]}
Both embedders also work with a self-hosted TEI server:
text_embedder = HuggingFaceAPITextEmbedder(
api_type="text_embeddings_inference",
api_params={"url": "http://localhost:8080"},
)
Ranking Models
Use
HuggingFaceTEIRanker to rank documents with a reranking model served by a TEI endpoint:
from haystack import Document
from haystack_integrations.components.rankers.huggingface_api import HuggingFaceTEIRanker
ranker = HuggingFaceTEIRanker(url="http://localhost:8080", top_k=2)
docs = [Document(content="The capital of France is Paris"),
Document(content="The capital of Germany is Berlin")]
result = ranker.run(query="What is the capital of France?", documents=docs)
print(result["documents"][0].content)
# The capital of France is Paris
