Integration: Voyage AI
Use text embeddings and rerankers from Voyage AI
Table of Contents
Voyage AI’s embedding and ranking models are state-of-the-art in retrieval accuracy. The integration supports the following models:
voyage-3.5andvoyage-3.5-lite- Latest general-purpose embedding models with superior performancevoyage-3-largeandvoyage-3- High-performance general-purpose embedding modelsvoyage-context-3- Contextualized chunk embedding model that preserves document context for improved retrieval accuracyvoyage-2andvoyage-large-2- Proven models that outperformintfloat/e5-mistral-7b-instructandOpenAI/text-embedding-3-largeon the MTEB Benchmark
For the complete list of available models, see the Embeddings Documentation and Contextualized Chunk Embeddings.
Installation
pip install voyage-embedders-haystack
Usage
You can use Voyage models with four components:
- VoyageTextEmbedder - For embedding query text
- VoyageDocumentEmbedder - For embedding documents
-
VoyageContextualizedDocumentEmbedder - For contextualized chunk embeddings with
voyage-context-3 - VoyageRanker - For reranking documents
Standard Embeddings
To create semantic embeddings for documents, use VoyageDocumentEmbedder in your indexing pipeline. For generating embeddings for queries, use VoyageTextEmbedder. For reranking, use VoyageRanker with
Voyage Rerankers.
Contextualized Embeddings
For improved retrieval quality, use VoyageContextualizedDocumentEmbedder with the voyage-context-3 model. This component preserves context between related document chunks by grouping them together during embedding, reducing context loss that occurs when chunks are embedded independently
Important: You must explicitly specify the model parameter when initializing any component. Choose from the available models listed in the
Embeddings Documentation. Recommended choices include:
voyage-3.5- Latest general-purpose model for best performancevoyage-3.5-lite- Efficient model with lower latencyvoyage-3-large- High-capacity model for complex tasksvoyage-context-3- Contextualized embeddings for improved retrieval (use withVoyageContextualizedDocumentEmbedder)voyage-2- Proven general-purpose model
You can set the environment variable VOYAGE_API_KEY instead of passing the API key as an argument. To get an API key, please see the
Voyage AI website.
Note (v1.7.0+): The
modelparameter is required and must be explicitly specified. Earlier versions defaulted tovoyage-3for embedders andrerank-2for the ranker.
Example
Below is the example Semantic Search pipeline that uses the
Simple Wikipedia Dataset from HuggingFace. You can find more examples in the
examples folder.
Load the dataset:
# Install HuggingFace Datasets using "pip install datasets"
from datasets import load_dataset
from haystack import Pipeline
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.components.writers import DocumentWriter
from haystack.dataclasses import Document
from haystack.document_stores.in_memory import InMemoryDocumentStore
# Import Voyage Embedders
from haystack_integrations.components.embedders.voyage_embedders import VoyageDocumentEmbedder, VoyageTextEmbedder
# Load first 100 rows of the Simple Wikipedia Dataset from HuggingFace
dataset = load_dataset("pszemraj/simple_wikipedia", split="validation[:100]")
docs = [
Document(
content=doc["text"],
meta={
"title": doc["title"],
"url": doc["url"],
},
)
for doc in dataset
]
Index the documents to the InMemoryDocumentStore using the VoyageDocumentEmbedder and DocumentWriter:
doc_store = InMemoryDocumentStore(embedding_similarity_function="cosine")
retriever = InMemoryEmbeddingRetriever(document_store=doc_store)
doc_writer = DocumentWriter(document_store=doc_store)
doc_embedder = VoyageDocumentEmbedder(
model="voyage-3.5",
input_type="document",
)
text_embedder = VoyageTextEmbedder(model="voyage-3.5", input_type="query")
# Indexing Pipeline
indexing_pipeline = Pipeline()
indexing_pipeline.add_component(instance=doc_embedder, name="DocEmbedder")
indexing_pipeline.add_component(instance=doc_writer, name="DocWriter")
indexing_pipeline.connect("DocEmbedder", "DocWriter")
indexing_pipeline.run({"DocEmbedder": {"documents": docs}})
print(f"Number of documents in Document Store: {len(doc_store.filter_documents())}")
print(f"First Document: {doc_store.filter_documents()[0]}")
print(f"Embedding of first Document: {doc_store.filter_documents()[0].embedding}")
Query the Semantic Search Pipeline using the InMemoryEmbeddingRetriever and VoyageTextEmbedder:
text_embedder = VoyageTextEmbedder(model="voyage-3.5", input_type="query")
# Query Pipeline
query_pipeline = Pipeline()
query_pipeline.add_component(instance=text_embedder, name="TextEmbedder")
query_pipeline.add_component(instance=retriever, name="Retriever")
query_pipeline.connect("TextEmbedder.embedding", "Retriever.query_embedding")
# Search
results = query_pipeline.run({"TextEmbedder": {"text": "Which year did the Joker movie release?"}})
# Print text from top result
top_result = results["Retriever"]["documents"][0].content
print("The top search result is:")
print(top_result)
Contextualized Embeddings Example
The voyage-context-3 model enables contextualized chunk embeddings, which preserve relationships between document chunks for better retrieval accuracy. Documents with the same source_id are embedded together in context:
from haystack import Document
from haystack_integrations.components.embedders.voyage_embedders import VoyageContextualizedDocumentEmbedder
# Create documents with source_id to group related chunks
docs = [
# Chunks from the same document (source_id: "doc1")
Document(
content="Apple Inc. released their Q1 earnings report.",
meta={"source_id": "doc1", "title": "Apple News"}
),
Document(
content="Revenue increased by 12% year over year.",
meta={"source_id": "doc1", "title": "Apple News"}
),
# Chunks from another document (source_id: "doc2")
Document(
content="Tesla announced new vehicle production targets.",
meta={"source_id": "doc2", "title": "Tesla Update"}
),
]
# Use VoyageContextualizedDocumentEmbedder for voyage-context-3
embedder = VoyageContextualizedDocumentEmbedder(
model="voyage-context-3",
input_type="document",
)
result = embedder.run(documents=docs)
# Chunks with the same source_id are embedded together, preserving context
# This improves retrieval - e.g., searching "Apple revenue growth" will better match
# the second chunk because it maintains its connection to "Apple Inc."
For more examples, see the contextualized embedder example.
License
voyage-embedders-haystack is distributed under the terms of the
Apache-2.0 license.
