Integration: Valyu Search

Search and content extraction components using Valyu's API for web and proprietary sources

Authors

Valyu

GitHub Repo PyPI Package

Overview
Installation
Usage
API Integration Details
- Authentication
- License

Overview

Haystack components for integrating Valyu’s powerful search and content extraction APIs into your Haystack pipelines.

This package provides two main components:

ValyuSearch - Search component that queries the Valyu DeepSearch API and returns documents with content already included
ValyuContentFetcher - Content extraction component that fetches and cleans content from URLs

Key Features:

Search across web and proprietary sources
Full content included in search results
AI-powered content extraction and summarization

Installation

Use pip to install Valyu Search for Haystack:

pip install valyu-search-haystack

Or install from source:

pip install -e .

Requirements:

Python 3.8+
haystack-ai >= 2.0.0
valyu >= 2.2.1

Usage

Set your Valyu API key as an environment variable:

export VALYU_API_KEY="your-api-key"

ValyuSearch

The ValyuSearch component integrates with the Valyu DeepSearch API. Unlike many search APIs, Valyu returns full content by default, making it ideal for RAG pipelines.

Basic Usage:

from valyu_haystack import ValyuSearch
from haystack import Pipeline

# Create a search component (API key from VALYU_API_KEY env var)
search = ValyuSearch(
    top_k=5,
    search_type="all",  # "web", "proprietary", or "all"
    relevance_threshold=0.5
)

# Create and run a pipeline
pipeline = Pipeline()
pipeline.add_component("search", search)

result = pipeline.run({"search": {"query": "What is Haystack AI?"}})
documents = result["search"]["documents"]
links = result["search"]["links"]

Component Parameters:

api_key (Secret): Your Valyu API key. Defaults to VALYU_API_KEY environment variable
top_k (int, default=10): Maximum number of results to return
api_base_url (str): Base URL for the Valyu API
search_type (Literal[“web”, “proprietary”, “all”], default=“all”): Type of search
relevance_threshold (float, default=0.5): Minimum relevance score (0.0-1.0)
max_price (int, default=100): Maximum price per thousand queries in cents

Output:

documents (List[Document]): Documents with content and rich metadata
links (List[str]): List of URLs from search results

Metadata included:

title: Page title
url: Source URL
description: Page description
source: Data source identifier
relevance_score: Relevance score (0.0-1.0)
price: Cost of this result
length: Content length in characters
data_type: Type of data (“structured” or “unstructured”)
image_url: Associated image URL (if any)

ValyuContentFetcher

The ValyuContentFetcher component extracts clean, readable content from URLs using the Valyu Contents API. It supports batch processing and AI-powered summarization.

Basic Usage:

from valyu_haystack import ValyuContentFetcher
from haystack import Pipeline

# Create a content fetcher component
fetcher = ValyuContentFetcher(
    extract_effort="normal",  # "normal", "high", or "auto"
    response_length="short",  # "short", "medium", "large", "max", or int
    summary=True  # Enable AI summarization
)

# Create and run a pipeline
pipeline = Pipeline()
pipeline.add_component("fetcher", fetcher)

urls = ["https://example.com/article1", "https://example.com/article2"]
result = pipeline.run({"fetcher": {"urls": urls}})
documents = result["fetcher"]["documents"]

Component Parameters:

api_key (Secret): Your Valyu API key. Defaults to VALYU_API_KEY environment variable
api_base_url (str): Base URL for the Valyu API
timeout (int, default=30): Request timeout in seconds
extract_effort (Literal[“normal”, “high”, “auto”], optional): Extraction thoroughness
response_length (Union[Literal[“short”, “medium”, “large”, “max”], int], optional): Content length per URL
summary (Union[bool, str, Dict], optional): AI summary config
- False or None: No AI processing (raw content)
- True: Basic automatic summarization
- str: Custom instructions (max 500 chars)
- dict: JSON schema for structured extraction

Input:

urls (List[str], optional): List of URLs to fetch
documents (List[Document], optional): Documents with URLs in metadata

Output:

documents (List[Document]): Documents with extracted content

Metadata included:

url: Source URL
title: Page title
length: Content length in characters
source: Data source identifier
data_type: Type of content

Pipeline Examples

RAG Pipeline with Search and Chat:

from haystack import Pipeline
from haystack.utils import Secret
from haystack.components.builders.chat_prompt_builder import ChatPromptBuilder
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.dataclasses import ChatMessage
from valyu_haystack import ValyuSearch

# Create components
web_search = ValyuSearch(top_k=3)

prompt_template = [
    ChatMessage.from_system("You are a helpful assistant."),
    ChatMessage.from_user(
        "Given the information below:\n"
        "{% for document in documents %}{{ document.content }}{% endfor %}\n"
        "Answer question: {{ query }}.\nAnswer:"
    )
]

prompt_builder = ChatPromptBuilder(template=prompt_template, required_variables={"query", "documents"})
llm = OpenAIChatGenerator(api_key=Secret.from_env_var("OPENAI_API_KEY"), model="gpt-4o-mini")

# Build pipeline
pipe = Pipeline()
pipe.add_component("search", web_search)
pipe.add_component("prompt_builder", prompt_builder)
pipe.add_component("llm", llm)

# Connect components
pipe.connect("search.documents", "prompt_builder.documents")
pipe.connect("prompt_builder.messages", "llm.messages")

# Run pipeline
query = "What is the most famous landmark in Berlin?"
result = pipe.run(data={"search": {"query": query}, "prompt_builder": {"query": query}})

Indexing Pipeline with Content Fetcher:

from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.writers import DocumentWriter
from valyu_haystack import ValyuContentFetcher

# Create components
document_store = InMemoryDocumentStore()
fetcher = ValyuContentFetcher()
writer = DocumentWriter(document_store=document_store)

# Build indexing pipeline
indexing_pipeline = Pipeline()
indexing_pipeline.add_component(instance=fetcher, name="fetcher")
indexing_pipeline.add_component(instance=writer, name="writer")

# Connect components
indexing_pipeline.connect("fetcher.documents", "writer.documents")

# Run pipeline
indexing_pipeline.run(data={
    "fetcher": {"urls": ["https://haystack.deepset.ai/blog/guide-to-using-zephyr-with-haystack2"]}
})

Advanced Configuration

Structured data extraction with Content Fetcher:

from valyu_haystack import ValyuContentFetcher

# Define JSON schema for structured extraction
schema = {
    "type": "object",
    "properties": {
        "title": {"type": "string"},
        "author": {"type": "string"},
        "published_date": {"type": "string"},
        "summary": {"type": "string"}
    }
}

fetcher = ValyuContentFetcher(summary=schema)
result = fetcher.run(urls=["https://example.com/article"])

# Extracted structured data will be in document metadata

API Integration Details

Authentication

Both components use Haystack’s Secret class for secure API key management:

Header: x-api-key: your-api-key
Environment variable: VALYU_API_KEY

License

valyu-search-haystack is distributed under the terms of the Apache-2.0 license.