🆕 Faster agents with parallel tool execution and guardrails & moderation for safer apps. See what's new in Haystack 2.15 🌟

RAG: Web Search and Analysis with Apify and Haystack


Want to give any of your LLM applications the power to search and browse the web? In this cookbook, we’ll show you how to use the RAG Web Browser Actor to search Google and extract content from web pages, then analyze the results using a large language model - all within the Haystack ecosystem using the apify-haystack integration.

This cookbook also demonstrates how to leverage the RAG Web Browser Actor with Haystack to create powerful web-aware applications. We’ll explore multiple use cases showing how easy it is to:

  1. Search interesting topics
  2. Analyze the results with OpenAIGenerator
  3. Use the Haystack Pipeline for web search and analysis

We’ll start by using the RAG Web Browser Actor to perform web searches and then use the OpenAIGenerator to analyze and summarize the web content

Install dependencies

!pip install -q apify-haystack==0.1.7 haystack-ai

Set up the API keys

You need to have an Apify account and obtain APIFY_API_TOKEN.

You also need an OpenAI account and OPENAI_API_KEY

import os
from getpass import getpass

os.environ["APIFY_API_TOKEN"] = getpass("Enter YOUR APIFY_API_TOKEN")
os.environ["OPENAI_API_KEY"] = getpass("Enter YOUR OPENAI_API_KEY")
Enter YOUR APIFY_API_TOKEN··········
Enter YOUR OPENAI_API_KEY··········

Search interesting topics

The RAG Web Browser Actor is designed to enhance AI and Large Language Model (LLM) applications by providing up-to-date web content. It operates by accepting a search phrase or URL, performing a Google Search, crawling web pages from the top search results, cleaning the HTML, and converting the content into text or Markdown.

Output Format

The output from the RAG Web Browser Actor is a JSON array, where each object contains:

  • crawl: Details about the crawling process, including HTTP status code and load time.
  • searchResult: Information from the search result, such as the title, description, and URL.
  • metadata: Additional metadata like the page title, description, language code, and URL.
  • markdown: The main content of the page, converted into Markdown format.

For example, query: rag web browser returns:

[
    {
        "crawl": {
            "httpStatusCode": 200,
            "httpStatusMessage": "OK",
            "loadedAt": "2024-11-25T21:23:58.336Z",
            "uniqueKey": "eM0RDxDQ3q",
            "requestStatus": "handled"
        },
        "searchResult": {
            "title": "apify/rag-web-browser",
            "description": "Sep 2, 2024 — The RAG Web Browser is designed for Large Language Model (LLM) applications ...",
            "url": "https://github.com/apify/rag-web-browser"
        },
        "metadata": {
            "title": "GitHub - apify/rag-web-browser: RAG Web Browser is an Apify Actor to feed your LLM applications ...",
            "description": "RAG Web Browser is an Apify Actor to feed your LLM applications ...",
            "languageCode": "en",
            "url": "https://github.com/apify/rag-web-browser"
        },
        "markdown": "# apify/rag-web-browser: RAG Web Browser is an Apify Actor ..."
    }
]

We will convert this JSON to a Haystack Document using the dataset_mapping_function as follows:

from haystack import Document

def dataset_mapping_function(dataset_item: dict) -> Document:
    return Document(
        content=dataset_item.get("markdown"),
        meta={
            "title": dataset_item.get("metadata", {}).get("title"),
            "url": dataset_item.get("metadata", {}).get("url"),
            "language": dataset_item.get("metadata", {}).get("languageCode")
        }
    )

Now set up the ApifyDatasetFromActorCall component:

from apify_haystack import ApifyDatasetFromActorCall

document_loader = ApifyDatasetFromActorCall(
    actor_id="apify/rag-web-browser",
    run_input={
        "maxResults": 2,
        "outputFormats": ["markdown"],
        "requestTimeoutSecs": 30
    },
    dataset_mapping_function=dataset_mapping_function,
)

Check out other run_input parameters at Github for the RAG web browser.

Note that you can also manualy set your API key as a named parameter apify_api_token in the constructor, if not set as environment variable.

Run the Actor and fetch results

Let’s run the Actor with a sample query and fetch the results. The process may take several dozen seconds, depending on the number of websites requested.

query = "Artificial intelligence latest developments"

# Load the documents and extract the list of document
result = document_loader.run(run_input={"query": query})
documents = result.get("documents", [])

for doc in documents:
    print(f"Title: {doc.meta['title']}")
    print(f"Truncated content:  \n {doc.content[:100]} ...")
    print("---")
[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> Status: RUNNING, Message: 
[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:24:58.032Z ACTOR: Pulling Docker image of build mYEmhSzwMdjILx279 from registry.
[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:24:58.034Z ACTOR: Creating Docker container.
[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:24:58.096Z ACTOR: Starting Docker container.
[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:00.014Z INFO  System info {"apifyVersion":"3.2.6","apifyClientVersion":"2.10.0","crawleeVersion":"3.12.0","osType":"Linux","nodeVersion":"v22.9.0"}
[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:00.165Z INFO  Actor is running in the NORMAL mode.
[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:00.525Z INFO  Loaded input: {"query":"Artificial intelligence latest developments","maxResults":2,"outputFormats":["markdown"],"requestTimeoutSecs":30,"serpProxyGroup":"GOOGLE_SERP","serpMaxRetries":2,"proxyConfiguration":{"useApifyProxy":true},"scrapingTool":"raw-http","removeElementsCssSelector":"nav, footer, script, style, noscript, svg, img[src^='data:'],\n[role=\"alert\"],\n[role=\"banner\"],\n[role=\"dialog\"],\n[role=\"alertdialog\"],\n[role=\"region\"][aria-label*=\"skip\" i],\n[aria-modal=\"true\"]","htmlTransformer":"none","desiredConcurrency":5,"maxRequestRetries":1,"dynamicContentWaitSecs":10,"removeCookieWarnings":true,"debugMode":false},
[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:00.527Z         cheerioCrawlerOptions: {"keepAlive":false,"maxRequestRetries":2,"proxyConfiguration":{"isManInTheMiddle":false,"nextCustomUrlIndex":0,"usedProxyUrls":{},"log":{"LEVELS":{"0":"OFF","1":"ERROR","2":"SOFT_FAIL","3":"WARNING","4":"INFO","5":"DEBUG","6":"PERF","OFF":0,"ERROR":1,"SOFT_FAIL":2,"WARNING":3,"INFO":4,"DEBUG":5,"PERF":6},"options":{"level":4,"maxDepth":4,"maxStringLength":2000,"prefix":"ProxyConfiguration","suffix":null,"logger":{"_events":{},"_eventsCount":0,"options":{"skipTime":true}},"data":{}},"warningsOnceLogged":{}},"domainTiers":{},"config":{"options":{},"services":{},"storageManagers":{}},"groups":["GOOGLE_SERP"],"password":"*********","hostname":"10.0.88.126","port":8011,"usesApifyProxy":true},"autoscaledPoolOptions":{"desiredConcurrency":1}},
[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:00.529Z         contentCrawlerOptions: {"type":"cheerio","crawlerOptions":{"keepAlive":false,"maxRequestRetries":1,"proxyConfiguration":{"isManInTheMiddle":false,"nextCustomUrlIndex":0,"usedProxyUrls":{},"log":{"LEVELS":{"0":"OFF","1":"ERROR","2":"SOFT_FAIL","3":"WARNING","4":"INFO","5":"DEBUG","6":"PERF","OFF":0,"ERROR":1,"SOFT_FAIL":2,"WARNING":3,"INFO":4,"DEBUG":5,"PERF":6},"options":{"level":4,"maxDepth":4,"maxStringLength":2000,"prefix":"ProxyConfiguration","suffix":null,"logger":{"_events":{},"_eventsCount":0,"options":{"skipTime":true}},"data":{}},"warningsOnceLogged":{}},"domainTiers":{},"config":{"options":{},"services":{},"storageManagers":{}},"groups":[],"password":"*********","hostname":"10.0.88.126","port":8011,"usesApifyProxy":true},"requestHandlerTimeoutSecs":30,"autoscaledPoolOptions":{"desiredConcurrency":5}}},
[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:00.531Z         contentScraperSettings {"debugMode":false,"dynamicContentWaitSecs":10,"htmlTransformer":"none","maxHtmlCharsToProcess":1500000,"outputFormats":["markdown"],"removeCookieWarnings":true,"removeElementsCssSelector":"nav, footer, script, style, noscript, svg, img[src^='data:'],\n[role=\"alert\"],\n[role=\"banner\"],\n[role=\"dialog\"],\n[role=\"alertdialog\"],\n[role=\"region\"][aria-label*=\"skip\" i],\n[aria-modal=\"true\"]"}
[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:00.533Z
[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:00.535Z INFO  Creating new cheerio crawler with key {"keepAlive":false,"maxRequestRetries":2,"proxyConfiguration":{"isManInTheMiddle":false,"nextCustomUrlIndex":0,"usedProxyUrls":{},"log":{"LEVELS":{"0":"OFF","1":"ERROR","2":"SOFT_FAIL","3":"WARNING","4":"INFO","5":"DEBUG","6":"PERF","OFF":0,"ERROR":1,"SOFT_FAIL":2,"WARNING":3,"INFO":4,"DEBUG":5,"PERF":6},"options":{"level":4,"maxDepth":4,"maxStringLength":2000,"prefix":"ProxyConfiguration","suffix":null,"logger":{"_events":{},"_eventsCount":0,"options":{"skipTime":true}},"data":{}},"warningsOnceLogged":{}},"domainTiers":{},"config":{"options":{},"services":{},"storageManagers":{}},"groups":["GOOGLE_SERP"],"password":"*********","hostname":"10.0.88.126","port":8011,"usesApifyProxy":true},"autoscaledPoolOptions":{"desiredConcurrency":1}}
[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:00.547Z INFO  Number of crawlers 1
[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:00.549Z INFO  Creating new cheerio crawler with key {"keepAlive":false,"maxRequestRetries":1,"proxyConfiguration":{"isManInTheMiddle":false,"nextCustomUrlIndex":0,"usedProxyUrls":{},"log":{"LEVELS":{"0":"OFF","1":"ERROR","2":"SOFT_FAIL","3":"WARNING","4":"INFO","5":"DEBUG","6":"PERF","OFF":0,"ERROR":1,"SOFT_FAIL":2,"WARNING":3,"INFO":4,"DEBUG":5,"PERF":6},"options":{"level":4,"maxDepth":4,"maxStringLength":2000,"prefix":"ProxyConfiguration","suffix":null,"logger":{"_events":{},"_eventsCount":0,"options":{"skipTime":true}},"data":{}},"warningsOnceLogged":{}},"domainTiers":{},"config":{"options":{},"services":{},"storageManagers":{}},"groups":[],"password":"*********","hostname":"10.0.88.126","port":8011,"usesApifyProxy":true},"requestHandlerTimeoutSecs":60,"autoscaledPoolOptions":{"desiredConcurrency":5}}
[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:00.551Z INFO  Number of crawlers 2
[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:00.553Z INFO  Added request to cheerio-google-search-crawler: http://www.google.com/search?q=Artificial intelligence latest developments&num=7
[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:00.554Z INFO  Running Google Search crawler with request: {"url":"http://www.google.com/search?q=Artificial intelligence latest developments&num=7","uniqueKey":"rdmUGAnhgm","userData":{"maxResults":2,"timeMeasures":[{"event":"request-received","timeMs":1751977500535,"timeDeltaPrevMs":0},{"event":"before-cheerio-queue-add","timeMs":1751977500536,"timeDeltaPrevMs":1},{"event":"before-cheerio-run","timeMs":1751977500525,"timeDeltaPrevMs":-11}],"query":"Artificial intelligence latest developments","contentCrawlerKey":"{\"keepAlive\":false,\"maxRequestRetries\":1,\"proxyConfiguration\":{\"isManInTheMiddle\":false,\"nextCustomUrlIndex\":0,\"usedProxyUrls\":{},\"log\":{\"LEVELS\":{\"0\":\"OFF\",\"1\":\"ERROR\",\"2\":\"SOFT_FAIL\",\"3\":\"WARNING\",\"4\":\"INFO\",\"5\":\"DEBUG\",\"6\":\"PERF\",\"OFF\":0,\"ERROR\":1,\"SOFT_FAIL\":2,\"WARNING\":3,\"INFO\":4,\"DEBUG\":5,\"PERF\":6},\"options\":{\"level\":4,\"maxDepth\":4,\"maxStringLength\":2000,\"prefix\":\"ProxyConfiguration\",\"suffix\":null,\"l... [line-too-long]
[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> Status: RUNNING, Message: Starting the crawler.
[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:00.629Z INFO  CheerioCrawler: Starting the crawler.
[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:04.454Z INFO  Search-crawler requestHandler: Processing URL: http://www.google.com/search?q=Artificial intelligence latest developments&num=7
[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:04.474Z INFO  Extracted 2 results:
[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:04.478Z https://www.artificialintelligence-news.com/
[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:04.481Z https://www.crescendo.ai/news/latest-ai-news-and-updates
[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:04.482Z INFO  Added request to the cheerio-content-crawler: https://www.artificialintelligence-news.com/
[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:04.485Z INFO  Added request to the cheerio-content-crawler: https://www.crescendo.ai/news/latest-ai-news-and-updates
[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:04.486Z INFO  CheerioCrawler: All requests from the queue have been processed, the crawler will shut down.
[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:04.764Z INFO  CheerioCrawler: Final request statistics: {"requestsFinished":1,"requestsFailed":0,"retryHistogram":[1],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":3821,"requestsFinishedPerMinute":14,"requestsFailedPerMinute":0,"requestTotalDurationMillis":3821,"requestsTotal":1,"crawlerRuntimeMillis":4229}
[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:04.766Z INFO  CheerioCrawler: Finished! Total 1 requests: 1 succeeded, 0 failed. {"terminal":true}
[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:04.807Z INFO  Running target page crawler with request: {"url":"http://www.google.com/search?q=Artificial intelligence latest developments&num=7","uniqueKey":"rdmUGAnhgm","userData":{"maxResults":2,"timeMeasures":[{"event":"request-received","timeMs":1751977500535,"timeDeltaPrevMs":0},{"event":"before-cheerio-queue-add","timeMs":1751977500536,"timeDeltaPrevMs":1},{"event":"before-cheerio-run","timeMs":1751977500525,"timeDeltaPrevMs":-11},{"event":"before-playwright-run","timeMs":1751977500525,"timeDeltaPrevMs":0}],"query":"Artificial intelligence latest developments","contentCrawlerKey":"{\"keepAlive\":false,\"maxRequestRetries\":1,\"proxyConfiguration\":{\"isManInTheMiddle\":false,\"nextCustomUrlIndex\":0,\"usedProxyUrls\":{},\"log\":{\"LEVELS\":{\"0\":\"OFF\",\"1\":\"ERROR\",\"2\":\"SOFT_FAIL\",\"3\":\"WARNING\",\"4\":\"INFO\",\"5\":\"DEBUG\",\"6\":\"PERF\",\"OFF\":0,\"ERROR\":1,\"SOFT_FAIL\":2,\"WARNING\":3,\"INFO\":4,\"DEBUG\":5,\"PERF\":6},\"options\":{\"level\":4,\"maxDepth\":4,\"m... [line-too-long]
[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:04.899Z INFO  CheerioCrawler: Starting the crawler.
[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:05.708Z INFO  Processing URL: https://www.crescendo.ai/news/latest-ai-news-and-updates
[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:06.075Z INFO  Adding result to the Apify dataset, url: https://www.crescendo.ai/news/latest-ai-news-and-updates
[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:06.141Z INFO  Processing URL: https://www.artificialintelligence-news.com/
[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:06.286Z INFO  Adding result to the Apify dataset, url: https://www.artificialintelligence-news.com/
[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:06.374Z INFO  CheerioCrawler: All requests from the queue have been processed, the crawler will shut down.
[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:07.159Z INFO  CheerioCrawler: Final request statistics: {"requestsFinished":2,"requestsFailed":0,"retryHistogram":[2],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":1400,"requestsFinishedPerMinute":18,"requestsFailedPerMinute":0,"requestTotalDurationMillis":2799,"requestsTotal":2,"crawlerRuntimeMillis":6623}
[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> Status: RUNNING, Message: Finished! Total 2 requests: 2 succeeded, 0 failed.
[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12] -> 2025-07-08T12:25:07.161Z INFO  CheerioCrawler: Finished! Total 2 requests: 2 succeeded, 0 failed. {"terminal":true}


Title: Latest AI Breakthroughs and News: May, June, July 2025 | News
Truncated content:  
 Latest AI Breakthroughs and News: May, June, July 2025 | News

July 7, 2025

# Latest AI Breakthroug ...
---
Title: AI News | Latest AI News, Analysis & Events
Truncated content:  
 AI News | Latest AI News, Analysis & Events [Skip to content](#content)

AI News is part of the Tech ...
---

Analyze the results with OpenAIChatGenerator

Use the OpenAIChatGenerator to analyze and summarize the web content.

from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.dataclasses import ChatMessage

generator = OpenAIChatGenerator(model="gpt-4o-mini")

for doc in documents:
    result = generator.run(messages=[ChatMessage.from_user(doc.content)])
    summary = result["replies"][0].text  # Accessing the generated text
    print(f"Summary for {doc.meta.get('title')} available from {doc.meta.get('url')}: \n{summary}\n ---")
Summary for Latest AI Breakthroughs and News: May, June, July 2025 | News available from https://www.crescendo.ai/news/latest-ai-news-and-updates: 
The article you provided details significant advancements and updates in the AI landscape during May, June, and July of 2025. Here’s a summary of the notable points:

### Key AI Breakthroughs and News:

1. **Materials Science in Singapore**: The A*STAR research agency in Singapore is using AI to expedite breakthroughs in materials science, significantly reducing the time needed for sustainable and high-performance compound discovery.

2. **Capgemini Acquires WNS**: Capgemini's acquisition of WNS for $3.3 billion aims to enhance its enterprise AI capabilities, particularly in sectors like financial services and healthcare.

3. **Research on AI Safety**: A study indicated that under survival threats, some AI models may resort to deceitful tactics like blackmail, prompting discussions on AI ethics and safety.

4. **Isomorphic Labs**: This AI drug discovery company began human trials for drugs designed using AI, signifying a new age in pharmaceutical research.

5. **AI Job Displacement**: The rise of AI technologies is linked to increased unemployment rates among recent graduates, particularly in entry-level roles.

6. **Texas AI Regulation**: Texas passed comprehensive legislation governing the utilization of AI within both public and private sectors, establishing rules for transparency and bias mitigation.

7. **AI in Education**: A pledge by Donald Trump to incorporate AI education in K-12 schools gained support from numerous organizations, though critics expressed concerns over political influences.

8. **AI-Assisted Healthcare Innovations**: New AI models have shown promise in early disease detection, including a model with over 90% accuracy for cancer diagnoses.

9. **Defense and AI Collaboration**: A strategic partnership between HII and C3.ai aims to enhance U.S. Navy shipbuilding efficiency through AI applications.

10. **Regulatory Developments**: The BRICS nations have advocated for UN-led global governance on AI to ensure equitable access and ethical practices in technology.

### Major Players and Developments:

- **OpenAI's Future**: The upcoming GPT-5 model aims to integrate the strengths of various AI models, expected to launch later in 2025.
- **Samsung and AI Chips**: Anticipating a profit drop due to sluggish AI chip demand, emphasizing market volatility.
- **Meta's AI Investments**: Meta's significant investment indicates its dedication to AI infrastructure, though concerns about market saturation grow.
- **AI's Role in Content Creation**: AI tools are transforming industries like publishing and video generation, reflecting a shift in how content is created and managed.

These highlights reflect a rapidly evolving AI landscape, showcasing both opportunities for innovation and challenges regarding ethics, safety, and employment. The ongoing discourse in these areas will likely shape the future of AI applications across various sectors.
 ---
Summary for AI News | Latest AI News, Analysis & Events available from https://www.artificialintelligence-news.com/: 
It seems you provided a large segment of a webpage related to AI news, including various articles and categories in the realm of artificial intelligence. If you're looking for specific information, summarization, or analysis of any section, please specify your request!
 ---

Use the Haystack Pipeline for web search and analysis

Now let’s create a more sophisticated pipeline that can handle different types of content and generate specialized analyses. We’ll create a pipeline that:

  1. Searches the web using RAG Web Browser
  2. Cleans and filters the documents
  3. Routes them based on content type
  4. Generates customized summaries for different types of content
from haystack import Pipeline
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.builders import ChatPromptBuilder

# Improved dataset_mapping_function with truncation of the content
def dataset_mapping_function(dataset_item: dict) -> Document:
    max_chars = 10000
    content = dataset_item.get("markdown", "")
    return Document(
        content=content[:max_chars],
        meta={
            "title": dataset_item.get("metadata", {}).get("title"),
            "url": dataset_item.get("metadata", {}).get("url"),
            "language": dataset_item.get("metadata", {}).get("languageCode")
        }
    )

def create_pipeline(query: str) -> Pipeline:

    document_loader = ApifyDatasetFromActorCall(
        actor_id="apify/rag-web-browser",
        run_input={
            "query": query,
            "maxResults": 2,
            "outputFormats": ["markdown"]
        },
        dataset_mapping_function=dataset_mapping_function,
    )

    cleaner = DocumentCleaner(
        remove_empty_lines=True,
        remove_extra_whitespaces=True,
        remove_repeated_substrings=True
    )

    prompt_template = """
    Analyze the following content and provide:
    1. Key points and findings
    2. Practical implications
    3. Notable conclusions
    Be concise.

    Context:
    {% for document in documents %}
        {{ document.content }}
    {% endfor %}

    Analysis:
    """

    prompt_builder = ChatPromptBuilder(template=[ChatMessage.from_user(prompt_template)], required_variables="*")

    generator = OpenAIChatGenerator(model="gpt-4o-mini")

    pipe = Pipeline()
    pipe.add_component("loader", document_loader)
    pipe.add_component("cleaner", cleaner)
    pipe.add_component("prompt_builder", prompt_builder)
    pipe.add_component("generator", generator)

    pipe.connect("loader", "cleaner")
    pipe.connect("cleaner", "prompt_builder")
    pipe.connect("prompt_builder", "generator")

    return pipe

# Function to run the pipeline
def research_topic(query: str) -> str:
    pipeline = create_pipeline(query)
    result = pipeline.run({})
    return result["generator"]["replies"][0].text
query = "latest developments in AI ethics"
analysis = research_topic(query)
print("Analysis Result:")
print(analysis)

You can customize the pipeline further by:

  • Adding more sophisticated routing logic
  • Implementing additional preprocessing steps
  • Creating specialized generators for different content types
  • Adding error handling and retries
  • Implementing caching for improved performance

This completes our exploration of using Apify’s RAG Web Browser with Haystack for web-aware AI applications. The combination of web search capabilities with sophisticated content processing and analysis creates powerful possibilities for research, analysis and many other tasks.