RAG: Extract and use website content for question answering with Apify-Haystack integration


Author: Jiri Spilka ( Apify)

In this tutorial, we’ll use the apify-haystack integration to call Website Content Crawler and crawl and scrape text content from the Haystack website. Then, we’ll use the OpenAIDocumentEmbedder to compute text embeddings and the InMemoryDocumentStore to store documents in a temporary in-memory database. The last step will be a retrieval augmented generation pipeline to answer users’ questions from the scraped data.

Install dependencies

!pip install apify-haystack haystack-ai

Set up the API keys

You need to have an Apify account and obtain APIFY_API_TOKEN.

You also need an OpenAI account and OPENAI_API_KEY

import os
from getpass import getpass

os.environ["APIFY_API_TOKEN"] = getpass("Enter YOUR APIFY_API_TOKEN")
os.environ["OPENAI_API_KEY"] = getpass("Enter YOUR OPENAI_API_KEY")

Use the Website Content Crawler to scrape data from the haystack documentation

Now, let us call the Website Content Crawler using the Haystack component ApifyDatasetFromActorCall. First, we need to define parameters for the Website Content Crawler and then what data we need to save into the vector database.

The actor_id and detailed description of input parameters (variable run_input) can be found on the Website Content Crawler input page.

For this example, we will define startUrls and limit the number of crawled pages to five.

actor_id = "apify/website-content-crawler"
run_input = {
    "maxCrawlPages": 5,  # limit the number of pages to crawl
    "startUrls": [{"url": "https://haystack.deepset.ai/"}],
}

Next, we need to define a dataset mapping function. We need to know the output of the Website Content Crawler. Typically, it is a JSON object that looks like this (truncated for brevity):

[
  {
    "url": "https://haystack.deepset.ai/",
    "text": "Haystack | Haystack - Multimodal - AI - Architect a next generation AI app around all modalities, not just text ..."
  },
  {
    "url": "https://haystack.deepset.ai/tutorials/24_building_chat_app",
    "text": "Building a Conversational Chat App ... "
  },
]

We will convert this JSON to a Haystack Document using the dataset_mapping_function as follows:

from haystack import Document

def dataset_mapping_function(dataset_item: dict) -> Document:
    return Document(content=dataset_item.get("text"), meta={"url": dataset_item.get("url")})

And the definition of the ApifyDatasetFromActorCall:

from apify_haystack import ApifyDatasetFromActorCall

apify_dataset_loader = ApifyDatasetFromActorCall(
    actor_id=actor_id,
    run_input=run_input,
    dataset_mapping_function=dataset_mapping_function,
)

Before actually running the Website Content Crawler, we need to define embedding function and document store:

from haystack.components.embedders import OpenAIDocumentEmbedder
from haystack.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()
docs_embedder = OpenAIDocumentEmbedder()

After that, we can call the Website Content Crawler and print the scraped data:

# Crawler website and store documents in the document_store
# Crawling will take some time (1-2 minutes), you can monitor progress in the https://console.apify.com/actors/runs

docs = apify_dataset_loader.run()
print(docs)

Compute the embeddings and store them in the database:

embeddings = docs_embedder.run(docs.get("documents"))
document_store.write_documents(embeddings["documents"])

Retrieval and LLM generative pipeline

Once we have the crawled data in the database, we can set up the classical retrieval augmented pipeline. Refer to the RAG Haystack tutorial for details.

from haystack import Pipeline
from haystack.components.builders import PromptBuilder
from haystack.components.embedders import OpenAITextEmbedder
from haystack.components.generators import OpenAIGenerator
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever

text_embedder = OpenAITextEmbedder()
retriever = InMemoryEmbeddingRetriever(document_store)
generator = OpenAIGenerator(model="gpt-3.5-turbo")

template = """
Given the following information, answer the question.

Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: {{question}}
Answer:
"""

prompt_builder = PromptBuilder(template=template)

# Add components to your pipeline
print("Initializing pipeline...")
pipe = Pipeline()
pipe.add_component("embedder", text_embedder)
pipe.add_component("retriever", retriever)
pipe.add_component("prompt_builder", prompt_builder)
pipe.add_component("llm", generator)

# Now, connect the components to each other
pipe.connect("embedder.embedding", "retriever.query_embedding")
pipe.connect("retriever", "prompt_builder.documents")
pipe.connect("prompt_builder", "llm")

Now, you can ask questions about Haystack and get correct answers:

question = "What is haystack?"

response = pipe.run({"embedder": {"text": question}, "prompt_builder": {"question": question}})

print(f"question: {question}")
print(f"answer: {response['llm']['replies'][0]}")