🆕 Faster agents with parallel tool execution and guardrails & moderation for safer apps. See what's new in Haystack 2.15 🌟

PDF-Based Question Answering with Amazon Bedrock and Haystack


Notebook by Bilge Yucel

Amazon Bedrock is a fully managed service that provides high-performing foundation models from leading AI startups and Amazon through a single API. You can choose from various foundation models to find the one best suited for your use case.

In this notebook, we’ll go through the process of creating a generative question answering application tailored for PDF files using the newly added Amazon Bedrock integration with Haystack and OpenSearch to store our documents efficiently. The demo will illustrate the step-by-step development of a QA application designed specifically for the Bedrock documentation, demonstrating the power of Bedrock in the process 🚀

Setup the Development Environment

Install dependencies

%%bash

pip install -q opensearch-haystack amazon-bedrock-haystack pypdf

Download Files

For this application, we’ll use the user guide of Amazon Bedrock. Amazon Bedrock provides the PDF form of its guide. Let’s download it!

!wget "https://docs.aws.amazon.com/pdfs/bedrock/latest/userguide/bedrock-ug.pdf"

Note: You can code to download the PDF to /content/bedrock-documentation.pdf directory as an alternative👇🏼

# import os

# import boto3
# from botocore import UNSIGNED
# from botocore.config import Config

# s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))
# s3.download_file('core-engineering', 'public/blog-posts/bedrock-documentation.pdf', '/content/bedrock-documentation.pdf')

Initialize an OpenSearch Instance on Colab

OpenSearch is a fully open source search and analytics engine and is compatible with the Amazon OpenSearch Service that’s helpful if you’d like to deploy, operate, and scale your OpenSearch cluster later on.

Let’s install OpenSearch and start an instance on Colab. For other installation options, check out OpenSearch documentation.

!wget https://artifacts.opensearch.org/releases/bundle/opensearch/2.11.1/opensearch-2.11.1-linux-x64.tar.gz
!tar -xvf opensearch-2.11.1-linux-x64.tar.gz
!chown -R daemon:daemon opensearch-2.11.1
# disabling security. Be mindful when you want to disable security in production systems
!sudo echo 'plugins.security.disabled: true' >> opensearch-2.11.1/config/opensearch.yml
%%bash --bg
cd opensearch-2.11.1 && sudo -u daemon -- ./bin/opensearch

OpenSearch needs 30 seconds for a fully started server

import time

time.sleep(30)

API Keys

To use Amazon Bedrock, you need aws_access_key_id, aws_secret_access_key, and indicate the aws_region_name. Once logged into your account, locate these keys under the IAM user’s “Security Credentials” section. For detailed guidance, refer to the documentation on Managing access keys for IAM users.

from getpass import getpass

os.environ["AWS_ACCESS_KEY_ID"] = getpass("aws_access_key_id: ")
os.environ["AWS_SECRET_ACCESS_KEY"] = getpass("aws_secret_access_key: ")
os.environ["AWS_DEFAULT_REGION"] = input("aws_region_name: ")
aws_access_key_id: ··········
aws_secret_access_key: ··········
aws_region_name: us-east-1

Building the Indexing Pipeline

Our indexing pipeline will convert the PDF file into a Haystack Document using PyPDFToDocument and preprocess it by cleaning and splitting it into chunks before storing them in OpenSearchDocumentStore.

Let’s run the pipeline below and index our file to our document store:

from pathlib import Path

from haystack import Pipeline
from haystack.components.converters import PyPDFToDocument
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack.document_stores.types import DuplicatePolicy
from haystack_integrations.document_stores.opensearch import OpenSearchDocumentStore

## Initialize the OpenSearchDocumentStore
document_store = OpenSearchDocumentStore()

## Create pipeline components
converter = PyPDFToDocument()
cleaner = DocumentCleaner()
splitter = DocumentSplitter(split_by="sentence", split_length=10, split_overlap=2)
writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP)

## Add components to the pipeline
indexing_pipeline = Pipeline()
indexing_pipeline.add_component("converter", converter)
indexing_pipeline.add_component("cleaner", cleaner)
indexing_pipeline.add_component("splitter", splitter)
indexing_pipeline.add_component("writer", writer)

## Connect the components to each other
indexing_pipeline.connect("converter", "cleaner")
indexing_pipeline.connect("cleaner", "splitter")
indexing_pipeline.connect("splitter", "writer")

Run the pipeline with the pdf. This might take ~4mins if you’re running this notebook on CPU.

indexing_pipeline.run({"converter": {"sources": [Path("/content/bedrock-ug.pdf")]}})
{'writer': {'documents_written': 1060}}

Building the Query Pipeline

Let’s create another pipeline to query our application. In this pipeline, we’ll use OpenSearchBM25Retriever to retrieve relevant information from the OpenSearchDocumentStore and an Amazon Nova model amazon.nova-pro-v1:0 to generate answers with AmazonChatBedrockGenerator. You can select and test different models using the dropdown on right.

Next, we’ll create a prompt for our task using the Retrieval-Augmented Generation (RAG) approach with ChatPromptBuilder. This prompt will help generate answers by considering the provided context. Finally, we’ll connect these three components to complete the pipeline.

from haystack.components.builders import ChatPromptBuilder
from haystack.dataclasses import ChatMessage
from haystack import Pipeline
from haystack_integrations.components.generators.amazon_bedrock import AmazonBedrockChatGenerator
from haystack_integrations.components.retrievers.opensearch import OpenSearchBM25Retriever

## Create pipeline components
retriever = OpenSearchBM25Retriever(document_store=document_store, top_k=15)

## Initialize the AmazonBedrockGenerator with an Amazon Bedrock model
bedrock_model = 'amazon.nova-lite-v1:0'
generator = AmazonBedrockChatGenerator(model=bedrock_model)
template = """
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Please answer the question based on the given information from Amazon Bedrock documentation.

{{query}}
"""
prompt_builder = ChatPromptBuilder(template=[ChatMessage.from_user(template)], required_variables="*")

## Add components to the pipeline
rag_pipeline = Pipeline()
rag_pipeline.add_component("retriever", retriever)
rag_pipeline.add_component("prompt_builder", prompt_builder)
rag_pipeline.add_component("llm", generator)

## Connect the components to each other
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")
<haystack.core.pipeline.pipeline.Pipeline object at 0x7d1aa6550150>
🚅 Components
  - retriever: OpenSearchBM25Retriever
  - prompt_builder: ChatPromptBuilder
  - llm: AmazonBedrockChatGenerator
🛤️ Connections
  - retriever.documents -> prompt_builder.documents (List[Document])
  - prompt_builder.prompt -> llm.messages (List[ChatMessage])

Ask your question and learn about the Amazon Bedrock service using Amazon Bedrock models!

question = "What is Amazon Bedrock?"
response = rag_pipeline.run({"query": question})

print(response["llm"]["replies"][0].text)
Amazon Bedrock is a fully managed service that makes high-performing foundation models (FMs) from leading AI startups and Amazon available for use through a unified API. Key capabilities include:

- Easily experiment with and evaluate top foundation models for various use cases. Models are available from providers like AI21 Labs, Anthropic, Cohere, Meta, and Stability AI.

- Privately customize models with your own data using techniques like fine-tuning and retrieval augmented generation (RAG). 

- Build agents that execute tasks using enterprise systems and data sources.

- Serverless experience so you can get started quickly without managing infrastructure.

- Integrate customized models into applications using AWS tools.

So in summary, Amazon Bedrock provides easy access to top AI models that you can customize and integrate into apps to build intelligent solutions. It's a fully managed service focused on generative AI.

Other Queries

You can also try these queries:

  • How can I setup Amazon Bedrock?
  • How can I finetune foundation models?
  • How should I form my prompts for Amazon Titan models?
  • How should I form my prompts for Claude models?