PDF-Based Question Answering with Amazon Bedrock and Haystack

_{Last Updated:
February 22, 2024}

Amazon Bedrock is a fully managed service that provides high-performing foundation models from leading AI startups and Amazon through a single API. You can choose from various foundation models to find the one best suited for your use case.

In this notebook, we’ll go through the process of creating a generative question answering application tailored for PDF files using the newly added Amazon Bedrock integration with Haystack and OpenSearch to store our documents efficiently. The demo will illustrate the step-by-step development of a QA application designed specifically for the Bedrock documentation, demonstrating the power of Bedrock in the process 🚀

Setup the Development Environment

Install dependencies

%%bash

pip install opensearch-haystack amazon-bedrock-haystack pypdf

Download Files

For this application, we’ll use the user guide of Amazon Bedrock. Amazon Bedrock provides the PDF form of its guide. Run the code to download the PDF to /content/bedrock-documentation.pdf directory 👇🏼

import os

import boto3
from botocore import UNSIGNED
from botocore.config import Config

s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))
s3.download_file('core-engineering', 'public/blog-posts/bedrock-documentation.pdf', '/content/bedrock-documentation.pdf')

Initialize an OpenSearch Instance on Colab

OpenSearch is a fully open source search and analytics engine and is compatible with the Amazon OpenSearch Service that’s helpful if you’d like to deploy, operate, and scale your OpenSearch cluster later on.

Let’s install OpenSearch and start an instance on Colab. For other installation options, check out OpenSearch documentation.

!wget https://artifacts.opensearch.org/releases/bundle/opensearch/2.11.1/opensearch-2.11.1-linux-x64.tar.gz
!tar -xvf opensearch-2.11.1-linux-x64.tar.gz
!chown -R daemon:daemon opensearch-2.11.1
# disabling security. Be mindful when you want to disable security in production systems
!sudo echo 'plugins.security.disabled: true' >> opensearch-2.11.1/config/opensearch.yml

%%bash --bg
cd opensearch-2.11.1 && sudo -u daemon -- ./bin/opensearch

OpenSearch needs 30 seconds for a fully started server

import time

time.sleep(30)

API Keys

To use Amazon Bedrock, you need aws_access_key_id, aws_secret_access_key, and indicate the aws_region_name. Once logged into your account, locate these keys under the IAM user’s “Security Credentials” section. For detailed guidance, refer to the documentation on Managing access keys for IAM users.

from getpass import getpass

os.environ["AWS_ACCESS_KEY_ID"] = getpass("aws_access_key_id: ")
os.environ["AWS_SECRET_ACCESS_KEY"] = getpass("aws_secret_access_key: ")
os.environ["AWS_DEFAULT_REGION"] = input("aws_region_name: ")

Building the Indexing Pipeline

Our indexing pipeline will convert the PDF file into a Haystack Document using PyPDFToDocument and preprocess it by cleaning and splitting it into chunks before storing them in OpenSearchDocumentStore.

Let’s run the pipeline below and index our file to our document store:

from pathlib import Path

from haystack import Pipeline
from haystack.components.converters import PyPDFToDocument
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack.document_stores.types import DuplicatePolicy
from haystack_integrations.document_stores.opensearch import OpenSearchDocumentStore

## Initialize the OpenSearchDocumentStore
document_store = OpenSearchDocumentStore()

## Create pipeline components
converter = PyPDFToDocument()
cleaner = DocumentCleaner()
splitter = DocumentSplitter(split_by="sentence", split_length=10, split_overlap=2)
writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP)

## Add components to the pipeline
indexing_pipeline = Pipeline()
indexing_pipeline.add_component("converter", converter)
indexing_pipeline.add_component("cleaner", cleaner)
indexing_pipeline.add_component("splitter", splitter)
indexing_pipeline.add_component("writer", writer)

## Connect the components to each other
indexing_pipeline.connect("converter", "cleaner")
indexing_pipeline.connect("cleaner", "splitter")
indexing_pipeline.connect("splitter", "writer")

Run the pipeline with the files you want to index:

indexing_pipeline.run({"converter": {"sources": [Path("/content/bedrock-documentation.pdf")]}})

Building the Query Pipeline

Let’s create another pipeline to query our application. In this pipeline, we’ll use OpenSearchBM25Retriever to retrieve relevant information from the OpenSearchDocumentStore and an Amazon Titan model amazon.titan-text-express-v1 to generate answers with AmazonBedrockGenerator. You can select and test different models using the dropdown on right.

Next, we’ll create a prompt for our task using the Retrieval-Augmented Generation (RAG) approach with PromptBuilder. This prompt will help generate answers by considering the provided context. Finally, we’ll connect these three components to complete the pipeline.

from haystack.components.builders import PromptBuilder
from haystack import Pipeline
from haystack_integrations.components.generators.amazon_bedrock import AmazonBedrockGenerator
from haystack_integrations.components.retrievers.opensearch import OpenSearchBM25Retriever

## Create pipeline components
retriever = OpenSearchBM25Retriever(document_store=document_store, top_k=15)

## Initialize the AmazonBedrockGenerator with an Amazon Bedrock model
bedrock_model = 'amazon.titan-text-express-v1' # @param ["amazon.titan-text-express-v1", "amazon.titan-text-lite-v1", "anthropic.claude-instant-v1", "anthropic.claude-v1", "anthropic.claude-v2","anthropic.claude-v2:1", "meta.llama2-13b-chat-v1", "meta.llama2-70b-chat-v1", "ai21.j2-mid-v1", "ai21.j2-ultra-v1"]
generator = AmazonBedrockGenerator(model=bedrock_model, max_length=500)
template = """
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Please answer the question based on the given information from Amazon Bedrock documentation.

{{question}}
"""
prompt_builder = PromptBuilder(template=template)

## Add components to the pipeline
rag_pipeline = Pipeline()
rag_pipeline.add_component("retriever", retriever)
rag_pipeline.add_component("prompt_builder", prompt_builder)
rag_pipeline.add_component("llm", generator)

## Connect the components to each other
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")

Ask your question and learn about the Amazon Bedrock service using Amazon Bedrock models!

question = "What is Amazon Bedrock?"
response = rag_pipeline.run({"retriever": {"query": question}, "prompt_builder": {"question": question}})

print(response["llm"]["replies"][0])

question = "How can I setup Amazon Bedrock?"
response = rag_pipeline.run({"retriever": {"query": question}, "prompt_builder": {"question": question}})

print(response["llm"]["replies"][0])

question = "How can I finetune foundation models?"
response = rag_pipeline.run({"retriever": {"query": question}, "prompt_builder": {"question": question}})

print(response["llm"]["replies"][0])

question = "How should I form my prompts for Amazon Titan models?"
response = rag_pipeline.run({"retriever": {"query": question}, "prompt_builder": {"question": question}})

print(response["llm"]["replies"][0])

question = "How should I form my prompts for Claude models?"
response = rag_pipeline.run({"retriever": {"query": question}, "prompt_builder": {"question": question}})

print(response["llm"]["replies"][0])