PDF-Based Question Answering with Amazon Bedrock and Haystack
Last Updated: July 8, 2025
Notebook by Bilge Yucel
Amazon Bedrock is a fully managed service that provides high-performing foundation models from leading AI startups and Amazon through a single API. You can choose from various foundation models to find the one best suited for your use case.
In this notebook, we’ll go through the process of creating a generative question answering application tailored for PDF files using the newly added Amazon Bedrock integration with Haystack and OpenSearch to store our documents efficiently. The demo will illustrate the step-by-step development of a QA application designed specifically for the Bedrock documentation, demonstrating the power of Bedrock in the process 🚀
Setup the Development Environment
Install dependencies
%%bash
pip install -q opensearch-haystack amazon-bedrock-haystack pypdf
Download Files
For this application, we’ll use the user guide of Amazon Bedrock. Amazon Bedrock provides the PDF form of its guide. Let’s download it!
!wget "https://docs.aws.amazon.com/pdfs/bedrock/latest/userguide/bedrock-ug.pdf"
Note: You can code to download the PDF to
/content/bedrock-documentation.pdf
directory as an alternative👇🏼
# import os
# import boto3
# from botocore import UNSIGNED
# from botocore.config import Config
# s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))
# s3.download_file('core-engineering', 'public/blog-posts/bedrock-documentation.pdf', '/content/bedrock-documentation.pdf')
Initialize an OpenSearch Instance on Colab
OpenSearch is a fully open source search and analytics engine and is compatible with the Amazon OpenSearch Service that’s helpful if you’d like to deploy, operate, and scale your OpenSearch cluster later on.
Let’s install OpenSearch and start an instance on Colab. For other installation options, check out OpenSearch documentation.
!wget https://artifacts.opensearch.org/releases/bundle/opensearch/2.11.1/opensearch-2.11.1-linux-x64.tar.gz
!tar -xvf opensearch-2.11.1-linux-x64.tar.gz
!chown -R daemon:daemon opensearch-2.11.1
# disabling security. Be mindful when you want to disable security in production systems
!sudo echo 'plugins.security.disabled: true' >> opensearch-2.11.1/config/opensearch.yml
%%bash --bg
cd opensearch-2.11.1 && sudo -u daemon -- ./bin/opensearch
OpenSearch needs 30 seconds for a fully started server
import time
time.sleep(30)
API Keys
To use Amazon Bedrock, you need aws_access_key_id
, aws_secret_access_key
, and indicate the aws_region_name
. Once logged into your account, locate these keys under the IAM user’s “Security Credentials” section. For detailed guidance, refer to the documentation on
Managing access keys for IAM users.
from getpass import getpass
os.environ["AWS_ACCESS_KEY_ID"] = getpass("aws_access_key_id: ")
os.environ["AWS_SECRET_ACCESS_KEY"] = getpass("aws_secret_access_key: ")
os.environ["AWS_DEFAULT_REGION"] = input("aws_region_name: ")
aws_access_key_id: ··········
aws_secret_access_key: ··········
aws_region_name: us-east-1
Building the Indexing Pipeline
Our indexing pipeline will convert the PDF file into a Haystack Document using PyPDFToDocument and preprocess it by cleaning and splitting it into chunks before storing them in OpenSearchDocumentStore.
Let’s run the pipeline below and index our file to our document store:
from pathlib import Path
from haystack import Pipeline
from haystack.components.converters import PyPDFToDocument
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack.document_stores.types import DuplicatePolicy
from haystack_integrations.document_stores.opensearch import OpenSearchDocumentStore
## Initialize the OpenSearchDocumentStore
document_store = OpenSearchDocumentStore()
## Create pipeline components
converter = PyPDFToDocument()
cleaner = DocumentCleaner()
splitter = DocumentSplitter(split_by="sentence", split_length=10, split_overlap=2)
writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP)
## Add components to the pipeline
indexing_pipeline = Pipeline()
indexing_pipeline.add_component("converter", converter)
indexing_pipeline.add_component("cleaner", cleaner)
indexing_pipeline.add_component("splitter", splitter)
indexing_pipeline.add_component("writer", writer)
## Connect the components to each other
indexing_pipeline.connect("converter", "cleaner")
indexing_pipeline.connect("cleaner", "splitter")
indexing_pipeline.connect("splitter", "writer")
Run the pipeline with the pdf. This might take ~4mins if you’re running this notebook on CPU.
indexing_pipeline.run({"converter": {"sources": [Path("/content/bedrock-ug.pdf")]}})
{'writer': {'documents_written': 1060}}
Building the Query Pipeline
Let’s create another pipeline to query our application. In this pipeline, we’ll use
OpenSearchBM25Retriever to retrieve relevant information from the OpenSearchDocumentStore and an Amazon Nova model amazon.nova-pro-v1:0
to generate answers with
AmazonChatBedrockGenerator. You can select and test different models using the dropdown on right.
Next, we’ll create a prompt for our task using the Retrieval-Augmented Generation (RAG) approach with ChatPromptBuilder. This prompt will help generate answers by considering the provided context. Finally, we’ll connect these three components to complete the pipeline.
from haystack.components.builders import ChatPromptBuilder
from haystack.dataclasses import ChatMessage
from haystack import Pipeline
from haystack_integrations.components.generators.amazon_bedrock import AmazonBedrockChatGenerator
from haystack_integrations.components.retrievers.opensearch import OpenSearchBM25Retriever
## Create pipeline components
retriever = OpenSearchBM25Retriever(document_store=document_store, top_k=15)
## Initialize the AmazonBedrockGenerator with an Amazon Bedrock model
bedrock_model = 'amazon.nova-lite-v1:0'
generator = AmazonBedrockChatGenerator(model=bedrock_model)
template = """
{% for document in documents %}
{{ document.content }}
{% endfor %}
Please answer the question based on the given information from Amazon Bedrock documentation.
{{query}}
"""
prompt_builder = ChatPromptBuilder(template=[ChatMessage.from_user(template)], required_variables="*")
## Add components to the pipeline
rag_pipeline = Pipeline()
rag_pipeline.add_component("retriever", retriever)
rag_pipeline.add_component("prompt_builder", prompt_builder)
rag_pipeline.add_component("llm", generator)
## Connect the components to each other
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")
<haystack.core.pipeline.pipeline.Pipeline object at 0x7d1aa6550150>
🚅 Components
- retriever: OpenSearchBM25Retriever
- prompt_builder: ChatPromptBuilder
- llm: AmazonBedrockChatGenerator
🛤️ Connections
- retriever.documents -> prompt_builder.documents (List[Document])
- prompt_builder.prompt -> llm.messages (List[ChatMessage])
Ask your question and learn about the Amazon Bedrock service using Amazon Bedrock models!
question = "What is Amazon Bedrock?"
response = rag_pipeline.run({"query": question})
print(response["llm"]["replies"][0].text)
Amazon Bedrock is a fully managed service that makes high-performing foundation models (FMs) from leading AI startups and Amazon available for use through a unified API. Key capabilities include:
- Easily experiment with and evaluate top foundation models for various use cases. Models are available from providers like AI21 Labs, Anthropic, Cohere, Meta, and Stability AI.
- Privately customize models with your own data using techniques like fine-tuning and retrieval augmented generation (RAG).
- Build agents that execute tasks using enterprise systems and data sources.
- Serverless experience so you can get started quickly without managing infrastructure.
- Integrate customized models into applications using AWS tools.
So in summary, Amazon Bedrock provides easy access to top AI models that you can customize and integrate into apps to build intelligent solutions. It's a fully managed service focused on generative AI.
Other Queries
You can also try these queries:
- How can I setup Amazon Bedrock?
- How can I finetune foundation models?
- How should I form my prompts for Amazon Titan models?
- How should I form my prompts for Claude models?