Integration: Amazon Textract
Use Amazon Textract with Haystack to extract text, tables, forms, and answers to queries from documents
Table of Contents
Overview
AmazonTextractConverter provides an integration of
Amazon Textract with Haystack.
This component uses Amazon Textract’s synchronous API to convert images and single-page PDFs into Haystack Document objects using OCR. It supports plain text extraction, structural analysis for tables and forms, and natural-language queries on documents.
Supported file formats: JPEG, PNG, TIFF, BMP, and single-page PDF (up to 10 MB).
Key features:
- Plain text extraction with
DetectDocumentText - Table, form, signature, and layout detection with
AnalyzeDocument - Natural-language queries to extract specific answers from documents
- Access to the raw Textract response for downstream processing
Installation
Install the Amazon Textract integration:
pip install amazon-textract-haystack
Usage
The component uses the standard boto3 credential chain. You can set AWS credentials (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGION) as environment variables, configure them via ~/.aws/credentials and ~/.aws/config, rely on an IAM role when running on AWS infrastructure, or pass them explicitly as
Secret arguments.
The Textract API is selected automatically based on how you configure the component: DetectDocumentText is used by default for plain text extraction, while AnalyzeDocument is used whenever you set feature_types or pass queries at runtime.
Basic text extraction
Extract plain text from a document with the default configuration, which calls DetectDocumentText:
from haystack_integrations.components.converters.amazon_textract import AmazonTextractConverter
converter = AmazonTextractConverter()
results = converter.run(sources=["document.png"])
documents = results["documents"]
print(documents[0].content)
Table and form analysis
Use AnalyzeDocument to detect tables and forms by setting feature_types:
from haystack_integrations.components.converters.amazon_textract import AmazonTextractConverter
converter = AmazonTextractConverter(feature_types=["TABLES", "FORMS"])
results = converter.run(sources=["invoice.png"])
documents = results["documents"]
raw_responses = results["raw_textract_response"]
Valid feature_types values: "TABLES", "FORMS", "SIGNATURES", "LAYOUT".
Natural-language queries
Ask questions about a document and get extracted answers. The QUERIES feature type is enabled automatically when you pass the queries parameter at runtime:
from haystack_integrations.components.converters.amazon_textract import AmazonTextractConverter
converter = AmazonTextractConverter()
results = converter.run(
sources=["medical_form.png"],
queries=["What is the patient name?", "What is the date of birth?"],
)
documents = results["documents"]
raw_responses = results["raw_textract_response"]
Queries can be combined with feature_types for both structural and question-based extraction:
converter = AmazonTextractConverter(feature_types=["TABLES", "FORMS"])
results = converter.run(
sources=["invoice.png"],
queries=["What is the total amount due?"],
)
Explicit credentials
from haystack.utils import Secret
from haystack_integrations.components.converters.amazon_textract import AmazonTextractConverter
converter = AmazonTextractConverter(
aws_access_key_id=Secret.from_env_var("MY_AWS_KEY"),
aws_secret_access_key=Secret.from_env_var("MY_AWS_SECRET"),
aws_region_name=Secret.from_token("us-east-1"),
)
For more details on Amazon Textract capabilities and setup, refer to the Amazon Textract documentation.
License
amazon-textract-haystack is distributed under the terms of the
Apache-2.0 license.
