Maintained by deepset

Integration: Amazon Textract

Use Amazon Textract with Haystack to extract text, tables, forms, and answers to queries from documents

Authors

deepset

GitHub Repo PyPI Package

Overview
Installation
Usage

Overview

AmazonTextractConverter provides an integration of Amazon Textract with Haystack.

This component uses Amazon Textract’s synchronous API to convert images and single-page PDFs into Haystack Document objects using OCR. It supports plain text extraction, structural analysis for tables and forms, and natural-language queries on documents.

Supported file formats: JPEG, PNG, TIFF, BMP, and single-page PDF (up to 10 MB).

Key features:

Plain text extraction with DetectDocumentText
Table, form, signature, and layout detection with AnalyzeDocument
Natural-language queries to extract specific answers from documents
Access to the raw Textract response for downstream processing

Installation

Install the Amazon Textract integration:

pip install amazon-textract-haystack

Usage

The component uses the standard boto3 credential chain. You can set AWS credentials (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGION) as environment variables, configure them via ~/.aws/credentials and ~/.aws/config, rely on an IAM role when running on AWS infrastructure, or pass them explicitly as Secret arguments.

The Textract API is selected automatically based on how you configure the component: DetectDocumentText is used by default for plain text extraction, while AnalyzeDocument is used whenever you set feature_types or pass queries at runtime.

Basic text extraction

Extract plain text from a document with the default configuration, which calls DetectDocumentText:

from haystack_integrations.components.converters.amazon_textract import AmazonTextractConverter

converter = AmazonTextractConverter()
results = converter.run(sources=["document.png"])
documents = results["documents"]

print(documents[0].content)

Table and form analysis

Use AnalyzeDocument to detect tables and forms by setting feature_types:

from haystack_integrations.components.converters.amazon_textract import AmazonTextractConverter

converter = AmazonTextractConverter(feature_types=["TABLES", "FORMS"])
results = converter.run(sources=["invoice.png"])

documents = results["documents"]
raw_responses = results["raw_textract_response"]

Valid feature_types values: "TABLES", "FORMS", "SIGNATURES", "LAYOUT".

Natural-language queries

Ask questions about a document and get extracted answers. The QUERIES feature type is enabled automatically when you pass the queries parameter at runtime:

from haystack_integrations.components.converters.amazon_textract import AmazonTextractConverter

converter = AmazonTextractConverter()
results = converter.run(
    sources=["medical_form.png"],
    queries=["What is the patient name?", "What is the date of birth?"],
)

documents = results["documents"]
raw_responses = results["raw_textract_response"]

Queries can be combined with feature_types for both structural and question-based extraction:

converter = AmazonTextractConverter(feature_types=["TABLES", "FORMS"])
results = converter.run(
    sources=["invoice.png"],
    queries=["What is the total amount due?"],
)

Explicit credentials

from haystack.utils import Secret
from haystack_integrations.components.converters.amazon_textract import AmazonTextractConverter

converter = AmazonTextractConverter(
    aws_access_key_id=Secret.from_env_var("MY_AWS_KEY"),
    aws_secret_access_key=Secret.from_env_var("MY_AWS_SECRET"),
    aws_region_name=Secret.from_token("us-east-1"),
)

For more details on Amazon Textract capabilities and setup, refer to the Amazon Textract documentation.

License

amazon-textract-haystack is distributed under the terms of the Apache-2.0 license.