Integration: PaddleOCR
Use PaddleOCR’s text-recognition and document-parsing capabilities with Haystack
Table of Contents
Overview
PaddleOCR converts documents and images into structured, AI-friendly data (like JSON and Markdown) with industry-leading accuracy—powering AI applications for everyone from indie developers and startups to large enterprises worldwide.
This integration allows you to use PaddleOCR’s text-recognition and document-parsing capabilities with Haystack.
Components
-
PaddleOCRVLDocumentConverter. This component extracts text from documents using PaddleOCR’s large model document parsing API.
Initialization
Every component of the PaddleOCR integration requires an access token from PaddlePaddle AI Studio. By default, authentication uses the AISTUDIO_ACCESS_TOKEN environment variable. You can also provide an access_token when initializing each component. The AI Studio access token can be obtained from
this page.
Installation
pip install paddleocr-haystack
Usage
How to use the PaddleOCRVLDocumentConverter
To start, visit the
PaddleOCR official website, click the API button in the upper-left corner, choose the example code for Large Model document parsing(PaddleOCR-VL), and copy the API_URL.
Basic usage with a local file:
from pathlib import Path
from haystack.utils import Secret
from haystack_integrations.components.converters.paddleocr import PaddleOCRVLDocumentConverter
converter = PaddleOCRVLDocumentConverter(
api_url="<your-api-url>",
access_token=Secret.from_env_var("AISTUDIO_ACCESS_TOKEN"),
)
result = converter.run(sources=[Path("my_document.pdf")])
documents = result["documents"]
Here’s an example of an indexing pipeline that processes PDFs with OCR and writes them to a Document Store:
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack.utils import Secret
from haystack_integrations.components.converters.paddleocr import PaddleOCRVLDocumentConverter
document_store = InMemoryDocumentStore()
pipeline = Pipeline()
pipeline.add_component(
"converter",
PaddleOCRVLDocumentConverter(
api_url="<your-api-url>",
access_token=Secret.from_env_var("AISTUDIO_ACCESS_TOKEN"),
)
)
pipeline.add_component("cleaner", DocumentCleaner())
pipeline.add_component("splitter", DocumentSplitter(split_by="page", split_length=1))
pipeline.add_component("writer", DocumentWriter(document_store=document_store))
pipeline.connect("converter", "cleaner")
pipeline.connect("cleaner", "splitter")
pipeline.connect("splitter", "writer")
file_paths = ["invoice.pdf", "receipt.pdf", "contract.pdf"]
pipeline.run({"converter": {"sources": file_paths}})
License
paddleocr-haystack is distributed under the terms of the
Apache-2.0 license.
