Integration: Docling
Use Docling to locally parse and chunk PDF, DOCX, and other document types in Haystack
Table of Contents
Overview
Docling locally parses PDF, DOCX, HTML, and other document formats into a rich standardized representation (incl. layout, tables etc.), which it can then export to Markdown, JSON, and others.
Check out the Docling docs for more details.
This integration introduces Docling support, enabling Haystack users to:
- use various document types in LLM applications with ease and speed, and
- leverage Docling’s rich format for advanced, document-native grounding.
Installation
pip install docling-haystack
Usage
Components
This integration introduces DoclingConverter
, a component which reads document
file paths (local or URL) and outputs Haystack Document
objects.
DoclingConverter
supports two different export modes, see export_type
initialization
argument further below.
Use Docling Converter
Docling Converter Initialization
DoclingConverter
creation can be parametrized via the following __init__()
arguments, most of which refer to the initialization and usage of the underlying Docling
DocumentConverter
and
chunker instances:
converter
: The DoclingDocumentConverter
to use; if not set, a system default is used.convert_kwargs
: Any parameters to pass to Docling conversion; if not set, a system default is used.export_type
: The export mode to use:ExportType.DOC_CHUNKS
(default) chunks each input document (seechunker
) and captures each individual chunk as a separate HaystackDocument
, whileExportType.MARKDOWN
captures each input document as a separate HaystackDocument
(in which case splitting is likely required downstream).md_export_kwargs
: Any parameters to pass to Markdown export (in case ofExportType.MARKDOWN
).chunker
: The Docling chunker instance to use; if not set, a system default is used (in case ofExportType.DOC_CHUNKS
).meta_extractor
: The extractor instance to use for populating the output document metadata; if not set, a system default is used.
Standalone
from docling_haystack.converter import DoclingConverter
converter = DoclingConverter()
documents = converter.run(paths=["https://arxiv.org/pdf/2408.09869"])["documents"]
print(repr(documents[2].content))
# -> Abstract\nThis technical report introduces Docling [...]
In a Pipeline
Check out this notebook illustrating usage in a complete example with indexing and RAG pipelines.
License
MIT License.