Integration: Kreuzberg
Locally convert 91+ document formats into Haystack Documents using Kreuzberg's Rust-core engine
Table of Contents
Overview
Kreuzberg is a document intelligence framework with a Rust core that extracts text, tables, and metadata from 91+ file formats โ entirely locally with no external API calls.
This integration provides KreuzbergConverter, a Haystack component that converts files into Haystack Document objects with rich metadata. It supports parallel batch extraction using Rust’s rayon thread pool for high throughput.
Supported format categories:
- Documents: PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS, ODT, ODS, ODP, RTF, Pages, Keynote, Numbers, and more
- Images (via OCR): PNG, JPEG, TIFF, GIF, BMP, WebP, JPEG 2000, SVG
- Text/Markup: Markdown, HTML, XML, LaTeX, Typst, JSON, YAML, reStructuredText, Jupyter notebooks
- Email: EML, MSG (with attachment extraction)
- Archives: ZIP, TAR, GZIP, 7Z (extracts and processes contents recursively)
- eBooks & Academic: EPUB, BibTeX, DocBook, JATS
Installation
pip install kreuzberg-haystack
Usage
Components
This integration introduces one component:
KreuzbergConverter: Converts files and directories into HaystackDocumentobjects. Accepts file paths, directory paths, andByteStreamobjects as input.
Basic Usage
from haystack_integrations.components.converters.kreuzberg import KreuzbergConverter
converter = KreuzbergConverter()
result = converter.run(sources=["report.pdf", "notes.docx"])
documents = result["documents"]
Markdown Output with OCR
Use ExtractionConfig to customize output format, OCR backend, and other extraction settings:
from haystack_integrations.components.converters.kreuzberg import KreuzbergConverter
from kreuzberg import ExtractionConfig, OcrConfig
converter = KreuzbergConverter(
config=ExtractionConfig(
output_format="markdown",
ocr=OcrConfig(backend="tesseract", language="eng"),
),
)
result = converter.run(sources=["scanned_document.pdf"])
documents = result["documents"]
In a Pipeline
from haystack import Pipeline
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.converters.kreuzberg import KreuzbergConverter
document_store = InMemoryDocumentStore()
pipeline = Pipeline()
pipeline.add_component("converter", KreuzbergConverter())
pipeline.add_component("splitter", DocumentSplitter(split_by="sentence", split_length=5))
pipeline.add_component("writer", DocumentWriter(document_store=document_store))
pipeline.connect("converter", "splitter")
pipeline.connect("splitter", "writer")
pipeline.run({"converter": {"sources": ["report.pdf", "presentation.pptx"]}})
Additional Features
- Per-page extraction: Create one
Documentper page usingPageConfig(extract_pages=True) - Chunking: Split documents by token count with configurable overlap via
ChunkingConfig - Token reduction: Reduce token count with modes from
"light"to"maximum"viaTokenReductionConfig - Rich metadata: Quality scores, detected languages, extracted keywords, table data, and PDF annotations
- Batch processing: Parallel extraction enabled by default; set
batch=Falsefor sequential mode - Config from file: Load extraction settings from a TOML, YAML, or JSON file via
config_path
For the full configuration reference and format support matrix, see the Kreuzberg documentation.
License
kreuzberg-haystack is distributed under the terms of the
Apache-2.0 license.
