๐Ÿ”Ž Haystack 2.26 is here! LLMRanker for high-quality context & Jinja2 support for dynamic system prompts in Agents
Maintained by deepset

Integration: MarkItDown

Use Microsoft's MarkItDown to locally convert PDF, DOCX, PPTX, XLSX, HTML, images, and more into Markdown in Haystack

Authors
deepset

Table of Contents

Overview

MarkItDown is a Python library by Microsoft for converting various file formats into Markdown. It supports a wide range of formats including PDF, Word (.docx), PowerPoint (.pptx), Excel (.xlsx), HTML, images, and more โ€” all processed locally.

This integration provides a MarkItDownConverter component that wraps Microsoft’s MarkItDown library, enabling Haystack users to convert files into Haystack Document objects with Markdown content.

Installation

pip install markitdown-haystack

Usage

Standalone

from haystack_integrations.components.converters.markitdown import MarkItDownConverter

converter = MarkItDownConverter()
result = converter.run(sources=["document.pdf", "report.docx"])
documents = result["documents"]

You can also pass metadata to attach to the resulting documents:

from haystack_integrations.components.converters.markitdown import MarkItDownConverter

converter = MarkItDownConverter()
result = converter.run(
    sources=["document.pdf", "report.docx"],
    meta=[{"author": "Alice"}, {"author": "Bob"}]
)
documents = result["documents"]

To convert ByteStream objects:

from haystack.dataclasses import ByteStream
from haystack_integrations.components.converters.markitdown import MarkItDownConverter

converter = MarkItDownConverter()
bytestream = ByteStream(data=file_bytes, meta={"file_path": "document.pdf"})
result = converter.run(sources=[bytestream])
documents = result["documents"]

In a Haystack Pipeline

from haystack import Pipeline
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.converters.markitdown import MarkItDownConverter

document_store = InMemoryDocumentStore()

indexing = Pipeline()
indexing.add_component("converter", MarkItDownConverter())
indexing.add_component("writer", DocumentWriter(document_store))
indexing.connect("converter", "writer")

indexing.run({"converter": {"sources": ["a/file/path.pdf", "another/file.docx"]}})

License

markitdown-haystack is distributed under the terms of the Apache-2.0 license.