๐Ÿ”Ž Haystack 2.26 is here! LLMRanker for high-quality context & Jinja2 support for dynamic system prompts in Agents
Maintained by deepset

Integration: LibreOffice File Converter

Convert office documents, spreadsheets, and presentations between formats using LibreOffice in Haystack pipelines.

Authors
Max Swain
deepset

Table of Contents

Overview

LibreOfficeFileConverter is a Haystack component that uses LibreOffice’s command-line utility (soffice) to convert office files between formats. It supports documents, spreadsheets, and presentations, and can output ByteStream objects that plug directly into other Haystack components.

Sources can be file paths (str or Path) or ByteStream objects. Both synchronous (run) and asynchronous (run_async) execution modes are supported.

Installation

First, install LibreOffice on your system:

  • macOS: brew install --cask libreoffice
  • Ubuntu/Debian: sudo apt-get install libreoffice
  • Windows: Download from libreoffice.org

Then install the Python package:

pip install libreoffice-haystack

Usage

Standalone

from pathlib import Path
from haystack_integrations.components.converters.libreoffice import LibreOfficeFileConverter

converter = LibreOfficeFileConverter()
result = converter.run(sources=[Path("report.doc")], output_file_type="docx")
print(result["output"])  # [ByteStream(data=b'...')]

The output_file_type can be set at initialization or passed per run() call (the latter takes precedence):

# Set at init
converter = LibreOfficeFileConverter(output_file_type="pdf")
result = converter.run(sources=[Path("report.docx")])

# Override per call
result = converter.run(sources=[Path("slides.pptx")], output_file_type="png")

In a Haystack Pipeline

LibreOfficeFileConverter outputs list[ByteStream], which connects directly to Haystack’s built-in converters. Here is an example that converts a legacy .doc file to .docx and then extracts its text as Haystack Document objects:

from pathlib import Path
from haystack import Pipeline
from haystack.components.converters import DOCXToDocument
from haystack_integrations.components.converters.libreoffice import LibreOfficeFileConverter

pipeline = Pipeline()
pipeline.add_component("libreoffice_converter", LibreOfficeFileConverter())
pipeline.add_component("docx_converter", DOCXToDocument())
pipeline.connect("libreoffice_converter.output", "docx_converter.sources")

result = pipeline.run({
    "libreoffice_converter": {
        "sources": [Path("legacy_report.doc")],
        "output_file_type": "docx",
    }
})
print(result["docx_converter"]["documents"])

Async Usage

LibreOfficeFileConverter also exposes a run_async method with the same signature as run, for use in async Haystack pipelines:

import asyncio
from pathlib import Path
from haystack_integrations.components.converters.libreoffice import LibreOfficeFileConverter

async def main():
    converter = LibreOfficeFileConverter()
    result = await converter.run_async(
        sources=[Path("presentation.pptx")],
        output_file_type="pdf",
    )
    print(result["output"])

asyncio.run(main())

Note: LibreOffice only supports one running soffice instance at a time. Conversions within a single run_async call are executed sequentially.

License

libreoffice-haystack is distributed under the Apache-2.0 License.