Maintained by deepset

Integration: Apache Tika

Convert files of different types (PDF, DOCX, HTML, and more) to Documents using Apache Tika

Authors

deepset

GitHub Repo PyPI Package

Overview
Installation
Usage
License

Overview

The tika-haystack integration provides TikaDocumentConverter, a component that converts files of different types (PDF, DOCX, HTML, RTF, and many others) into Haystack Document objects using Apache Tika.

Apache Tika is a content analysis toolkit that detects and extracts metadata and text from many file formats. The component requires a running Tika server to parse documents.

This component was previously part of Haystack core and now lives in the tika-haystack integration package, maintained in haystack-core-integrations.

Installation

Install the tika-haystack package:

pip install tika-haystack

This integration requires a running Tika server. The easiest way to start one is with Docker:

docker run -d -p 127.0.0.1:9998:9998 apache/tika:latest

For more options, see the Tika Docker documentation.

Usage

On its own

from haystack_integrations.components.converters.tika import TikaDocumentConverter
from pathlib import Path

converter = TikaDocumentConverter()
result = converter.run(sources=[Path("sample.docx"), Path("report.pdf")])
documents = result["documents"]

print(documents[0].content)

In a pipeline

from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack_integrations.components.converters.tika import TikaDocumentConverter

document_store = InMemoryDocumentStore()

pipeline = Pipeline()
pipeline.add_component("converter", TikaDocumentConverter())
pipeline.add_component("cleaner", DocumentCleaner())
pipeline.add_component("splitter", DocumentSplitter(split_by="sentence", split_length=5))
pipeline.add_component("writer", DocumentWriter(document_store=document_store))
pipeline.connect("converter", "cleaner")
pipeline.connect("cleaner", "splitter")
pipeline.connect("splitter", "writer")

pipeline.run({"converter": {"sources": ["document.pdf", "report.docx"]}})

By default, the component connects to a Tika server at http://localhost:9998/tika. Use the tika_url parameter to point to a different server:

converter = TikaDocumentConverter(tika_url="http://my-tika-server:9998/tika")

License

tika-haystack is distributed under the terms of the Apache-2.0 license.