Integration: Apache Tika
Convert files of different types (PDF, DOCX, HTML, and more) to Documents using Apache Tika
Table of Contents
Overview
The tika-haystack integration provides
TikaDocumentConverter, a component that converts files of different types (PDF, DOCX, HTML, RTF, and many others) into Haystack Document objects using
Apache Tika.
Apache Tika is a content analysis toolkit that detects and extracts metadata and text from many file formats. The component requires a running Tika server to parse documents.
This component was previously part of Haystack core and now lives in the tika-haystack integration package, maintained in
haystack-core-integrations.
Installation
Install the tika-haystack package:
pip install tika-haystack
This integration requires a running Tika server. The easiest way to start one is with Docker:
docker run -d -p 127.0.0.1:9998:9998 apache/tika:latest
For more options, see the Tika Docker documentation.
Usage
On its own
from haystack_integrations.components.converters.tika import TikaDocumentConverter
from pathlib import Path
converter = TikaDocumentConverter()
result = converter.run(sources=[Path("sample.docx"), Path("report.pdf")])
documents = result["documents"]
print(documents[0].content)
In a pipeline
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack_integrations.components.converters.tika import TikaDocumentConverter
document_store = InMemoryDocumentStore()
pipeline = Pipeline()
pipeline.add_component("converter", TikaDocumentConverter())
pipeline.add_component("cleaner", DocumentCleaner())
pipeline.add_component("splitter", DocumentSplitter(split_by="sentence", split_length=5))
pipeline.add_component("writer", DocumentWriter(document_store=document_store))
pipeline.connect("converter", "cleaner")
pipeline.connect("cleaner", "splitter")
pipeline.connect("splitter", "writer")
pipeline.run({"converter": {"sources": ["document.pdf", "report.docx"]}})
By default, the component connects to a Tika server at http://localhost:9998/tika. Use the tika_url parameter to point to a different server:
converter = TikaDocumentConverter(tika_url="http://my-tika-server:9998/tika")
License
tika-haystack is distributed under the terms of the
Apache-2.0 license.
