DocumentationAPI ReferenceπŸ““ TutorialsπŸ§‘β€πŸ³ Cookbook🀝 IntegrationsπŸ’œ Discord

DocumentSplitter

DocumentSplitter divides a list of text documents into a list of shorter text Documents. This is useful for long texts that otherwise wouldn't fit into the maximum text length of language models and can also speed up question answering.

NameDocumentSplitter
Folder Path/preprocessors/
Position in a PipelineIn indexing Pipelines after File Converters and DocumentCleaner, before Classifiers
Inputs"documents": a list of Documents
Outputs"documents": a list of Documents

Overview

DocumentSplitter expects a list of Documents as input and returns a list of Documents with split texts. It splits each input Document by split_by after split_length units with an overlap of split_overlap units. These three parameters can be set when the component is initialized.

  • split_by can be "word", "sentence", or "passage" (paragraph).
  • split_length is an integer indicating the chunk size, which is the number of words, sentences, or passages.
  • split_overlap is an integer indicating the number of overlapping words, sentences, or passages between chunks.

A field "source_id" is added to each Document's meta data to keep track of the original Document that was split. Other metadata are copied from the original Document.

Usage

On its own

You can use this component outside of a pipeline to shorten your Documents like this:

from haystack.components.preprocessors import DocumentSplitter

splitter = DocumentSplitter(split_by="passage", split_length=200, split_overlap=0)

In a pipeline

Here's how you can use DocumentSplitter in an indexing pipeline:

from pathlib import Path

from haystack import Document
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.converters.txt import TextFileToDocument
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.writers import DocumentWriter

document_store = InMemoryDocumentStore()
p = Pipeline()
p.add_component(instance=TextFileToDocument(), name="text_file_converter")
p.add_component(instance=DocumentCleaner(), name="cleaner")
p.add_component(instance=DocumentSplitter(split_by="sentence", split_length=1), name="splitter")
p.add_component(instance=DocumentWriter(document_store=document_store), name="writer")
p.connect("text_file_converter.documents", "cleaner.documents")
p.connect("cleaner.documents", "splitter.documents")
p.connect("splitter.documents", "writer.documents")

path = "path/to/your/files"
files = list(Path(path).glob("*.md"))
p.run({"text_file_converter": {"sources": files}})


Related Links

See the parameters details in our API reference: