Maintained by deepset

Integration: Chonkie

Fast, lightweight text chunking for Haystack indexing pipelines, powered by Chonkie.

Authors

deepset

GitHub Repo PyPI Package

Overview
Installation
Usage
License

Overview

Chonkie is a fast, lightweight chunking library designed for RAG applications. This integration provides four Haystack document splitter components backed by Chonkie’s chunkers:

Component	Chunking strategy
`ChonkieTokenDocumentSplitter`	Fixed-size token-based chunks with configurable overlap
`ChonkieSentenceDocumentSplitter`	Chunks that respect sentence boundaries
`ChonkieRecursiveDocumentSplitter`	Hierarchical recursive splitting using a rule set
`ChonkieSemanticDocumentSplitter`	Embedding-based splitting at semantic topic boundaries

All components accept a list[Document] and return a list[Document]. Each output document carries source_id, page_number, split_id, split_idx_start, split_idx_end, and token_count in its metadata.

Tokenizers (ChonkieTokenDocumentSplitter, ChonkieSentenceDocumentSplitter, ChonkieRecursiveDocumentSplitter): pass any tokenizer name accepted by Chonkie. Common options are "character" (default, no extra dependencies), "gpt2", and "cl100k_base". See the Chonkie token chunker docs for the full list of supported tokenizer backends.

Embedding models (ChonkieSemanticDocumentSplitter): the default is minishlab/potion-base-32M, a static model2vec model that runs on CPU with no GPU required. Chonkie also supports SentenceTransformers, OpenAI, Cohere, Google Gemini, JinaAI, VoyageAI, and Azure OpenAI embeddings — see the Chonkie embeddings overview for the full list and installation instructions.

Installation

pip install chonkie-haystack

Usage

Token-based splitting

Split documents into fixed-size token chunks:

from haystack import Document
from haystack_integrations.components.preprocessors.chonkie import ChonkieTokenDocumentSplitter

chunker = ChonkieTokenDocumentSplitter(tokenizer="gpt2", chunk_size=10, chunk_overlap=2)
result = chunker.run(documents=[Document(content=(
    "Haystack is an open-source framework for building LLM applications. "
    "It supports retrieval-augmented generation and custom components. "
    "Developers can connect models, databases, and tools in a pipeline."
))])
print(result["documents"])

Sentence-based splitting

Split documents while keeping sentence boundaries intact:

from haystack import Document
from haystack_integrations.components.preprocessors.chonkie import ChonkieSentenceDocumentSplitter

chunker = ChonkieSentenceDocumentSplitter(tokenizer="gpt2", chunk_size=10)
result = chunker.run(documents=[Document(content=(
    "Haystack is an open-source framework for building LLM applications. "
    "It supports retrieval-augmented generation and custom components. "
    "Developers can connect models, databases, and tools in a pipeline."
))])
print(result["documents"])

Recursive splitting

Apply a hierarchy of splitting rules — useful for structured text like Markdown or code:

from haystack import Document
from haystack_integrations.components.preprocessors.chonkie import ChonkieRecursiveDocumentSplitter

chunker = ChonkieRecursiveDocumentSplitter(chunk_size=30)
result = chunker.run(documents=[Document(content=(
    "# Introduction\n\n"
    "Haystack is an open-source framework for building LLM applications.\n\n"
    "## Features\n\n"
    "It supports retrieval-augmented generation, custom components, and production pipelines.\n\n"
    "## Installation\n\n"
    "Install Haystack with pip and start building your first pipeline today."
))])
print(result["documents"])

Semantic splitting

Split documents at topic boundaries detected via embedding similarity:

from haystack import Document
from haystack_integrations.components.preprocessors.chonkie import ChonkieSemanticDocumentSplitter

chunker = ChonkieSemanticDocumentSplitter(chunk_size=512, threshold=0.5)
result = chunker.run(documents=[
    Document(content=(
        "Haystack is an open-source framework for building LLM applications. "
        "It supports retrieval-augmented generation and custom components. "
        "The Eiffel Tower is a wrought-iron landmark on the Champ de Mars in Paris. "
        "It was constructed between 1887 and 1889 as the centrepiece of the World's Fair."
    ))
])
print(result["documents"])

The embedding model is loaded lazily on the first run() call. No explicit warm_up() is needed.

In a pipeline

All four components fit directly into a standard indexing pipeline:

from pathlib import Path

from haystack import Pipeline
from haystack.components.converters import TextFileToDocument
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.preprocessors.chonkie import ChonkieTokenDocumentSplitter

document_store = InMemoryDocumentStore()

p = Pipeline()
p.add_component("converter", TextFileToDocument())
p.add_component("cleaner", DocumentCleaner())
p.add_component("splitter", ChonkieTokenDocumentSplitter(tokenizer="gpt2", chunk_size=512))
p.add_component("writer", DocumentWriter(document_store=document_store))

p.connect("converter.documents", "cleaner.documents")
p.connect("cleaner.documents", "splitter.documents")
p.connect("splitter.documents", "writer.documents")

files = list(Path("path/to/your/files").glob("*.txt"))
p.run({"converter": {"sources": files}})

License

chonkie-haystack is distributed under the terms of the Apache-2.0 license.