🆕 Faster agents with parallel tool execution and guardrails & moderation for safer apps. See what's new in Haystack 2.15 🌟
Maintained by deepset

Integration: HanLP

Use HanLP for Chinese text processing with Haystack

Authors
MaChi
deepset

Table of Contents

Overview

You can use HanLP (Han Language Processing) in your Haystack pipelines or as a standalone component for Chinese text processing. HanLP is a comprehensive NLP library for Chinese language processing that provides advanced tokenization, sentence segmentation, and other linguistic analysis capabilities.

The integration provides a specialized ChineseDocumentSplitter component that understands the unique characteristics of Chinese text, such as the lack of spaces between words and the multi-character nature of Chinese words.

Installation

pip install hanlp-haystack

Usage

Basic Configuration

Here’s a simple example of how to use the ChineseDocumentSplitter:

from haystack import Document, Pipeline
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.preprocessors.hanlp import ChineseDocumentSplitter

# Create a document with Chinese text
doc = Document(content=
    "这是第一句话,这是第二句话,这是第三句话。"
    "这是第四句话,这是第五句话,这是第六句话!"
    "这是第七句话,这是第八句话,这是第九句话?"
)

# Initialize the splitter
splitter = ChineseDocumentSplitter(
    split_by="word",
    split_length=10,
    split_overlap=3,
    respect_sentence_boundary=True
)

# Warm up the component (loads the necessary models)
splitter.warm_up()

result = splitter.run(documents=[doc])
print(result["documents"])

Advanced Configuration

The ChineseDocumentSplitter supports various configuration options:

from haystack_integrations.components.preprocessors.hanlp import ChineseDocumentSplitter

splitter = ChineseDocumentSplitter(
    split_by="sentence",
    split_length=1000,
    split_overlap=200,
    split_threshold=0,
    respect_sentence_boundary=True,
    granularity="coarse"
)

Available split_by options:

  • word: Split by Chinese words (default)
  • sentence: Split by sentences using HanLP sentence tokenizer
  • passage: Split by double line breaks (\n\n)
  • page: Split by form feed (\f)
  • line: Split by line breaks (\n)
  • period: Split by periods (.)
  • function: Use a custom splitting function

Granularity options:

  • coarse: Coarse granularity Chinese word segmentation (default)
  • fine: Fine granularity word segmentation

Sentence Boundary Respect

When splitting by words, you can ensure that splits respect sentence boundaries:

from haystack import Document
from haystack_integrations.components.preprocessors.hanlp import ChineseDocumentSplitter

doc = Document(content=
    "这是第一句话,这是第二句话,这是第三句话。"
    "这是第四句话,这是第五句话,这是第六句话!"
    "这是第七句话,这是第八句话,这是第九句话?"
)

splitter = ChineseDocumentSplitter(
    split_by="word",
    split_length=10,
    split_overlap=3,
    respect_sentence_boundary=True
)
splitter.warm_up()
result = splitter.run(documents=[doc])

Custom Splitting Functions

You can also use custom splitting functions for specialized text processing:

from haystack import Document
from haystack_integrations.components.preprocessors.hanlp import ChineseDocumentSplitter

def custom_chinese_split(text: str) -> list[str]:
    return text.split("。")

doc = Document(content="这是第一句话。这是第二句话。这是第三句话。")

splitter = ChineseDocumentSplitter(
    split_by="function",
    splitting_function=custom_chinese_split
)
splitter.warm_up()
result = splitter.run(documents=[doc])

Integration with Haystack Pipelines

The ChineseDocumentSplitter integrates seamlessly with Haystack pipelines:

from haystack import Pipeline
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.preprocessors.hanlp import ChineseDocumentSplitter

document_store = InMemoryDocumentStore()

indexing_pipeline = Pipeline()
indexing_pipeline.add_component("splitter", ChineseDocumentSplitter(
    split_by="word",
    split_length=1000,
    split_overlap=200,
    respect_sentence_boundary=True
))
indexing_pipeline.add_component("writer", DocumentWriter(document_store=document_store))
indexing_pipeline.connect("splitter", "writer")

chinese_documents = [
    Document(content="这是第一个文档的内容。"),
    Document(content="这是第二个文档的内容。"),
]

indexing_pipeline.run({"splitter": {"documents": chinese_documents}})

License

hanlp-haystack is distributed under the terms of the Apache-2.0 license.