
Integration: HanLP
Use HanLP for Chinese text processing with Haystack
Table of Contents
Overview
You can use HanLP (Han Language Processing) in your Haystack pipelines or as a standalone component for Chinese text processing. HanLP is a comprehensive NLP library for Chinese language processing that provides advanced tokenization, sentence segmentation, and other linguistic analysis capabilities.
The integration provides a specialized ChineseDocumentSplitter
component that understands the unique characteristics of Chinese text, such as the lack of spaces between words and the multi-character nature of Chinese words.
Installation
pip install hanlp-haystack
Usage
Basic Configuration
Here’s a simple example of how to use the ChineseDocumentSplitter
:
from haystack import Document, Pipeline
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.preprocessors.hanlp import ChineseDocumentSplitter
# Create a document with Chinese text
doc = Document(content=
"这是第一句话,这是第二句话,这是第三句话。"
"这是第四句话,这是第五句话,这是第六句话!"
"这是第七句话,这是第八句话,这是第九句话?"
)
# Initialize the splitter
splitter = ChineseDocumentSplitter(
split_by="word",
split_length=10,
split_overlap=3,
respect_sentence_boundary=True
)
# Warm up the component (loads the necessary models)
splitter.warm_up()
result = splitter.run(documents=[doc])
print(result["documents"])
Advanced Configuration
The ChineseDocumentSplitter
supports various configuration options:
from haystack_integrations.components.preprocessors.hanlp import ChineseDocumentSplitter
splitter = ChineseDocumentSplitter(
split_by="sentence",
split_length=1000,
split_overlap=200,
split_threshold=0,
respect_sentence_boundary=True,
granularity="coarse"
)
Available split_by
options:
word
: Split by Chinese words (default)sentence
: Split by sentences using HanLP sentence tokenizerpassage
: Split by double line breaks (\n\n
)page
: Split by form feed (\f
)line
: Split by line breaks (\n
)period
: Split by periods (.
)function
: Use a custom splitting function
Granularity options:
coarse
: Coarse granularity Chinese word segmentation (default)fine
: Fine granularity word segmentation
Sentence Boundary Respect
When splitting by words, you can ensure that splits respect sentence boundaries:
from haystack import Document
from haystack_integrations.components.preprocessors.hanlp import ChineseDocumentSplitter
doc = Document(content=
"这是第一句话,这是第二句话,这是第三句话。"
"这是第四句话,这是第五句话,这是第六句话!"
"这是第七句话,这是第八句话,这是第九句话?"
)
splitter = ChineseDocumentSplitter(
split_by="word",
split_length=10,
split_overlap=3,
respect_sentence_boundary=True
)
splitter.warm_up()
result = splitter.run(documents=[doc])
Custom Splitting Functions
You can also use custom splitting functions for specialized text processing:
from haystack import Document
from haystack_integrations.components.preprocessors.hanlp import ChineseDocumentSplitter
def custom_chinese_split(text: str) -> list[str]:
return text.split("。")
doc = Document(content="这是第一句话。这是第二句话。这是第三句话。")
splitter = ChineseDocumentSplitter(
split_by="function",
splitting_function=custom_chinese_split
)
splitter.warm_up()
result = splitter.run(documents=[doc])
Integration with Haystack Pipelines
The ChineseDocumentSplitter
integrates seamlessly with Haystack pipelines:
from haystack import Pipeline
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.preprocessors.hanlp import ChineseDocumentSplitter
document_store = InMemoryDocumentStore()
indexing_pipeline = Pipeline()
indexing_pipeline.add_component("splitter", ChineseDocumentSplitter(
split_by="word",
split_length=1000,
split_overlap=200,
respect_sentence_boundary=True
))
indexing_pipeline.add_component("writer", DocumentWriter(document_store=document_store))
indexing_pipeline.connect("splitter", "writer")
chinese_documents = [
Document(content="这是第一个文档的内容。"),
Document(content="这是第二个文档的内容。"),
]
indexing_pipeline.run({"splitter": {"documents": chinese_documents}})
License
hanlp-haystack
is distributed under the terms of the
Apache-2.0 license.