Maintained by deepset

Integration: langdetect

Detect the language of documents and route text by language with langdetect

Authors

deepset

GitHub Repo PyPI Package

Overview
Installation
Usage
License

Overview

The langdetect-haystack integration provides two components for language detection in Haystack pipelines, built on top of the langdetect library:

DocumentLanguageClassifier: classifies the language of each document and stores the detected language in the document’s metadata.
TextLanguageRouter: routes a text string to a different output connection depending on its detected language.

Both components take a list of ISO language codes during initialization. If the detected language is not in that list, the document or text is labeled or routed as "unmatched".

These components were previously part of Haystack core and now live in the langdetect-haystack integration package, maintained in haystack-core-integrations.

Installation

Install the langdetect-haystack package:

pip install langdetect-haystack

Usage

DocumentLanguageClassifier

DocumentLanguageClassifier adds a language field to the metadata of each document. Combine it with the MetadataRouter to send documents to different branches of a pipeline based on their language:

from haystack import Document, Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack_integrations.components.classifiers.langdetect import DocumentLanguageClassifier
from haystack.components.routers import MetadataRouter
from haystack.components.writers import DocumentWriter

docs = [
    Document(id="1", content="This is an English document"),
    Document(id="2", content="Este es un documento en español"),
]

document_store = InMemoryDocumentStore()

p = Pipeline()
p.add_component(instance=DocumentLanguageClassifier(languages=["en"]), name="language_classifier")
p.add_component(
    instance=MetadataRouter(rules={"en": {"field": "meta.language", "operator": "==", "value": "en"}}),
    name="router",
)
p.add_component(instance=DocumentWriter(document_store=document_store), name="writer")
p.connect("language_classifier.documents", "router.documents")
p.connect("router.en", "writer.documents")

p.run({"language_classifier": {"documents": docs}})

written_docs = document_store.filter_documents()
assert len(written_docs) == 1
assert written_docs[0].content == "This is an English document"

TextLanguageRouter

TextLanguageRouter routes a query string to the output named after its detected language. Use it as the first component of a query pipeline to only forward queries in supported languages:

from haystack import Pipeline, Document
from haystack_integrations.components.routers.langdetect import TextLanguageRouter
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever

document_store = InMemoryDocumentStore()
document_store.write_documents([Document(content="Elvis Presley was an American singer and actor.")])

p = Pipeline()
p.add_component(instance=TextLanguageRouter(languages=["en"]), name="text_language_router")
p.add_component(instance=InMemoryBM25Retriever(document_store=document_store), name="retriever")
p.connect("text_language_router.en", "retriever.query")

result = p.run({"text_language_router": {"text": "Who was Elvis Presley?"}})
assert result["retriever"]["documents"][0].content == "Elvis Presley was an American singer and actor."

result = p.run({"text_language_router": {"text": "ένα ελληνικό κείμενο"}})
assert result["text_language_router"]["unmatched"] == "ένα ελληνικό κείμενο"

License

langdetect-haystack is distributed under the terms of the Apache-2.0 license.