Languages Other Than English
Haystack is well suited to open-domain QA on languages other than English. While our defaults are tuned for English, you will find some tips and tricks here for using Haystack in your language.
This feature will be implemented by this PR.
The PreProcessor's sentence tokenization is language specific.
If you are using the PreProcessor on a language other than English,
make sure to set the
language argument when initializing it.
The sparse retriever methods themselves(BM25, TF-IDF) are language agnostic. Their only requirement is that the text be split into words. The ElasticsearchDocumentStore relies on an analyzer to impose word boundaries, but also to handle punctuation, casing and stop words.
The default analyzer is an English analyzer. While it can still work decently for a large range of languages, you will want to set it to your language's analyzer for optimal performance. In some cases, such as with Thai, the default analyzer is completely incompatible. See this page for the full list of language specific analyzers.
from haystack.document_store import ElasticsearchDocumentStoredocument_store = ElasticsearchDocumentStore(analyzer="thai")
The models used in dense retrievers are language specific. Be sure to check language of the model used in your EmbeddingRetriever. The default model that is loaded in the DensePassageRetriever is for English.
We have created a German DensePassageRetriever model and know other teams who work on further languages.
If you have a language model and a question answering dataset in your own language, you can also train a DPR model using Haystack!
Below is a simplified example.
See our tutorial and also the API reference for
DensePassageRetriever.train() for more details.
from haystack.retriever import DensePassageRetrieverdense_passage_retriever = DensePassageRetriever(document_store)dense_passage_retriever.train(self,data_dir: str,train_filename: str,dev_filename: str = None,test_filename: str = None,batch_size: int = 16,embed_title: bool = True,num_hard_negatives: int = 1,n_epochs: int = 3)
While models are comparatively more performant on English, thanks to a wealth of available English training data, there are a couple QA models that are directly usable in Haystack.
We are the creators of the German model and you can find out more about it here
The French, Italian, Spanish, Portuguese and Chinese models are monolingual language models trained on versions of the SQuAD dataset in their respective languages and their authors report decent results in their model cards (e.g. here and here). There also exist Korean QA models on the model hub but their performance is not reported.
The zero-shot model that is shown above is a multilingual XLM-RoBERTa Large that is trained on English SQuAD. It is clear, from our evaluations, that the model has been able to transfer some of its English QA capabilities to other languages, but still its performance lags behind that of the monolingual models. Nonetheless, if there is not yet a monolingual model for your language and it is one of the 100 supported by XLM-RoBERTa, this zero-shot model may serve as a decent first baseline.