Haystack docs home page

Extending your Metadata using DocumentClassifiers at Index Time

Open In Colab

With DocumentClassifier it's possible to automatically enrich your documents with categories, sentiments, topics or whatever metadata you like. This metadata could be used for efficient filtering or further processing. Say you have some categories your users typically filter on. If the documents are tagged manually with these categories, you could automate this process by training a model. Or you can leverage the full power and flexibility of zero shot classification. All you need to do is pass your categories to the classifier, no labels required. This tutorial shows how to integrate it in your indexing pipeline.

DocumentClassifier adds the classification result (label and score) to Document's meta property. Hence, we can use it to classify documents at index time.
The result can be accessed at query time: for example by applying a filter for "classification.label".

This tutorial will show you how to integrate a classification model into your preprocessing steps and how you can filter for this additional metadata at query time. In the last section we show how to put it all together and create an indexing pipeline.

# Install the latest release of Haystack in your own environment
#! pip install farm-haystack

# Install the latest master of Haystack
!pip install --upgrade pip
!pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab, ocr]

!wget --no-check-certificate https://dl.xpdfreader.com/xpdf-tools-linux-4.03.tar.gz
!tar -xvf xpdf-tools-linux-4.03.tar.gz && sudo cp xpdf-tools-linux-4.03/bin64/pdftotext /usr/local/bin

# Install  pygraphviz
!apt install libgraphviz-dev
!pip install pygraphviz
# Here are the imports we need
from haystack.document_stores.elasticsearch import ElasticsearchDocumentStore
from haystack.nodes import PreProcessor, TransformersDocumentClassifier, FARMReader, ElasticsearchRetriever
from haystack.schema import Document
from haystack.utils import convert_files_to_dicts, fetch_archive_from_http, print_answers
# This fetches some sample files to work with

doc_dir = "data/tutorial16"
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/preprocessing_tutorial16.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)

Read and preprocess documents

# note that you can also use the document classifier before applying the PreProcessor, e.g. before splitting your documents

all_docs = convert_files_to_dicts(dir_path=doc_dir)
preprocessor_sliding_window = PreProcessor(split_overlap=3, split_length=10, split_respect_sentence_boundary=False)
docs_sliding_window = preprocessor_sliding_window.process(all_docs)

Apply DocumentClassifier

We can enrich the document metadata at index time using any transformers document classifier model. While traditional classification models are trained to predict one of a few "hard-coded" classes and required a dedicated training dataset, zero-shot classification is super flexible and you can easily switch the classes the model should predict on the fly. Just supply them via the labels param. Here we use a zero shot model that is supposed to classify our documents in 'music', 'natural language processing' and 'history'. Feel free to change them for whatever you like to classify.
These classes can later on be accessed at query time.

doc_classifier = TransformersDocumentClassifier(
    labels=["music", "natural language processing", "history"],
# we can also use any other transformers model besides zero shot classification

# doc_classifier_model = 'bhadresh-savani/distilbert-base-uncased-emotion'
# doc_classifier = TransformersDocumentClassifier(model_name_or_path=doc_classifier_model, batch_size=16, use_gpu=-1)
# we could also specifiy a different field we want to run the classification on

# doc_classifier = TransformersDocumentClassifier(model_name_or_path="cross-encoder/nli-distilroberta-base",
#    task="zero-shot-classification",
#    labels=["music", "natural language processing", "history"],
#    batch_size=16, use_gpu=-1,
#    classification_field="description")
# convert to Document using a fieldmap for custom content fields the classification should run on
docs_to_classify = [Document.from_dict(d) for d in docs_sliding_window]
# classify using gpu, batch_size makes sure we do not run out of memory
classified_docs = doc_classifier.predict(docs_to_classify)
# let's see how it looks: there should be a classification result in the meta entry containing labels and scores.


# In Colab / No Docker environments: Start Elasticsearch from source
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.9.2

import os
from subprocess import Popen, PIPE, STDOUT

es_server = Popen(
    ["elasticsearch-7.9.2/bin/elasticsearch"], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1)  # as daemon
# wait until ES has started
! sleep 30
# Connect to Elasticsearch
document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document")
# Now, let's write the docs to our DB.
# check if indexed docs contain classification results
test_doc = document_store.get_all_documents()[0]
    f'document {test_doc.id} with content \n\n{test_doc.content}\n\nhas label {test_doc.meta["classification"]["label"]}'

Querying the data

All we have to do to filter for one of our classes is to set a filter on "classification.label".

# Initialize QA-Pipeline
from haystack.pipelines import ExtractiveQAPipeline

retriever = ElasticsearchRetriever(document_store=document_store)
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)
pipe = ExtractiveQAPipeline(reader, retriever)
## Voilà! Ask a question while filtering for "music"-only documents
prediction = pipe.run(
    query="What is heavy metal?",
    params={"Retriever": {"top_k": 10, "filters": {"classification.label": ["music"]}}, "Reader": {"top_k": 5}},
print_answers(prediction, details="high")

Wrapping it up in an indexing pipeline

from pathlib import Path
from haystack.pipelines import Pipeline
from haystack.nodes import TextConverter, PreProcessor, FileTypeClassifier, PDFToTextConverter, DocxToTextConverter
file_type_classifier = FileTypeClassifier()
text_converter = TextConverter()
pdf_converter = PDFToTextConverter()
docx_converter = DocxToTextConverter()

indexing_pipeline_with_classification = Pipeline()
    component=file_type_classifier, name="FileTypeClassifier", inputs=["File"]
    component=text_converter, name="TextConverter", inputs=["FileTypeClassifier.output_1"]
    component=pdf_converter, name="PdfConverter", inputs=["FileTypeClassifier.output_2"]
    component=docx_converter, name="DocxConverter", inputs=["FileTypeClassifier.output_4"]
    inputs=["TextConverter", "PdfConverter", "DocxConverter"],
    component=doc_classifier, name="DocumentClassifier", inputs=["Preprocessor"]
    component=document_store, name="DocumentStore", inputs=["DocumentClassifier"]

txt_files = [f for f in Path(doc_dir).iterdir() if f.suffix == ".txt"]
pdf_files = [f for f in Path(doc_dir).iterdir() if f.suffix == ".pdf"]
docx_files = [f for f in Path(doc_dir).iterdir() if f.suffix == ".docx"]

# we can store this pipeline and use it from the REST-API

About us

This Haystack notebook was made with love by deepset in Berlin, Germany

We bring NLP to the industry via open source!
Our focus: Industry specific language models & large scale QA systems.

Some of our other work:

Get in touch: Twitter | LinkedIn | Slack | GitHub Discussions | Website

By the way: we're hiring!