๐ŸŽ„ Let's code and celebrate this holiday season with Advent of Haystack

Tutorial: Query Classification with TransformersTextRouter and TransformersZeroShotTextRouter


Overview

One of the great benefits of using state-of-the-art NLP models like those available in Haystack is that it allows users to state their queries as plain natural language questions: rather than trying to come up with just the right set of keywords to find the answer to their question, users can simply ask their question in much the same way that they would ask it of a (very knowledgeable!) person.

But just because users can ask their questions in “plain English” (or “plain German”, etc.), that doesn’t mean they always will. For instance, users might input a few keywords rather than a complete question because they don’t understand the pipeline’s full capabilities or are so accustomed to keyword search. While a standard Haystack pipeline might handle such queries with reasonable accuracy, for a variety of reasons we still might prefer that our pipeline is sensitive to the type of query it is receiving, so that it behaves differently when a user inputs, say, a collection of keywords instead of a question. For this reason, Haystack comes with built-in capabilities to distinguish between types of text inputs, such as keyword queries, interrogative queries(questions), and statement queries.

Given a text input, classification models ouput a label, which can be LABEL_0, LABEL_1, … depending on how the model was trained. Haystack’s TransformersTextRouter component uses classification models and then routes the text to an output branch that is named after the label.

In this tutorial you will learn how to use TransformersTextRouter and TransformersZeroShotTextRouter to branch your Haystack pipeline based on the type of query it receives:

  1. Keyword vs. Question/Statement โ€” routes a query into one of two branches depending on whether it is a full question/statement or a collection of keywords.

  2. Question vs. Statement โ€” routes a natural language query into one of two branches depending on whether it is a question or a statement.

With TransformersTextRouter, it’s also possible to route queries based on your custom classification models. With TransformersZeroShotTextRouter you can even do zero-shot classification, which means you can specify custom classes but don’t need any custom model.

With all of that explanation out of the way, let’s dive in!

Preparing the Colab Environment

Installing Haystack

To start, install the latest release of Haystack with pip:

%%bash

pip install --upgrade pip
pip install haystack-ai torch sentencepiece datasets sentence-transformers

Enabling Telemetry

Knowing you’re using this tutorial helps us decide where to invest our efforts to build a better product but you can always opt out by commenting the following line. See Telemetry for more details.

from haystack.telemetry import tutorial_running

tutorial_running(41)

Trying out TransformersTextRouter

Before integrating a TransformersTextRouter into the pipeline, test it out on its own and see what it actually does. First, initiate a simple, out-of-the-box keyword vs. question/statement TransformersTextRouter, which uses the shahrukhx01/bert-mini-finetune-question-detection model. For this model, LABEL_0 corresponds to keyword queries and LABEL_1 to non-keyword queries.

from haystack.components.routers import TransformersTextRouter

text_router = TransformersTextRouter(model="shahrukhx01/bert-mini-finetune-question-detection")
text_router.warm_up()

Now feed some queries into this text router. Test with one keyword query, one interrogative query, and one statement query. Note that you don’t need to use any punctuation, such as question marks, for the text router to make the right decision.

queries = [
    "Arya Stark father",  # Keyword Query
    "Who was the father of Arya Stark",  # Interrogative Query
    "Lord Eddard was the father of Arya Stark",  # Statement Query
]

Below, you can see what the text router does with these queries: it correctly determines that “Arya Stark father” is a keyword query and sends it to the branch LABEL_0. It also correctly classifies both the interrogative query “Who was the father of Arya Stark” and the statement query “Lord Eddard was the father of Arya Stark” as non-keyword queries, and sends them to branch 1.

result = text_router.run(text=queries[0])
next(iter(result))
import pandas as pd

results = {"Query": [], "Output Branch": [], "Class": []}

for query in queries:
    result = text_router.run(text=query)
    results["Query"].append(query)
    results["Output Branch"].append(next(iter(result)))
    results["Class"].append("Keyword Query" if next(iter(result)) == "LABEL_0" else "Question/Statement")

pd.DataFrame.from_dict(results)

Next, you will illustrate a question vs. statement TransformersTextRouter using shahrukhx01/question-vs-statement-classifier. For this task, you need to initialize a new text router with this classification model.

text_router = TransformersTextRouter(model="shahrukhx01/question-vs-statement-classifier")
text_router.warm_up()

queries = [
    "Who was the father of Arya Stark",  # Interrogative Query
    "Lord Eddard was the father of Arya Stark",  # Statement Query
]

results = {"Query": [], "Output Branch": [], "Class": []}

for query in queries:
    result = text_router.run(text=query)
    results["Query"].append(query)
    results["Output Branch"].append(next(iter(result)))
    results["Class"].append("Question" if next(iter(result)) == "LABEL_1" else "Statement")

pd.DataFrame.from_dict(results)

Again, the text router send the question and the statement to the expected output branches.

Custom Use Cases for Text Classification

TransformersTextRouter is very flexible and also supports other options for classifying texts beyond distinguishing keyword queries from interrogative queries. For example, you may be interested in detecting the sentiment of a text or classifying the topics. You can do this by loading a custom classification model from the Hugging Face Hub or by using zero-shot classification with TransformersZeroShotTextRouter.

  • Traditional text classification models are trained to predict one of a few “hard-coded” classes and require a dedicated training dataset. In the Hugging Face Hub, you can find many pre-trained models, maybe even related to your domain of interest.
  • Zero-shot classification is very versatile: by choosing a suitable base transformer, you can classify the text without any training dataset. You just have to provide the candidate categories.

Custom Classification Models with TransformersTextRouter

For this use case, you can use a public model available in the Hugging Face Hub. For example, if you want to classify the sentiment of the queries, you can choose an appropriate model, such as cardiffnlp/twitter-roberta-base-sentiment.

text_router = TransformersTextRouter(model="cardiffnlp/twitter-roberta-base-sentiment")
text_router.warm_up()
queries = [
    "What's the answer?",  # neutral query
    "Would you be so lovely to tell me the answer?",  # positive query
    "Can you give me the damn right answer for once??",  # negative query
]
sent_results = {"Query": [], "Output Branch": [], "Class": []}

for query in queries:
    result = text_router.run(text=query)
    sent_results["Query"].append(query)
    sent_results["Output Branch"].append(next(iter(result)))
    sent_results["Class"].append({"LABEL_0": "negative", "LABEL_1": "neutral", "LABEL_2":"positive"}.get(next(iter(result)), "Unknown"))

pd.DataFrame.from_dict(sent_results)

Zero-Shot Classification with TransformersZeroShotTextRouter

TransformersZeroShotTextRouter let’s you perform zero-shot classification by providing a suitable base transformer model and defining the classes the model should predict.

First, initialize a TransformersZeroShotTextRouter with some custom category labels. By default, it uses the base size zero shot classification model MoritzLaurer/deberta-v3-base-zeroshot-v1.1-all-33. You can switch to the larger model MoritzLaurer/deberta-v3-large-zeroshot-v2.0 and get even better results by using TransformersZeroShotTextRouter(model="MoritzLaurer/deberta-v3-large-zeroshot-v2.0").

Let’s look at an example. You may be interested in whether the user query is related to music or cinema. In this case, the labels parameter is a list containing the candidate classes and the output branches of the component are named accordingly.

from haystack.components.routers import TransformersZeroShotTextRouter

text_router = TransformersZeroShotTextRouter(labels=["music", "cinema"])
text_router.warm_up()
queries = [
    "In which films does John Travolta appear?",  # cinema
    "What is the Rolling Stones first album?",  # music
    "Who was Sergio Leone?",  # cinema
]
sent_results = {"Query": [], "Output Branch": []}

for query in queries:
    result = text_router.run(text=query)
    sent_results["Query"].append(query)
    sent_results["Output Branch"].append(next(iter(result)))

pd.DataFrame.from_dict(sent_results)

Similar to the previous example, we can use zero-shot text classification to group questions into “Game of Thrones”, “Star Wars” and “Lord of the Rings” related question. The number of labels is up to you!

from haystack.components.routers import TransformersZeroShotTextRouter

text_router = TransformersZeroShotTextRouter(labels=["Game of Thrones", "Star Wars", "Lord of the Rings"])
text_router.warm_up()

queries = [
    "Who was the father of Arya Stark",  # Game of Thrones
    "Who was the father of Luke Skywalker",  # Star Wars
    "Who was the father of Frodo Baggins",  # Lord of the Rings
]

results = {"Query": [], "Output Branch": []}

for query in queries:
    result = text_router.run(text=query)
    results["Query"].append(query)
    results["Output Branch"].append(next(iter(result)))

pd.DataFrame.from_dict(results)

And as you see, the question about “Arya Stark” is sent to branch “Game of Thrones”, while the question about “Luke Skywalker” is sent to branch “Star Wars” and the question about “Frodo Baggins” is sent to “Lord of the Rings”. This means you can have your pipeline treat questions about these universes differently.

Congratulations! ๐ŸŽ‰ Youโ€™ve learned how TransformersZeroShotTextRouter and TransformersTextRouter work and how you can use these components individually. Now let’s explore how to use them in a pipeline.

Pipeline with Keyword vs. Question/Statement Query Classification

Now you will create a question-answering (QA) pipeline with keyword vs. question/statement query classification and route the questions based on the classification result.

Fetching and Indexing Documents

You’ll start creating your question answering system by downloading the data and indexing the data with its embeddings to a DocumentStore.

In this tutorial, you will take a simple approach to writing documents and their embeddings into the DocumentStore. For a full indexing pipeline with preprocessing, cleaning and splitting, check out our tutorial on Preprocessing Different File Types.

Initializing the DocumentStore

Initialize a DocumentStore to index your documents. A DocumentStore stores the Documents that the question answering system uses to find answers to your questions. In this tutorial, you’ll be using the InMemoryDocumentStore.

from haystack.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()

InMemoryDocumentStore is the simplest DocumentStore to get started with. It requires no external dependencies and it’s a good option for smaller projects and debugging. But it doesn’t scale up so well to larger Document collections, so it’s not a good choice for production systems. To learn more about the different types of external databases that Haystack supports, see DocumentStore Integrations.

The DocumentStore is now ready. Now it’s time to fill it with some Documents.

Fetch the Data

You’ll use the Wikipedia pages of Seven Wonders of the Ancient World as Documents. We preprocessed the data and uploaded to a Hugging Face Space: Seven Wonders. Thus, you don’t need to perform any additional cleaning or splitting.

Fetch the data and convert it into Haystack Documents:

from datasets import load_dataset
from haystack import Document

dataset = load_dataset("bilgeyucel/seven-wonders", split="train")
docs = [Document(content=doc["content"], meta=doc["meta"]) for doc in dataset]

Initalize a Document Embedder

To store your data in the DocumentStore with embeddings, initialize a SentenceTransformersDocumentEmbedder with the model name and call warm_up() to download the embedding model.

If you’d like, you can use a different Embedder for your documents.

from haystack.components.embedders import SentenceTransformersDocumentEmbedder

doc_embedder = SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
doc_embedder.warm_up()

Write Documents to the DocumentStore

Run the doc_embedder with the Documents. The embedder will create embeddings for each document and save these embeddings in Document object’s embedding field. Then, you can write the Documents to the DocumentStore with write_documents() method.

docs_with_embeddings = doc_embedder.run(docs)
document_store.write_documents(docs_with_embeddings["documents"])

2) Initialize Retrievers, TextEmbedder and TransformersTextRouter

Your pipeline will be a simple Retriever pipeline, but the Retriever choice will depend on the type of query received: keyword queries will use a sparse BM25Retriever, while question/statement queries will use the more accurate but also more computationally expensive EmbeddingRetriever.

Now, initialize a Router, and Embedder, both Retrievers, and a Joiner.

from haystack.components.retrievers.in_memory import InMemoryBM25Retriever, InMemoryEmbeddingRetriever
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.joiners import DocumentJoiner

text_router = TransformersTextRouter(model="shahrukhx01/bert-mini-finetune-question-detection")
text_embedder = SentenceTransformersTextEmbedder(model="sentence-transformers/all-MiniLM-L6-v2")
embedding_retriever = InMemoryEmbeddingRetriever(document_store)
bm25_retriever = InMemoryBM25Retriever(document_store)
document_joiner = DocumentJoiner()

3) Define the Pipeline

As promised, the question/statement branch LABEL_0 from the text router is fed into the TextEmbedder and then the EmbeddingRetriever, while the keyword branch LABEL_1 from the same router is fed into the BM25Retriever. Both of these retrievers are then fed into our joiner. Our pipeline can thus be thought of as having something of a diamond shape: all queries are sent into the router, which splits those queries into two different retrievers, and those retrievers feed their outputs to the same joiner.

from haystack import Pipeline

query_classification_pipeline = Pipeline()
query_classification_pipeline.add_component("text_router", text_router)
query_classification_pipeline.add_component("text_embedder", text_embedder)
query_classification_pipeline.add_component("embedding_retriever", embedding_retriever)
query_classification_pipeline.add_component("bm25_retriever", bm25_retriever)
query_classification_pipeline.add_component("document_joiner", document_joiner)

query_classification_pipeline.connect("text_router.LABEL_0", "text_embedder")
query_classification_pipeline.connect("text_embedder", "embedding_retriever")
query_classification_pipeline.connect("text_router.LABEL_1", "bm25_retriever")
query_classification_pipeline.connect("bm25_retriever", "document_joiner")
query_classification_pipeline.connect("embedding_retriever", "document_joiner")

4) Run the Pipeline

Below, you can see how this choice affects the branching structure: the keyword query “arya stark father” and the question query “Who is the father of Arya Stark?” generate noticeably different results, a distinction that is likely due to the use of different retrievers for keyword vs. question/statement queries.

# Useful for framing headers
equal_line = "=" * 30

# Run only the dense retriever on the full sentence query
res_1 = query_classification_pipeline.run({"text_router": {"text": "Who is the father of Arya Stark?"}})
print(f"\n\n{equal_line}\nQUESTION QUERY RESULTS\n{equal_line}")
print(res_1)

# Run only the sparse retriever on a keyword based query
res_2 = query_classification_pipeline.run({"text_router": {"text": "arya stark father"}})
print(f"\n\n{equal_line}\nKEYWORD QUERY RESULTS\n{equal_line}")
print(res_2)

Above you saw a potential use for keyword vs. question/statement classification: you might choose to use a less resource-intensive retriever for keyword queries than for question/statement queries. But what about question vs. statement classification?

Pipeline with Question vs. Statement Query Classifier

To illustrate one potential use for question vs. statement classification, you will build a pipeline that looks as follows:

  1. The pipeline will start with a retriever that every query will go through.
  2. The pipeline will end with a reader that only question queries will go through.

In other words, your pipeline will be a retriever-only pipeline for statement queriesโ€”given the statement “Arya Stark was the daughter of a Lord”, all you will get back are the most relevant documentsโ€”but it will be a retriever-reader pipeline for question queries.

To make things more concrete, your pipeline will start with a Retriever, which is then fed into a QueryClassifier that is set to do question vs. statement classification. The QueryClassifier’s first branch, which handles question queries, will then be sent to the Reader, while the second branch will not be connected to any other nodes. As a result, the last node of the pipeline depends on the type of query: questions go all the way through the Reader, while statements only go through the Retriever.

Now, define the pipeline. Keep in mind that you don’t need to write Documents to the DocumentStore again as they are already indexed.

2) Define the Pipeline and Components

from haystack.components.readers import ExtractiveReader

query_classification_pipeline = Pipeline()
query_classification_pipeline.add_component("bm25_retriever_0", InMemoryBM25Retriever(document_store))
query_classification_pipeline.add_component("bm25_retriever_1", InMemoryBM25Retriever(document_store))
query_classification_pipeline.add_component("text_router", TransformersTextRouter(model="shahrukhx01/question-vs-statement-classifier"))
query_classification_pipeline.add_component("reader", ExtractiveReader())

query_classification_pipeline.connect("text_router.LABEL_0", "bm25_retriever_0")
query_classification_pipeline.connect("bm25_retriever_0", "reader")
query_classification_pipeline.connect("text_router.LABEL_1", "bm25_retriever_1")

2) Run the Pipeline

And here are the results of this pipeline: with a question query like “Who is the father of Arya Stark?”, you obtain answers from a Reader, and with a statement query like “Arya Stark was the daughter of a Lord”, you just obtain documents from a Retriever.

# Useful for framing headers
equal_line = "=" * 30

# Run the retriever + reader on the question query
query = "Who is the father of Arya Stark?"
res_1 = query_classification_pipeline.run({"text_router": {"text": query}, "reader": {"query": query}})
print(f"\n\n{equal_line}\nQUESTION QUERY RESULTS\n{equal_line}")
print(res_1)

# Run only the retriever on the statement query
query = "Arya Stark was the daughter of a Lord"
res_2 = query_classification_pipeline.run({"text_router": {"text": query}, "reader": {"query": query}})
print(f"\n\n{equal_line}\nKEYWORD QUERY RESULTS\n{equal_line}")
print(res_2)