Tutorial: Utilizing Existing FAQs for Question Answering


  • Level: Beginner
  • Time to complete: 15 minutes
  • Nodes Used: InMemoryDocumentStore, EmbeddingRetriever
  • Goal: Learn how to use the EmbeddingRetriever in a FAQPipeline to answer incoming questions by matching them to the most similar questions in your existing FAQ.

Overview

While extractive Question Answering works on pure texts and is therefore more generalizable, there’s also a common alternative that utilizes existing FAQ data.

Pros:

  • Very fast at inference time
  • Utilize existing FAQ data
  • Quite good control over answers

Cons:

  • Generalizability: We can only answer questions that are similar to existing ones in FAQ

In some use cases, a combination of extractive QA and FAQ-style can also be an interesting option.

Preparing the Colab Environment

Installing Haystack

To start, let’s install the latest release of Haystack with pip:

%%bash

pip install --upgrade pip
pip install farm-haystack[colab,inference]

Enabling Telemetry

Knowing you’re using this tutorial helps us decide where to invest our efforts to build a better product but you can always opt out by commenting the following line. See Telemetry for more details.

from haystack.telemetry import tutorial_running

tutorial_running(4)

Set the logging level to INFO:

import logging

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)

Create a simple DocumentStore

The InMemoryDocumentStore is good for quick development and prototyping. For more scalable options, check-out the docs.

from haystack.document_stores import InMemoryDocumentStore

document_store = InMemoryDocumentStore()

Create a Retriever using embeddings

Instead of retrieving via Elasticsearch’s plain BM25, we want to use vector similarity of the questions (user question vs. FAQ ones). We can use the EmbeddingRetriever for this purpose and specify a model that we use for the embeddings.

from haystack.nodes import EmbeddingRetriever

retriever = EmbeddingRetriever(
    document_store=document_store,
    embedding_model="sentence-transformers/all-MiniLM-L6-v2",
    use_gpu=True,
    scale_score=False,
)

Prepare & Index FAQ data

We create a pandas dataframe containing some FAQ data (i.e curated pairs of question + answer) and index those in our documentstore. Here: We download some question-answer pairs related to COVID-19

import pandas as pd

from haystack.utils import fetch_archive_from_http


# Download
doc_dir = "data/tutorial4"
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/small_faq_covid.csv.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)

# Get dataframe with columns "question", "answer" and some custom metadata
df = pd.read_csv(f"{doc_dir}/small_faq_covid.csv")
# Minimal cleaning
df.fillna(value="", inplace=True)
df["question"] = df["question"].apply(lambda x: x.strip())
print(df.head())

# Create embeddings for our questions from the FAQs
# In contrast to most other search use cases, we don't create the embeddings here from the content of our documents,
# but rather from the additional text field "question" as we want to match "incoming question" <-> "stored question".
questions = list(df["question"].values)
df["embedding"] = retriever.embed_queries(queries=questions).tolist()
df = df.rename(columns={"question": "content"})

# Convert Dataframe to list of dicts and index them in our DocumentStore
docs_to_index = df.to_dict(orient="records")
document_store.write_documents(docs_to_index)

Ask questions

Initialize a Pipeline (this time without a reader) and ask questions

from haystack.pipelines import FAQPipeline

pipe = FAQPipeline(retriever=retriever)
from haystack.utils import print_answers

# Run any question and change top_k to see more or less answers
prediction = pipe.run(query="How is the virus spreading?", params={"Retriever": {"top_k": 1}})

print_answers(prediction, details="medium")