EXECUTABLE VERSION: colab
The Retriever has a huge impact on the performance of our overall search pipeline.
Family of algorithms based on counting the occurrences of words (bag-of-words) resulting in very sparse vectors with length = vocab size.
Examples: BM25, TF-IDF
Pros: Simple, fast, well explainable
Cons: Relies on exact keyword matches between query and text
These retrievers use neural network models to create "dense" embedding vectors. Within this family there are two different approaches:
a) Single encoder: Use a single model to embed both query and passage.
b) Dual-encoder: Use two models, one to embed the query and one to embed the passage
Recent work suggests that dual encoders work better, likely because they can deal better with the different nature of query and passage (length, style, syntax ...).
Examples: REALM, DPR, Sentence-Transformers
Pros: Captures semantinc similarity instead of "word matches" (e.g. synonyms, related topics ...)
Cons: Computationally more heavy, initial training of model
In this Tutorial, we want to highlight one "Dense Dual-Encoder" called Dense Passage Retriever. It was introdoced by Karpukhin et al. (2020, https://arxiv.org/abs/2004.04906.
"Open-domain question answering relies on efficient passage retrieval to select candidate contexts, where traditional sparse vector space models, such as TF-IDF or BM25, are the de facto method. In this work, we show that retrieval can be practically implemented using dense representations alone, where embeddings are learned from a small number of questions and passages by a simple dual-encoder framework. When evaluated on a wide range of open-domain QA datasets, our dense retriever outperforms a strong Lucene-BM25 system largely by 9%-19% absolute in terms of top-20 passage retrieval accuracy, and helps our end-to-end QA system establish new state-of-the-art on multiple open-domain QA benchmarks."
Use this link to open the notebook in Google Colab.
Make sure you enable the GPU runtime to experience decent speed in this tutorial.
Runtime -> Change Runtime type -> Hardware accelerator -> GPU
# Make sure you have a GPU running !nvidia-smi
# Install the latest release of Haystack in your own environment #! pip install farm-haystack # Install the latest master of Haystack and install the version of torch that works with the colab GPUs !pip install git+https://github.com/deepset-ai/haystack.git !pip install torch==1.6.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html
from haystack import Finder from haystack.preprocessor.cleaning import clean_wiki_text from haystack.preprocessor.utils import convert_files_to_dicts, fetch_archive_from_http from haystack.reader.farm import FARMReader from haystack.reader.transformers import TransformersReader from haystack.utils import print_answers
FAISS is a library for efficient similarity search on a cluster of dense vectors.
FAISSDocumentStore uses a SQL(SQLite in-memory be default) database under-the-hood
to store the document text and other meta data. The vector embeddings of the text are
indexed on a FAISS Index that later is queried for searching answers.
from haystack.document_store.faiss import FAISSDocumentStore document_store = FAISSDocumentStore()
Similarly to the previous tutorials, we download, convert and index some Game of Thrones articles to our DocumentStore
# Let's first get some files that we want to use doc_dir = "data/article_txt_got" s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt.zip" fetch_archive_from_http(url=s3_url, output_dir=doc_dir) # Convert files to dicts dicts = convert_files_to_dicts(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True) # Now, let's write the dicts containing documents to our DB. document_store.write_documents(dicts)
Here: We use a
ElasticsearchRetrieverwith custom queries (e.g. boosting) and filters
EmbeddingRetrieverto find candidate documents based on the similarity of embeddings (e.g. created via Sentence-BERT)
TfidfRetrieverin combination with a SQL or InMemory Document store for simple prototyping and debugging
from haystack.retriever.dense import DensePassageRetriever retriever = DensePassageRetriever(document_store=document_store, query_embedding_model="facebook/dpr-question_encoder-single-nq-base", passage_embedding_model="facebook/dpr-ctx_encoder-single-nq-base", use_gpu=True, embed_title=True, max_seq_len=256, batch_size=16, remove_sep_tok_from_untitled_passages=True) # Important: # Now that after we have the DPR initialized, we need to call update_embeddings() to iterate over all # previously indexed documents and update their embedding representation. # While this can be a time consuming operation (depending on corpus size), it only needs to be done once. # At query time, we only need to embed the query and compare it the existing doc embeddings which is very fast. document_store.update_embeddings(retriever)
Similar to previous Tutorials we now initalize our reader.
Here we use a FARMReader with the deepset/roberta-base-squad2 model (see: https://huggingface.co/deepset/roberta-base-squad2)
# Load a local model or any of the QA models on # Hugging Face's model hub (https://huggingface.co/models) reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)
The Finder sticks together reader and retriever in a pipeline to answer our actual questions.
finder = Finder(reader, retriever)
# You can configure how many candidates the reader and retriever shall return # The higher top_k_retriever, the better (but also the slower) your answers. prediction = finder.get_answers(question="Who created the Dothraki vocabulary?", top_k_retriever=10, top_k_reader=5) #prediction = finder.get_answers(question="Who is the father of Arya Stark?", top_k_retriever=10, top_k_reader=5) #prediction = finder.get_answers(question="Who is the sister of Sansa?", top_k_retriever=10, top_k_reader=5)