Sparse Embedding Retrieval with Qdrant and FastEmbed


In this notebook, we will see how to use Sparse Embedding Retrieval techniques (such as SPLADE) in Haystack.

We will use the Qdrant Document Store and FastEmbed Sparse Embedders.

Why SPLADE?

  • Sparse Keyword-Based Retrieval (based on BM25 algorithm or similar ones) is simple and fast, requires few resources but relies on lexical matching and struggles to capture semantic meaning.
  • Dense Embedding-Based Retrieval takes semantics into account but requires considerable computational resources, usually does not work well on novel domains, and does not consider precise wording.

While good results can be achieved by combining the two approaches ( tutorial), SPLADE (Sparse Lexical and Expansion Model for Information Retrieval) introduces a new method that encapsulates the positive aspects of both techniques. In particular, SPLADE uses Language Models like BERT to weigh the relevance of different terms in the query and perform automatic term expansions, reducing the vocabulary mismatch problem (queries and relevant documents often lack term overlap).

Main features:

  • Better than dense embedding Retrievers on precise keyword matching
  • Better than BM25 on semantic matching
  • Slower than BM25
  • Still experimental compared to both BM25 and dense embeddings: few models; supported by few Document Stores

Resources

Install dependencies

!pip install -U fastembed-haystack qdrant-haystack wikipedia transformers

Sparse Embedding Retrieval

Indexing

Create a Qdrant Document Store

from haystack_integrations.document_stores.qdrant import QdrantDocumentStore

document_store = QdrantDocumentStore(
    ":memory:",
    recreate_index=True,
    return_embedding=True,
    use_sparse_embeddings=True  # set this parameter to True, otherwise the collection schema won't allow to store sparse vectors
)

Download Wikipedia pages and create raw documents

We download a few Wikipedia pages about animals and create Haystack documents from them.

nice_animals=["Capybara", "Dolphin"]

import wikipedia
from haystack.dataclasses import Document

raw_docs=[]
for title in nice_animals:
    page = wikipedia.page(title=title, auto_suggest=False)
    doc = Document(content=page.content, meta={"title": page.title, "url":page.url})
    raw_docs.append(doc)

Initialize a FastembedSparseDocumentEmbedder

The FastembedSparseDocumentEmbedder enrichs a list of documents with their sparse embeddings.

We are using prithvida/Splade_PP_en_v1, a good sparse embedding model with a permissive license.

We also want to embed the title of the document, because it contains relevant information.

For more customization options, refer to the docs.

from haystack_integrations.components.embedders.fastembed import FastembedSparseDocumentEmbedder

sparse_doc_embedder = FastembedSparseDocumentEmbedder(model="prithvida/Splade_PP_en_v1",
                                                      meta_fields_to_embed=["title"])
sparse_doc_embedder.warm_up()

# let's try the embedder
print(sparse_doc_embedder.run(documents=[Document(content="An example document")]))

Indexing pipeline

from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack.document_stores.types import DuplicatePolicy
from haystack import Pipeline
indexing = Pipeline()
indexing.add_component("cleaner", DocumentCleaner())
indexing.add_component("splitter", DocumentSplitter(split_by='sentence', split_length=4))
indexing.add_component("sparse_doc_embedder", sparse_doc_embedder)
indexing.add_component("writer", DocumentWriter(document_store=document_store, policy=DuplicatePolicy.OVERWRITE))

indexing.connect("cleaner", "splitter")
indexing.connect("splitter", "sparse_doc_embedder")
indexing.connect("sparse_doc_embedder", "writer")

Let’s index our documents!

⚠ī¸ If you are running this notebook on Google Colab, please note that Google Colab only provides 2 CPU cores, so the sparse embedding generation could be not as fast as it can be on a standard machine.

indexing.run({"documents":raw_docs})
document_store.count_documents()

Retrieval

Retrieval pipeline

Now, we create a simple retrieval Pipeline:

  • FastembedSparseTextEmbedder: transforms the query into a sparse embedding
  • QdrantSparseEmbeddingRetriever: looks for relevant documents, based on the similarity of the sparse embeddings
from haystack import Pipeline
from haystack_integrations.components.retrievers.qdrant import QdrantSparseEmbeddingRetriever
from haystack_integrations.components.embedders.fastembed import FastembedSparseTextEmbedder

sparse_text_embedder = FastembedSparseTextEmbedder(model="prithvida/Splade_PP_en_v1")

query_pipeline = Pipeline()
query_pipeline.add_component("sparse_text_embedder", sparse_text_embedder)
query_pipeline.add_component("sparse_retriever", QdrantSparseEmbeddingRetriever(document_store=document_store))

query_pipeline.connect("sparse_text_embedder.sparse_embedding", "sparse_retriever.query_sparse_embedding")

Try the retrieval pipeline

question = "Where do capybaras live?"

results = query_pipeline.run({"sparse_text_embedder": {"text": question}})
import rich

for d in results['sparse_retriever']['documents']:
  rich.print(f"\nid: {d.id}\n{d.content}\nscore: {d.score}\n---")

Understanding SPLADE vectors

(Inspiration: FastEmbed SPLADE notebook)

We have seen that our model encodes text into a sparse vector (= a vector with many zeros). An efficient representation of sparse vectors is to save the indices and values of nonzero elements.

Let’s try to understand what information resides in these vectors…

question = "Where do capybaras live?"
sparse_embedding = sparse_text_embedder.run(text=question)["sparse_embedding"]
rich.print(sparse_embedding.to_dict())
from transformers import AutoTokenizer

# we need the tokenizer vocabulary
tokenizer = AutoTokenizer.from_pretrained("Qdrant/Splade_PP_en_v1") # ONNX export of the original model
def get_tokens_and_weights(sparse_embedding, tokenizer):
    token_weight_dict = {}
    for i in range(len(sparse_embedding.indices)):
        token = tokenizer.decode([sparse_embedding.indices[i]])
        weight = sparse_embedding.values[i]
        token_weight_dict[token] = weight

    # Sort the dictionary by weights
    token_weight_dict = dict(sorted(token_weight_dict.items(), key=lambda item: item[1], reverse=True))
    return token_weight_dict


rich.print(get_tokens_and_weights(sparse_embedding, tokenizer))

Very nice! đŸĻĢ

  • tokens are ordered by relevance
  • the query is expanded with relevant tokens/terms: “location”, “habitat”…

Hybrid Retrieval

Ideally, techniques like SPLADE are intended to replace other approaches (BM25 and Dense Embedding Retrieval) and their combinations.

However, sometimes it may make sense to combine, for example, Dense Embedding Retrieval and Sparse Embedding Retrieval. You can find some positive examples in the appendix of this paper ( An Analysis of Fusion Functions for Hybrid Retrieval). Make sure this works for your use case and conduct an evaluation.


Below we show how to create such an application in Haystack.

In the example, we use the Qdrant Hybrid Retriever: it compares dense and sparse query and document embeddings and retrieves the most relevant documents , merging the scores with Reciprocal Rank Fusion.

If you want to customize the behavior more, see Hybrid Retrieval Pipelines ( tutorial).

from haystack_integrations.document_stores.qdrant import QdrantDocumentStore
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack_integrations.components.embedders.fastembed import FastembedSparseDocumentEmbedder, FastembedDocumentEmbedder
from haystack.document_stores.types import DuplicatePolicy
from haystack import Pipeline
document_store = QdrantDocumentStore(
    ":memory:",
    recreate_index=True,
    return_embedding=True,
    use_sparse_embeddings=True,
    embedding_dim = 384
)
hybrid_indexing = Pipeline()
hybrid_indexing.add_component("cleaner", DocumentCleaner())
hybrid_indexing.add_component("splitter", DocumentSplitter(split_by='sentence', split_length=4))
hybrid_indexing.add_component("sparse_doc_embedder", FastembedSparseDocumentEmbedder(model="prithvida/Splade_PP_en_v1", meta_fields_to_embed=["title"]))
hybrid_indexing.add_component("dense_doc_embedder", FastembedDocumentEmbedder(model="BAAI/bge-small-en-v1.5", meta_fields_to_embed=["title"]))
hybrid_indexing.add_component("writer", DocumentWriter(document_store=document_store, policy=DuplicatePolicy.OVERWRITE))

hybrid_indexing.connect("cleaner", "splitter")
hybrid_indexing.connect("splitter", "sparse_doc_embedder")
hybrid_indexing.connect("sparse_doc_embedder", "dense_doc_embedder")
hybrid_indexing.connect("dense_doc_embedder", "writer")
hybrid_indexing.run({"documents":raw_docs})
document_store.filter_documents()[0]
from haystack_integrations.components.retrievers.qdrant import QdrantHybridRetriever
from haystack_integrations.components.embedders.fastembed import FastembedTextEmbedder


hybrid_query = Pipeline()
hybrid_query.add_component("sparse_text_embedder", FastembedSparseTextEmbedder(model="prithvida/Splade_PP_en_v1"))
hybrid_query.add_component("dense_text_embedder", FastembedTextEmbedder(model="BAAI/bge-small-en-v1.5", prefix="Represent this sentence for searching relevant passages: "))
hybrid_query.add_component("retriever", QdrantHybridRetriever(document_store=document_store))

hybrid_query.connect("sparse_text_embedder.sparse_embedding", "retriever.query_sparse_embedding")
hybrid_query.connect("dense_text_embedder.embedding", "retriever.query_embedding")
question = "Where do capybaras live?"

results = hybrid_query.run(
    {"dense_text_embedder": {"text": question},
     "sparse_text_embedder": {"text": question}}
)
import rich

for d in results['retriever']['documents']:
  rich.print(f"\nid: {d.id}\n{d.content}\nscore: {d.score}\n---")

📚 Docs on Sparse Embedding support in Haystack

(Notebook by Stefano Fiorucci)