Improve Retrieval by Embedding Meaningful Metadata

_{Last Updated:
September 19, 2024}

In this notebook, I do some experiments on embedding meaningful metadata to improve Document retrieval.

%%capture
! pip install wikipedia haystack-ai sentence_transformers rich

import rich

Load data from Wikipedia

We are going to download the Wikipedia pages related to some bands, using the python library wikipedia.

These pages are converted into Haystack Documents.

some_bands="""The Beatles
Rolling stones
Dire Straits
The Cure
The Smiths""".split("\n")

import wikipedia
from haystack.dataclasses import Document

raw_docs=[]

for title in some_bands:
    page = wikipedia.page(title=title, auto_suggest=False)
    doc = Document(content=page.content, meta={"title": page.title, "url":page.url})
    raw_docs.append(doc)

🔧 Setup the experiment

Utility functions to create Pipelines

The indexing Pipeline transforms the Documents and stores them (with vectors) in a Document Store. The retrieval Pipeline takes a query as input and perform the vector search.

I build some utility functions to create different indexing and retrieval Pipelines.

In fact, I am interested in comparing the standard approach (where we only embed text) with the embedding metadata strategy (we embed text + meaningful metadata).

from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.embedders import SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder
from haystack.components.writers import DocumentWriter
from haystack.document_stores.types import DuplicatePolicy
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.utils import ComponentDevice

def create_indexing_pipeline(document_store, metadata_fields_to_embed):

  indexing = Pipeline()
  indexing.add_component("cleaner", DocumentCleaner())
  indexing.add_component("splitter", DocumentSplitter(split_by='sentence', split_length=2))

  # in the following componente, we can specify the parameter `metadata_fields_to_embed`, with the metadata to embed
  indexing.add_component("doc_embedder", SentenceTransformersDocumentEmbedder(model="thenlper/gte-large",
                                                                              device=ComponentDevice.from_str("cuda:0"),
                                                                              meta_fields_to_embed=metadata_fields_to_embed)
  )
  indexing.add_component("writer", DocumentWriter(document_store=document_store, policy=DuplicatePolicy.OVERWRITE))

  indexing.connect("cleaner", "splitter")
  indexing.connect("splitter", "doc_embedder")
  indexing.connect("doc_embedder", "writer")

  return indexing

def create_retrieval_pipeline(document_store):

  retrieval = Pipeline()
  retrieval.add_component("text_embedder", SentenceTransformersTextEmbedder(model="thenlper/gte-large",
                                                                            device=ComponentDevice.from_str("cuda:0")))
  retrieval.add_component("retriever", InMemoryEmbeddingRetriever(document_store=document_store, scale_score=False, top_k=3))

  retrieval.connect("text_embedder", "retriever")

  return retrieval

Create the Pipelines

Let’s define 2 Document Stores, to compare the different approaches.

document_store = InMemoryDocumentStore(embedding_similarity_function="cosine")
document_store_w_embedded_metadata = InMemoryDocumentStore(embedding_similarity_function="cosine")

Now, I create the 2 indexing pipelines and run them.

indexing_pipe_std = create_indexing_pipeline(document_store=document_store, metadata_fields_to_embed=[])

# here we specify the fields to embed
# we select the field `title`, containing the name of the band
indexing_pipe_w_embedded_metadata = create_indexing_pipeline(document_store=document_store_w_embedded_metadata, metadata_fields_to_embed=["title"])

indexing_pipe_std.run({"cleaner":{"documents":raw_docs}})
indexing_pipe_w_embedded_metadata.run({"cleaner":{"documents":raw_docs}})

print(len(document_store.filter_documents()))
print(len(document_store_w_embedded_metadata.filter_documents()))

Create the 2 retrieval pipelines.

retrieval_pipe_std = create_retrieval_pipeline(document_store=document_store)

retrieval_pipe_w_embedded_metadata = create_retrieval_pipeline(document_store=document_store_w_embedded_metadata)

🧪 Run the experiment!

# standard approach (no metadata embedding)

res=retrieval_pipe_std.run({"text_embedder":{"text":"have the beatles ever been to bangor?"}})
for doc in res['retriever']['documents']:
  rich.print(doc)
  rich.print(doc.content+"\n")

❌ the retrieved Documents seem irrelevant

# embedding meaningful metadata

res=retrieval_pipe_w_embedded_metadata.run({"text_embedder":{"text":"have the beatles ever been to bangor?"}})
for doc in res['retriever']['documents']:
  rich.print(doc)
  rich.print(doc.content+"\n")

✅ the first Document is relevant

# standard approach (no metadata embedding)

res=retrieval_pipe_std.run({"text_embedder":{"text":"What announcements did the band The Cure make in 2022?"}})
for doc in res['retriever']['documents']:
  rich.print(doc)
  rich.print(doc.content)

❌ the retrieved Documents seem irrelevant

# embedding meaningful metadata

res=retrieval_pipe_w_embedded_metadata.run({"text_embedder":{"text":"What announcements did the band The Cure make in 2022?"}})
for doc in res['retriever']['documents']:
  rich.print(doc)
  rich.print(doc.content)

✅ some Documents are relevant

⚠️ Notes of caution

This technique is not a silver bullet
It works well when the embedded metadata are meaningful and distinctive
I would say that the embedded metadata should be meaningful from the perspective of the embedding model. For example, I don’t expect embedding numbers to work well.