Improving Retrieval with Auto-Merging and Hierarchical Document Retrieval


This notebook shows how to use experimetnal components for Haystack: AutoMergingRetriever and HierarchicalDocumentSplitter.

Setting up

!pip install git+https://github.com/deepset-ai/haystack-experimental.git@main
!pip install haystack-ai

Let’s get a dataset to index and explore

!wget https://raw.githubusercontent.com/amankharwal/Website-data/master/bbc-news-data.csv

Let’s convert the raw data into Haystack Documents

import csv
from typing import List
from haystack import Document

def read_documents() -> List[Document]:
    with open("bbc-news-data.csv", "r") as file:
        reader = csv.reader(file, delimiter="\t")
        next(reader, None)  # skip the headers
        documents = []
        for row in reader:
            category = row[0].strip()
            title = row[2].strip()
            text = row[3].strip()
            documents.append(Document(content=text, meta={"category": category, "title": title}))

    return documents
docs = read_documents()
docs[0:5]

We can see that we have successfully created Documents.

Document Splitting and Indexing

Now we split each document into smaller ones creating an hierarchical document structure connecting each smaller child documents with the corresponding parent document.

We also create two document stores, one for the leaf documents and the other for the parent documents.

from typing import Tuple

from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.document_stores.types import DuplicatePolicy

from haystack_experimental.components.splitters import HierarchicalDocumentSplitter

def indexing(documents: List[Document]) -> Tuple[InMemoryDocumentStore, InMemoryDocumentStore]:
    splitter = HierarchicalDocumentSplitter(block_sizes={10, 3}, split_overlap=0, split_by="word")
    docs = splitter.run(documents)

    # Store the leaf documents in one document store
    leaf_documents = [doc for doc in docs["documents"] if doc.meta["__level"] == 1]
    leaf_doc_store = InMemoryDocumentStore()
    leaf_doc_store.write_documents(leaf_documents, policy=DuplicatePolicy.OVERWRITE)

    # Store the parent documents in another document store
    parent_documents = [doc for doc in docs["documents"] if doc.meta["__level"] == 0]
    parent_doc_store = InMemoryDocumentStore()
    parent_doc_store.write_documents(parent_documents, policy=DuplicatePolicy.OVERWRITE)

    return leaf_doc_store, parent_doc_store
leaf_doc_store, parent_doc_store = indexing(docs)

Retrieving Documents with Auto-Merging

We are now ready to query the document store using the AutoMergingRetriever. Let’s build a pipeline that uses the BM25Retriever to handle the user queries, and we connect it to the AutoMergingRetriever, which, based on the documents retrieved and the hierarchical structure, decides whether the leaf documents or the parent document is returned.

from haystack import Pipeline
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack_experimental.components.retrievers import AutoMergingRetriever

def querying_pipeline(leaf_doc_store: InMemoryDocumentStore, parent_doc_store: InMemoryDocumentStore, threshold: float = 0.6):
    pipeline = Pipeline()
    bm25_retriever = InMemoryBM25Retriever(document_store=leaf_doc_store)
    auto_merge_retriever = AutoMergingRetriever(parent_doc_store, threshold=threshold)
    pipeline.add_component(instance=bm25_retriever, name="BM25Retriever")
    pipeline.add_component(instance=auto_merge_retriever, name="AutoMergingRetriever")
    pipeline.connect("BM25Retriever.documents", "AutoMergingRetriever.matched_leaf_documents")
    return pipeline

Let’s create this pipeline by setting the threshold for the AutoMergingRetriever at 0.6

pipeline = querying_pipeline(leaf_doc_store, parent_doc_store, threshold=0.6)

Let’s now query the pipeline for document store for articles related to cybersecurity. Let’s also make use of the pipeline parameter include_outputs_from to also get the outputs from the BM25Retriever component.

result = pipeline.run(data={'query': 'phishing attacks spoof websites spam e-mails spyware'},  include_outputs_from={'BM25Retriever'})
len(result['AutoMergingRetriever']['documents'])
len(result['BM25Retriever']['documents'])
retrieved_doc_titles_bm25 = sorted([d.meta['title'] for d in result['BM25Retriever']['documents']])
retrieved_doc_titles_bm25
retrieved_doc_titles_automerging = sorted([d.meta['title'] for d in result['AutoMergingRetriever']['documents']])
retrieved_doc_titles_automerging