Tutorial: Text-To-Image Search Pipeline with Multimodal Retriever


Level: Intermediate

Time to complete: 20 minutes

Prerequisites: This tutorial assumes basic knowledge of Haystack Retrievers and Pipelines. If you want to learn about them, have a look at our tutorials on Build Your First QA System and Fine-Tuning a Model on Your Own Data.

Prepare the Colab environment (see links below).

Nodes Used: InMemoryDocumentStore, MultiModalRetriever

Goal: After completing this tutorial, you will have built a search system that retrieves images as answers to a text query.

Description: In this tutorial, you’ll download a set of images that you’ll then turn into embeddings using a transformers model, OpenAI CLIP. You’ll then use the same model to embed the text query. Finally, you’ll perform a nearest neighbor search to retrieve the images relevant to the text query.

Let’s build a text-to-image search pipeline using a small animal dataset!

Preparing the Colab Environment

Installing Haystack

%%bash

pip install --upgrade pip
pip install farm-haystack[colab,inference]

Enabling Telemetry

Knowing you’re using this tutorial helps us decide where to invest our efforts to build a better product but you can always opt out by commenting the following line. See Telemetry for more details.

from haystack.telemetry import tutorial_running

tutorial_running(19)

Initializing the DocumentStore

A DocumentStore stores references to the images that Haystack will compare with your query. But before it can do that, you need to initialize it. In this tutorial, you’ll use the InMemoryDocumentStore.

If you want to learn more, see DocumentStore.

from haystack.document_stores import InMemoryDocumentStore

# Here Here we initialize the DocumentStore to store 512 dim image embeddings
# obtained using OpenAI CLIP model
document_store = InMemoryDocumentStore(embedding_dim=512)

Downloading Data

Download 18 sample images of different animals and store it. You can find them in data/tutorial19/spirit-animals/ as a set of .jpg files.

from haystack.utils import fetch_archive_from_http

doc_dir = "data/tutorial19"

fetch_archive_from_http(
    url="https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/spirit-animals.zip",
    output_dir=doc_dir,
)

Add the images you just downloaded into Haystack Document objects and write them into the DocumentStore.

import os

from haystack import Document

images = [
    Document(content=f"./{doc_dir}/spirit-animals/{filename}", content_type="image")
    for filename in os.listdir(f"./{doc_dir}/spirit-animals/")
]

document_store.write_documents(images)

You have successfully stored your images in the DocumentStore.

Initializing the Retriever

Retrievers sift through all the images and return only those that are relevant to the query. To run a search on images, you’ll use the MultiModalRetriever with the OpenAI CLIP model.

For more details on supported modalities, see MultiModalRetriever.

Before adding the Retriever to your pipeline, let’s configure its parameters

from haystack.nodes.retriever.multimodal import MultiModalRetriever

retriever_text_to_image = MultiModalRetriever(
    document_store=document_store,
    query_embedding_model="sentence-transformers/clip-ViT-B-32",
    query_type="text",
    document_embedding_models={"image": "sentence-transformers/clip-ViT-B-32"},
)

# Now let's turn our images into embeddings and store them in the DocumentStore.
document_store.update_embeddings(retriever=retriever_text_to_image)

Your retriever is now ready for search!

Creating the MultiModal Search Pipeline

We are populating a pipeline with a MultiModalRetriever node. This search pipeline queries the image database with text and returns the most relevant images.

from haystack import Pipeline

pipeline = Pipeline()
pipeline.add_node(component=retriever_text_to_image, name="retriever_text_to_image", inputs=["Query"])

Now, you have a pipeline that uses the MultiModalRetriever and takes a text query as input. Let’s try it out.

Searching Through the Images

Use the pipeline run() method to query the images in the DocumentStore. The query argument is where you type your text query. Additionally, you can set the number of images you want the MultiModalRetriever to return using the top-k parameter. To learn more about setting arguments, see Pipeline Arguments.

results = pipeline.run(query="Animal that lives in the water", params={"retriever_text_to_image": {"top_k": 3}})

# Sort the results based on the scores
results = sorted(results["documents"], key=lambda d: d.score, reverse=True)

for doc in results:
    print(doc.score, doc.content)

Here are some more query strings you could try out:

  1. King of the Jungle
  2. Fastest animal
  3. Bird that can see clearly even in the dark

You can also easily vizualize these images together with their score using this code:

from io import BytesIO
from PIL import Image, ImageDraw, ImageOps
from IPython.display import display, Image as IPImage


def display_img_array(ima, score):
    im = Image.open(ima)
    img_with_border = ImageOps.expand(im, border=20, fill="white")

    # Add Text to an image
    img = ImageDraw.Draw(img_with_border)
    img.text((20, 0), f"Score: {score},    Path: {ima}", fill=(0, 0, 0))

    bio = BytesIO()
    img_with_border.save(bio, format="png")
    display(IPImage(bio.getvalue(), format="png"))


images_array = [doc.content for doc in results]
scores = [doc.score for doc in results]
for ima, score in zip(images_array, scores):
    display_img_array(ima, score)

Congratulations! You’ve created a search system that returns images of animals in answer to a text query.