Multilingual RAG on a Podcast


Notebook by Stefano Fiorucci

This notebook shows how to create a multilingual Retrieval Augmented Generation application application, starting from a podcast.

🧰 Stack:

  • Haystack LLM framework
  • OpenAI Whisper model for audio transcription
  • Qdrant vector database
  • multilingual embedding model: multilingual-e5-large
  • multilingual LLM: Mixtral-8x7B-Instruct-v0.1

Installation

%%capture
! pip install -U haystack-ai qdrant-haystack "openai-whisper>=20231106" pytube "sentence-transformers>=3.0.0" "huggingface_hub>=0.23.0"

Podcast transcription

  • download the audio from Youtube using pytube
  • transcribe it locally using Haystack’s LocalWhisperTranscriber with the whisper-small model. We could use bigger models, which take longer to transcribe. We could also call the paid OpenAI API, using RemoteWhisperTranscriber.

Since the transcription takes some time (about 10 minutes), I commented out the following code and will provide the transcription.

# # https://www.tutorialspoint.com/download-video-in-mp3-format-using-pytube

# from pytube import YouTube

# url = "https://www.youtube.com/watch?v=vrf4_XMSlE0"


# video = YouTube(url)
# stream = video.streams.filter(only_audio=True).first()
# stream.download(filename=f"podcast.mp3")
# from haystack.components.audio import LocalWhisperTranscriber
# from haystack.utils import ComponentDevice

# whisper = LocalWhisperTranscriber(model="small", device=ComponentDevice.from_str("cuda:0"),)
# whisper.warm_up()
# transcription = whisper.run(sources=["podcast.mp3"])

# with open('podcast_transcript_whisper_small.txt.txt', 'w') as fo:
#   fo.write(transcription["documents"][0].content)

Indexing pipeline

Create an Indexing pipeline that stores chunks of the transcript in the Qdrant vector database.

# download the podcast transcript
# to create the transcript, you can uncomment and run the previous section
!wget "https://raw.githubusercontent.com/deepset-ai/haystack-cookbook/main/data/multilingual_rag_podcast/podcast_transcript_whisper_small.txt"

# let's take a look at the begininning of our 🇮🇹 transcript
!head --bytes 300 podcast_transcript_whisper_small.txt
from haystack_integrations.document_stores.qdrant import QdrantDocumentStore
from haystack import Document
from haystack import Pipeline
from haystack.components.embedders import SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder
from haystack.components.converters import TextFileToDocument
from haystack.components.writers import DocumentWriter
from haystack.components.preprocessors import DocumentSplitter
from haystack.utils import ComponentDevice

# initialize the Document store
document_store = QdrantDocumentStore(
    ":memory:",
    embedding_dim=1024,  # the embedding_dim should match that of the embedding model
)

indexing_pipeline = Pipeline()
indexing_pipeline.add_component("text_file_converter", TextFileToDocument())
indexing_pipeline.add_component("splitter", DocumentSplitter(split_by="word", split_length=200))

indexing_pipeline.add_component(
    "embedder",
    SentenceTransformersDocumentEmbedder(
        model="intfloat/multilingual-e5-large",  # good multilingual model: https://huggingface.co/intfloat/multilingual-e5-large
        device=ComponentDevice.from_str("cuda:0"),    # load the model on GPU
        prefix="passage:",  # as explained in the model card (https://huggingface.co/intfloat/multilingual-e5-large#faq), documents should be prefixed with "passage:"
    ))
indexing_pipeline.add_component("writer", DocumentWriter(document_store=document_store))

# connect the components
indexing_pipeline.connect("text_file_converter", "splitter")
indexing_pipeline.connect("splitter", "embedder")
indexing_pipeline.connect("embedder", "writer")
# show the pipeline

from IPython.display import Image

indexing_pipeline.draw('indexing_pipeline.png')
Image('indexing_pipeline.png')
# Run the pipeline! 🚀
res = indexing_pipeline.run({"text_file_converter":{"sources":["podcast_transcript_whisper_small.txt"]}})
document_store.count_documents()

RAG pipeline

Finally our RAG pipeline: from an Italian podcast 🇮🇹🎧 to answering questions in English 🇬🇧

  • SentenceTransformersTextEmbedder transforms the query into a vector that captures its semantics, to allow vector retrieval
  • QdrantRetriever compares the query and Document embeddings and fetches the Documents most relevant to the query.
  • PromptBuilder prepares the prompt for the LLM: renders a prompt template and fill in variable values.
  • HuggingFaceAPIGenerator allows using LLMs via the (free) Hugging Face Serverless Inference API
from haystack.components.generators import HHuggingFaceAPIGenerator
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack_integrations.components.retrievers.qdrant import QdrantEmbeddingRetriever
from pprint import pprint
from getpass import getpass
import os
import os

os.environ["HF_API_TOKEN"] = getpass("Enter your Hugging Face Token: https://huggingface.co/settings/tokens ")
# load the model (https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1) and try the Generator
# you first need to accept Mistral conditions here: https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1

generator = HHuggingFaceAPIGenerator(
        api_type="serverless_inference_api",
        api_params={"model": "mistralai/Mistral-7B-Instruct-v0.1"},
        generation_kwargs={"max_new_tokens":500})

generator.run("Please explain in a fun way why vim is the ultimate IDE")
# define a multilingual prompt template

prompt_template = """
Using only the information contained in these documents in Italian, answer the question using English.
If the answer cannot be inferred from the documents, respond \"I don't know\".
Documents:
{% for doc in documents %}
    {{ doc.content }}
{% endfor %}
Question: {{question}}
Answer:
"""
# define the query pipeline
query_pipeline = Pipeline()

query_pipeline.add_component(
    "text_embedder",
    SentenceTransformersTextEmbedder(
        model="intfloat/multilingual-e5-large",  # good multilingual model: https://huggingface.co/intfloat/multilingual-e5-large
        device=ComponentDevice.from_str("cuda:0"),  # load the model on GPU
        prefix="query:",  # as explained in the model card (https://huggingface.co/intfloat/multilingual-e5-large#faq), queries should be prefixed with "query:"
    ))
query_pipeline.add_component("retriever", QdrantEmbeddingRetriever(document_store=document_store))
query_pipeline.add_component("prompt_builder", PromptBuilder(template=prompt_template))
query_pipeline.add_component("generator", generator)

# connect the components
query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
query_pipeline.connect("retriever.documents", "prompt_builder.documents")
query_pipeline.connect("prompt_builder", "generator")
# show the pipeline

from IPython.display import Image

query_pipeline.draw('query_pipeline.png')
Image('query_pipeline.png')
# try the pipeline

question = "What is Pointer Podcast?"
results = query_pipeline.run(
    {   "text_embedder": {"text": question},
        "prompt_builder": {"question": question},
    }
)

for d in results['generator']['replies']:
  pprint(d)

✨ Nice!

# let's create a simple wrapper to call the pipeline and show the answers

def ask_rag(question: str):
  results = query_pipeline.run(
      {
          "text_embedder": {"text": question},
          "prompt_builder": {"question": question},
      }
  )

  for d in results["generator"]["replies"]:
      pprint(d)

Try our multilingual RAG application!

import random
questions="""What are some interesting directions in Large Language Models?
What is Haystack?
What is Ollama?
How did Stefano end up working at deepset?
Will open source models achieve the quality of closed ones?
What are the main features of Haystack 2.0?
Summarize in a bulleted list the main stages of training a Large Language Model
What is Zephyr?
What is it and why is the quantization of Large Language Models interesting?
Could you point out the names of the hosts and guests of the podcast?""".split("\n")
q = random.choice(questions)
print(q)
ask_rag(q)