Integration: Whisper
Transcribe audio files with OpenAI's Whisper, locally or via the OpenAI API
Table of Contents
Overview
The whisper-haystack integration provides two components that transcribe audio files into Haystack documents using OpenAI’s
Whisper model:
-
LocalWhisperTranscriber: runs Whisper on your own machine. The audio is never sent to a third party. -
RemoteWhisperTranscriber: transcribes audio with the OpenAI Whisper API (and other OpenAI-compatible providers).
Both components are typically used as the first step of an indexing pipeline. They were previously part of Haystack core and now live in the whisper-haystack integration package, maintained in
haystack-core-integrations.
Installation
pip install whisper-haystack
This is all you need for RemoteWhisperTranscriber, which uses the OpenAI Whisper API (set the OPENAI_API_KEY environment variable).
To use LocalWhisperTranscriber, also install the optional openai-whisper dependency and make sure
ffmpeg is available on your system:
pip install -U openai-whisper
Usage
RemoteWhisperTranscriber
RemoteWhisperTranscriber transcribes audio with the OpenAI Whisper API. Set your OPENAI_API_KEY and pass the audio sources to transcribe:
import os
from haystack_integrations.components.audio.whisper import RemoteWhisperTranscriber
os.environ["OPENAI_API_KEY"] = "your-api-key"
transcriber = RemoteWhisperTranscriber()
result = transcriber.run(sources=["path/to/audio/file.mp3"])
print(result["documents"][0].content)
LocalWhisperTranscriber
LocalWhisperTranscriber runs the Whisper model on your machine. Choose a model size (for example tiny, base, or small) and transcribe your audio files:
from haystack_integrations.components.audio.whisper import LocalWhisperTranscriber
transcriber = LocalWhisperTranscriber(model="small")
transcriber.warm_up()
result = transcriber.run(sources=["path/to/audio/file.mp3"])
print(result["documents"][0].content)
In a pipeline
The pipeline below fetches an audio file from a URL with LinkContentFetcher and transcribes it with LocalWhisperTranscriber:
from haystack import Pipeline
from haystack.components.fetchers import LinkContentFetcher
from haystack_integrations.components.audio.whisper import LocalWhisperTranscriber
pipe = Pipeline()
pipe.add_component("fetcher", LinkContentFetcher())
pipe.add_component("transcriber", LocalWhisperTranscriber(model="tiny"))
pipe.connect("fetcher", "transcriber")
result = pipe.run(
data={
"fetcher": {
"urls": [
"https://github.com/deepset-ai/haystack/raw/refs/heads/main/test/test_files/audio/MLK_Something_happening.mp3"
]
}
}
)
print(result["transcriber"]["documents"][0].content)
Alternatively, the pipeline below indexes audio files from a local folder using LocalWhisperTranscriber, DocumentCleaner, DocumentSplitter, and DocumentWriter:
from pathlib import Path
from haystack import Pipeline
from haystack_integrations.components.audio.whisper import LocalWhisperTranscriber
from haystack.components.preprocessors import DocumentSplitter, DocumentCleaner
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
document_store = InMemoryDocumentStore()
pipeline = Pipeline()
pipeline.add_component(instance=LocalWhisperTranscriber(model="small"), name="transcriber")
pipeline.add_component(instance=DocumentCleaner(), name="cleaner")
pipeline.add_component(instance=DocumentSplitter(), name="splitter")
pipeline.add_component(instance=DocumentWriter(document_store=document_store), name="writer")
pipeline.connect("transcriber.documents", "cleaner.documents")
pipeline.connect("cleaner.documents", "splitter.documents")
pipeline.connect("splitter.documents", "writer.documents")
pipeline.run({"transcriber": {"audio_files": list(Path("path/to/audio/folder").iterdir())}})
License
whisper-haystack is distributed under the terms of the
Apache-2.0 license.
