Integration: Google Drive
Search and fetch files from Google Drive via the Drive API.
Table of Contents
Overview
This integration brings files from Google Drive into your Haystack pipelines through the Google Drive API v3. It ships two components:
GoogleDriveRetrieverโ runs a full-text search over the user’s Drive (and optionally shared drives) via thefiles.listendpoint and returns one HaystackDocumentper matching file. Each document carries resource metadata (file_name,file_id,web_url,mime_type,file_extension, author, timestamps). By default thecontentis the filedescriptionorname; setinclude_content=Trueto export native Google Docs/Sheets/Slides to text and use that as the content. Binary files (PDF, DOCX, …) are never downloaded.GoogleDriveFetcherโ downloads the full content of Drive files and returns them asByteStreams, ready for a downstream converter. Binary files are downloaded as-is, native Google Docs/Sheets/Slides are exported (by default to DOCX/XLSX/PPTX), and folders or non-downloadable Google types are skipped. Feed it the retriever’sdocumentsor a list of file ids / Drive URLs.
The two components are designed to work together โ and with the OAuth integration for authentication โ but each can be used on its own.
Installation
pip install google-drive-haystack
Authentication
Both components take a per-user access_token as a run input: a delegated Google OAuth bearer token for the
user whose Drive is searched or fetched. The token must carry a scope that allows reading file content, for example
https://www.googleapis.com/auth/drive.readonly. The metadata-only drive.metadata.readonly scope cannot search
file content or export documents.
You typically obtain this token from an upstream OAuthTokenResolver (provided by the
OAuth integration) and wire it into the components’ access_token
input. In the standalone examples below the token is passed directly as a string for brevity.
Usage
Search Google Drive
from haystack_integrations.components.retrievers.google_drive import GoogleDriveRetriever
retriever = GoogleDriveRetriever(top_k=5)
# `access_token` is a per-user delegated Google OAuth bearer token.
result = retriever.run(query="quarterly roadmap", access_token="my-delegated-google-token")
for document in result["documents"]:
print(document.meta["file_name"], "->", document.meta["web_url"])
Pass include_content=True to export native Google Docs/Sheets/Slides to text, include_shared_drives=True to span
shared drives, or a query_filter (such as "'<folderId>' in parents") to scope the search.
Fetch full content
The retriever returns only metadata (and optionally exported text). Pass its documents (or raw file ids / Drive
URLs) to the fetcher to download the full content as ByteStreams:
from haystack_integrations.components.fetchers.google_drive import GoogleDriveFetcher
fetcher = GoogleDriveFetcher()
result = fetcher.run(
access_token="my-delegated-google-token",
targets=["https://drive.google.com/file/d/1AbCdEfGhIjKlMnOpQrStUvWxYz/view"],
)
streams = result["streams"]
Each ByteStream’s meta carries file_id, web_url, file_name, and content_type, so you can route the
streams to the right converter โ for example a FileTypeRouter in front of PyPDFToDocument, DOCXToDocument,
XLSXToDocument, or PPTXToDocument.
End-to-end pipeline
The components shine when combined: resolve a token once with the OAuth integration, search with the retriever, and download the matching files with the fetcher.
from haystack import Pipeline
from haystack.utils import Secret
from haystack_integrations.components.connectors.oauth import OAuthTokenResolver
from haystack_integrations.utils.oauth import OAuthRefreshTokenSource
from haystack_integrations.components.retrievers.google_drive import GoogleDriveRetriever
from haystack_integrations.components.fetchers.google_drive import GoogleDriveFetcher
pipe = Pipeline()
pipe.add_component(
"oauth",
OAuthTokenResolver(
token_source=OAuthRefreshTokenSource(
token_url="https://oauth2.googleapis.com/token",
client_id="aaa-bbb-ccc",
refresh_token=Secret.from_env_var("GOOGLE_REFRESH_TOKEN"),
scopes=["https://www.googleapis.com/auth/drive.readonly"],
),
),
)
pipe.add_component("retriever", GoogleDriveRetriever(top_k=5))
pipe.add_component("fetcher", GoogleDriveFetcher())
# Feed the resolved token into both components and the retrieved hits into the fetcher.
pipe.connect("oauth.access_token", "retriever.access_token")
pipe.connect("oauth.access_token", "fetcher.access_token")
pipe.connect("retriever.documents", "fetcher.targets")
result = pipe.run({"retriever": {"query": "quarterly roadmap"}})
streams = result["fetcher"]["streams"]
From here, connect the fetcher’s streams to a FileTypeRouter and the appropriate converters to turn the raw
content into Haystack Documents for indexing or RAG.
License
google-drive-haystack is distributed under the terms of the
Apache-2.0 license.
