Integration: Microsoft SharePoint
Search and fetch content from Microsoft SharePoint and OneDrive via the Microsoft Graph API.
Table of Contents
Overview
This integration brings content from Microsoft SharePoint and OneDrive into your Haystack pipelines through the Microsoft Graph API. It ships two components:
MSSharePointRetrieverโ searches SharePoint and OneDrive via the Microsoft Search (Graph) API and returns one HaystackDocumentper hit. Each document’scontentis the search snippet and itsmetacarries the resource metadata (file_name,web_url,entity_type, timestamps, author info,mime_type, …) plus the SharePoint identifiers a downstream fetcher needs. It does not download the underlying files.MSSharePointFetcherโ downloads the full content of SharePoint and OneDrive items and returns them asByteStreams, ready for a downstream converter. Files (driveItem) come back as their raw bytes, list items as JSON, and SharePoint pages (sitePage) as HTML. Feed it the retriever’sdocumentsor a list ofweb_urls.
The two components are designed to work together โ and with the OAuth integration for authentication โ but each can be used on its own.
Installation
pip install microsoft-sharepoint-haystack
Authentication
Both components take a per-user access_token as a run input: a delegated Microsoft Graph bearer token for the
user whose content is searched or fetched. The Microsoft Search API supports delegated permissions only. The token
must carry the relevant delegated scopes, for example Files.Read.All for files and Sites.Read.All for
site/list scoping and SharePoint pages.
You typically obtain this token from an upstream OAuthTokenResolver (provided by the
OAuth integration) and wire it into the components’ access_token
input. In the examples below the token is passed directly as a string for brevity.
Usage
Search SharePoint and OneDrive
from haystack_integrations.components.retrievers.microsoft_sharepoint import MSSharePointRetriever
retriever = MSSharePointRetriever(top_k=5)
# `access_token` is a per-user delegated Microsoft Graph bearer token.
result = retriever.run(query="quarterly roadmap", access_token="my-delegated-graph-token")
for document in result["documents"]:
print(document.meta["file_name"], "->", document.meta["web_url"])
You can scope or filter results with
Keyword Query Language (KQL)
operators directly in the query, for example quarterly roadmap filetype:docx or
path:"https://contoso.sharepoint.com/sites/Team".
Fetch full content
The retriever only returns snippets and metadata. Pass its documents (or raw web_url strings) to the fetcher to
download the full content as ByteStreams:
from haystack_integrations.components.fetchers.microsoft_sharepoint import MSSharePointFetcher
fetcher = MSSharePointFetcher()
result = fetcher.run(
access_token="my-delegated-graph-token",
targets=["https://contoso.sharepoint.com/sites/contoso-team/contoso-designs.docx"],
)
streams = result["streams"]
Each ByteStream’s meta carries url, file_name, content_type, and a normalized entity_type
(driveItem, listItem, or sitePage), so you can route the streams to the right converter โ for example a
FileTypeRouter in front of PyPDFToDocument, DOCXToDocument, or HTMLToDocument.
End-to-end pipeline
The components shine when combined: resolve a token once with the OAuth integration, search with the retriever, and download the matching items with the fetcher.
from haystack import Pipeline
from haystack.utils import Secret
from haystack_integrations.components.connectors.oauth import OAuthTokenResolver
from haystack_integrations.utils.oauth import OAuthRefreshTokenSource
from haystack_integrations.components.retrievers.microsoft_sharepoint import MSSharePointRetriever
from haystack_integrations.components.fetchers.microsoft_sharepoint import MSSharePointFetcher
pipe = Pipeline()
pipe.add_component(
"oauth",
OAuthTokenResolver(
token_source=OAuthRefreshTokenSource(
token_url="https://login.microsoftonline.com/common/oauth2/v2.0/token",
client_id="aaa-bbb-ccc",
refresh_token=Secret.from_env_var("MS_REFRESH_TOKEN"),
scopes=["https://graph.microsoft.com/Files.Read.All", "offline_access"],
),
),
)
pipe.add_component("retriever", MSSharePointRetriever(top_k=5))
pipe.add_component("fetcher", MSSharePointFetcher())
# Feed the resolved token into both components and the retrieved hits into the fetcher.
pipe.connect("oauth.access_token", "retriever.access_token")
pipe.connect("oauth.access_token", "fetcher.access_token")
pipe.connect("retriever.documents", "fetcher.targets")
result = pipe.run({"retriever": {"query": "quarterly roadmap"}})
streams = result["fetcher"]["streams"]
From here, connect the fetcher’s streams to a FileTypeRouter and the appropriate converters to turn the raw
content into Haystack Documents for indexing or RAG.
License
microsoft-sharepoint-haystack is distributed under the terms of the
Apache-2.0 license.
