Integration: Firecrawl
Crawl websites and extract LLM-ready content using Firecrawl
Table of Contents
Overview
Firecrawl turns websites into LLM-ready data. It handles JavaScript rendering, anti-bot bypassing, and outputs clean Markdown.
This integration provides a
FirecrawlCrawler component that crawls one or more URLs and returns the content as Haystack Document objects. Crawling starts from each given URL and follows links to discover subpages, up to a configurable limit.
You need a Firecrawl API key to use this integration. You can get one at firecrawl.dev.
Installation
pip install firecrawl-haystack
Usage
Components
This integration provides the following component:
FirecrawlCrawler: Crawls URLs and their subpages, returning extracted content as Haystack Documents.
Basic Example
from haystack_integrations.components.fetchers.firecrawl import FirecrawlCrawler
crawler = FirecrawlCrawler(params={"limit": 5})
result = crawler.run(urls=["https://docs.haystack.deepset.ai/docs/intro"])
documents = result["documents"]
By default, the component reads the API key from the FIRECRAWL_API_KEY environment variable. You can also pass it explicitly:
from haystack.utils import Secret
from haystack_integrations.components.fetchers.firecrawl import FirecrawlCrawler
crawler = FirecrawlCrawler(
api_key=Secret.from_token("your-api-key"),
params={"limit": 10, "scrape_options": {"formats": ["markdown"]}},
)
Parameters
api_key: API key for Firecrawl. Defaults to theFIRECRAWL_API_KEYenvironment variable.params: Parameters for the crawl request. Defaults to{"limit": 1, "scrape_options": {"formats": ["markdown"]}}. See the Firecrawl API reference for all available parameters. Without a limit, Firecrawl may crawl all subpages and consume credits quickly.
Async Support
The component supports asynchronous execution via run_async:
import asyncio
from haystack_integrations.components.fetchers.firecrawl import FirecrawlCrawler
async def main():
crawler = FirecrawlCrawler(params={"limit": 5})
result = await crawler.run_async(urls=["https://docs.haystack.deepset.ai/docs/intro"])
print(f"Crawled {len(result['documents'])} documents")
asyncio.run(main())
License
firecrawl-haystack is distributed under the terms of the
Apache-2.0 license.
