LinkedIn, Company Intelligence & Lead Enrichment with Haystack, MongoDB Atlas, and Bright Data
Last Updated: February 4, 2026
LinkedIn, Company Intelligence & Lead Enrichment with Haystack, MongoDB Atlas, and Bright Data
This cookbook demonstrates how to build an AI-powered sales research assistant that:
- Extracts live data from LinkedIn, Crunchbase, news sources, and job postings
- Stores and indexes data in MongoDB Atlas for semantic search
- Answers complex questions like “What pain points is this company facing?” and “Generate a personalized outreach angle”
The Tech Stack:
- π Bright Data: Web scraping for 45+ data sources (LinkedIn, Crunchbase, news, job boards)
- π MongoDB Atlas: Vector database for semantic search + structured metadata filtering
- π§ Haystack: Open-source LLM framework for building RAG pipelines
- π€ Google Gemini 2.5: Generate actionable sales intelligence from raw data
What You’ll Build:
- Find companies matching your Ideal Customer Profile (ICP) criteria
- Identify decision makers and research their backgrounds
- Extract pain points from job postings, news articles, and company data
- Generate personalized outreach angles based on comprehensive company intelligence
Let’s get started! π―
ποΈ Architecture Overview
How the Sales Research Assistant Works
Our AI assistant combines three powerful technologies to deliver comprehensive lead intelligence:
βββββββββββββββββββ
β User Query β "Find AI startups in NYC with Series A funding"
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β HAYSTACK PIPELINE β
β ββββββββββββββββ ββββββββββββββββ βββββββββββββββββ β
β β Embedder βββββΆβ Retriever βββββΆβ Prompt Builderβ β
β ββββββββββββββββ ββββββββ¬ββββββββ ββββββββ¬βββββββββ β
β β β β
β βΌ βΌ β
β ββββββββββββββββββββ ββββββββββββββββ β
β β MongoDB Atlas β β Gemini 2.5 β β
β β Vector Search + β β Generator β β
β β Metadata Filter β ββββββββββββββββ β
β ββββββββββ²ββββββββββ β
βββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββ
β
βββββββββββ΄ββββββββββ
β INDEXING LAYER β
βββββββββββ²ββββββββββ
β
βββββββββββ΄ββββββββββ
β BRIGHT DATA β
β Web Scraping API β
βββββββββββ¬ββββββββββ
β
βββββββββββββββββΌββββββββββββββββ
β β β
ββββββΌβββββ βββββββΌββββββ βββββββΌββββββ
βLinkedIn β β Crunchbaseβ βGoogle SERPβ
βProfiles β β Companies β β News β
βββββββββββ βββββββββββββ βββββββββββββ
Component Breakdown
1. Bright Data Layer (Data Collection)
- Web Scraper API: Extracts structured data from 45+ sources
linkedin_company_profile: Company size, industry, description, locationlinkedin_person_profile: Decision maker titles, backgrounds, experiencecrunchbase_company: Funding rounds, investors, employee count
- SERP API: Real-time search results from Google/Bing
- Company news and press releases
- Job postings (signal for pain points)
- Industry trends and mentions
- Compliance Built-in: Respects robots.txt, handles CAPTCHAs, rotates IPs automatically
2. MongoDB Atlas (Storage & Retrieval)
- Vector Search: Semantic similarity matching on embedded company/person descriptions
- Metadata Filtering: Hybrid search combining vectors with structured filters
- Filter by: industry, funding stage, location, company size, job titles
- Document Storage: Stores raw scraped data + embeddings + metadata
- Scalable: Handles millions of leads with sub-second query times
3. Haystack Pipeline (Orchestration)
- Embedder: Converts queries and documents to vector representations using Google’s text-embedding-004
- Retriever: Finds most relevant leads from MongoDB based on semantic + metadata match
- Prompt Builder: Constructs context-rich prompts with retrieved lead data
- LLM Generator: Gemini 2.0 Flash synthesizes insights and generates actionable intelligence
Agent Capabilities
This architecture enables four key workflows:
1. Company Discovery
- Input: ICP criteria (industry, funding stage, location, size)
- Process: Scrape Crunchbase/LinkedIn β Index in MongoDB β Semantic search
- Output: Ranked list of companies matching criteria
2. Decision Maker Identification
- Input: Company name or URL
- Process: Scrape LinkedIn company page β Extract employee profiles β Identify key roles
- Output: List of decision makers with titles, backgrounds, and contact hints
3. Pain Point Analysis
- Input: Company name
- Process: SERP search for job postings + news β Analyze requirements and challenges
- Output: Inferred pain points, hiring priorities, growth signals
4. Personalized Outreach Generation
- Input: Prospect name/company + context from above
- Process: RAG retrieval of all data β Gemini synthesis with sales prompts
- Output: Personalized email/message angle with specific talking points
Data Flow Example
Query: “Find AI startups in NYC that raised Series A in the last 6 months”
- Scraping: Bright Data queries Crunchbase for AI companies in NYC with recent Series A funding
- Indexing: Companies are converted to Documents with embeddings and metadata (industry=AI, location=NYC, funding_stage=Series A)
- Retrieval: Query embedding matches semantically similar companies + metadata filters enforce ICP criteria
- Generation: Gemini 2.0 receives top 10 matching companies and synthesizes a detailed report with key insights
Now let’s build it! π οΈ
Setup
First, we need to install the required dependencies for our sales research assistant.
! pip install haystack-ai haystack-brightdata mongodb-atlas-haystack google-genai-haystack dotenv
Requirement already satisfied: haystack-ai in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (2.22.0)
Requirement already satisfied: haystack-brightdata in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (0.1.0)
Requirement already satisfied: mongodb-atlas-haystack in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (4.0.0)
Requirement already satisfied: google-genai-haystack in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (3.2.0)
Requirement already satisfied: dotenv in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (0.9.9)
Requirement already satisfied: docstring-parser in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from haystack-ai) (0.17.0)
Requirement already satisfied: filetype in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from haystack-ai) (1.2.0)
Requirement already satisfied: haystack-experimental in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from haystack-ai) (0.16.0)
Requirement already satisfied: jinja2 in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from haystack-ai) (3.1.6)
Requirement already satisfied: jsonschema in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from haystack-ai) (4.26.0)
Requirement already satisfied: lazy-imports in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from haystack-ai) (1.2.0)
Requirement already satisfied: more-itertools in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from haystack-ai) (10.8.0)
Requirement already satisfied: networkx in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from haystack-ai) (3.6.1)
Requirement already satisfied: numpy in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from haystack-ai) (2.4.1)
Requirement already satisfied: openai>=1.99.2 in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from haystack-ai) (2.15.0)
Requirement already satisfied: posthog!=3.12.0 in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from haystack-ai) (7.6.0)
Requirement already satisfied: pydantic in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from haystack-ai) (2.12.5)
Requirement already satisfied: python-dateutil in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from haystack-ai) (2.9.0.post0)
Requirement already satisfied: pyyaml in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from haystack-ai) (6.0.3)
Requirement already satisfied: requests in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from haystack-ai) (2.32.5)
Requirement already satisfied: tenacity!=8.4.0 in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from haystack-ai) (9.1.2)
Requirement already satisfied: tqdm in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from haystack-ai) (4.67.1)
Requirement already satisfied: typing-extensions>=4.7 in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from haystack-ai) (4.15.0)
Requirement already satisfied: aiohttp>=3.8.0 in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from haystack-brightdata) (3.13.2)
Requirement already satisfied: pymongo>=4.13.0 in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from pymongo[srv]>=4.13.0->mongodb-atlas-haystack) (4.16.0)
Requirement already satisfied: google-genai>=1.51.0 in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from google-genai[aiohttp]>=1.51.0->google-genai-haystack) (1.60.0)
Requirement already satisfied: jsonref>=1.0.0 in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from google-genai-haystack) (1.1.0)
Requirement already satisfied: python-dotenv in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from dotenv) (1.2.1)
Requirement already satisfied: aiohappyeyeballs>=2.5.0 in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from aiohttp>=3.8.0->haystack-brightdata) (2.6.1)
Requirement already satisfied: aiosignal>=1.4.0 in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from aiohttp>=3.8.0->haystack-brightdata) (1.4.0)
Requirement already satisfied: attrs>=17.3.0 in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from aiohttp>=3.8.0->haystack-brightdata) (25.4.0)
Requirement already satisfied: frozenlist>=1.1.1 in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from aiohttp>=3.8.0->haystack-brightdata) (1.8.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from aiohttp>=3.8.0->haystack-brightdata) (6.7.0)
Requirement already satisfied: propcache>=0.2.0 in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from aiohttp>=3.8.0->haystack-brightdata) (0.4.1)
Requirement already satisfied: yarl<2.0,>=1.17.0 in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from aiohttp>=3.8.0->haystack-brightdata) (1.22.0)
Requirement already satisfied: idna>=2.0 in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from yarl<2.0,>=1.17.0->aiohttp>=3.8.0->haystack-brightdata) (3.11)
Requirement already satisfied: anyio<5.0.0,>=4.8.0 in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from google-genai>=1.51.0->google-genai[aiohttp]>=1.51.0->google-genai-haystack) (4.12.1)
Requirement already satisfied: google-auth<3.0.0,>=2.47.0 in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from google-auth[requests]<3.0.0,>=2.47.0->google-genai>=1.51.0->google-genai[aiohttp]>=1.51.0->google-genai-haystack) (2.47.0)
Requirement already satisfied: httpx<1.0.0,>=0.28.1 in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from google-genai>=1.51.0->google-genai[aiohttp]>=1.51.0->google-genai-haystack) (0.28.1)
Requirement already satisfied: websockets<15.1.0,>=13.0.0 in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from google-genai>=1.51.0->google-genai[aiohttp]>=1.51.0->google-genai-haystack) (15.0.1)
Requirement already satisfied: distro<2,>=1.7.0 in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from google-genai>=1.51.0->google-genai[aiohttp]>=1.51.0->google-genai-haystack) (1.9.0)
Requirement already satisfied: sniffio in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from google-genai>=1.51.0->google-genai[aiohttp]>=1.51.0->google-genai-haystack) (1.3.1)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from google-auth<3.0.0,>=2.47.0->google-auth[requests]<3.0.0,>=2.47.0->google-genai>=1.51.0->google-genai[aiohttp]>=1.51.0->google-genai-haystack) (0.4.2)
Requirement already satisfied: rsa<5,>=3.1.4 in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from google-auth<3.0.0,>=2.47.0->google-auth[requests]<3.0.0,>=2.47.0->google-genai>=1.51.0->google-genai[aiohttp]>=1.51.0->google-genai-haystack) (4.9.1)
Requirement already satisfied: certifi in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from httpx<1.0.0,>=0.28.1->google-genai>=1.51.0->google-genai[aiohttp]>=1.51.0->google-genai-haystack) (2026.1.4)
Requirement already satisfied: httpcore==1.* in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from httpx<1.0.0,>=0.28.1->google-genai>=1.51.0->google-genai[aiohttp]>=1.51.0->google-genai-haystack) (1.0.9)
Requirement already satisfied: h11>=0.16 in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from httpcore==1.*->httpx<1.0.0,>=0.28.1->google-genai>=1.51.0->google-genai[aiohttp]>=1.51.0->google-genai-haystack) (0.16.0)
Requirement already satisfied: annotated-types>=0.6.0 in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from pydantic->haystack-ai) (0.7.0)
Requirement already satisfied: pydantic-core==2.41.5 in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from pydantic->haystack-ai) (2.41.5)
Requirement already satisfied: typing-inspection>=0.4.2 in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from pydantic->haystack-ai) (0.4.2)
Requirement already satisfied: charset_normalizer<4,>=2 in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from requests->haystack-ai) (3.4.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from requests->haystack-ai) (2.6.3)
Requirement already satisfied: pyasn1>=0.1.3 in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from rsa<5,>=3.1.4->google-auth<3.0.0,>=2.47.0->google-auth[requests]<3.0.0,>=2.47.0->google-genai>=1.51.0->google-genai[aiohttp]>=1.51.0->google-genai-haystack) (0.6.2)
Requirement already satisfied: jiter<1,>=0.10.0 in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from openai>=1.99.2->haystack-ai) (0.12.0)
Requirement already satisfied: six>=1.5 in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from posthog!=3.12.0->haystack-ai) (1.17.0)
Requirement already satisfied: backoff>=1.10.0 in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from posthog!=3.12.0->haystack-ai) (2.2.1)
Requirement already satisfied: dnspython<3.0.0,>=2.6.1 in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from pymongo>=4.13.0->pymongo[srv]>=4.13.0->mongodb-atlas-haystack) (2.8.0)
[33mWARNING: pymongo 4.16.0 does not provide the extra 'srv'[0m[33m
[0mRequirement already satisfied: rich in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from haystack-experimental->haystack-ai) (14.3.1)
Requirement already satisfied: MarkupSafe>=2.0 in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from jinja2->haystack-ai) (3.0.3)
Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from jsonschema->haystack-ai) (2025.9.1)
Requirement already satisfied: referencing>=0.28.4 in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from jsonschema->haystack-ai) (0.37.0)
Requirement already satisfied: rpds-py>=0.25.0 in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from jsonschema->haystack-ai) (0.30.0)
Requirement already satisfied: markdown-it-py>=2.2.0 in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from rich->haystack-experimental->haystack-ai) (4.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from rich->haystack-experimental->haystack-ai) (2.19.2)
Requirement already satisfied: mdurl~=0.1 in /home/meirk/BrightAI/Demos/haystack-cookbook/.venv/lib/python3.11/site-packages (from markdown-it-py>=2.2.0->rich->haystack-experimental->haystack-ai) (0.1.2)
API Configuration
Next, we’ll configure the API keys needed for our sales research assistant. You’ll need:
- Bright Data API Key: Get yours from the Bright Data Dashboard
- MongoDB Connection String: From your MongoDB Atlas cluster
- Google API Key: For Gemini access from Google AI Studio
import os
from dotenv import load_dotenv
# Load environment variables from .env file
load_dotenv(override=True)
# Map GOOGLE_AI_API_KEY to GOOGLE_API_KEY if needed
if not os.environ.get("GOOGLE_API_KEY") and os.environ.get("GOOGLE_AI_API_KEY"):
os.environ["GOOGLE_API_KEY"] = os.environ["GOOGLE_AI_API_KEY"]
# Verify all required keys are loaded
required_keys = ["BRIGHT_DATA_API_KEY", "MONGO_CONNECTION_STRING", "GOOGLE_API_KEY"]
missing_keys = [key for key in required_keys if not os.environ.get(key)]
if missing_keys:
print(f"β Missing keys: {', '.join(missing_keys)}")
raise ValueError(f"Please add {', '.join(missing_keys)} to your .env file")
else:
print("β
All environment variables loaded successfully")
β
All environment variables loaded successfully
Bright Data Datasets Reference
from haystack_brightdata import BrightDataWebScraper
# List all supported datasets
datasets = BrightDataWebScraper.get_supported_datasets()
print(f"Total available datasets: {len(datasets)}\n")
print("Sales research relevant datasets:")
print("-" * 50)
# Filter for relevant datasets
relevant_keywords = ["linkedin", "crunchbase", "company", "profile"]
for dataset in datasets:
if any(keyword in dataset['id'].lower() for keyword in relevant_keywords):
print(f"π {dataset['id']}")
print(f" {dataset['description']}")
print()
Total available datasets: 43
Sales research relevant datasets:
--------------------------------------------------
π linkedin_person_profile
Extract structured LinkedIn person profile data. Requires a valid LinkedIn profile URL.
π linkedin_company_profile
Extract structured LinkedIn company profile data. Requires a valid LinkedIn company URL.
π linkedin_job_listings
Extract structured LinkedIn job listings data. Requires a valid LinkedIn job URL.
π linkedin_posts
Extract structured LinkedIn posts data. Requires a valid LinkedIn post URL.
π linkedin_people_search
Extract structured LinkedIn people search data. Requires URL, first_name, and last_name.
π crunchbase_company
Extract structured Crunchbase company data. Requires a valid Crunchbase company URL.
π zoominfo_company_profile
Extract structured ZoomInfo company profile data. Requires a valid ZoomInfo company URL.
π instagram_profiles
Extract structured Instagram profile data. Requires a valid Instagram profile URL.
π facebook_company_reviews
Extract structured Facebook company reviews. Requires a valid Facebook company URL and num_of_reviews.
π tiktok_profiles
Extract structured TikTok profile data. Requires a valid TikTok profile URL.
π youtube_profiles
Extract structured YouTube channel profile data. Requires a valid YouTube channel URL.
MongoDB Atlas Setup
MongoDB Atlas will serve as our vector database for storing embedded lead data and enabling semantic search.
Setup Requirements
1. Create a MongoDB Atlas Cluster
Follow the Get Started with Atlas guide to:
- Create a free cluster (M0 tier is sufficient for testing)
- Set up database access credentials
- Configure network access (allow your IP or use 0.0.0.0/0 for testing)
- Get your connection string
2. Create Vector Search Index
-
Go to your cluster in the Atlas UI
-
Click the “Search” tab β “Create Search Index”
-
Select “Atlas Vector Search” β “JSON Editor”
-
Configure:
- Index name:
lead_vector_index - Database:
sales_intelligence - Collection:
leads
- Index name:
-
Paste this configuration:
{
"fields": [
{
"type": "vector",
"path": "embedding",
"numDimensions": 768,
"similarity": "cosine"
}
]
}
- Wait for the index status to change from “Building” to “Active”
This is how it should look after setup:
Let’s initialize the document store:
from haystack_integrations.document_stores.mongodb_atlas import MongoDBAtlasDocumentStore
# Initialize MongoDB Atlas Document Store
# Note: It automatically reads from MONGO_CONNECTION_STRING environment variable
document_store = MongoDBAtlasDocumentStore(
database_name="sales_intelligence",
collection_name="leads",
vector_search_index="lead_vector_index",
full_text_search_index="lead_fulltext_index"
)
print("β
MongoDB Atlas DocumentStore initialized")
print(f" Database: sales_intelligence")
print(f" Collection: leads")
print(f" Vector Search Index: lead_vector_index")
print(f" Full-Text Search Index: lead_fulltext_index")
β
MongoDB Atlas DocumentStore initialized
Database: sales_intelligence
Collection: leads
Vector Search Index: lead_vector_index
Full-Text Search Index: lead_fulltext_index
Data Model Design
Our lead intelligence database uses a flexible schema that accommodates data from multiple sources while enabling powerful hybrid search capabilities.
Hybrid Search Strategy
This structure enables three search modes:
-
Semantic Search: Find similar companies/people based on meaning
- Query: “AI startups focused on enterprise automation”
- Matches: Companies with similar descriptions, even if wording differs
-
Metadata Filtering: Exact match on structured fields
- Filter:
funding_stage = "Series A" AND location = "New York, NY" - Returns: Only companies meeting exact criteria
- Filter:
-
Hybrid Search: Combine both approaches
- Semantic query: “Companies building developer tools”
-
- Filters:
funding_stage = "Series A"ANDlocation = "San Francisco, CA"
- Filters:
- Result: Semantically relevant companies that also match exact criteria
Example Documents
Each document has three components: content (human-readable text for LLM context), embedding (768-dim vector from text-embedding-004 for semantic search), and meta (structured fields for filtering).
Company Document (Crunchbase):
{
"content": "Company: Acme AI\nIndustry: Artificial Intelligence\nFunding: $15M Series A...",
"embedding": [0.123, -0.456, ...], # 768 dimensions
"meta": {
"source_url": "https://www.crunchbase.com/organization/acme-ai",
"dataset_type": "crunchbase_company",
"company_name": "Acme AI",
"industry": "AI/ML",
"funding_stage": "Series A",
"location": "San Francisco, CA",
"scraped_date": "2026-01-19"
}
}
Person Document (LinkedIn):
{
"content": "Name: Jane Smith\nTitle: VP of Engineering\nCompany: Acme AI\nExperience: 10+ years...",
"embedding": [0.234, -0.567, ...], # 768 dimensions
"meta": {
"source_url": "https://www.linkedin.com/in/janesmith",
"dataset_type": "linkedin_person",
"person_name": "Jane Smith",
"person_title": "VP of Engineering",
"company": "Acme AI",
"location": "San Francisco, CA",
"scraped_date": "2026-01-19"
}
}
News Signal Document (SERP):
{
"content": "News: Acme AI raises $15M Series A\nSource: TechCrunch\nSnippet: AI startup...",
"embedding": [0.345, -0.678, ...], # 768 dimensions
"meta": {
"source_url": "https://techcrunch.com/...",
"dataset_type": "news",
"company_name": "Acme AI",
"scraped_date": "2026-01-19"
}
}
This flexible schema allows us to enrich lead profiles with multiple data sources while maintaining fast, accurate search capabilities.
from haystack_integrations.components.retrievers.mongodb_atlas import MongoDBAtlasEmbeddingRetriever
# Initialize the retriever for vector search
retriever = MongoDBAtlasEmbeddingRetriever(document_store=document_store)
print("β
MongoDB Atlas Retriever initialized")
print(f" Connected to: {document_store.collection_name}")
print(f" Using vector index: {document_store.vector_search_index}")
β
MongoDB Atlas Retriever initialized
Connected to: leads
Using vector index: lead_vector_index
from haystack_brightdata import BrightDataWebScraper
# Initialize the Web Scraper
# Note: Automatically uses BRIGHT_DATA_API_KEY from environment
scraper = BrightDataWebScraper()
print("β
Bright Data Web Scraper initialized")
print(f" API Key configured: {os.environ.get('BRIGHT_DATA_API_KEY')[:20]}...")
print(f" Ready to scrape from 45+ supported datasets")
β
Bright Data Web Scraper initialized
API Key configured: 2dceb1aa0cda2fc6f7f7...
Ready to scrape from 45+ supported datasets
Example 1: Scraping Crunchbase Company Data
Let’s start by extracting company intelligence from Crunchbase. This gives us funding information, investors, employee count, and more.
import json
# Example: Scrape company data from Crunchbase
# Replace with an actual Crunchbase company URL you want to research
company_url = "https://www.crunchbase.com/organization/openai"
print("Scraping Crunchbase data for: {}".format(company_url))
print()
def coalesce(data, *keys, default="N/A"):
for key in keys:
value = data.get(key)
if value not in (None, "", [], {}):
return value
return default
def format_industries(industries):
if not industries:
return "N/A"
if isinstance(industries, list):
values = []
for item in industries:
if isinstance(item, dict):
value = item.get("value") or item.get("name") or item.get("id")
if value:
values.append(value)
else:
values.append(str(item))
return ", ".join(values) if values else "N/A"
return industries
def parse_company(result):
raw = result.get("data", result)
if isinstance(raw, str):
raw = json.loads(raw)
if isinstance(raw, list):
return raw[0] if raw else {}
if isinstance(raw, dict):
return raw
return {}
try:
result = scraper.run(
dataset="crunchbase_company",
url=company_url
)
company_data = parse_company(result)
industries = format_industries(company_data.get("industries"))
tech_list = company_data.get("builtwith_tech") or company_data.get("built_with_tech") or []
tech_names = [
item.get("name")
for item in tech_list
if isinstance(item, dict) and item.get("name")
]
tech_preview = ", ".join(tech_names[:5]) if tech_names else "N/A"
news_items = company_data.get("news") or []
news_dates = [
item.get("date")
for item in news_items
if isinstance(item, dict) and item.get("date")
]
latest_news_date = max(news_dates) if news_dates else "N/A"
print("β
Successfully scraped company data!")
print()
print("π Key Information:")
print(" Company: {}".format(coalesce(company_data, "name", "legal_name")))
print(" Overview: {}".format(coalesce(company_data, "about", "company_overview")))
print(" Industries: {}".format(industries))
print(" Operating Status: {}".format(coalesce(company_data, "operating_status")))
print(" Website: {}".format(coalesce(company_data, "website", "url")))
print(" Employees: {}".format(coalesce(company_data, "num_employees", "number_of_employee_profiles")))
print(" Phone: {}".format(coalesce(company_data, "contact_phone", "phone_number")))
print(
" Active Tech Count: {}".format(
coalesce(
company_data,
"active_tech_count",
"builtwith_num_technologies_used",
"built_with_num_technologies_used"
)
)
)
print(" Tech (sample): {}".format(tech_preview))
print(" Latest News Date: {}".format(latest_news_date))
print()
print("π Full data structure (first 500 chars):")
print(json.dumps(company_data, indent=2)[:500] + "...")
except Exception as e:
print("β Error scraping data: {}".format(e))
print(" This might be due to invalid URL or rate limiting")
Scraping Crunchbase data for: https://www.crunchbase.com/organization/openai
β
Successfully scraped company data!
π Key Information:
Company: OpenAI
Overview: OpenAI is an AI research and deployment company that develops advanced AI models, including ChatGPT.
Industries: Agentic AI, Artificial Intelligence (AI), Foundational AI, Generative AI, Machine Learning, Natural Language Processing, SaaS
Operating Status: active
Website: https://www.openai.com
Employees: 1001-5000
Phone: +1 800-242-8478
Active Tech Count: 79
Tech (sample): DNSSEC, SSL by Default, HSTS, U.S. Server Location, Mobile Non Scaleable Content
Latest News Date: 2026-01-25
π Full data structure (first 500 chars):
{
"name": "OpenAI",
"url": "https://www.crunchbase.com/organization/openai",
"id": "openai",
"cb_rank": 3,
"region": "California",
"about": "OpenAI is an AI research and deployment company that develops advanced AI models, including ChatGPT.",
"industries": [
{
"id": "agentic-ai-17fa",
"value": "Agentic AI"
},
{
"id": "artificial-intelligence",
"value": "Artificial Intelligence (AI)"
},
{
"id": "foundational-ai",
"value": "Fou...
Example 2: Scraping Linkedin Company Data
Now we will extract company data from Linkedin. This gives us broader information about the requested company
import json
# Example: Scrape LinkedIn company profile
# Replace with an actual LinkedIn company URL you want to research
linkedin_url = "https://www.linkedin.com/company/openai/"
print(f"Scraping LinkedIn company data for: {linkedin_url}")
print()
try:
result = scraper.run(
dataset="linkedin_company_profile",
url=linkedin_url
)
# Parse the JSON response
if isinstance(result["data"], str):
company_data = json.loads(result["data"])
else:
company_data = result["data"]
# Handle list response
if isinstance(company_data, list):
company_data = company_data[0] if company_data else {}
print("β
Successfully scraped LinkedIn company data!")
print("\nπ Key Information:")
print(f" Company: {company_data.get('name', 'N/A')}")
print(f" Description: {company_data.get('description', 'N/A')[:200]}...")
print(f" Industry: {company_data.get('industry', 'N/A')}")
print(f" Company Size: {company_data.get('company_size', 'N/A')}")
print(f" Headquarters: {company_data.get('headquarters', 'N/A')}")
print(f" Website: {company_data.get('website', 'N/A')}")
print(f" Followers: {company_data.get('follower_count', 'N/A')}")
print(f" Specialties: {', '.join(company_data.get('specialties', [])[:5]) if company_data.get('specialties') else 'N/A'}")
print("\nπ Full data structure (first 500 chars):")
print(json.dumps(company_data, indent=2)[:500] + "...")
except Exception as e:
print(f"β Error scraping data: {e}")
print(" This might be due to invalid URL, rate limiting, or authentication requirements")
Scraping LinkedIn company data for: https://www.linkedin.com/company/openai/
β
Successfully scraped LinkedIn company data!
π Key Information:
Company: OpenAI
Description: OpenAI | 9,797,179 followers on LinkedIn. OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. AI is an extremel...
Industry: N/A
Company Size: 201-500 employees
Headquarters: San Francisco, CA
Website: https://openai.com/
Followers: N/A
Specialties: a, r, t, i, f
π Full data structure (first 500 chars):
{
"id": "openai",
"name": "OpenAI",
"country_code": "US",
"locations": [
"San Francisco, CA 94110, US"
],
"followers": 9797179,
"employees_in_linkedin": 7020,
"about": "OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. AI is an extremely powerful tool that must be created with safety and human needs at its core. OpenAI is dedicated to putting that alignment of interests first \u2014 ahe...
Example 3: Scraping LinkedIn Person Profile
Now let’s extract decision maker profiles from LinkedIn. This helps identify key contacts, their backgrounds, and experience.
import json
# Example: Scrape LinkedIn person profile
person_url = "https://www.linkedin.com/in/satyanadella/"
print(f"Scraping LinkedIn person profile for: {person_url}")
print()
try:
result = scraper.run(
dataset="linkedin_person_profile",
url=person_url
)
# Parse the JSON response
if isinstance(result["data"], str):
person_data = json.loads(result["data"])
else:
person_data = result["data"]
# Handle list response - LinkedIn returns a list with one person object
if isinstance(person_data, list):
person_data = person_data[0] if person_data else {}
print("β
Successfully scraped LinkedIn person profile!")
print("\nπ Key Information:")
print(f" Name: {person_data.get('name', 'N/A')}")
print(f" Position: {person_data.get('position', 'N/A')}")
print(f" Location: {person_data.get('city', 'N/A')}, {person_data.get('country_code', 'N/A')}")
# Current company
current_company = person_data.get('current_company', {})
if current_company:
print(f" Current Company: {current_company.get('name', 'N/A')}")
else:
print(f" Current Company: N/A")
print(f" Followers: {person_data.get('followers', 'N/A')}")
print(f" Connections: {person_data.get('connections', 'N/A')}")
# About section
about = person_data.get('about')
if about:
print(f"\n About: {about[:200]}...")
# Experience
experience = person_data.get('experience', [])
if experience:
print(f"\n Experience ({len(experience)} roles):")
for i, exp in enumerate(experience[:3]): # Show first 3 roles
company = exp.get('company', 'N/A')
title = exp.get('title', 'N/A')
duration = exp.get('duration', 'N/A')
print(f" {i+1}. {title} at {company} ({duration})")
# Education
education = person_data.get('education', [])
if education:
print(f"\n Education ({len(education)} entries):")
for i, edu in enumerate(education[:2]): # Show first 2 education entries
title = edu.get('title', 'N/A')
years = f"{edu.get('start_year', '')}-{edu.get('end_year', '')}"
print(f" {i+1}. {title} ({years})")
print("\nπ Full data structure (first 500 chars):")
print(json.dumps(person_data, indent=2)[:500] + "...")
except Exception as e:
print(f"β Error scraping data: {e}")
print(" This might be due to invalid URL, rate limiting, or authentication requirements")
Scraping LinkedIn person profile for: https://www.linkedin.com/in/satyanadella/
β
Successfully scraped LinkedIn person profile!
π Key Information:
Name: Satya Nadella
Position: Chairman and CEO at Microsoft
Location: Redmond, Washington, United States, US
Current Company: Microsoft
Followers: 11816477
Connections: 500
About: As chairman and CEO of Microsoft, I define my mission and that of my company as empowering every person and every organization on the planet to achieve more....
Experience (5 roles):
1. Chairman and CEO at Microsoft (N/A)
2. Member Board Of Trustees at University of Chicago (N/A)
3. Board Member at Starbucks (N/A)
Education (3 entries):
1. The University of Chicago Booth School of Business (1994-1996)
2. Manipal Institute of Technology, Manipal (-)
π Full data structure (first 500 chars):
{
"id": "satyanadella",
"name": "Satya Nadella",
"city": "Redmond, Washington, United States",
"country_code": "US",
"position": "Chairman and CEO at Microsoft",
"about": "As chairman and CEO of Microsoft, I define my mission and that of my company as empowering every person and every organization on the planet to achieve more.",
"posts": [
{
"title": "A Positive-Sum Future",
"attribution": "I\u2019ve been thinking a lot about what the net benefit of the AI platform...
SERP API for Market Signals
Bright Data’s SERP API lets us gather market signals through search results - hiring signals, news, and pain points.
Example SERP Queries for Sales Research
# Hiring signals
query = 'site:linkedin.com/jobs "Company Name" engineering'
# Funding news
query = '"Company Name" funding Series A announcement'
# Recent news
query = '"Company Name" news (2024 OR 2025)'
Data Structure
SERP API returns search results:
{
"results": [
{
"title": "Company raises $50M Series B...",
"url": "https://techcrunch.com/...",
"snippet": "AI startup Company announced today...",
"date": "2025-01-15"
}
]
}
Let’s see it in action!
from haystack_brightdata import BrightDataSERP
# Initialize the SERP API component
# Note: Automatically uses BRIGHT_DATA_API_KEY from environment
serp = BrightDataSERP()
print("β
Bright Data SERP API initialized")
print(f" API Key configured: {os.environ.get('BRIGHT_DATA_API_KEY')[:20]}...")
print(f" Ready to search Google/Bing for market signals")
β
Bright Data SERP API initialized
API Key configured: 2dceb1aa0cda2fc6f7f7...
Ready to search Google/Bing for market signals
Example: Using SERP API to Find Company News
Let’s use SERP to discover recent news and signals about a company. This is perfect for identifying buying signals like funding announcements, product launches, or hiring initiatives.
import json
# Example: Search for recent company news and announcements
company_name = "OpenAI"
search_query = f'"{company_name}" news funding OR announcement OR launch 2025 OR 2026'
print(f"Searching for recent news about: {company_name}")
print(f"Query: {search_query}")
print()
try:
result = serp.run(
query=search_query,
num_results=10
)
# Parse the results
if isinstance(result["results"], str):
serp_data = json.loads(result["results"])
else:
serp_data = result["results"]
# Extract organic results (may be at root level or nested)
organic_results = serp_data.get("organic", [])
if not organic_results and "results" in serp_data:
organic_results = serp_data.get("results", [])
if not organic_results:
print("β οΈ No results found")
else:
print(f"β
Found {len(organic_results)} results")
print("\nπ° Recent News & Signals:\n")
for i, item in enumerate(organic_results[:5], 1): # Show top 5 results
title = item.get("title", "N/A")
link = item.get("link", item.get("url", "N/A"))
snippet = item.get("snippet", item.get("description", "N/A"))
print(f"{i}. {title}")
print(f" URL: {link}")
print(f" Snippet: {snippet[:150]}...")
print()
print("\nπ‘ Sales Intelligence Use Cases:")
print(" β’ Store these results in MongoDB with embeddings")
print(" β’ Use Gemini to summarize key developments")
print(" β’ Set up alerts for specific keywords (funding, hiring, launch)")
print(" β’ Identify warm leads (companies announcing growth)")
print("\nπ Full data structure (first 500 chars):")
print(json.dumps(serp_data, indent=2)[:500] + "...")
except Exception as e:
print(f"β Error searching: {e}")
print(" This might be due to rate limiting or API issues")
Searching for recent news about: OpenAI
Query: "OpenAI" news funding OR announcement OR launch 2025 OR 2026
β
Found 9 results
π° Recent News & Signals:
1. OpenAI seek investments from Middle East for multibillion- ...
URL: https://www.cnbc.com/2026/01/21/openai-seek-investments-from-middle-east-for-multibillion-dollar-round.html
Snippet: OpenAI is in talks with sovereign wealth funds in the Middle East to try to secure investments for a new multibillion dollar funding round, CNBC ...Re...
2. Horizon 1000: Advancing AI for primary healthcare
URL: https://openai.com/index/horizon-1000/
Snippet: Together, the Gates Foundation and OpenAI are committing $50 million in funding, technology, and technical support to support their work ...Read more...
3. OpenAI is coming for those sweet enterprise dollars in 2026
URL: https://techcrunch.com/2026/01/22/openai-is-coming-for-those-sweet-enterprise-dollars-in-2026/
Snippet: OpenAI on the other hand has seen its usage market share drop from 50% in 2023 to 27% at the end of 2025 β a trend that appears to concern the ...Read...
4. OpenAI's Altman Meets Mideast Investors for $50 Billion ...
URL: https://www.bloomberg.com/news/articles/2026-01-21/openai-s-altman-meets-mideast-investors-for-50-billion-round
Snippet: OpenAI Chief Executive Officer Sam Altman has been meeting with top investors in the Middle East to line up funding for a new investment round ...Read...
5. Inside OpenAI's Plan To Make Money
URL: https://www.forbes.com/sites/the-prompt/2026/01/20/inside-openais-plan-to-make-money/
Snippet: OpenAI ended 2025 with back-to-back massive infrastructure deals with the likes of Oracle, AMD and Broadcom that tallied up to $1.4 trillion of ...Rea...
π‘ Sales Intelligence Use Cases:
β’ Store these results in MongoDB with embeddings
β’ Use Gemini to summarize key developments
β’ Set up alerts for specific keywords (funding, hiring, launch)
β’ Identify warm leads (companies announcing growth)
π Full data structure (first 500 chars):
{
"general": {
"search_engine": "google",
"query": "\"OpenAI\" news funding OR announcement OR launch 2025 OR 2026",
"language": "en",
"location": "San Antonio, Texas",
"mobile": false,
"basic_view": false,
"search_type": "text",
"page_title": "\"OpenAI\" news funding OR announcement OR launch 2025 OR 2026 - Google Search",
"timestamp": "2026-01-25T12:05:32.212Z"
},
"input": {
"original_url": "https://www.google.com/search?q=%22OpenAI%22+news+funding...
Data Processing & Indexing Pipeline
Now we need to process and index our scraped data into MongoDB Atlas for semantic search.
The Indexing Pipeline Flow
Raw Scraped Data β Document Creation β Embedding Generation β MongoDB Storage
(JSON) (Haystack) (Gemini 768d) (Vector DB)
Document Structure
Each document in MongoDB has three components:
{
"content": "Human-readable text about company/person",
"embedding": [0.123, -0.456, ...], # 768-dimensional vector
"meta": {
"source_url": "...",
"dataset_type": "crunchbase_company",
"company_name": "...",
"industry": "...",
"funding_stage": "...",
"location": "...",
"scraped_date": "2026-01-19"
}
}
Let’s build it!
Helper Functions: Transform Scraped Data into Haystack Documents
Before we can index data, we need to transform raw scraper responses into Haystack Document objects. Let’s create helper functions for each data source.
import json
from datetime import datetime
from haystack import Document
def create_company_documents(scraper_result, source_url, dataset_type):
"""
Transform company data from Crunchbase or LinkedIn into Haystack Documents.
Args:
scraper_result: Raw result from BrightDataWebScraper.run()
source_url: Original URL that was scraped
dataset_type: "crunchbase_company" or "linkedin_company_profile"
Returns:
List of Document objects ready for indexing
"""
# Parse the JSON response
if isinstance(scraper_result["data"], str):
data = json.loads(scraper_result["data"])
else:
data = scraper_result["data"]
# Handle both list and single object responses
if not isinstance(data, list):
data = [data]
documents = []
scraped_date = datetime.now().strftime("%Y-%m-%d")
for item in data:
# Create content string based on dataset type
if dataset_type == "crunchbase_company":
content = f"""Company: {item.get('name', 'N/A')}
Overview: {item.get('about', 'N/A')}
Industries: {item.get('industries', 'N/A')}
Operating Status: {item.get('operating_status', 'N/A')}
Location: {item.get('headquarters', 'N/A')}
Founded: {item.get('founded_year') or item.get('founded_date', 'N/A')}
Employees: {item.get('num_employees', 'N/A')}
Website: {item.get('website', 'N/A')}"""
elif dataset_type == "linkedin_company_profile":
content = f"""Company: {item.get('name', 'N/A')}
About: {item.get('about') or item.get('description', 'N/A')}
Industries: {item.get('industries', 'N/A')}
Company Size: {item.get('company_size', 'N/A')}
Headquarters: {item.get('headquarters', 'N/A')}
Founded: {item.get('founded', 'N/A')}
Website: {item.get('website', 'N/A')}
Followers: {item.get('followers', 'N/A')}
Employees on LinkedIn: {item.get('employees_in_linkedin', 'N/A')}"""
else:
content = f"Company: {item.get('name', 'N/A')}"
# Extract industry - handle both string and list formats
industries = item.get('industries', item.get('industry', ''))
if isinstance(industries, list):
industries = ', '.join([
ind.get('value', ind) if isinstance(ind, dict) else str(ind)
for ind in industries
])
# Create Document with metadata
documents.append(Document(
content=content,
meta={
"source_url": source_url,
"dataset_type": dataset_type,
"company_name": item.get('name', ''),
"industry": industries,
"location": item.get('headquarters') or item.get('location', ''),
"scraped_date": scraped_date
}
))
return documents
print("β
Helper function created: create_company_documents()")
print(" Supports: crunchbase_company, linkedin_company_profile")
β
Helper function created: create_company_documents()
Supports: crunchbase_company, linkedin_company_profile
def create_person_documents(scraper_result, source_url):
"""
Transform LinkedIn person profile data into Haystack Documents.
Args:
scraper_result: Raw result from BrightDataWebScraper.run()
source_url: Original LinkedIn profile URL
Returns:
List of Document objects ready for indexing
"""
# Parse the JSON response
if isinstance(scraper_result["data"], str):
data = json.loads(scraper_result["data"])
else:
data = scraper_result["data"]
# Handle both list and single object responses
if not isinstance(data, list):
data = [data]
documents = []
scraped_date = datetime.now().strftime("%Y-%m-%d")
for person in data:
# Extract experience summary (first 3 roles)
experience = person.get('experience', [])
experience_summary = []
for i, exp in enumerate(experience[:3]):
company = exp.get('company', 'N/A')
title = exp.get('title', 'N/A')
duration = exp.get('duration', 'N/A')
experience_summary.append(f"{title} at {company} ({duration})")
experience_text = '\n'.join(experience_summary) if experience_summary else 'N/A'
# Extract education summary
education = person.get('education', [])
education_summary = []
for edu in education[:2]:
title = edu.get('title', 'N/A')
years = f"{edu.get('start_year', '')}-{edu.get('end_year', '')}"
education_summary.append(f"{title} ({years})")
education_text = '\n'.join(education_summary) if education_summary else 'N/A'
# Get current company info
current_company = person.get('current_company', {})
current_company_name = current_company.get('name', 'N/A') if current_company else 'N/A'
# Create content string
content = f"""Name: {person.get('name', 'N/A')}
Position: {person.get('position', 'N/A')}
Current Company: {current_company_name}
Location: {person.get('city', 'N/A')}, {person.get('country_code', 'N/A')}
About: {person.get('about', 'N/A')}
Followers: {person.get('followers', 'N/A')}
Connections: {person.get('connections', 'N/A')}
Recent Experience:
{experience_text}
Education:
{education_text}"""
# Create Document with metadata
documents.append(Document(
content=content,
meta={
"source_url": source_url,
"dataset_type": "linkedin_person_profile",
"person_name": person.get('name', ''),
"person_title": person.get('position', ''),
"company": current_company_name,
"location": f"{person.get('city', '')}, {person.get('country_code', '')}",
"scraped_date": scraped_date
}
))
return documents
print("β
Helper function created: create_person_documents()")
print(" Supports: linkedin_person_profile")
β
Helper function created: create_person_documents()
Supports: linkedin_person_profile
Build the Indexing Pipeline
Now let’s create a Haystack pipeline that automatically:
- Takes Document objects
- Generates embeddings using Gemini
- Writes to MongoDB Atlas
from haystack import Pipeline
from haystack.components.writers import DocumentWriter
from haystack_integrations.components.embedders.google_genai import GoogleGenAIDocumentEmbedder
# Create the indexing pipeline
indexing_pipeline = Pipeline()
# Add components - create a fresh embedder instance for this pipeline
indexing_pipeline.add_component("embedder", GoogleGenAIDocumentEmbedder(model="text-embedding-004"))
indexing_pipeline.add_component("writer", DocumentWriter(document_store=document_store))
# Connect components
indexing_pipeline.connect("embedder.documents", "writer.documents")
print("β
Indexing pipeline created")
print("\nPipeline structure:")
print(" Documents β Embedder (Gemini text-embedding-004) β Writer (MongoDB)")
print("\nComponents:")
print(f" β’ Embedder: GoogleGenAIDocumentEmbedder (768 dimensions)")
print(f" β’ Writer: MongoDB Atlas ({document_store.collection_name})")
Both GOOGLE_API_KEY and GEMINI_API_KEY are set. Using GOOGLE_API_KEY.
β
Indexing pipeline created
Pipeline structure:
Documents β Embedder (Gemini text-embedding-004) β Writer (MongoDB)
Components:
β’ Embedder: GoogleGenAIDocumentEmbedder (768 dimensions)
β’ Writer: MongoDB Atlas (leads)
Demo: Index Sample Companies
Let’s test the complete indexing flow by scraping a company and indexing it into MongoDB Atlas.
# Initialize the collection in MongoDB if it doesn't exist
# This creates the collection and ensures it's ready for indexing
try:
# Get the MongoDB client and database
from pymongo import MongoClient
client = MongoClient(os.environ.get("MONGO_CONNECTION_STRING"))
db = client[document_store.database_name]
# Create the collection if it doesn't exist
if document_store.collection_name not in db.list_collection_names():
db.create_collection(document_store.collection_name)
print(f"β
Created collection '{document_store.collection_name}' in database '{document_store.database_name}'")
else:
print(f"β
Collection '{document_store.collection_name}' already exists")
# Count existing documents
collection = db[document_store.collection_name]
doc_count = collection.count_documents({})
print(f" Current document count: {doc_count}")
except Exception as e:
print(f"β οΈ Error initializing collection: {e}")
print(" You may need to create the collection manually in MongoDB Atlas")
β
Collection 'leads' already exists
Current document count: 2
Note: Before indexing, we need to ensure the MongoDB collection exists. Let’s initialize it:
# Example: Scrape and index a company from Crunchbase
company_url = "https://www.crunchbase.com/organization/openai"
print(f"Step 1: Scraping company data from {company_url}")
print("-" * 60)
# Scrape the company
scraper_result = scraper.run(
dataset="crunchbase_company",
url=company_url
)
print("β
Scraping complete")
# Transform into Haystack Documents
print("\nStep 2: Transforming into Haystack Documents")
print("-" * 60)
documents = create_company_documents(
scraper_result=scraper_result,
source_url=company_url,
dataset_type="crunchbase_company"
)
print(f"β
Created {len(documents)} document(s)")
print(f"\nDocument preview:")
print(f" Content (first 200 chars): {documents[0].content[:200]}...")
print(f" Metadata: {documents[0].meta}")
# Index into MongoDB
print("\nStep 3: Generating embeddings and indexing into MongoDB")
print("-" * 60)
result = indexing_pipeline.run({"embedder": {"documents": documents}})
print(f"β
Indexed {result['writer']['documents_written']} document(s) into MongoDB")
print(f"\nπ Complete! The company is now searchable in your vector database")
print(f" β’ Semantic search: Find similar companies")
print(f" β’ Metadata filters: Filter by industry, location, etc.")
print(f" β’ RAG pipeline: Answer questions about this company")
Step 1: Scraping company data from https://www.crunchbase.com/organization/openai
------------------------------------------------------------
β
Scraping complete
Step 2: Transforming into Haystack Documents
------------------------------------------------------------
β
Created 1 document(s)
Document preview:
Content (first 200 chars): Company: OpenAI
Overview: OpenAI is an AI research and deployment company that develops advanced AI models, including ChatGPT.
Industries: [{'id': 'agentic-ai-17fa', 'value': 'Agentic AI'}, {'id': 'ar...
Metadata: {'source_url': 'https://www.crunchbase.com/organization/openai', 'dataset_type': 'crunchbase_company', 'company_name': 'OpenAI', 'industry': 'Agentic AI, Artificial Intelligence (AI), Foundational AI, Generative AI, Machine Learning, Natural Language Processing, SaaS', 'location': [{'name': 'San Francisco', 'permalink': 'san-francisco-california'}, {'name': 'California', 'permalink': 'california-united-states'}, {'name': 'United States', 'permalink': 'united-states'}, {'name': 'North America', 'permalink': 'north-america'}], 'scraped_date': '2026-01-25'}
Step 3: Generating embeddings and indexing into MongoDB
------------------------------------------------------------
Calculating embeddings: 1it [00:00, 1.18it/s]
β
Indexed 1 document(s) into MongoDB
π Complete! The company is now searchable in your vector database
β’ Semantic search: Find similar companies
β’ Metadata filters: Filter by industry, location, etc.
β’ RAG pipeline: Answer questions about this company
RAG Pipeline for Sales Intelligence
RAG combines retrieval (finding relevant documents) with generation (LLM synthesis) to answer questions based on your indexed data.
User Question β Text Embedder β Retriever β Prompt Builder β Generator β Answer
Components
Build the RAG Pipeline
Now let’s assemble all components into a complete RAG pipeline.
from haystack import Pipeline
from haystack.components.builders import ChatPromptBuilder
from haystack.dataclasses import ChatMessage
from haystack_integrations.components.embedders.google_genai import GoogleGenAITextEmbedder
from haystack_integrations.components.generators.google_genai import GoogleGenAIChatGenerator
# Define the prompt template for sales intelligence
system_message = ChatMessage.from_system("""
You are a sales intelligence assistant. Your role is to analyze company and people data to provide actionable sales intelligence.
When answering queries:
- Cite specific company names and details from the data
- Provide insights relevant for sales outreach
- Highlight key information like funding, company size, location, recent news
- Suggest talking points for personalized outreach
""")
user_template = """
Based on the following company/person data, answer the user's question.
Context:
{% for document in documents %}
{{ document.content }}
---
{% endfor %}
Question: {{ question }}
Provide a detailed, actionable answer based on the retrieved data.
"""
user_message = ChatMessage.from_user(user_template)
# Create the RAG pipeline
rag_pipeline = Pipeline()
# Add components
rag_pipeline.add_component("text_embedder", GoogleGenAITextEmbedder(model="text-embedding-004"))
rag_pipeline.add_component("retriever", MongoDBAtlasEmbeddingRetriever(document_store=document_store, top_k=5))
rag_pipeline.add_component("prompt_builder", ChatPromptBuilder(template=[system_message, user_message]))
rag_pipeline.add_component("generator", GoogleGenAIChatGenerator(model="gemini-2.5-flash"))
# Connect components
rag_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
rag_pipeline.connect("retriever.documents", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder.prompt", "generator.messages")
print("β
RAG pipeline created")
print("\nPipeline structure:")
print(" Question β Text Embedder β Retriever β Prompt Builder β Generator β Answer")
print("\nComponents:")
print(" β’ Text Embedder: text-embedding-004 (768d)")
print(" β’ Retriever: MongoDB Atlas (top_k=5)")
print(" β’ Prompt Builder: Sales intelligence template")
print(" β’ Generator: gemini-2.5-flash")
Both GOOGLE_API_KEY and GEMINI_API_KEY are set. Using GOOGLE_API_KEY.
ChatPromptBuilder has 2 prompt variables, but `required_variables` is not set. By default, all prompt variables are treated as optional, which may lead to unintended behavior in multi-branch pipelines. To avoid unexpected execution, ensure that variables intended to be required are explicitly set in `required_variables`.
Both GOOGLE_API_KEY and GEMINI_API_KEY are set. Using GOOGLE_API_KEY.
β
RAG pipeline created
Pipeline structure:
Question β Text Embedder β Retriever β Prompt Builder β Generator β Answer
Components:
β’ Text Embedder: text-embedding-004 (768d)
β’ Retriever: MongoDB Atlas (top_k=5)
β’ Prompt Builder: Sales intelligence template
β’ Generator: gemini-2.5-flash
Demo: Query the RAG Pipeline
Let’s test our RAG pipeline with a sales intelligence question.
# Example query: Ask about companies in the database
question = "What can you tell me about OpenAI? Include details about their industry, products, and any relevant information for sales outreach."
print(f"Question: {question}")
print("\n" + "="*80)
print("Processing...")
print("="*80 + "\n")
try:
# Run the RAG pipeline with include_outputs_from to get retriever results
result = rag_pipeline.run(
data={
"text_embedder": {"text": question},
"prompt_builder": {"question": question}
},
include_outputs_from={"retriever"}
)
# Extract the answer using .text
answer = result["generator"]["replies"][0].text
print("Answer:")
print("-" * 80)
print(answer)
print("-" * 80)
# Show retrieved documents
if "retriever" in result:
retrieved_docs = result["retriever"]["documents"]
print(f"\nπ Retrieved {len(retrieved_docs)} relevant documents from MongoDB")
print("\n" + "="*80)
print("RETRIEVED DOCUMENTS:")
print("="*80)
for i, doc in enumerate(retrieved_docs, 1):
print(f"\nDocument {i}:")
print(f" Company: {doc.meta.get('company_name', 'N/A')}")
print(f" Source: {doc.meta.get('dataset_type', 'N/A')}")
print(f" Location: {doc.meta.get('location', 'N/A')}")
print(f" Industry: {doc.meta.get('industry', 'N/A')}")
print(f"\n Content:")
print(f" {doc.content[:300]}...")
print("-" * 80)
else:
print("\nβ οΈ Retriever output not available")
except Exception as e:
print(f"β Error: {e}")
import traceback
traceback.print_exc()
print("\nMake sure you have:")
print(" 1. Indexed at least one company (run the indexing demo cell)")
print(" 2. MongoDB collection exists and has data")
print(" 3. Vector search index is properly configured")
Question: What can you tell me about OpenAI? Include details about their industry, products, and any relevant information for sales outreach.
================================================================================
Processing...
================================================================================
Answer:
--------------------------------------------------------------------------------
Based on the provided data, here's what you can tell about OpenAI and relevant information for sales outreach:
**Company Overview:**
* **Name:** OpenAI
* **Core Business:** OpenAI is a leading AI research and deployment company. They are known for developing advanced AI models, most notably **ChatGPT**.
* **Key Industries:** They operate at the forefront of several cutting-edge AI fields, including:
* Agentic AI
* Artificial Intelligence (AI)
* Foundational AI
* Generative AI
* Machine Learning
* Natural Language Processing (NLP)
* SaaS (indicating they deploy their models as services)
* **Operating Status:** Active
* **Size:** The data indicates their employee count is substantial, either **1,001-5,000** or **5,001-10,000**. This suggests a rapidly growing, large enterprise.
* **Website:** https://www.openai.com
**Sales Intelligence and Talking Points:**
1. **Pioneers in AI:** OpenAI is a major player and innovator in the AI space, particularly in generative AI and foundational models. This indicates they are constantly looking for cutting-edge solutions, talent, and infrastructure to maintain their leadership.
* **Sales Angle:** Any product or service that enhances AI research, model development, deployment efficiency, or security for advanced AI systems would be highly relevant.
* **Talking Point:** "Given OpenAI's groundbreaking work in [Generative AI/Foundational AI] with models like ChatGPT, I imagine you're constantly seeking ways to optimize your [data processing/compute infrastructure/model deployment/AI safety protocols]."
2. **SaaS Provider:** Their inclusion in the "SaaS" industry means they are not just developing AI, but also productizing and deploying it as services. This implies needs related to scaling, customer support, API management, cloud infrastructure, and enterprise-grade reliability.
* **Sales Angle:** Solutions for large-scale SaaS operations, particularly those with high computational demands, would be valuable.
* **Talking Point:** "As a key SaaS provider in the AI space, managing the scalability and reliability of services like ChatGPT must be a critical focus. How are you currently addressing [specific SaaS challenge, e.g., low-latency inference at scale/secure API access]?"
3. **Large and Growing Organization:** With thousands of employees, OpenAI likely faces challenges typical of rapidly scaling enterprises, such as internal communication, talent management, complex project coordination, and managing diverse research and engineering teams.
* **Sales Angle:** Solutions for enterprise collaboration, project management, developer tools, or specialized HR/recruitment for AI talent could be relevant.
* **Talking Point:** "With OpenAI's rapid growth and the complexity of your AI projects, I'm curious how you manage [specific internal challenge, e.g., cross-functional collaboration between research and engineering/onboarding specialized AI talent]."
4. **Focus on Advanced AI:** Their specific industry tags like "Agentic AI" and "Foundational AI" highlight their focus on the most complex and impactful areas of AI. This implies a need for robust, high-performance, and secure infrastructure.
* **Sales Angle:** If your product or service provides a competitive advantage in areas like high-performance computing, specialized hardware (e.g., GPUs), data privacy, or ethical AI frameworks, it would resonate.
* **Talking Point:** "Your work in Foundational AI is truly pushing the boundaries. We've seen companies tackling similar challenges find significant value in our [specific solution, e.g., secure data pipelines for large models/compute orchestration for distributed AI training]."
**Overall Sales Strategy:**
When reaching out to OpenAI, emphasize how your solution directly supports their mission to advance AI, enhances their existing AI models or infrastructure, improves their SaaS offerings, or addresses the operational complexities of a leading, fast-growing AI enterprise. Tailor your message to their specific industry focus (e.g., Generative AI, Foundational AI) to demonstrate you understand their unique challenges and priorities.
--------------------------------------------------------------------------------
π Retrieved 3 relevant documents from MongoDB
================================================================================
RETRIEVED DOCUMENTS:
================================================================================
Document 1:
Company: OpenAI
Source: crunchbase_company
Location: [{'name': 'San Francisco', 'permalink': 'san-francisco-california'}, {'name': 'California', 'permalink': 'california-united-states'}, {'name': 'United States', 'permalink': 'united-states'}, {'name': 'North America', 'permalink': 'north-america'}]
Industry: Agentic AI, Artificial Intelligence (AI), Foundational AI, Generative AI, Machine Learning, Natural Language Processing, SaaS
Content:
Company: OpenAI
Overview: OpenAI is an AI research and deployment company that develops advanced AI models, including ChatGPT.
Industries: [{'id': 'agentic-ai-17fa', 'value': 'Agentic AI'}, {'id': 'artificial-intelligence', 'value': 'Artificial Intelligence (AI)'}, {'id': 'foundational-ai', 'value':...
--------------------------------------------------------------------------------
Document 2:
Company: OpenAI
Source: crunchbase_company
Location: [{'name': 'San Francisco', 'permalink': 'san-francisco-california'}, {'name': 'California', 'permalink': 'california-united-states'}, {'name': 'United States', 'permalink': 'united-states'}, {'name': 'North America', 'permalink': 'north-america'}]
Industry: Agentic AI, Artificial Intelligence (AI), Foundational AI, Generative AI, Machine Learning, Natural Language Processing, SaaS
Content:
Company: OpenAI
Overview: OpenAI is an AI research and deployment company that develops advanced AI models, including ChatGPT.
Industries: [{'id': 'agentic-ai-17fa', 'value': 'Agentic AI'}, {'id': 'artificial-intelligence', 'value': 'Artificial Intelligence (AI)'}, {'id': 'foundational-ai', 'value':...
--------------------------------------------------------------------------------
Document 3:
Company: OpenAI
Source: crunchbase_company
Location: [{'name': 'San Francisco', 'permalink': 'san-francisco-california'}, {'name': 'California', 'permalink': 'california-united-states'}, {'name': 'United States', 'permalink': 'united-states'}, {'name': 'North America', 'permalink': 'north-america'}]
Industry: Agentic AI, Artificial Intelligence (AI), Foundational AI, Generative AI, Machine Learning, Natural Language Processing, SaaS
Content:
Company: OpenAI
Overview: OpenAI is an AI research and deployment company that develops advanced AI models, including ChatGPT.
Industries: [{'id': 'agentic-ai-17fa', 'value': 'Agentic AI'}, {'id': 'artificial-intelligence', 'value': 'Artificial Intelligence (AI)'}, {'id': 'foundational-ai', 'value':...
--------------------------------------------------------------------------------
