LinkedIn, Company Intelligence & Lead Enrichment with Haystack, MongoDB Atlas, and Bright Data

Open in Colab Download

_{Last Updated:
February 17, 2026}

🚀 Build Your Own AI Sales Research Assistant

This cookbook demonstrates how to build an AI-powered sales research assistant that:

Extracts live data from LinkedIn, Crunchbase, news sources, and job postings
Stores and indexes data in MongoDB Atlas for semantic search
Answers complex questions like “What pain points is this company facing?” and “Generate a personalized outreach angle”

The Tech Stack:

🌐 Bright Data: Web scraping for 45+ data sources (LinkedIn, Crunchbase, news, job boards)
🍃 MongoDB Atlas: Vector database for semantic search + structured metadata filtering
🔧 Haystack: Open-source LLM framework for building RAG pipelines
🤖 Google Gemini 2.5: Generate actionable sales intelligence from raw data

What You’ll Build:

Find companies matching your Ideal Customer Profile (ICP) criteria
Identify decision makers and research their backgrounds
Extract pain points from job postings, news articles, and company data
Generate personalized outreach angles based on comprehensive company intelligence

Let’s get started! 🎯

🏗️ Architecture Overview

How the Sales Research Assistant Works

Our AI assistant combines three powerful technologies to deliver comprehensive lead intelligence:

┌─────────────────┐
│   User Query    │  "Find AI startups in NYC with Series A funding"
└────────┬────────┘
         │
         ▼
┌─────────────────────────────────────────────────────────────┐
│                   HAYSTACK PIPELINE                         │
│  ┌──────────────┐    ┌──────────────┐    ┌───────────────┐  │
│  │   Embedder   │───▶│  Retriever   │───▶│ Prompt Builder│  │
│  └──────────────┘    └──────┬───────┘    └──────┬────────┘  │
│                              │                    │         │
│                              ▼                    ▼         │
│                    ┌──────────────────┐  ┌──────────────┐   │
│                    │  MongoDB Atlas   │  │ Gemini 2.5   │   │
│                    │ Vector Search +  │  │  Generator   │   │
│                    │ Metadata Filter  │  └──────────────┘   │
│                    └────────▲─────────┘                     │
└─────────────────────────────┼───────────────────────────────┘
                              │
                    ┌─────────┴─────────┐
                    │  INDEXING LAYER   │
                    └─────────▲─────────┘
                              │
                    ┌─────────┴─────────┐
                    │   BRIGHT DATA     │
                    │  Web Scraping API │
                    └─────────┬─────────┘
                              │
              ┌───────────────┼───────────────┐
              │               │               │
         ┌────▼────┐    ┌─────▼─────┐  ┌─────▼─────┐
         │LinkedIn │    │ Crunchbase│  │Google SERP│
         │Profiles │    │ Companies │  │   News    │
         └─────────┘    └───────────┘  └───────────┘

Component Breakdown

1. Bright Data Layer (Data Collection)

Web Scraper API: Extracts structured data from 45+ sources
- linkedin_company_profile: Company size, industry, description, location
- linkedin_person_profile: Decision maker titles, backgrounds, experience
- crunchbase_company: Funding rounds, investors, employee count
SERP API: Real-time search results from Google/Bing
- Company news and press releases
- Job postings (signal for pain points)
- Industry trends and mentions
Compliance Built-in: Respects robots.txt, handles CAPTCHAs, rotates IPs automatically

2. MongoDB Atlas (Storage & Retrieval)

Vector Search: Semantic similarity matching on embedded company/person descriptions
Metadata Filtering: Hybrid search combining vectors with structured filters
- Filter by: industry, funding stage, location, company size, job titles
Document Storage: Stores raw scraped data + embeddings + metadata
Scalable: Handles millions of leads with sub-second query times

3. Haystack Pipeline (Orchestration)

Embedder: Converts queries and documents to vector representations using Google’s text-embedding-004
Retriever: Finds most relevant leads from MongoDB based on semantic + metadata match
Prompt Builder: Constructs context-rich prompts with retrieved lead data
LLM Generator: Gemini 2.5 Flash synthesizes insights and generates actionable intelligence

Agent Capabilities

This architecture enables four key workflows:

1. Company Discovery

Input: ICP criteria (industry, funding stage, location, size)
Process: Scrape Crunchbase/LinkedIn → Index in MongoDB → Semantic search
Output: Ranked list of companies matching criteria

2. Decision Maker Identification

Input: Company name or URL
Process: Scrape LinkedIn company page → Extract employee profiles → Identify key roles
Output: List of decision makers with titles, backgrounds, and contact hints

3. Pain Point Analysis

Input: Company name
Process: SERP search for job postings + news → Analyze requirements and challenges
Output: Inferred pain points, hiring priorities, growth signals

4. Personalized Outreach Generation

Input: Prospect name/company + context from above
Process: RAG retrieval of all data → Gemini synthesis with sales prompts
Output: Personalized email/message angle with specific talking points

Data Flow Example

Query: “Find AI startups in NYC that raised Series A in the last 6 months”

Scraping: Bright Data queries Crunchbase for AI companies in NYC with recent Series A funding
Indexing: Companies are converted to Documents with embeddings and metadata (industry=AI, location=NYC, funding_stage=Series A)
Retrieval: Query embedding matches semantically similar companies + metadata filters enforce ICP criteria
Generation: Gemini 2.0 receives top 10 matching companies and synthesizes a detailed report with key insights

Now let’s build it! 🛠️

Setup

First, we need to install the required dependencies for our sales research assistant.

! pip install haystack-ai haystack-brightdata mongodb-atlas-haystack google-genai-haystack dotenv

API Configuration

Next, we’ll configure the API keys needed for our sales research assistant. You’ll need:

Bright Data API Key: Get yours from the Bright Data Dashboard
MongoDB Connection String: From your MongoDB Atlas cluster
Google API Key: For Gemini access from Google AI Studio

import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv(override=True)

if not os.environ.get("GOOGLE_API_KEY") and os.environ.get("GOOGLE_AI_API_KEY"):
    os.environ["GOOGLE_API_KEY"] = os.environ["GOOGLE_AI_API_KEY"]

# Verify all required keys are loaded
required_keys = ["BRIGHT_DATA_API_KEY", "MONGO_CONNECTION_STRING", "GOOGLE_API_KEY"]
missing_keys = [key for key in required_keys if not os.environ.get(key)]

if missing_keys:
    print(f"❌ Missing keys: {', '.join(missing_keys)}")
    raise ValueError(f"Please add {', '.join(missing_keys)} to your .env file")
else:
    print("✅ All environment variables loaded successfully")

✅ All environment variables loaded successfully

Bright Data Datasets Reference

from haystack_brightdata import BrightDataWebScraper

# List all supported datasets
datasets = BrightDataWebScraper.get_supported_datasets()

print(f"Total available datasets: {len(datasets)}\n")
print("Sales research relevant datasets:")
print("-" * 50)

# Filter for relevant datasets
relevant_keywords = ["linkedin", "crunchbase", "company", "profile"]
for dataset in datasets:
    if any(keyword in dataset['id'].lower() for keyword in relevant_keywords):
        print(f"📊 {dataset['id']}")
        print(f"   {dataset['description']}")
        print()

Total available datasets: 43

Sales research relevant datasets:
--------------------------------------------------
📊 linkedin_person_profile
   Extract structured LinkedIn person profile data. Requires a valid LinkedIn profile URL.

📊 linkedin_company_profile
   Extract structured LinkedIn company profile data. Requires a valid LinkedIn company URL.

📊 linkedin_job_listings
   Extract structured LinkedIn job listings data. Requires a valid LinkedIn job URL.

📊 linkedin_posts
   Extract structured LinkedIn posts data. Requires a valid LinkedIn post URL.

📊 linkedin_people_search
   Extract structured LinkedIn people search data. Requires URL, first_name, and last_name.

📊 crunchbase_company
   Extract structured Crunchbase company data. Requires a valid Crunchbase company URL.

📊 zoominfo_company_profile
   Extract structured ZoomInfo company profile data. Requires a valid ZoomInfo company URL.

📊 instagram_profiles
   Extract structured Instagram profile data. Requires a valid Instagram profile URL.

📊 facebook_company_reviews
   Extract structured Facebook company reviews. Requires a valid Facebook company URL and num_of_reviews.

📊 tiktok_profiles
   Extract structured TikTok profile data. Requires a valid TikTok profile URL.

📊 youtube_profiles
   Extract structured YouTube channel profile data. Requires a valid YouTube channel URL.

MongoDB Atlas Setup

MongoDB Atlas will serve as our vector database for storing embedded lead data and enabling semantic search.

1. Create a MongoDB Atlas Cluster

Follow the Get Started with Atlas guide to:

Create a free cluster (M0 tier is sufficient for testing)
Set up database access credentials
Configure network access (allow your IP or use 0.0.0.0/0 for testing)
Get your connection string

2. Create Vector Search Index

Go to your cluster in the Atlas UI
Click the “Search” tab → “Create Search Index”
Select “Atlas Vector Search” → “JSON Editor”
Configure:
- Index name: lead_vector_index
- Database: sales_intelligence
- Collection: leads
Paste this configuration:

{
  "fields": [
    {
      "type": "vector",
      "path": "embedding",
      "numDimensions": 768,
      "similarity": "cosine"
    }
  ]
}

Wait for the index status to change from “Building” to “Active”

This is how it should look after setup:

Let’s initialize the document store:

from haystack_integrations.document_stores.mongodb_atlas import MongoDBAtlasDocumentStore

# Initialize MongoDB Atlas Document Store
# Note: It automatically reads from MONGO_CONNECTION_STRING environment variable
document_store = MongoDBAtlasDocumentStore(
    database_name="sales_intelligence",
    collection_name="leads",
    vector_search_index="lead_vector_index",
    full_text_search_index="lead_fulltext_index"
)

print("✅ MongoDB Atlas DocumentStore initialized")
print(f"   Database: sales_intelligence")
print(f"   Collection: leads")
print(f"   Vector Search Index: lead_vector_index")
print(f"   Full-Text Search Index: lead_fulltext_index")

✅ MongoDB Atlas DocumentStore initialized
   Database: sales_intelligence
   Collection: leads
   Vector Search Index: lead_vector_index
   Full-Text Search Index: lead_fulltext_index

Data Model Design

Our lead intelligence database uses a flexible schema that accommodates data from multiple sources while enabling powerful hybrid search capabilities.

This structure enables three search modes:

Semantic Search: Find similar companies/people based on meaning
- Query: “AI startups focused on enterprise automation”
- Matches: Companies with similar descriptions, even if wording differs
Metadata Filtering: Exact match on structured fields
- Filter: funding_stage = "Series A" AND location = "New York, NY"
- Returns: Only companies meeting exact criteria
Hybrid Search: Combine both approaches
- Semantic query: “Companies building developer tools”
- - Filters: funding_stage = "Series A" AND location = "San Francisco, CA"
- Result: Semantically relevant companies that also match exact criteria

Example Documents

Each document has three components: content (human-readable text for LLM context), embedding (768-dim vector from text-embedding-004 for semantic search), and meta (structured fields for filtering).

Company Document (Crunchbase):

{
  "content": "Company: Acme AI\nIndustry: Artificial Intelligence\nFunding: $15M Series A...",
  "embedding": [0.123, -0.456, ...],  # 768 dimensions
  "meta": {
    "source_url": "https://www.crunchbase.com/organization/acme-ai",
    "dataset_type": "crunchbase_company",
    "company_name": "Acme AI",
    "industry": "AI/ML",
    "funding_stage": "Series A",
    "location": "San Francisco, CA",
    "scraped_date": "2026-01-19"
  }
}

Person Document (LinkedIn):

{
  "content": "Name: Jane Smith\nTitle: VP of Engineering\nCompany: Acme AI\nExperience: 10+ years...",
  "embedding": [0.234, -0.567, ...],  # 768 dimensions
  "meta": {
    "source_url": "https://www.linkedin.com/in/janesmith",
    "dataset_type": "linkedin_person",
    "person_name": "Jane Smith",
    "person_title": "VP of Engineering",
    "company": "Acme AI",
    "location": "San Francisco, CA",
    "scraped_date": "2026-01-19"
  }
}

News Signal Document (SERP):

{
  "content": "News: Acme AI raises $15M Series A\nSource: TechCrunch\nSnippet: AI startup...",
  "embedding": [0.345, -0.678, ...],  # 768 dimensions
  "meta": {
    "source_url": "https://techcrunch.com/...",
    "dataset_type": "news",
    "company_name": "Acme AI",
    "scraped_date": "2026-01-19"
  }
}

This flexible schema allows us to enrich lead profiles with multiple data sources while maintaining fast, accurate search capabilities.

from haystack_integrations.components.retrievers.mongodb_atlas import MongoDBAtlasEmbeddingRetriever

# Initialize the retriever for vector search
retriever = MongoDBAtlasEmbeddingRetriever(document_store=document_store)

print("✅ MongoDB Atlas Retriever initialized")
print(f"   Connected to: {document_store.collection_name}")
print(f"   Using vector index: {document_store.vector_search_index}")

✅ MongoDB Atlas Retriever initialized
   Connected to: leads
   Using vector index: lead_vector_index

from haystack_brightdata import BrightDataWebScraper

# Initialize the Web Scraper
# Note: Automatically uses BRIGHT_DATA_API_KEY from environment
scraper = BrightDataWebScraper()

print("✅ Bright Data Web Scraper initialized")
print(f"   API Key configured: {os.environ.get('BRIGHT_DATA_API_KEY')[:20]}...")
print(f"   Ready to scrape from 45+ supported datasets")

✅ Bright Data Web Scraper initialized
   API Key configured: 2dceb1aa0cda2fc6f7f7...
   Ready to scrape from 45+ supported datasets

Example 1: Scraping Crunchbase Company Data

Let’s start by extracting company intelligence from Crunchbase. This gives us funding information, investors, employee count, and more.

import json

# Example: Scrape company data from Crunchbase
# Replace with an actual Crunchbase company URL you want to research
company_url = "https://www.crunchbase.com/organization/openai"

print("Scraping Crunchbase data for: {}".format(company_url))
print()

def coalesce(data, *keys, default="N/A"):
    for key in keys:
        value = data.get(key)
        if value not in (None, "", [], {}):
            return value
    return default

def format_industries(industries):
    if not industries:
        return "N/A"
    if isinstance(industries, list):
        values = []
        for item in industries:
            if isinstance(item, dict):
                value = item.get("value") or item.get("name") or item.get("id")
                if value:
                    values.append(value)
            else:
                values.append(str(item))
        return ", ".join(values) if values else "N/A"
    return industries

def parse_company(result):
    raw = result.get("data", result)
    if isinstance(raw, str):
        raw = json.loads(raw)
    if isinstance(raw, list):
        return raw[0] if raw else {}
    if isinstance(raw, dict):
        return raw
    return {}

try:
    result = scraper.run(
        dataset="crunchbase_company",
        url=company_url
    )

    company_data = parse_company(result)

    industries = format_industries(company_data.get("industries"))
    tech_list = company_data.get("builtwith_tech") or company_data.get("built_with_tech") or []
    tech_names = [
        item.get("name")
        for item in tech_list
        if isinstance(item, dict) and item.get("name")
    ]
    tech_preview = ", ".join(tech_names[:5]) if tech_names else "N/A"

    news_items = company_data.get("news") or []
    news_dates = [
        item.get("date")
        for item in news_items
        if isinstance(item, dict) and item.get("date")
    ]
    latest_news_date = max(news_dates) if news_dates else "N/A"

    print("✅ Successfully scraped company data!")
    print()
    print("📊 Key Information:")
    print("   Company: {}".format(coalesce(company_data, "name", "legal_name")))
    print("   Overview: {}".format(coalesce(company_data, "about", "company_overview")))
    print("   Industries: {}".format(industries))
    print("   Operating Status: {}".format(coalesce(company_data, "operating_status")))
    print("   Website: {}".format(coalesce(company_data, "website", "url")))
    print("   Employees: {}".format(coalesce(company_data, "num_employees", "number_of_employee_profiles")))
    print("   Phone: {}".format(coalesce(company_data, "contact_phone", "phone_number")))
    print(
        "   Active Tech Count: {}".format(
            coalesce(
                company_data,
                "active_tech_count",
                "builtwith_num_technologies_used",
                "built_with_num_technologies_used"
            )
        )
    )
    print("   Tech (sample): {}".format(tech_preview))
    print("   Latest News Date: {}".format(latest_news_date))

    print()
    print("📄 Full data structure (first 500 chars):")
    print(json.dumps(company_data, indent=2)[:500] + "...")
except Exception as e:
    print("❌ Error scraping data: {}".format(e))
    print("   This might be due to invalid URL or rate limiting")

Scraping Crunchbase data for: https://www.crunchbase.com/organization/openai

✅ Successfully scraped company data!

📊 Key Information:
   Company: OpenAI
   Overview: OpenAI is an AI research and deployment company that develops advanced AI models, including ChatGPT.
   Industries: Agentic AI, Artificial Intelligence (AI), Foundational AI, Generative AI, Machine Learning, Natural Language Processing, SaaS
   Operating Status: active
   Website: https://www.openai.com
   Employees: 1001-5000
   Phone: +1 800-242-8478
   Active Tech Count: 79
   Tech (sample): DNSSEC, SSL by Default, HSTS, U.S. Server Location, Mobile Non Scaleable Content
   Latest News Date: 2026-01-25

📄 Full data structure (first 500 chars):
{
  "name": "OpenAI",
  "url": "https://www.crunchbase.com/organization/openai",
  "id": "openai",
  "cb_rank": 3,
  "region": "California",
  "about": "OpenAI is an AI research and deployment company that develops advanced AI models, including ChatGPT.",
  "industries": [
    {
      "id": "agentic-ai-17fa",
      "value": "Agentic AI"
    },
    {
      "id": "artificial-intelligence",
      "value": "Artificial Intelligence (AI)"
    },
    {
      "id": "foundational-ai",
      "value": "Fou...

Example 2: Scraping Linkedin Company Data

Now we will extract company data from Linkedin. This gives us broader information about the requested company

import json

# Example: Scrape LinkedIn company profile
# Replace with an actual LinkedIn company URL you want to research
linkedin_url = "https://www.linkedin.com/company/openai/"

print(f"Scraping LinkedIn company data for: {linkedin_url}")
print()

try:
    result = scraper.run(
        dataset="linkedin_company_profile",
        url=linkedin_url
    )

    # Parse the JSON response
    if isinstance(result["data"], str):
        company_data = json.loads(result["data"])
    else:
        company_data = result["data"]

    # Handle list response
    if isinstance(company_data, list):
        company_data = company_data[0] if company_data else {}

    print("✅ Successfully scraped LinkedIn company data!")
    print("\n📊 Key Information:")
    print(f"   Company: {company_data.get('name', 'N/A')}")
    print(f"   Description: {company_data.get('description', 'N/A')[:200]}...")
    print(f"   Industry: {company_data.get('industry', 'N/A')}")
    print(f"   Company Size: {company_data.get('company_size', 'N/A')}")
    print(f"   Headquarters: {company_data.get('headquarters', 'N/A')}")
    print(f"   Website: {company_data.get('website', 'N/A')}")
    print(f"   Followers: {company_data.get('follower_count', 'N/A')}")
    print(f"   Specialties: {', '.join(company_data.get('specialties', [])[:5]) if company_data.get('specialties') else 'N/A'}")

    print("\n📄 Full data structure (first 500 chars):")
    print(json.dumps(company_data, indent=2)[:500] + "...")

except Exception as e:
    print(f"❌ Error scraping data: {e}")
    print("   This might be due to invalid URL, rate limiting, or authentication requirements")

Scraping LinkedIn company data for: https://www.linkedin.com/company/openai/

✅ Successfully scraped LinkedIn company data!

📊 Key Information:
   Company: OpenAI
   Description: OpenAI | 9,797,179 followers on LinkedIn. OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. AI is an extremel...
   Industry: N/A
   Company Size: 201-500 employees
   Headquarters: San Francisco, CA
   Website: https://openai.com/
   Followers: N/A
   Specialties: a, r, t, i, f

📄 Full data structure (first 500 chars):
{
  "id": "openai",
  "name": "OpenAI",
  "country_code": "US",
  "locations": [
    "San Francisco, CA 94110, US"
  ],
  "followers": 9797179,
  "employees_in_linkedin": 7020,
  "about": "OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. AI is an extremely powerful tool that must be created with safety and human needs at its core. OpenAI is dedicated to putting that alignment of interests first \u2014 ahe...

Example 3: Scraping LinkedIn Person Profile

Now let’s extract decision maker profiles from LinkedIn. This helps identify key contacts, their backgrounds, and experience.

import json

# Example: Scrape LinkedIn person profile
person_url = "https://www.linkedin.com/in/satyanadella/"

print(f"Scraping LinkedIn person profile for: {person_url}")
print()

try:
    result = scraper.run(
        dataset="linkedin_person_profile",
        url=person_url
    )

    # Parse the JSON response
    if isinstance(result["data"], str):
        person_data = json.loads(result["data"])
    else:
        person_data = result["data"]

    # Handle list response - LinkedIn returns a list with one person object
    if isinstance(person_data, list):
        person_data = person_data[0] if person_data else {}

    print("✅ Successfully scraped LinkedIn person profile!")
    print("\n📊 Key Information:")
    print(f"   Name: {person_data.get('name', 'N/A')}")
    print(f"   Position: {person_data.get('position', 'N/A')}")
    print(f"   Location: {person_data.get('city', 'N/A')}, {person_data.get('country_code', 'N/A')}")

    # Current company
    current_company = person_data.get('current_company', {})
    if current_company:
        print(f"   Current Company: {current_company.get('name', 'N/A')}")
    else:
        print(f"   Current Company: N/A")

    print(f"   Followers: {person_data.get('followers', 'N/A')}")
    print(f"   Connections: {person_data.get('connections', 'N/A')}")

    # About section
    about = person_data.get('about')
    if about:
        print(f"\n   About: {about[:200]}...")

    # Experience
    experience = person_data.get('experience', [])
    if experience:
        print(f"\n   Experience ({len(experience)} roles):")
        for i, exp in enumerate(experience[:3]):  # Show first 3 roles
            company = exp.get('company', 'N/A')
            title = exp.get('title', 'N/A')
            duration = exp.get('duration', 'N/A')
            print(f"      {i+1}. {title} at {company} ({duration})")

    # Education
    education = person_data.get('education', [])
    if education:
        print(f"\n   Education ({len(education)} entries):")
        for i, edu in enumerate(education[:2]):  # Show first 2 education entries
            title = edu.get('title', 'N/A')
            years = f"{edu.get('start_year', '')}-{edu.get('end_year', '')}"
            print(f"      {i+1}. {title} ({years})")

    print("\n📄 Full data structure (first 500 chars):")
    print(json.dumps(person_data, indent=2)[:500] + "...")

except Exception as e:
    print(f"❌ Error scraping data: {e}")
    print("   This might be due to invalid URL, rate limiting, or authentication requirements")

Scraping LinkedIn person profile for: https://www.linkedin.com/in/satyanadella/

✅ Successfully scraped LinkedIn person profile!

📊 Key Information:
   Name: Satya Nadella
   Position: Chairman and CEO at Microsoft
   Location: Redmond, Washington, United States, US
   Current Company: Microsoft
   Followers: 11816477
   Connections: 500

   About: As chairman and CEO of Microsoft, I define my mission and that of my company as empowering every person and every organization on the planet to achieve more....

   Experience (5 roles):
      1. Chairman and CEO at Microsoft (N/A)
      2. Member Board Of Trustees at University of Chicago (N/A)
      3. Board Member at Starbucks (N/A)

   Education (3 entries):
      1. The University of Chicago Booth School of Business (1994-1996)
      2. Manipal Institute of Technology, Manipal (-)

📄 Full data structure (first 500 chars):
{
  "id": "satyanadella",
  "name": "Satya Nadella",
  "city": "Redmond, Washington, United States",
  "country_code": "US",
  "position": "Chairman and CEO at Microsoft",
  "about": "As chairman and CEO of Microsoft, I define my mission and that of my company as empowering every person and every organization on the planet to achieve more.",
  "posts": [
    {
      "title": "A Positive-Sum Future",
      "attribution": "I\u2019ve been thinking a lot about what the net benefit of the AI platform...

SERP API for Market Signals

Bright Data’s SERP API lets us gather market signals through search results - hiring signals, news, and pain points.

Example SERP Queries for Sales Research

# Hiring signals
query = 'site:linkedin.com/jobs "Company Name" engineering'

# Funding news
query = '"Company Name" funding Series A announcement'

# Recent news
query = '"Company Name" news (2024 OR 2025)'

Data Structure

SERP API returns search results:

{
  "results": [
    {
      "title": "Company raises $50M Series B...",
      "url": "https://techcrunch.com/...",
      "snippet": "AI startup Company announced today...",
      "date": "2025-01-15"
    }
  ]
}

Let’s see it in action!

from haystack_brightdata import BrightDataSERP

# Initialize the SERP API component
# Note: Automatically uses BRIGHT_DATA_API_KEY from environment
serp = BrightDataSERP()

print("✅ Bright Data SERP API initialized")
print(f"   API Key configured: {os.environ.get('BRIGHT_DATA_API_KEY')[:20]}...")
print(f"   Ready to search Google/Bing for market signals")

✅ Bright Data SERP API initialized
   API Key configured: 2dceb1aa0cda2fc6f7f7...
   Ready to search Google/Bing for market signals

Example: Using SERP API to Find Company News

Let’s use SERP to discover recent news and signals about a company. This is perfect for identifying buying signals like funding announcements, product launches, or hiring initiatives.

import json

# Example: Search for recent company news and announcements
company_name = "OpenAI"
search_query = f'"{company_name}" news funding OR announcement OR launch 2025 OR 2026'

print(f"Searching for recent news about: {company_name}")
print(f"Query: {search_query}")
print()

try:
    result = serp.run(
        query=search_query,
        num_results=10
    )

    # Parse the results
    if isinstance(result["results"], str):
        serp_data = json.loads(result["results"])
    else:
        serp_data = result["results"]

    # Extract organic results (may be at root level or nested)
    organic_results = serp_data.get("organic", [])
    if not organic_results and "results" in serp_data:
        organic_results = serp_data.get("results", [])

    if not organic_results:
        print("⚠️ No results found")
    else:
        print(f"✅ Found {len(organic_results)} results")
        print("\n📰 Recent News & Signals:\n")

        for i, item in enumerate(organic_results[:5], 1):  # Show top 5 results
            title = item.get("title", "N/A")
            link = item.get("link", item.get("url", "N/A"))
            snippet = item.get("snippet", item.get("description", "N/A"))

            print(f"{i}. {title}")
            print(f"   URL: {link}")
            print(f"   Snippet: {snippet[:150]}...")
            print()

        print("\n💡 Sales Intelligence Use Cases:")
        print("   • Store these results in MongoDB with embeddings")
        print("   • Use Gemini to summarize key developments")
        print("   • Set up alerts for specific keywords (funding, hiring, launch)")
        print("   • Identify warm leads (companies announcing growth)")

    print("\n📄 Full data structure (first 500 chars):")
    print(json.dumps(serp_data, indent=2)[:500] + "...")

except Exception as e:
    print(f"❌ Error searching: {e}")
    print("   This might be due to rate limiting or API issues")

Searching for recent news about: OpenAI
Query: "OpenAI" news funding OR announcement OR launch 2025 OR 2026

✅ Found 9 results

📰 Recent News & Signals:

1. OpenAI seek investments from Middle East for multibillion- ...
   URL: https://www.cnbc.com/2026/01/21/openai-seek-investments-from-middle-east-for-multibillion-dollar-round.html
   Snippet: OpenAI is in talks with sovereign wealth funds in the Middle East to try to secure investments for a new multibillion dollar funding round, CNBC ...Re...

2. Horizon 1000: Advancing AI for primary healthcare
   URL: https://openai.com/index/horizon-1000/
   Snippet: Together, the Gates Foundation and OpenAI are committing $50 million in funding, technology, and technical support to support their work ...Read more...

3. OpenAI is coming for those sweet enterprise dollars in 2026
   URL: https://techcrunch.com/2026/01/22/openai-is-coming-for-those-sweet-enterprise-dollars-in-2026/
   Snippet: OpenAI on the other hand has seen its usage market share drop from 50% in 2023 to 27% at the end of 2025 — a trend that appears to concern the ...Read...

4. OpenAI's Altman Meets Mideast Investors for $50 Billion ...
   URL: https://www.bloomberg.com/news/articles/2026-01-21/openai-s-altman-meets-mideast-investors-for-50-billion-round
   Snippet: OpenAI Chief Executive Officer Sam Altman has been meeting with top investors in the Middle East to line up funding for a new investment round ...Read...

5. Inside OpenAI's Plan To Make Money
   URL: https://www.forbes.com/sites/the-prompt/2026/01/20/inside-openais-plan-to-make-money/
   Snippet: OpenAI ended 2025 with back-to-back massive infrastructure deals with the likes of Oracle, AMD and Broadcom that tallied up to $1.4 trillion of ...Rea...


💡 Sales Intelligence Use Cases:
   • Store these results in MongoDB with embeddings
   • Use Gemini to summarize key developments
   • Set up alerts for specific keywords (funding, hiring, launch)
   • Identify warm leads (companies announcing growth)

📄 Full data structure (first 500 chars):
{
  "general": {
    "search_engine": "google",
    "query": "\"OpenAI\" news funding OR announcement OR launch 2025 OR 2026",
    "language": "en",
    "location": "San Antonio, Texas",
    "mobile": false,
    "basic_view": false,
    "search_type": "text",
    "page_title": "\"OpenAI\" news funding OR announcement OR launch 2025 OR 2026 - Google Search",
    "timestamp": "2026-01-25T12:05:32.212Z"
  },
  "input": {
    "original_url": "https://www.google.com/search?q=%22OpenAI%22+news+funding...

Data Processing & Indexing Pipeline

Now we need to process and index our scraped data into MongoDB Atlas for semantic search.

The Indexing Pipeline Flow

Raw Scraped Data → Document Creation → Embedding Generation → MongoDB Storage
     (JSON)            (Haystack)         (Gemini 768d)         (Vector DB)

Document Structure

Each document in MongoDB has three components:

{
  "content": "Human-readable text about company/person",
  "embedding": [0.123, -0.456, ...],  # 768-dimensional vector
  "meta": {
    "source_url": "...",
    "dataset_type": "crunchbase_company",
    "company_name": "...",
    "industry": "...",
    "funding_stage": "...",
    "location": "...",
    "scraped_date": "2026-01-19"
  }
}

Let’s build it!

Helper Functions: Transform Scraped Data into Haystack Documents

Before we can index data, we need to transform raw scraper responses into Haystack Document objects. Let’s create helper functions for each data source.

import json
from datetime import datetime
from haystack import Document

def create_company_documents(scraper_result, source_url, dataset_type):
    """
    Transform company data from Crunchbase or LinkedIn into Haystack Documents.

    Args:
        scraper_result: Raw result from BrightDataWebScraper.run()
        source_url: Original URL that was scraped
        dataset_type: "crunchbase_company" or "linkedin_company_profile"

    Returns:
        List of Document objects ready for indexing
    """
    # Parse the JSON response
    if isinstance(scraper_result["data"], str):
        data = json.loads(scraper_result["data"])
    else:
        data = scraper_result["data"]

    # Handle both list and single object responses
    if not isinstance(data, list):
        data = [data]

    documents = []
    scraped_date = datetime.now().strftime("%Y-%m-%d")

    for item in data:
        # Create content string based on dataset type
        if dataset_type == "crunchbase_company":
            content = f"""Company: {item.get('name', 'N/A')}
Overview: {item.get('about', 'N/A')}
Industries: {item.get('industries', 'N/A')}
Operating Status: {item.get('operating_status', 'N/A')}
Location: {item.get('headquarters', 'N/A')}
Founded: {item.get('founded_year') or item.get('founded_date', 'N/A')}
Employees: {item.get('num_employees', 'N/A')}
Website: {item.get('website', 'N/A')}"""

        elif dataset_type == "linkedin_company_profile":
            content = f"""Company: {item.get('name', 'N/A')}
About: {item.get('about') or item.get('description', 'N/A')}
Industries: {item.get('industries', 'N/A')}
Company Size: {item.get('company_size', 'N/A')}
Headquarters: {item.get('headquarters', 'N/A')}
Founded: {item.get('founded', 'N/A')}
Website: {item.get('website', 'N/A')}
Followers: {item.get('followers', 'N/A')}
Employees on LinkedIn: {item.get('employees_in_linkedin', 'N/A')}"""

        else:
            content = f"Company: {item.get('name', 'N/A')}"

        # Extract industry - handle both string and list formats
        industries = item.get('industries', item.get('industry', ''))
        if isinstance(industries, list):
            industries = ', '.join([
                ind.get('value', ind) if isinstance(ind, dict) else str(ind)
                for ind in industries
            ])

        # Create Document with metadata
        documents.append(Document(
            content=content,
            meta={
                "source_url": source_url,
                "dataset_type": dataset_type,
                "company_name": item.get('name', ''),
                "industry": industries,
                "location": item.get('headquarters') or item.get('location', ''),
                "scraped_date": scraped_date
            }
        ))

    return documents

print("✅ Helper function created: create_company_documents()")
print("   Supports: crunchbase_company, linkedin_company_profile")

✅ Helper function created: create_company_documents()
   Supports: crunchbase_company, linkedin_company_profile

def create_person_documents(scraper_result, source_url):
    """
    Transform LinkedIn person profile data into Haystack Documents.

    Args:
        scraper_result: Raw result from BrightDataWebScraper.run()
        source_url: Original LinkedIn profile URL

    Returns:
        List of Document objects ready for indexing
    """
    # Parse the JSON response
    if isinstance(scraper_result["data"], str):
        data = json.loads(scraper_result["data"])
    else:
        data = scraper_result["data"]

    # Handle both list and single object responses
    if not isinstance(data, list):
        data = [data]

    documents = []
    scraped_date = datetime.now().strftime("%Y-%m-%d")

    for person in data:
        # Extract experience summary (first 3 roles)
        experience = person.get('experience', [])
        experience_summary = []
        for i, exp in enumerate(experience[:3]):
            company = exp.get('company', 'N/A')
            title = exp.get('title', 'N/A')
            duration = exp.get('duration', 'N/A')
            experience_summary.append(f"{title} at {company} ({duration})")
        experience_text = '\n'.join(experience_summary) if experience_summary else 'N/A'

        # Extract education summary
        education = person.get('education', [])
        education_summary = []
        for edu in education[:2]:
            title = edu.get('title', 'N/A')
            years = f"{edu.get('start_year', '')}-{edu.get('end_year', '')}"
            education_summary.append(f"{title} ({years})")
        education_text = '\n'.join(education_summary) if education_summary else 'N/A'

        # Get current company info
        current_company = person.get('current_company', {})
        current_company_name = current_company.get('name', 'N/A') if current_company else 'N/A'

        # Create content string
        content = f"""Name: {person.get('name', 'N/A')}
Position: {person.get('position', 'N/A')}
Current Company: {current_company_name}
Location: {person.get('city', 'N/A')}, {person.get('country_code', 'N/A')}
About: {person.get('about', 'N/A')}
Followers: {person.get('followers', 'N/A')}
Connections: {person.get('connections', 'N/A')}

Recent Experience:
{experience_text}

Education:
{education_text}"""

        # Create Document with metadata
        documents.append(Document(
            content=content,
            meta={
                "source_url": source_url,
                "dataset_type": "linkedin_person_profile",
                "person_name": person.get('name', ''),
                "person_title": person.get('position', ''),
                "company": current_company_name,
                "location": f"{person.get('city', '')}, {person.get('country_code', '')}",
                "scraped_date": scraped_date
            }
        ))

    return documents

print("✅ Helper function created: create_person_documents()")
print("   Supports: linkedin_person_profile")

✅ Helper function created: create_person_documents()
   Supports: linkedin_person_profile

Build the Indexing Pipeline

Now let’s create a Haystack pipeline that automatically:

Takes Document objects
Generates embeddings using Gemini
Writes to MongoDB Atlas

from haystack import Pipeline
from haystack.components.writers import DocumentWriter
from haystack_integrations.components.embedders.google_genai import GoogleGenAIDocumentEmbedder

# Create the indexing pipeline
indexing_pipeline = Pipeline()

# Add components - create a fresh embedder instance for this pipeline
indexing_pipeline.add_component("embedder", GoogleGenAIDocumentEmbedder(model="text-embedding-004"))
indexing_pipeline.add_component("writer", DocumentWriter(document_store=document_store))

# Connect components
indexing_pipeline.connect("embedder.documents", "writer.documents")

print("✅ Indexing pipeline created")
print("\nPipeline structure:")
print("   Documents → Embedder (Gemini text-embedding-004) → Writer (MongoDB)")
print("\nComponents:")
print(f"   • Embedder: GoogleGenAIDocumentEmbedder (768 dimensions)")
print(f"   • Writer: MongoDB Atlas ({document_store.collection_name})")

Both GOOGLE_API_KEY and GEMINI_API_KEY are set. Using GOOGLE_API_KEY.


✅ Indexing pipeline created

Pipeline structure:
   Documents → Embedder (Gemini text-embedding-004) → Writer (MongoDB)

Components:
   • Embedder: GoogleGenAIDocumentEmbedder (768 dimensions)
   • Writer: MongoDB Atlas (leads)

Index Sample Companies

Let’s test the complete indexing flow by scraping a company and indexing it into MongoDB Atlas.

# Initialize the collection in MongoDB if it doesn't exist
# This creates the collection and ensures it's ready for indexing

try:
    # Get the MongoDB client and database
    from pymongo import MongoClient

    client = MongoClient(os.environ.get("MONGO_CONNECTION_STRING"))
    db = client[document_store.database_name]

    # Create the collection if it doesn't exist
    if document_store.collection_name not in db.list_collection_names():
        db.create_collection(document_store.collection_name)
        print(f"✅ Created collection '{document_store.collection_name}' in database '{document_store.database_name}'")
    else:
        print(f"✅ Collection '{document_store.collection_name}' already exists")

    # Count existing documents
    collection = db[document_store.collection_name]
    doc_count = collection.count_documents({})
    print(f"   Current document count: {doc_count}")

except Exception as e:
    print(f"⚠️ Error initializing collection: {e}")
    print("   You may need to create the collection manually in MongoDB Atlas")

✅ Collection 'leads' already exists
   Current document count: 2

Note: Before indexing, we need to ensure the MongoDB collection exists. Let’s initialize it:

# Example: Scrape and index a company from Crunchbase
company_url = "https://www.crunchbase.com/organization/openai"

print(f"Step 1: Scraping company data from {company_url}")
print("-" * 60)

# Scrape the company
scraper_result = scraper.run(
    dataset="crunchbase_company",
    url=company_url
)

print("✅ Scraping complete")

# Transform into Haystack Documents
print("\nStep 2: Transforming into Haystack Documents")
print("-" * 60)

documents = create_company_documents(
    scraper_result=scraper_result,
    source_url=company_url,
    dataset_type="crunchbase_company"
)

print(f"✅ Created {len(documents)} document(s)")
print(f"\nDocument preview:")
print(f"   Content (first 200 chars): {documents[0].content[:200]}...")
print(f"   Metadata: {documents[0].meta}")

# Index into MongoDB
print("\nStep 3: Generating embeddings and indexing into MongoDB")
print("-" * 60)

result = indexing_pipeline.run({"embedder": {"documents": documents}})

print(f"✅ Indexed {result['writer']['documents_written']} document(s) into MongoDB")
print(f"\n🎉 Complete! The company is now searchable in your vector database")
print(f"   • Semantic search: Find similar companies")
print(f"   • Metadata filters: Filter by industry, location, etc.")
print(f"   • RAG pipeline: Answer questions about this company")

Step 1: Scraping company data from https://www.crunchbase.com/organization/openai
------------------------------------------------------------


✅ Scraping complete

Step 2: Transforming into Haystack Documents
------------------------------------------------------------
✅ Created 1 document(s)

Document preview:
   Content (first 200 chars): Company: OpenAI
Overview: OpenAI is an AI research and deployment company that develops advanced AI models, including ChatGPT.
Industries: [{'id': 'agentic-ai-17fa', 'value': 'Agentic AI'}, {'id': 'ar...
   Metadata: {'source_url': 'https://www.crunchbase.com/organization/openai', 'dataset_type': 'crunchbase_company', 'company_name': 'OpenAI', 'industry': 'Agentic AI, Artificial Intelligence (AI), Foundational AI, Generative AI, Machine Learning, Natural Language Processing, SaaS', 'location': [{'name': 'San Francisco', 'permalink': 'san-francisco-california'}, {'name': 'California', 'permalink': 'california-united-states'}, {'name': 'United States', 'permalink': 'united-states'}, {'name': 'North America', 'permalink': 'north-america'}], 'scraped_date': '2026-01-25'}

Step 3: Generating embeddings and indexing into MongoDB
------------------------------------------------------------


Calculating embeddings: 1it [00:00,  1.18it/s]


✅ Indexed 1 document(s) into MongoDB

🎉 Complete! The company is now searchable in your vector database
   • Semantic search: Find similar companies
   • Metadata filters: Filter by industry, location, etc.
   • RAG pipeline: Answer questions about this company

RAG Pipeline for Sales Intelligence

RAG combines retrieval (finding relevant documents) with generation (LLM synthesis) to answer questions based on your indexed data.

User Question → Text Embedder → Retriever → Prompt Builder → Generator → Answer

Components

Build the RAG Pipeline

Now let’s assemble all components into a complete RAG pipeline.

from haystack import Pipeline
from haystack.components.builders import ChatPromptBuilder
from haystack.dataclasses import ChatMessage
from haystack_integrations.components.embedders.google_genai import GoogleGenAITextEmbedder
from haystack_integrations.components.generators.google_genai import GoogleGenAIChatGenerator

# Define the prompt template for sales intelligence
system_message = ChatMessage.from_system("""
You are a sales intelligence assistant. Your role is to analyze company and people data to provide actionable sales intelligence.

When answering queries:
- Cite specific company names and details from the data
- Provide insights relevant for sales outreach
- Highlight key information like funding, company size, location, recent news
- Suggest talking points for personalized outreach
""")

user_template = """
Based on the following company/person data, answer the user's question.

Context:
{% for document in documents %}
{{ document.content }}
---
{% endfor %}

Question: {{ question }}

Provide a detailed, actionable answer based on the retrieved data.
"""

user_message = ChatMessage.from_user(user_template)

# Create the RAG pipeline
rag_pipeline = Pipeline()

# Add components
rag_pipeline.add_component("text_embedder", GoogleGenAITextEmbedder(model="text-embedding-004"))
rag_pipeline.add_component("retriever", MongoDBAtlasEmbeddingRetriever(document_store=document_store, top_k=5))
rag_pipeline.add_component("prompt_builder", ChatPromptBuilder(template=[system_message, user_message]))
rag_pipeline.add_component("generator", GoogleGenAIChatGenerator(model="gemini-2.5-flash"))

# Connect components
rag_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
rag_pipeline.connect("retriever.documents", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder.prompt", "generator.messages")

print("✅ RAG pipeline created")
print("\nPipeline structure:")
print("   Question → Text Embedder → Retriever → Prompt Builder → Generator → Answer")
print("\nComponents:")
print("   • Text Embedder: text-embedding-004 (768d)")
print("   • Retriever: MongoDB Atlas (top_k=5)")
print("   • Prompt Builder: Sales intelligence template")
print("   • Generator: gemini-2.5-flash")

Both GOOGLE_API_KEY and GEMINI_API_KEY are set. Using GOOGLE_API_KEY.
ChatPromptBuilder has 2 prompt variables, but `required_variables` is not set. By default, all prompt variables are treated as optional, which may lead to unintended behavior in multi-branch pipelines. To avoid unexpected execution, ensure that variables intended to be required are explicitly set in `required_variables`.
Both GOOGLE_API_KEY and GEMINI_API_KEY are set. Using GOOGLE_API_KEY.


✅ RAG pipeline created

Pipeline structure:
   Question → Text Embedder → Retriever → Prompt Builder → Generator → Answer

Components:
   • Text Embedder: text-embedding-004 (768d)
   • Retriever: MongoDB Atlas (top_k=5)
   • Prompt Builder: Sales intelligence template
   • Generator: gemini-2.5-flash

Demo: Query the Sales Research Assistant

Let’s test our RAG pipeline with a sales intelligence question.

# Example query: Ask about companies in the database
question = "What can you tell me about OpenAI? Include details about their industry, products, and any relevant information for sales outreach."

print(f"Question: {question}")
print("\n" + "="*80)
print("Processing...")
print("="*80 + "\n")

try:
    # Run the RAG pipeline with include_outputs_from to get retriever results
    result = rag_pipeline.run(
        data={
            "text_embedder": {"text": question},
            "prompt_builder": {"question": question}
        },
        include_outputs_from={"retriever"}
    )

    # Extract the answer using .text
    answer = result["generator"]["replies"][0].text

    print("Answer:")
    print("-" * 80)
    print(answer)
    print("-" * 80)

    # Show retrieved documents
    if "retriever" in result:
        retrieved_docs = result["retriever"]["documents"]
        print(f"\n📄 Retrieved {len(retrieved_docs)} relevant documents from MongoDB")
        print("\n" + "="*80)
        print("RETRIEVED DOCUMENTS:")
        print("="*80)

        for i, doc in enumerate(retrieved_docs, 1):
            print(f"\nDocument {i}:")
            print(f"   Company: {doc.meta.get('company_name', 'N/A')}")
            print(f"   Source: {doc.meta.get('dataset_type', 'N/A')}")
            print(f"   Location: {doc.meta.get('location', 'N/A')}")
            print(f"   Industry: {doc.meta.get('industry', 'N/A')}")
            print(f"\n   Content:")
            print(f"   {doc.content[:300]}...")
            print("-" * 80)
    else:
        print("\n⚠️ Retriever output not available")

except Exception as e:
    print(f"❌ Error: {e}")
    import traceback
    traceback.print_exc()
    print("\nMake sure you have:")
    print("   1. Indexed at least one company (run the indexing demo cell)")
    print("   2. MongoDB collection exists and has data")
    print("   3. Vector search index is properly configured")

Question: What can you tell me about OpenAI? Include details about their industry, products, and any relevant information for sales outreach.

================================================================================
Processing...
================================================================================

Answer:
--------------------------------------------------------------------------------
Based on the provided data, here's what you can tell about OpenAI and relevant information for sales outreach:

**Company Overview:**

*   **Name:** OpenAI
*   **Core Business:** OpenAI is a leading AI research and deployment company. They are known for developing advanced AI models, most notably **ChatGPT**.
*   **Key Industries:** They operate at the forefront of several cutting-edge AI fields, including:
    *   Agentic AI
    *   Artificial Intelligence (AI)
    *   Foundational AI
    *   Generative AI
    *   Machine Learning
    *   Natural Language Processing (NLP)
    *   SaaS (indicating they deploy their models as services)
*   **Operating Status:** Active
*   **Size:** The data indicates their employee count is substantial, either **1,001-5,000** or **5,001-10,000**. This suggests a rapidly growing, large enterprise.
*   **Website:** https://www.openai.com

**Sales Intelligence and Talking Points:**

1.  **Pioneers in AI:** OpenAI is a major player and innovator in the AI space, particularly in generative AI and foundational models. This indicates they are constantly looking for cutting-edge solutions, talent, and infrastructure to maintain their leadership.
    *   **Sales Angle:** Any product or service that enhances AI research, model development, deployment efficiency, or security for advanced AI systems would be highly relevant.
    *   **Talking Point:** "Given OpenAI's groundbreaking work in [Generative AI/Foundational AI] with models like ChatGPT, I imagine you're constantly seeking ways to optimize your [data processing/compute infrastructure/model deployment/AI safety protocols]."

2.  **SaaS Provider:** Their inclusion in the "SaaS" industry means they are not just developing AI, but also productizing and deploying it as services. This implies needs related to scaling, customer support, API management, cloud infrastructure, and enterprise-grade reliability.
    *   **Sales Angle:** Solutions for large-scale SaaS operations, particularly those with high computational demands, would be valuable.
    *   **Talking Point:** "As a key SaaS provider in the AI space, managing the scalability and reliability of services like ChatGPT must be a critical focus. How are you currently addressing [specific SaaS challenge, e.g., low-latency inference at scale/secure API access]?"

3.  **Large and Growing Organization:** With thousands of employees, OpenAI likely faces challenges typical of rapidly scaling enterprises, such as internal communication, talent management, complex project coordination, and managing diverse research and engineering teams.
    *   **Sales Angle:** Solutions for enterprise collaboration, project management, developer tools, or specialized HR/recruitment for AI talent could be relevant.
    *   **Talking Point:** "With OpenAI's rapid growth and the complexity of your AI projects, I'm curious how you manage [specific internal challenge, e.g., cross-functional collaboration between research and engineering/onboarding specialized AI talent]."

4.  **Focus on Advanced AI:** Their specific industry tags like "Agentic AI" and "Foundational AI" highlight their focus on the most complex and impactful areas of AI. This implies a need for robust, high-performance, and secure infrastructure.
    *   **Sales Angle:** If your product or service provides a competitive advantage in areas like high-performance computing, specialized hardware (e.g., GPUs), data privacy, or ethical AI frameworks, it would resonate.
    *   **Talking Point:** "Your work in Foundational AI is truly pushing the boundaries. We've seen companies tackling similar challenges find significant value in our [specific solution, e.g., secure data pipelines for large models/compute orchestration for distributed AI training]."

**Overall Sales Strategy:**

When reaching out to OpenAI, emphasize how your solution directly supports their mission to advance AI, enhances their existing AI models or infrastructure, improves their SaaS offerings, or addresses the operational complexities of a leading, fast-growing AI enterprise. Tailor your message to their specific industry focus (e.g., Generative AI, Foundational AI) to demonstrate you understand their unique challenges and priorities.
--------------------------------------------------------------------------------

📄 Retrieved 3 relevant documents from MongoDB

================================================================================
RETRIEVED DOCUMENTS:
================================================================================

Document 1:
   Company: OpenAI
   Source: crunchbase_company
   Location: [{'name': 'San Francisco', 'permalink': 'san-francisco-california'}, {'name': 'California', 'permalink': 'california-united-states'}, {'name': 'United States', 'permalink': 'united-states'}, {'name': 'North America', 'permalink': 'north-america'}]
   Industry: Agentic AI, Artificial Intelligence (AI), Foundational AI, Generative AI, Machine Learning, Natural Language Processing, SaaS

   Content:
   Company: OpenAI
Overview: OpenAI is an AI research and deployment company that develops advanced AI models, including ChatGPT.
Industries: [{'id': 'agentic-ai-17fa', 'value': 'Agentic AI'}, {'id': 'artificial-intelligence', 'value': 'Artificial Intelligence (AI)'}, {'id': 'foundational-ai', 'value':...
--------------------------------------------------------------------------------

Document 2:
   Company: OpenAI
   Source: crunchbase_company
   Location: [{'name': 'San Francisco', 'permalink': 'san-francisco-california'}, {'name': 'California', 'permalink': 'california-united-states'}, {'name': 'United States', 'permalink': 'united-states'}, {'name': 'North America', 'permalink': 'north-america'}]
   Industry: Agentic AI, Artificial Intelligence (AI), Foundational AI, Generative AI, Machine Learning, Natural Language Processing, SaaS

   Content:
   Company: OpenAI
Overview: OpenAI is an AI research and deployment company that develops advanced AI models, including ChatGPT.
Industries: [{'id': 'agentic-ai-17fa', 'value': 'Agentic AI'}, {'id': 'artificial-intelligence', 'value': 'Artificial Intelligence (AI)'}, {'id': 'foundational-ai', 'value':...
--------------------------------------------------------------------------------

Document 3:
   Company: OpenAI
   Source: crunchbase_company
   Location: [{'name': 'San Francisco', 'permalink': 'san-francisco-california'}, {'name': 'California', 'permalink': 'california-united-states'}, {'name': 'United States', 'permalink': 'united-states'}, {'name': 'North America', 'permalink': 'north-america'}]
   Industry: Agentic AI, Artificial Intelligence (AI), Foundational AI, Generative AI, Machine Learning, Natural Language Processing, SaaS

   Content:
   Company: OpenAI
Overview: OpenAI is an AI research and deployment company that develops advanced AI models, including ChatGPT.
Industries: [{'id': 'agentic-ai-17fa', 'value': 'Agentic AI'}, {'id': 'artificial-intelligence', 'value': 'Artificial Intelligence (AI)'}, {'id': 'foundational-ai', 'value':...
--------------------------------------------------------------------------------