🌸 Join the Spring into Haystack challenge and create your Agent with MCP and Haystack!

Simple Keyword Extraction using OpenAIChatGenerator


This notebook demonstrates how to extract keywords and key phrases from text using Haystack’s ChatPromptBuilder together with an LLM via OpenAIChatGenerator. We will:

  • Define a prompt that instructs the model to identify single- and multi-word keywords.

  • Capture each keyword’s character offsets.

  • Assign a relevance score (0–1).

  • Parse and display the results as JSON.

Install packages and setup OpenAI API key

!pip install haystack-ai
import os
from getpass import getpass

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass("Enter OpenAI API key:")

Import Required Libraries

import json


from haystack.dataclasses import ChatMessage
from haystack.components.builders import ChatPromptBuilder
from haystack.components.generators.chat import OpenAIChatGenerator

Prepare Text

Collect your text you want to analyze.

text_to_analyze = "Artificial intelligence models like large language models are increasingly integrated into various sectors including healthcare, finance, education, and customer service. They can process natural language, generate text, translate languages, and extract meaningful insights from unstructured data. When performing key word extraction, these systems identify the most significant terms, phrases, or concepts that represent the core meaning of a document. Effective extraction must balance between technical terminology, domain-specific jargon, named entities, action verbs, and contextual relevance. The process typically involves tokenization, stopword removal, part-of-speech tagging, frequency analysis, and semantic relationship mapping to prioritize terms that most accurately capture the document's essential information and main topics."

Build the Prompt

We construct a single-message template that instructs the model to extract keywords, their positions and scores and return the output as JSON object.

messages = [
    ChatMessage.from_user(
        '''
You are a keyword extractor. Extract the most relevant keywords and phrases from the following text. For each keyword:
1. Find single and multi-word keywords that capture important concepts
2. Include the starting position (index) where each keyword appears in the text
3. Assign a relevance score between 0 and 1 for each keyword
4. Focus on nouns, noun phrases, and important terms

Text to analyze: {{text}}

Return the results as a JSON array in this exact format:
{
  "keywords": [
    {
      "keyword": "example term",
      "positions": [5],
      "score": 0.95
    },
    {
      "keyword": "another keyword",
      "positions": [20],
      "score": 0.85
    }
  ]
}

Important:
- Each keyword must have its EXACT character position in the text (counting from 0)
- Scores should reflect the relevance (0–1)
- Include both single words and meaningful phrases
- List results from highest to lowest score
'''
    )
]

builder = ChatPromptBuilder(template=messages, required_variables='*')
prompt = builder.run(text=text_to_analyze)

Initialize the Generator and Extract Keywords

We use OpenAIChatGenerator (e.g., gpt-4o-mini) to send our prompt and request a JSON-formatted response.

# Initialize the chat-based generator
extractor = OpenAIChatGenerator(model="gpt-4o-mini")

# Run the generator with our formatted prompt
results = extractor.run(
    messages=prompt["prompt"],
    generation_kwargs={"response_format": {"type": "json_object"}}
)

# Extract the raw text reply
output_str = results["replies"][0].text

Parse and Display Results

Finally, convert the returned JSON string into a Python object and iterate over the extracted keywords.

try:
    data = json.loads(output_str)
    for kw in data["keywords"]:
        print(f'Keyword: {kw["keyword"]}')
        print(f' Positions: {kw["positions"]}')
        print(f' Score: {kw["score"]}\n')
except json.JSONDecodeError:
    print("Failed to parse the output as JSON. Raw output:", output_str)
Keyword: artificial intelligence
 Positions: [0]
 Score: 1.0

Keyword: large language models
 Positions: [18]
 Score: 0.95

Keyword: healthcare
 Positions: [63]
 Score: 0.9

Keyword: finance
 Positions: [72]
 Score: 0.9

Keyword: education
 Positions: [81]
 Score: 0.9

Keyword: customer service
 Positions: [91]
 Score: 0.9

Keyword: natural language
 Positions: [108]
 Score: 0.85

Keyword: unstructured data
 Positions: [162]
 Score: 0.85

Keyword: key word extraction
 Positions: [193]
 Score: 0.8

Keyword: significant terms
 Positions: [215]
 Score: 0.8

Keyword: technical terminology
 Positions: [290]
 Score: 0.75

Keyword: domain-specific jargon
 Positions: [311]
 Score: 0.75

Keyword: named entities
 Positions: [334]
 Score: 0.7

Keyword: action verbs
 Positions: [352]
 Score: 0.7

Keyword: contextual relevance
 Positions: [367]
 Score: 0.7

Keyword: tokenization
 Positions: [406]
 Score: 0.65

Keyword: stopword removal
 Positions: [420]
 Score: 0.65

Keyword: part-of-speech tagging
 Positions: [437]
 Score: 0.65

Keyword: frequency analysis
 Positions: [457]
 Score: 0.65

Keyword: semantic relationship mapping
 Positions: [476]
 Score: 0.65

Keyword: essential information
 Positions: [508]
 Score: 0.6

Keyword: main topics
 Positions: [529]
 Score: 0.6