Simple Keyword Extraction using OpenAIChatGenerator
Last Updated: May 9, 2025
This notebook demonstrates how to extract keywords and key phrases from text using Haystack’s ChatPromptBuilder
together with an LLM via OpenAIChatGenerator
. We will:
-
Define a prompt that instructs the model to identify single- and multi-word keywords.
-
Capture each keyword’s character offsets.
-
Assign a relevance score (0–1).
-
Parse and display the results as JSON.
Install packages and setup OpenAI API key
!pip install haystack-ai
import os
from getpass import getpass
if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass("Enter OpenAI API key:")
Import Required Libraries
import json
from haystack.dataclasses import ChatMessage
from haystack.components.builders import ChatPromptBuilder
from haystack.components.generators.chat import OpenAIChatGenerator
Prepare Text
Collect your text you want to analyze.
text_to_analyze = "Artificial intelligence models like large language models are increasingly integrated into various sectors including healthcare, finance, education, and customer service. They can process natural language, generate text, translate languages, and extract meaningful insights from unstructured data. When performing key word extraction, these systems identify the most significant terms, phrases, or concepts that represent the core meaning of a document. Effective extraction must balance between technical terminology, domain-specific jargon, named entities, action verbs, and contextual relevance. The process typically involves tokenization, stopword removal, part-of-speech tagging, frequency analysis, and semantic relationship mapping to prioritize terms that most accurately capture the document's essential information and main topics."
Build the Prompt
We construct a single-message template that instructs the model to extract keywords, their positions and scores and return the output as JSON object.
messages = [
ChatMessage.from_user(
'''
You are a keyword extractor. Extract the most relevant keywords and phrases from the following text. For each keyword:
1. Find single and multi-word keywords that capture important concepts
2. Include the starting position (index) where each keyword appears in the text
3. Assign a relevance score between 0 and 1 for each keyword
4. Focus on nouns, noun phrases, and important terms
Text to analyze: {{text}}
Return the results as a JSON array in this exact format:
{
"keywords": [
{
"keyword": "example term",
"positions": [5],
"score": 0.95
},
{
"keyword": "another keyword",
"positions": [20],
"score": 0.85
}
]
}
Important:
- Each keyword must have its EXACT character position in the text (counting from 0)
- Scores should reflect the relevance (0–1)
- Include both single words and meaningful phrases
- List results from highest to lowest score
'''
)
]
builder = ChatPromptBuilder(template=messages, required_variables='*')
prompt = builder.run(text=text_to_analyze)
Initialize the Generator and Extract Keywords
We use OpenAIChatGenerator (e.g., gpt-4o-mini) to send our prompt and request a JSON-formatted response.
# Initialize the chat-based generator
extractor = OpenAIChatGenerator(model="gpt-4o-mini")
# Run the generator with our formatted prompt
results = extractor.run(
messages=prompt["prompt"],
generation_kwargs={"response_format": {"type": "json_object"}}
)
# Extract the raw text reply
output_str = results["replies"][0].text
Parse and Display Results
Finally, convert the returned JSON string into a Python object and iterate over the extracted keywords.
try:
data = json.loads(output_str)
for kw in data["keywords"]:
print(f'Keyword: {kw["keyword"]}')
print(f' Positions: {kw["positions"]}')
print(f' Score: {kw["score"]}\n')
except json.JSONDecodeError:
print("Failed to parse the output as JSON. Raw output:", output_str)
Keyword: artificial intelligence
Positions: [0]
Score: 1.0
Keyword: large language models
Positions: [18]
Score: 0.95
Keyword: healthcare
Positions: [63]
Score: 0.9
Keyword: finance
Positions: [72]
Score: 0.9
Keyword: education
Positions: [81]
Score: 0.9
Keyword: customer service
Positions: [91]
Score: 0.9
Keyword: natural language
Positions: [108]
Score: 0.85
Keyword: unstructured data
Positions: [162]
Score: 0.85
Keyword: key word extraction
Positions: [193]
Score: 0.8
Keyword: significant terms
Positions: [215]
Score: 0.8
Keyword: technical terminology
Positions: [290]
Score: 0.75
Keyword: domain-specific jargon
Positions: [311]
Score: 0.75
Keyword: named entities
Positions: [334]
Score: 0.7
Keyword: action verbs
Positions: [352]
Score: 0.7
Keyword: contextual relevance
Positions: [367]
Score: 0.7
Keyword: tokenization
Positions: [406]
Score: 0.65
Keyword: stopword removal
Positions: [420]
Score: 0.65
Keyword: part-of-speech tagging
Positions: [437]
Score: 0.65
Keyword: frequency analysis
Positions: [457]
Score: 0.65
Keyword: semantic relationship mapping
Positions: [476]
Score: 0.65
Keyword: essential information
Positions: [508]
Score: 0.6
Keyword: main topics
Positions: [529]
Score: 0.6