📅 Join us in Berlin for AI Builders Unconference on Nov 25!

Improving PostgreSQL Keyword Search to Avoid Empty Results


Notebook by Mayank Laddha

As noted in the Haystack documentation for PgvectorKeywordRetriever, this component, unlike others such as ElasticsearchBM25Retriever, doesn’t apply fuzzy search by default. As a result, queries need to be crafted carefully to avoid returning empty results.

In this notebook, you’ll extend PgvectorDocumentStore to make it more forgiving and flexible. You’ll learn how to subclass it to use PostgreSQL’s websearch_to_tsquery and how to leverage NLTK to extract keywords and transform user queries.

Haystack’s modular design makes it easy to tweak or enhance components when results don’t meet expectations and this notebook will show exactly how to do that.

Setting up the Development Environment

Install required dependencies and set up PostgreSQL

%%bash
pip install -q haystack-ai pgvector-haystack psycopg
#The output of the installation is not displayed when %%capture is used at the start of the cell
%%capture
# Install PostgreSQL and pgvector (version-agnostic)
!sudo apt-get -y -qq update
!sudo apt-get -y -qq install postgresql postgresql-server-dev-all git make gcc

# Build and install pgvector manually (works for any PostgreSQL version)
!git clone --quiet https://github.com/pgvector/pgvector.git
!cd pgvector && make && sudo make install

# Start PostgreSQL service
!sudo service postgresql start

# Set password for default user
!sudo -u postgres psql -c "ALTER USER postgres PASSWORD 'postgres';"

# Create database only if it doesn't exist
!sudo -u postgres psql -tc "SELECT 1 FROM pg_database WHERE datname = 'sampledb'" | grep -q 1 || \
sudo -u postgres psql -c "CREATE DATABASE sampledb;"

# Enable pgvector extension in sampledb
!sudo -u postgres psql -d sampledb -c "CREATE EXTENSION IF NOT EXISTS vector;"

Set an environment variable PG_CONN_STR with the connection string to your PostgreSQL database. This is needed for Haystack.

# set connection
import os
os.environ["PG_CONN_STR"] = "postgresql://postgres:postgres@localhost:5432/sampledb"

Subclassing PgvectorDocumentStore to Enable Websearch-Style Queries

Why not plainto_tsquery? Why websearch_to_tsquery?

plainto_tsquery transforms the unformatted text querytext to a tsquery value. The text is parsed and normalized much as for to_tsvector, then the & (AND) tsquery operator is inserted between surviving words. so all your keywords need to be present in the document.

websearch_to_tsquery creates a tsquery value from querytext using an alternative syntax in which simple unformatted text is a valid query. Unlike plainto_tsquery and phraseto_tsquery, it also recognizes certain operators. Moreover, this function will never raise syntax errors, which makes it possible to use raw user-supplied input for search. The following syntax is supported:

unquoted text: text not inside quote marks will be converted to terms separated by & operators, as if processed by plainto_tsquery.

“quoted text”: text inside quote marks will be converted to terms separated by <-> operators, as if processed by phraseto_tsquery.

OR: the word “or” will be converted to the | operator.

-: a dash will be converted to the ! operator.

from haystack_integrations.document_stores.pgvector import PgvectorDocumentStore
from psycopg.sql import SQL, Composed, Identifier, Literal as SQLLiteral
from typing import Dict, Any, Optional, Tuple, Union

class CustomPgvectorDocumentStore(PgvectorDocumentStore):
    def _build_keyword_retrieval_query(
        self, query: str, top_k: int, filters: Optional[Dict[str, Any]] = None
    ) -> Tuple[Composed, tuple]:

        # Replace plainto_tsquery with websearch_to_tsquery
        KEYWORD_QUERY_CUSTOM = """
        SELECT {table_name}.*, ts_rank_cd(to_tsvector({language}, content), query) AS score
        FROM {schema_name}.{table_name}, websearch_to_tsquery({language}, %s) query
        WHERE to_tsvector({language}, content) @@ query
        """
        sql_select = SQL(KEYWORD_QUERY_CUSTOM).format(
            schema_name=Identifier(self.schema_name),
            table_name=Identifier(self.table_name),
            language=SQLLiteral(self.language),
            query=SQLLiteral(query),
        )

        where_params = ()
        sql_where_clause: Union[Composed, SQL] = SQL("")
        if filters:
            sql_where_clause, where_params = self._convert_filters_to_where_clause_and_params(
                filters=filters, operator="AND"
            )

        sql_sort = SQL(" ORDER BY score DESC LIMIT {top_k}").format(top_k=SQLLiteral(top_k))
        sql_query = sql_select + sql_where_clause + sql_sort

        return sql_query, where_params

Detect Keywords with NLTK

Detecting keywords make sure we use only the relevant words. So, even if you decide to use the default implementation with plainto_tsquery, which uses AND operator, you stil have better chances of not getting zero results.

Download required packages

import nltk
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')

Simple keyword detector

from nltk import word_tokenize, pos_tag
def extract_keywords(query: str):
    tokens = word_tokenize(query)
    nouns = [word for word, pos in pos_tag(tokens) if pos.startswith("NN")]
    return nouns[:5]

Test the Improved Implementation

from haystack import Document
from haystack_integrations.components.retrievers.pgvector import PgvectorKeywordRetriever
import psycopg

#use our custom store instead of PgvectorDocumentStore, that's it
document_store = CustomPgvectorDocumentStore()

#rest of the flow/pipeline will remain the same
retriever = PgvectorKeywordRetriever(document_store=document_store,top_k = 1)

document_store.write_documents([
    Document(id = "1" ,content="My name is Jean and I live in Paris."),
    Document(id = "2", content="My name is Mark and I live in Berlin."),
    Document(id = "3", content="My name is Giorgio and I live in Rome.")
])

query = "Do you think Jean, Mayank and Alex live in Paris?"
keywords = extract_keywords(query)
transformed_query =  " OR ".join(keywords)
print("transformed query",transformed_query)
res = retriever.run(query=transformed_query)
print("result",res)

#just delete the data
schema_name = "public"
vec_table_name = "haystack_documents"
vec_full_table_name = f"{schema_name}.{vec_table_name}"
db_url = os.environ.get("PG_CONN_STR")
with psycopg.connect(db_url) as conn:
    with conn.cursor() as cur:
        cur.execute(f"TRUNCATE TABLE {vec_full_table_name} RESTART IDENTITY CASCADE;")
transformed query Jean OR Mayank OR Alex OR Paris
result {'documents': [Document(id=1, content: 'My name is Jean and I live in Paris.', score: 0.2)]}
#use default store
document_store = PgvectorDocumentStore()

retriever = PgvectorKeywordRetriever(document_store=document_store,top_k = 1)

document_store.write_documents([
    Document(id = "1" ,content="My name is Jean and I live in Paris."),
    Document(id = "2", content="My name is Mark and I live in Berlin."),
    Document(id = "3", content="My name is Giorgio and I live in Rome.")
])

query = "Do you think Jean, Mayank and Alex live in Paris?"
res = retriever.run(query=transformed_query)
print("result",res)

#just delete the data
schema_name = "public"
vec_table_name = "haystack_documents"
vec_full_table_name = f"{schema_name}.{vec_table_name}"
db_url = os.environ.get("PG_CONN_STR")
with psycopg.connect(db_url) as conn:
    with conn.cursor() as cur:
        cur.execute(f"TRUNCATE TABLE {vec_full_table_name} RESTART IDENTITY CASCADE;")
result {'documents': []}

As you can see the results are empty.