Maintained by deepset

Integration: AstraDB

A Document Store for storing and retrieval from AstraDB - built for Haystack 2.0.

Authors
Nicholas Brackley
deepset

Table of Contents

Overview

DataStax Astra DB is a serverless vector database built on Apache Cassandra, and it supports vector-based search and auto-scaling. You can deploy it on AWS, GCP, or Azure and easily expand to one or more regions within those clouds for multi-region availability, low latency data access, data sovereignty, and to avoid cloud vendor lock-in. For more information, see the DataStax documentation.

This integration allows you to use AstraDB for document storage and retrieval in your Haystack 2.0 pipelines. This page provides instructions on how to initialize an AstraDB instance and connect with Haystack.

Components

  • AstraDocumentStore. This component serves as a persistent data store for your Haystack documents, and supports a number of embedding models and vector dimensions.
  • AstraEmbeddingRetriever This is an embedding-based Retriever compatible with the Astra Document Store.

Initialization

First you need to sign up for a free DataStax account. Follow these instructions for creating an AstraDB Database in the Datastax console. Make sure you create a collection, a keyspace name, and an access token since you’ll need those later.

Installation

pip install astra-haystack

Usage

This package includes Astra Document Store and Astra Retriever classes that integrate with Haystack 2.0, allowing you to easily perform document retrieval or RAG with AstraDB, and include those functions in Haystack pipelines.

In order to connect AstraDB with Haystack, you’ll need these pieces of information from your Datastax console:

  • Database API Endpoint
  • Application Token
  • Astra collection name (otherwise documents will be used)

how to use the AstraDocumentStore:

from haystack import Document
from haystack_integrations.document_stores.astra import AstraDocumentStore

# Make sure ASTRA_DB_API_ENDPOINT and ASTRA_DB_APPLICATION_TOKEN environment variables are set
document_store = AstraDocumentStore()

document_store.write_documents([
    Document(content="This is first"),
    Document(content="This is second")
    ])
print(document_store.count_documents())

How to use the AstraEmbeddingRetriever

from haystack import Document, Pipeline
from haystack.components.embedders import SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder
from haystack_integrations.components.retrievers.astra import AstraEmbeddingRetriever
from haystack_integrations.document_stores.astra import AstraDocumentStore


# Make sure ASTRA_DB_API_ENDPOINT and ASTRA_DB_APPLICATION_TOKEN environment variables are set
document_store = AstraDocumentStore()

model = "sentence-transformers/all-mpnet-base-v2"

documents = [Document(content="There are over 7,000 languages spoken around the world today."),
						Document(content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors."),
						Document(content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.")]

document_embedder = SentenceTransformersDocumentEmbedder(model=model)  
document_embedder.warm_up()
documents_with_embeddings = document_embedder.run(documents)

document_store.write_documents(documents_with_embeddings.get("documents"))
query_pipeline = Pipeline()
query_pipeline.add_component("text_embedder", SentenceTransformersTextEmbedder(model=model))
query_pipeline.add_component("retriever", AstraEmbeddingRetriever(document_store=document_store))
query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")

query = "How many languages are there?"

result = query_pipeline.run({"text_embedder": {"text": query}})

print(result['retriever']['documents'][0])

Note:

Please note that the current version of Astra JSON API does not support the following operators: $lt, $lte, $gt, $gte, $nin, $not, $neq As well as filtering with none values (these won’t be inserted as the result is stored as json document, and it doesn’t store nones)

License

astra-haystack is distributed under the terms of the Apache-2.0 license.