Document Stores

You can think of the Document Store as a "database" that:

  • stores your texts and meta data
  • provides them to the retriever at query time

There are different DocumentStores in Haystack to fit different use cases and tech stacks.

Initialisation

Initialising a new Document Store is straight forward.

document_store = ElasticsearchDocumentStore()
document_store = FAISSDocumentStore()
document_store = InMemoryDocumentStore()
document_store = SQLDocumentStore()

Each DocumentStore constructor allows for arguments specifying how to connect to existing databases and the names of indexes. See API documentation for more info.

Preparing Documents

DocumentStores expect Documents in dictionary form, like that below. They are loaded using the DocumentStore.write_documents() method.

document_store = ElasticsearchDocumentStore()
dicts = [
    {
        'text': DOCUMENT_TEXT_HERE,
        'meta': {'name': DOCUMENT_NAME, ...}
    }, ...
]
document_store.write_documents(dicts)

File Conversion

There are a range of different file converters in Haystack that can help get your data into the right format. Haystack features support for txt, pdf and docx formats and there is even a converted that leverages Apache Tika. See the File Converters section in the API docs for more information.

Haystack also has a convert_files_to_dicts() utility function that will convert all txt or pdf files in a given folder into this dictionary format.

document_store = ElasticsearchDocumentStore()
dicts = convert_files_to_dicts(dir_path=doc_dir)
document_store.write_documents(dicts)

Writing Documents

Haystack allows for you to write store documents in an optimised fashion so that query times can be kept low.

For Sparse Retrievers

For sparse, keyword based retrievers such as BM25 and TF-IDF, you simply have to call DocumentStore.write_documents(). The creation of the inverted index which optimises querying speed is handled automatically.

document_store.write_documents(dicts)

For Dense Retrievers

For dense neural network based retrievers like Dense Passage Retrieval, or Embedding Retrieval, indexing involves computing the Document embeddings which will be compared against the Query embedding.

The storing of the text is handled by DocumentStore.write_documents() and the computation of the embeddings is started by DocumentStore.update_embeddings().

document_store.write_documents(dicts)
document_store.update_embeddings(retriever)

This step is computationally intensive since it will engage the transformer based encoders. Having GPU acceleration will significantly speed this up.

Choosing the right document store

The Document stores have different characteristics. You should choose one depending on the maturity of your project, the use case and technical environment:

Pros:

  • Fast & accurate sparse retrieval
  • Basic support for dense retrieval
  • Production-ready
  • Many options to tune sparse retrieval

Cons:

  • Slow for dense retrieval with more than ~ 1 Mio documents

Pros:

  • Fast & accurate dense retrieval
  • Highly scalable due to approximate nearest neighbour algorithms (ANN)
  • Many options to tune dense retrieval via different index types

Cons:

  • No efficient sparse retrieval

Pros:

  • Simple
  • Exists already in many environments

Cons:

  • Only compatible with minimal TF-IDF Retriever
  • Bad retrieval performance
  • Not recommended for production

Pros:

  • Simple & fast to test
  • No database requirements

Cons:

  • Not scalable
  • Not persisting your data on disk

Our recommendations

Restricted environment: Use the InMemoryDocumentStore, if you are just giving Haystack a quick try on a small sample and are working in a restricted environment that complicates running Elasticsearch or other databases

Allrounder: Use the ElasticSearchDocumentStore, if you want to evaluate the performance of different retrieval options (dense vs. sparse) and are aiming for a smooth transition from PoC to production

Vector Specialist: Use the FAISSDocumentStore, if you want to focus on dense retrieval and possibly deal with larger datasets

© 2020 - 2021 deepset. All rights reserved.Imprint