DocumentationAPI ReferenceπŸ““ TutorialsπŸ§‘β€πŸ³ Cookbook🀝 IntegrationsπŸ’œ Discord

DocumentWriter

Use this component to write Documents into a Document Store of your choice.

NameDocumentWriter
Folder Path/writers/
Position in a PipelineIn indexing pipelines as the last component
Inputs"documents": List of Documents
Outputs"documents_written": Number of Documents written (integer)

Overview

DocumentWriter writes a list of Documents into a Document Store of your choice. It’s typically used in an indexing pipeline as the final step after preprocessing Documents and creating their embeddings.

To use this component with a specific file type, make sure you use the correct Converter before it. For example, to use DocumentWriter with Markdown files, use the MarkdownToDocument component before DocumentWriter in your indexing pipeline.

DuplicatePolicy

The DuplicatePolicy is a class that defines the different options for handling documents with the same ID in a DocumentStore. It has three possible values:

  • OVERWRITE: Indicates that if a document with the same ID already exists in the DocumentStore, it should be overwritten with the new document.
  • SKIP: If a document with the same ID already exists, the new document will be skipped and not added to the DocumentStore.
  • FAIL: Raises an error if a document with the same ID already exists in the DocumentStore. It prevents duplicate documents from being added.

Usage

On its own

Below is an example how to write two Documents into an InMemoryDocumentStore:

from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.writers import DocumentWriter

documents = [
    Document(content="This is document 1"),
    Document(content="This is document 2")
]

document_store = InMemoryDocumentStore()
document_writer = DocumentWriter(document_store = document_store)
document_writer.run(documents=documents)

In a pipeline

Below is an example of an indexing pipeline that first uses the SentenceTransformersDocumentEmbedder to create embeddings of Documents and then uses the DocumentWriter to write the Documents to an InMemoryDocumentStore:

from haystack.pipeline import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.document_stores.types import DuplicatePolicy
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.writers import DocumentWriter

documents = [
    Document(content="This is document 1"),
    Document(content="This is document 2")
]

document_store = InMemoryDocumentStore()
embedder = SentenceTransformersDocumentEmbedder()
document_writer = DocumentWriter(document_store = document_store, policy=DuplicatePolicy.NONE)

indexing_pipeline = Pipeline()
indexing_pipeline.add_component(instance=embedder, name="embedder")
indexing_pipeline.add_component(instance=document_writer, name="writer")

indexing_pipeline.connect("embedder", "writer")
indexing_pipeline.run({"embedder": {"documents": documents}})

Related Links

See the parameters details in our API reference: