๐Ÿ“ฃ Haystack 2.28 is here! Pass agent State directly to tools & components - no extra wiring needed
Maintained by deepset

Integration: Presidio

PII detection and anonymization for Haystack Documents and text strings, powered by Microsoft Presidio.

Authors
deepset
Shahmeer Ali

Table of Contents

Overview

Microsoft Presidio is an open-source library for PII detection and anonymization using NLP-based entity recognition.

presidio-haystack provides three Haystack components:

Component Input Purpose
PresidioDocumentCleaner list[Document] Replace PII in document text with entity type placeholders
PresidioTextCleaner list[str] Replace PII in plain strings โ€” useful for sanitizing user queries
PresidioEntityExtractor list[Document] Detect PII and store entities as structured document metadata

All components run locally โ€” no external API required. Presidio uses spaCy NLP models under the hood.

Installation

pip install presidio-haystack

en_core_web_lg is the recommended English model for best accuracy. For a lighter footprint, en_core_web_sm works too โ€” see the full list of spaCy models for options.

Each component accepts a language parameter (default "en"). To use a non-English language, specify the language code, and provide a model mapping, unless you want to use the large one.

Usage

Document Cleaning

Replace PII in document content before indexing:

from haystack import Document
from haystack_integrations.components.preprocessors.presidio import PresidioDocumentCleaner

cleaner = PresidioDocumentCleaner()
result = cleaner.run(documents=[
    Document(content="Contact Alice Smith at alice@example.com or 212-555-1234.")
])
print(result["documents"][0].content)
# Contact <PERSON> at <EMAIL_ADDRESS> or <PHONE_NUMBER>.

Original documents are not mutated. Documents with no text content pass through unchanged.

Text Cleaning

Sanitize user queries before they reach your LLM:

from haystack_integrations.components.preprocessors.presidio import PresidioTextCleaner

cleaner = PresidioTextCleaner()
result = cleaner.run(texts=["My name is John Doe, my SSN is 123-45-6789"])
print(result["texts"][0])
# My name is <PERSON>, my SSN is <US_SSN>

Entity Extraction

Detect PII and attach it as structured metadata without modifying the document text:

from haystack import Document
from haystack_integrations.components.extractors.presidio import PresidioEntityExtractor

extractor = PresidioEntityExtractor()
result = extractor.run(documents=[
    Document(content="Contact Alice at alice@example.com")
])
print(result["documents"][0].meta["entities"])
# [{"entity_type": "PERSON", "start": 8, "end": 13, "score": 0.85},
#  {"entity_type": "EMAIL_ADDRESS", "start": 17, "end": 34, "score": 1.0}]

All three components accept language, entities, and score_threshold parameters at init time. See Presidio supported entities for the full list of detectable PII types.

License

presidio-haystack is distributed under the terms of the Apache-2.0 license.