Structured-PII Detection with Microsoft Presidio

The regex + spaCy pipeline on the PII redaction page is fast and transparent, but extending it to new entity types (bar numbers, docket numbers, FDA submission IDs, billing codes) means writing glue code for every case. Microsoft Presidio is an open-source framework that normalizes this work: pluggable recognizers, confidence scoring, language-aware NLP engines, and reversible anonymizers all share a common RecognizerResult shape.

In production it sits as one layer of a composite classifier — not a replacement for the hand-written regex/NER stack, but a way to plug in domain-specific detectors without growing the core pipeline.

1. What Presidio Gives You

50+ built-in recognizers — SSN, credit card, IBAN, phone, email, person, location, medical license, and jurisdiction-specific IDs (US/UK/IN/AU/SG).
Pluggable NLP engine — spaCy, Stanza, or transformers (e.g. en_core_web_trf) as the backing NER.
Confidence scores standardized to [0, 1] so thresholds are comparable across recognizers.
Anonymizer operators — replace, mask, hash, redact, encrypt (reversible) — as first-class, composable operations.
Context words — recognizers can boost confidence when trigger words appear nearby ("bar no.", "docket", "case #").

2. Architecture: Analyzer + Anonymizer

The pipeline has two stages: the analyzer produces RecognizerResult spans with entity type, start/end offsets, and confidence. The anonymizer consumes those spans and applies a chosen operator per entity type. They are independent services, which matches the redactor/tokenizer separation on the PII page.

3. Example: Custom Recognizer for Bar & Docket Numbers

from presidio_analyzer import (
    AnalyzerEngine,
    Pattern,
    PatternRecognizer,
    RecognizerRegistry,
)
from presidio_analyzer.nlp_engine import NlpEngineProvider
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig


# --- Custom recognizers for legal identifiers ---
bar_pattern = Pattern(
    name="CA bar number",
    regex=r"\b(?:CA\s*Bar\s*(?:No\.?|#)?\s*)(\d{5,7})\b",
    score=0.85,
)
bar_recognizer = PatternRecognizer(
    supported_entity="BAR_NUMBER",
    patterns=[bar_pattern],
    context=["bar", "attorney", "counsel"],
)

docket_pattern = Pattern(
    name="Federal docket",
    regex=r"\b\d{1,2}:\d{2}-[a-z]{2}-\d{5}\b",   # e.g. 3:24-cv-01234
    score=0.9,
)
docket_recognizer = PatternRecognizer(
    supported_entity="DOCKET_NUMBER",
    patterns=[docket_pattern],
    context=["docket", "case", "civil action"],
)


def build_analyzer() -> AnalyzerEngine:
    nlp_engine = NlpEngineProvider(nlp_configuration={
        "nlp_engine_name": "spacy",
        "models": [{"lang_code": "en", "model_name": "en_core_web_sm"}],
    }).create_engine()

    registry = RecognizerRegistry()
    registry.load_predefined_recognizers()
    registry.add_recognizer(bar_recognizer)
    registry.add_recognizer(docket_recognizer)

    return AnalyzerEngine(nlp_engine=nlp_engine, registry=registry,
                          supported_languages=["en"])


if __name__ == "__main__":
    analyzer = build_analyzer()
    anonymizer = AnonymizerEngine()

    text = (
        "Per CA Bar No. 234567, counsel appeared in 3:24-cv-01234 on behalf "
        "of Jane Doe (jane@example.com, SSN 123-45-6789)."
    )

    results = analyzer.analyze(text=text, language="en")
    for r in results:
        print(r)   # RecognizerResult(entity_type=..., start=..., end=..., score=...)

    anon = anonymizer.anonymize(
        text=text,
        analyzer_results=results,
        operators={
            "DEFAULT":        OperatorConfig("replace", {"new_value": "[REDACTED]"}),
            "BAR_NUMBER":     OperatorConfig("mask",
                                             {"chars_to_mask": 10, "masking_char": "*",
                                              "from_end": True}),
            "DOCKET_NUMBER":  OperatorConfig("hash", {"hash_type": "sha256"}),
            "US_SSN":         OperatorConfig("replace", {"new_value": "[SSN]"}),
            "EMAIL_ADDRESS":  OperatorConfig("replace", {"new_value": "[EMAIL]"}),
        },
    )
    print(anon.text)

4. Composing with the In-House Pipeline

Presidio results are merged with the hand-written regex/spaCy spans from the PII redaction page. The merge step resolves overlaps by confidence and category priority (the same ordering used inside the house redactor):

High-precision regex (SSN with Luhn, email, phone) wins on overlap.
Presidio custom recognizers cover the long tail (bar, docket, IBAN, license).
spaCy NER remains the generic fallback for PERSON/ORG/GPE.

5. Confidence Thresholds & Context

Set a per-entity threshold, not a global one. US_SSN can be high-confidence; PERSON from NER is lower.
Use context=[...] on custom recognizers to boost precision when trigger words appear; this matters for short numeric patterns that would otherwise match noise.
Log every sub-threshold match for review — today's near-miss is tomorrow's rule adjustment.

6. Trade-offs vs. Rolling Your Own

Pros: shared vocabulary of entity types; community-maintained recognizers; pluggable anonymizers with reversible encryption operator; supports structured data (DataFrame anonymization) out of the box.
Cons: extra dependency with its own release cadence; default recognizers can be noisy on legal text (aggressive PERSON tagging); harder to trace a false positive back to a specific rule in production logs.
Net recommendation: use Presidio for domain-specific and long-tail entities; keep the hand-written redactor for the high-stakes categories (SSN, email, phone, privilege markers) where you want full control over the regex and the confidence calibration.