The regex + spaCy pipeline on the
PII redaction page
is fast and transparent, but extending it to new entity types (bar numbers, docket
numbers, FDA submission IDs, billing codes) means writing glue code for every case.
Microsoft Presidio is an open-source framework that normalizes this
work: pluggable recognizers, confidence scoring, language-aware NLP engines, and
reversible anonymizers all share a common RecognizerResult shape.
In production it sits as one layer of a composite classifier — not a replacement for the hand-written regex/NER stack, but a way to plug in domain-specific detectors without growing the core pipeline.
en_core_web_trf) as the backing NER.
The pipeline has two stages: the analyzer produces
RecognizerResult spans with entity type, start/end offsets, and
confidence. The anonymizer consumes those spans and applies a
chosen operator per entity type. They are independent services, which matches the
redactor/tokenizer separation on the PII page.
from presidio_analyzer import (
AnalyzerEngine,
Pattern,
PatternRecognizer,
RecognizerRegistry,
)
from presidio_analyzer.nlp_engine import NlpEngineProvider
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig
# --- Custom recognizers for legal identifiers ---
bar_pattern = Pattern(
name="CA bar number",
regex=r"\b(?:CA\s*Bar\s*(?:No\.?|#)?\s*)(\d{5,7})\b",
score=0.85,
)
bar_recognizer = PatternRecognizer(
supported_entity="BAR_NUMBER",
patterns=[bar_pattern],
context=["bar", "attorney", "counsel"],
)
docket_pattern = Pattern(
name="Federal docket",
regex=r"\b\d{1,2}:\d{2}-[a-z]{2}-\d{5}\b", # e.g. 3:24-cv-01234
score=0.9,
)
docket_recognizer = PatternRecognizer(
supported_entity="DOCKET_NUMBER",
patterns=[docket_pattern],
context=["docket", "case", "civil action"],
)
def build_analyzer() -> AnalyzerEngine:
nlp_engine = NlpEngineProvider(nlp_configuration={
"nlp_engine_name": "spacy",
"models": [{"lang_code": "en", "model_name": "en_core_web_sm"}],
}).create_engine()
registry = RecognizerRegistry()
registry.load_predefined_recognizers()
registry.add_recognizer(bar_recognizer)
registry.add_recognizer(docket_recognizer)
return AnalyzerEngine(nlp_engine=nlp_engine, registry=registry,
supported_languages=["en"])
if __name__ == "__main__":
analyzer = build_analyzer()
anonymizer = AnonymizerEngine()
text = (
"Per CA Bar No. 234567, counsel appeared in 3:24-cv-01234 on behalf "
"of Jane Doe (jane@example.com, SSN 123-45-6789)."
)
results = analyzer.analyze(text=text, language="en")
for r in results:
print(r) # RecognizerResult(entity_type=..., start=..., end=..., score=...)
anon = anonymizer.anonymize(
text=text,
analyzer_results=results,
operators={
"DEFAULT": OperatorConfig("replace", {"new_value": "[REDACTED]"}),
"BAR_NUMBER": OperatorConfig("mask",
{"chars_to_mask": 10, "masking_char": "*",
"from_end": True}),
"DOCKET_NUMBER": OperatorConfig("hash", {"hash_type": "sha256"}),
"US_SSN": OperatorConfig("replace", {"new_value": "[SSN]"}),
"EMAIL_ADDRESS": OperatorConfig("replace", {"new_value": "[EMAIL]"}),
},
)
print(anon.text)
Presidio results are merged with the hand-written regex/spaCy spans from the PII redaction page. The merge step resolves overlaps by confidence and category priority (the same ordering used inside the house redactor):
US_SSN can be
high-confidence; PERSON from NER is lower.context=[...] on custom recognizers to boost precision when
trigger words appear; this matters for short numeric patterns that would
otherwise match noise.