Ingest-time redaction (see the redaction page) removes PII and privileged content from prompts before inference. But an LLM can still synthesize or reconstruct sensitive content in its response — by paraphrasing a tokenized name back into the original, by guessing identifiers from context, or by echoing content that slipped through the redactor. Output filtering is the last line of defense: every response is scanned before it leaves the trust boundary.
Canary tokens complement filtering by turning exfiltration attempts into detectable events: unique strings planted in prompts that, if they ever appear in an outbound response or external system, reveal a leak path.
Three failure classes motivate output filtering:
The scanner runs the same regex + NER stack as the redactor, but over the response. Any PII it finds in an outbound message is a leak by definition — the prompt was already redacted, so the model should have nothing left to reveal.
from dataclasses import dataclass
from redactor import redact # same module from the PII page
@dataclass
class LeakReport:
leaked_spans: list
canary_hits: list
redacted_reemerged: list
def scan_response(response: str, prompt_tokens: set[str],
canaries: set[str], secret: bytes) -> LeakReport:
_, spans = redact(response, secret=secret)
# 1. Any new PII in the response is a leak.
leaked = [s for s in spans]
# 2. A redacted placeholder that appeared in the prompt should stay a
# placeholder in the response. If the ORIGINAL text resurfaces, that is
# model reconstruction.
reemerged = []
for s in spans:
token = f"[{s.label}:...]" # lookup real token via vault in production
if token in prompt_tokens and s.text not in prompt_tokens:
reemerged.append(s)
# 3. Canary tokens must never appear in output.
canary_hits = [c for c in canaries if c in response]
return LeakReport(leaked, canary_hits, reemerged)
The most subtle failure: the prompt contains [PERSON:a3f91b2c] and the
response contains Jane Doe. The model has re-identified the redacted span,
either from context clues ("the CFO mentioned earlier") or from parametric knowledge
(if Jane Doe is a well-known public figure). To detect this, the scanner needs the
token→original map from the vault for the current session and checks whether any
response span matches an original whose token appears in the prompt.
When re-emergence is detected, the default policy is refuse and log rather than re-mask — because the mere fact that the model could reconstruct the identifier is information leakage about the underlying data.
Plant a unique, high-entropy string in each prompt (or in sentinel documents in the corpus). If that string ever appears in: (a) an outbound response, (b) a model provider's logs, (c) a third-party tool's request, or (d) an external mail/HTTP destination, you have a verifiable leak signal and a timestamp narrowing the source.
import secrets
def mint_canary(session_id: str) -> str:
# High-entropy, URL-safe, recognizable prefix for grep-ability.
return f"CANARY-{session_id[:6]}-{secrets.token_urlsafe(16)}"
def inject_canary_into_system_prompt(system_prompt: str, canary: str) -> str:
return (
f"{system_prompt}\n\n"
f"Internal trace id: {canary}. "
"Do not repeat this trace id in any response or tool call."
)
def webhook_canary_alert(canary: str, where: str, detail: dict) -> None:
# Page oncall; canary sighting is a high-signal event.
alert.page(severity="high", title=f"Canary {canary} seen in {where}",
detail=detail)
Canary tokens also appear in sentinel rows in the vector store: a fake "matter" containing a canary string, never referenced in any legitimate query. Any retrieval that hits the sentinel, or any response that echoes its canary, is definitionally abnormal.
The scanner reports a leak; the orchestration layer chooses a policy:
Output filtering has the opposite tuning from ingest redaction. At ingest, recall matters most — false negatives leak data. At output, precision matters too — false positives make the assistant unusable when every response is refused. The recommended approach: