Output Filtering & Canary Tokens

Ingest-time redaction (see the redaction page) removes PII and privileged content from prompts before inference. But an LLM can still synthesize or reconstruct sensitive content in its response — by paraphrasing a tokenized name back into the original, by guessing identifiers from context, or by echoing content that slipped through the redactor. Output filtering is the last line of defense: every response is scanned before it leaves the trust boundary.

Canary tokens complement filtering by turning exfiltration attempts into detectable events: unique strings planted in prompts that, if they ever appear in an outbound response or external system, reveal a leak path.

1. Why Filter Output

Three failure classes motivate output filtering:

Redactor false negatives — a name or identifier the ingest pipeline missed is now in the prompt and echoed back.
Model reconstruction — given enough context, the model infers a redacted value ("the plaintiff mentioned in section 3") and substitutes the original rather than preserving the placeholder.
Prompt injection side-effects — an injected instruction causes the model to emit a tool-call or URL containing sensitive data intended for exfiltration.

2. Leak-Detection Scanner

The scanner runs the same regex + NER stack as the redactor, but over the response. Any PII it finds in an outbound message is a leak by definition — the prompt was already redacted, so the model should have nothing left to reveal.

from dataclasses import dataclass
from redactor import redact   # same module from the PII page


@dataclass
class LeakReport:
    leaked_spans: list
    canary_hits: list
    redacted_reemerged: list


def scan_response(response: str, prompt_tokens: set[str],
                  canaries: set[str], secret: bytes) -> LeakReport:
    _, spans = redact(response, secret=secret)

    # 1. Any new PII in the response is a leak.
    leaked = [s for s in spans]

    # 2. A redacted placeholder that appeared in the prompt should stay a
    #    placeholder in the response. If the ORIGINAL text resurfaces, that is
    #    model reconstruction.
    reemerged = []
    for s in spans:
        token = f"[{s.label}:...]"   # lookup real token via vault in production
        if token in prompt_tokens and s.text not in prompt_tokens:
            reemerged.append(s)

    # 3. Canary tokens must never appear in output.
    canary_hits = [c for c in canaries if c in response]

    return LeakReport(leaked, canary_hits, reemerged)

3. Re-emergence of Redacted Spans

The most subtle failure: the prompt contains [PERSON:a3f91b2c] and the response contains Jane Doe. The model has re-identified the redacted span, either from context clues ("the CFO mentioned earlier") or from parametric knowledge (if Jane Doe is a well-known public figure). To detect this, the scanner needs the token→original map from the vault for the current session and checks whether any response span matches an original whose token appears in the prompt.

When re-emergence is detected, the default policy is refuse and log rather than re-mask — because the mere fact that the model could reconstruct the identifier is information leakage about the underlying data.

4. Canary Tokens

Plant a unique, high-entropy string in each prompt (or in sentinel documents in the corpus). If that string ever appears in: (a) an outbound response, (b) a model provider's logs, (c) a third-party tool's request, or (d) an external mail/HTTP destination, you have a verifiable leak signal and a timestamp narrowing the source.

import secrets


def mint_canary(session_id: str) -> str:
    # High-entropy, URL-safe, recognizable prefix for grep-ability.
    return f"CANARY-{session_id[:6]}-{secrets.token_urlsafe(16)}"


def inject_canary_into_system_prompt(system_prompt: str, canary: str) -> str:
    return (
        f"{system_prompt}\n\n"
        f"Internal trace id: {canary}. "
        "Do not repeat this trace id in any response or tool call."
    )


def webhook_canary_alert(canary: str, where: str, detail: dict) -> None:
    # Page oncall; canary sighting is a high-signal event.
    alert.page(severity="high", title=f"Canary {canary} seen in {where}",
               detail=detail)

Canary tokens also appear in sentinel rows in the vector store: a fake "matter" containing a canary string, never referenced in any legitimate query. Any retrieval that hits the sentinel, or any response that echoes its canary, is definitionally abnormal.

5. Response Policy: Block, Mask, or Refuse

The scanner reports a leak; the orchestration layer chooses a policy:

Mask — rewrite the leaked span with its placeholder and return the rest. Appropriate for low-severity categories (DATE, ORG in non-privileged matters).
Refuse — return a canned apology and do not surface the model output at all. Appropriate for canary hits, re-emergence, and privileged content reconstruction.
Block & escalate — refuse, log, and open a ticket for human review of the prompt/response pair. Appropriate for repeated leaks from the same actor or session.

6. Tuning Precision vs. Recall

Output filtering has the opposite tuning from ingest redaction. At ingest, recall matters most — false negatives leak data. At output, precision matters too — false positives make the assistant unusable when every response is refused. The recommended approach:

Tune the scanner to match the redactor's recall (so anything the redactor would have caught at ingest is also caught at output).
Log every sanitizer hit with prompt hash, response hash, and matched span.
Review refusals weekly; false positives usually cluster around quoted case law (public names in citations) — allowlist these with signed citation patterns rather than weakening the scanner globally.