GDPR Article 17 — the right to erasure, colloquially "right to be forgotten" —
requires that, when a data subject withdraws consent or objects to processing, the
controller deletes their personal data "without undue delay." For a relational
database the implementation is familiar: DELETE WHERE user_id = ?. For
a RAG pipeline with embeddings, summaries, fine-tunes, and retrieval caches, the
data in scope has proliferated into many derivatives — and the vector
representation of a sentence is still personal data when it was computed from one.
An erasure request reaches into every derivative of the subject's content:
Hard-deleting embeddings from a live index is expensive and often impossible in approximate-nearest-neighbor structures (HNSW, IVF) without a full rebuild. Tombstoning is the standard interim: mark the vector as deleted, filter it from search results, and compact on a schedule.
Keep indexes scoped to the narrowest safe unit — per tenant, and inside a tenant per matter. When a whole matter is erased, the cheapest implementation is to drop its index outright. This is dramatically cheaper than erasing individual vectors from a shared index and removes ambiguity about cross-matter residue.
Embedding models change. When you re-embed the corpus (new model version, new chunking strategy), the erasure work must re-run: old vectors are rebuilt only from documents that still exist in the raw store. Wire erasure into the re-embed pipeline so a tombstoned raw document cannot accidentally produce fresh embeddings on the next rebuild.
from dataclasses import dataclass
from datetime import datetime, timezone
@dataclass
class ErasureRequest:
tenant_id: str
subject_id: str # the data subject
reason: str # "gdpr-art17" | "ccpa-deletion" | "contract-termination"
received_at: datetime
def handle_erasure(req: ErasureRequest, stores, audit) -> None:
audit.log("erasure.start", tenant=req.tenant_id, subject=req.subject_id,
reason=req.reason)
# 1. Identify every document tied to the subject.
doc_ids = stores.metadata.docs_for_subject(req.tenant_id, req.subject_id)
# 2. Tombstone the raw documents so re-embed jobs skip them.
for d in doc_ids:
stores.raw.mark_tombstoned(d, ts=req.received_at)
# 3. Tombstone vectors and purge retrieval cache immediately (user-visible).
stores.vectors.tombstone_by_doc(doc_ids)
stores.retrieval_cache.purge_by_doc(doc_ids)
# 4. Purge prompt/response logs EXCEPT audit trail (legal-hold exemption).
stores.prompt_logs.purge_by_subject(req.tenant_id, req.subject_id,
keep_audit=True)
# 5. Schedule index compaction within the internal SLA.
stores.jobs.enqueue("compact_index", tenant=req.tenant_id,
run_after=req.received_at, priority="gdpr")
# 6. Record the completion promise; follow up after compaction verifies.
audit.log("erasure.tombstoned", tenant=req.tenant_id,
subject=req.subject_id, doc_count=len(doc_ids))
def verify_erasure(tenant_id: str, subject_id: str, stores) -> bool:
"""Post-compaction check: no live records remain."""
return (
stores.vectors.count_live_by_subject(tenant_id, subject_id) == 0 and
stores.retrieval_cache.count_by_subject(tenant_id, subject_id) == 0
)