Releasing a dataset — a discovery set, a training corpus, a research extract — is different from running a query against one. Once the rows leave your trust boundary, you can no longer add noise or enforce budgets; every record must stand on its own. k-Anonymity and its refinements (l-diversity, t-closeness) are the classical techniques for preparing a dataset for release by ensuring every record is indistinguishable from at least k−1 others on its quasi-identifiers.
A quasi-identifier is an attribute that is not a direct identifier but, combined with others, identifies an individual. Sweeney's classic 1997 result: 87% of US residents are uniquely identifiable by the triple (5-digit ZIP, full DOB, gender). In a legal discovery set, quasi-identifiers include employment dates, job titles, project codes, office locations — attributes that survive simple PII redaction but re-identify in combination.
A table is k-anonymous with respect to a set of quasi-identifiers
if every combination of quasi-identifier values appears in at least k
rows. Achieved by:
k-Anonymity prevents identity disclosure but not attribute disclosure. If all k records in an equivalence class share the same sensitive value (all HIV+), membership in the group reveals the attribute.
l well-represented values of the sensitive attribute.t of the global
distribution (stronger, harder to achieve).from collections import Counter
from dataclasses import dataclass
@dataclass
class Record:
zip5: str
dob: str # "YYYY-MM-DD"
gender: str
diagnosis: str # sensitive attribute
def generalize(rec: Record, zip_prefix: int, dob_granularity: str) -> tuple:
zip_g = rec.zip5[:zip_prefix] + "*" * (5 - zip_prefix)
if dob_granularity == "year":
dob_g = rec.dob[:4]
elif dob_granularity == "decade":
dob_g = f"{rec.dob[:3]}0s"
else:
dob_g = rec.dob
return (zip_g, dob_g, rec.gender)
def k_anonymize(records: list[Record], k: int) -> list[Record]:
# Try progressively coarser generalizations until every class has >=k rows.
strategies = [
(5, "full"), (5, "year"), (3, "year"),
(3, "decade"), (0, "decade"),
]
for zp, gran in strategies:
classes = Counter(generalize(r, zp, gran) for r in records)
small = {cls for cls, n in classes.items() if n < k}
if not small:
return [
Record(zip5=g[0], dob=g[1], gender=g[2], diagnosis=r.diagnosis)
for r, g in ((r, generalize(r, zp, gran)) for r in records)
]
# Suppress rows that cannot reach k even at the coarsest level.
classes = Counter(generalize(r, 0, "decade") for r in records)
return [
Record(zip5="*****", dob=generalize(r, 0, "decade")[1],
gender=r.gender, diagnosis=r.diagnosis)
for r in records if classes[generalize(r, 0, "decade")] >= k
]
def check_l_diversity(anon: list[Record], l: int) -> bool:
groups: dict = {}
for r in anon:
groups.setdefault((r.zip5, r.dob, r.gender), []).append(r.diagnosis)
return all(len(set(diags)) >= l for diags in groups.values())