k-Anonymity & l-Diversity

Releasing a dataset — a discovery set, a training corpus, a research extract — is different from running a query against one. Once the rows leave your trust boundary, you can no longer add noise or enforce budgets; every record must stand on its own. k-Anonymity and its refinements (l-diversity, t-closeness) are the classical techniques for preparing a dataset for release by ensuring every record is indistinguishable from at least k−1 others on its quasi-identifiers.

1. Quasi-Identifiers: ZIP + DOB + Gender

A quasi-identifier is an attribute that is not a direct identifier but, combined with others, identifies an individual. Sweeney's classic 1997 result: 87% of US residents are uniquely identifiable by the triple (5-digit ZIP, full DOB, gender). In a legal discovery set, quasi-identifiers include employment dates, job titles, project codes, office locations — attributes that survive simple PII redaction but re-identify in combination.

2. k-Anonymity

A table is k-anonymous with respect to a set of quasi-identifiers if every combination of quasi-identifier values appears in at least k rows. Achieved by:

Generalization — replace specific values with broader ones (5-digit ZIP → 3-digit ZIP; DOB → birth year; exact salary → salary bucket).
Suppression — drop cells or entire rows that would otherwise form equivalence classes smaller than k.

3. l-Diversity and t-Closeness

k-Anonymity prevents identity disclosure but not attribute disclosure. If all k records in an equivalence class share the same sensitive value (all HIV+), membership in the group reveals the attribute.

l-Diversity — every equivalence class contains at least l well-represented values of the sensitive attribute.
t-Closeness — the distribution of the sensitive attribute in each equivalence class is within distance t of the global distribution (stronger, harder to achieve).

4. Example: Generalization & Suppression

from collections import Counter
from dataclasses import dataclass


@dataclass
class Record:
    zip5: str
    dob: str        # "YYYY-MM-DD"
    gender: str
    diagnosis: str  # sensitive attribute


def generalize(rec: Record, zip_prefix: int, dob_granularity: str) -> tuple:
    zip_g = rec.zip5[:zip_prefix] + "*" * (5 - zip_prefix)
    if dob_granularity == "year":
        dob_g = rec.dob[:4]
    elif dob_granularity == "decade":
        dob_g = f"{rec.dob[:3]}0s"
    else:
        dob_g = rec.dob
    return (zip_g, dob_g, rec.gender)


def k_anonymize(records: list[Record], k: int) -> list[Record]:
    # Try progressively coarser generalizations until every class has >=k rows.
    strategies = [
        (5, "full"), (5, "year"), (3, "year"),
        (3, "decade"), (0, "decade"),
    ]
    for zp, gran in strategies:
        classes = Counter(generalize(r, zp, gran) for r in records)
        small = {cls for cls, n in classes.items() if n < k}
        if not small:
            return [
                Record(zip5=g[0], dob=g[1], gender=g[2], diagnosis=r.diagnosis)
                for r, g in ((r, generalize(r, zp, gran)) for r in records)
            ]
    # Suppress rows that cannot reach k even at the coarsest level.
    classes = Counter(generalize(r, 0, "decade") for r in records)
    return [
        Record(zip5="*****", dob=generalize(r, 0, "decade")[1],
               gender=r.gender, diagnosis=r.diagnosis)
        for r in records if classes[generalize(r, 0, "decade")] >= k
    ]


def check_l_diversity(anon: list[Record], l: int) -> bool:
    groups: dict = {}
    for r in anon:
        groups.setdefault((r.zip5, r.dob, r.gender), []).append(r.diagnosis)
    return all(len(set(diags)) >= l for diags in groups.values())

5. Release Workflow

Classify attributes into direct identifiers (drop), quasi-identifiers (generalize), sensitive (protect with l-diversity), and non-sensitive (keep).
Choose k, l, t based on release context. k=5 is common for internal releases; k=10 or higher for external / published datasets. l should match the number of meaningful sensitive values.
Apply generalization + suppression; record the schema of what was transformed and how.
Verify with an automated check before release; refuse to publish if any equivalence class fails k or l.
Log the release — recipient, purpose, k/l parameters, row count — for audit.

6. Limits: Linkage and High-Dim Data

Linkage attacks — k-anonymity protects against re-identification within the published table, not against joins with external data the attacker already has (voter rolls, LinkedIn, prior leaks). For high-stakes releases, differential privacy is the stronger choice.
High-dimensional data — document embeddings, communication graphs, timestamped event logs have so many quasi-identifier dimensions that k-anonymity collapses to suppression-everything. Do not use k-anonymity for these; use DP or synthetic data.
Free-text fields — cannot be generalized mechanically; must be redacted with the PII pipeline before the k-anonymity step runs on structured columns.