Tokenization vs. Pseudonymization vs. Anonymization

Three terms are used almost interchangeably in marketing copy and mean meaningfully different things under GDPR, HIPAA, and CCPA. Getting the distinction wrong is a governance problem: calling a dataset "anonymized" when it is in fact "pseudonymized" triggers a different regulatory regime, different breach obligations, and a different audit surface.

1. The Taxonomy at a Glance

Technique	Reversible?	Trust locus	Still "personal data"?
Tokenization (vault)	Yes, via vault	Vault operator	Yes
Pseudonymization (keyed)	Yes, via key	Key holder	Yes
Anonymization	No (irreversible)	—	No — if truly irreversible

2. Tokenization

Replace a sensitive value with an opaque random token; store the token→plaintext mapping in a secured vault. The token carries no information about the plaintext by itself; reversal requires vault access.

Strengths: best-in-class cryptographic leakage properties (no equality leak if tokens are random per-insertion); clean audit surface (every reversal is a vault query).
Weaknesses: requires running a vault; latency on every re-identification; availability of the vault becomes a hard dependency.

Covered in detail on the PII redaction page, §5.

3. Pseudonymization

Replace the sensitive value with a deterministic derivative — a keyed HMAC, an FPE ciphertext, or an encrypted value — where reversal or linkage requires the key.

Deterministic HMAC: HMAC(key, plaintext). Same input → same output, so records can be joined. Reversal requires brute force over the plaintext domain, which is feasible for small domains (SSNs, email prefixes).
FPE: reversible with the key; preserves format. See the FPE page.
Encrypted value: AES-GCM with a tenant key; reversible only with KMS access.

import hmac, hashlib

def pseudonymize(plaintext: str, key: bytes, field: str) -> str:
    # Include the field name so "123" as SSN != "123" as account number.
    msg = f"{field}:{plaintext}".encode()
    return hmac.new(key, msg, hashlib.sha256).hexdigest()[:16]

4. Anonymization

Transform the data so that re-identification is infeasible for any party, including the data controller, under any reasonable set of resources. This is a much stronger claim than "we threw away the key." Under GDPR Recital 26, the test is whether identification is possible "by means reasonably likely to be used" by anyone — accounting for technical progress, available auxiliary data, and motivated adversaries.

Aggregation with differential privacy (DP page).
k-Anonymity with l-diversity / t-closeness (k-anonymity page) — but this is brittle against linkage attacks.
Synthetic data generated from a DP-trained model.

If you hold the key, the dataset is not anonymized. Pseudonymization and key destruction can approach anonymization only if the key is truly destroyed and no quasi-identifier linkage remains.

Pseudonymized data is still personal data (GDPR Art. 4(5)). All GDPR obligations apply — lawful basis, data-subject rights, breach notification — but pseudonymization is recognized as a mitigating technical measure (Art. 32).
Truly anonymized data falls outside GDPR (Recital 26), but the bar is high. "Anonymized" datasets that can be re-identified by combining with other public data are not anonymized for GDPR purposes.
CCPA / CPRA uses similar logic: "deidentified" data must be supported by technical and organizational measures, contractual commitments not to re-identify, and published policies. HIPAA Safe Harbor and Expert Determination are the two recognized deidentification paths.

6. Choosing the Right One

Need reversal for authorized users? → Tokenization (preferred) or pseudonymization with KMS-wrapped keys.
Need joins across systems but no re-identification? → Pseudonymization with a deterministic HMAC and a carefully managed key.
Releasing aggregates to third parties? → Anonymization via DP.
Releasing row-level data to third parties? → Be skeptical. k-anonymity is often insufficient; consider synthetic data instead.
Unsure? Default to pseudonymization + strict access control. It preserves optionality without overclaiming on anonymization.