Tokenization vs. Pseudonymization vs. Anonymization
Three terms are used almost interchangeably in marketing copy and mean meaningfully
different things under GDPR, HIPAA, and CCPA. Getting the distinction wrong is a
governance problem: calling a dataset "anonymized" when it is in fact
"pseudonymized" triggers a different regulatory regime, different breach obligations,
and a different audit surface.
1. The Taxonomy at a Glance
Technique
Reversible?
Trust locus
Still "personal data"?
Tokenization (vault)
Yes, via vault
Vault operator
Yes
Pseudonymization (keyed)
Yes, via key
Key holder
Yes
Anonymization
No (irreversible)
—
No — if truly irreversible
2. Tokenization
Replace a sensitive value with an opaque random token; store the
token→plaintext mapping in a secured vault. The token carries no information
about the plaintext by itself; reversal requires vault access.
Strengths: best-in-class cryptographic leakage properties (no
equality leak if tokens are random per-insertion); clean audit surface (every
reversal is a vault query).
Weaknesses: requires running a vault; latency on every
re-identification; availability of the vault becomes a hard dependency.
Replace the sensitive value with a deterministic derivative — a keyed HMAC,
an FPE ciphertext, or an encrypted value — where reversal or linkage requires
the key.
Deterministic HMAC:HMAC(key, plaintext). Same
input → same output, so records can be joined. Reversal requires brute force
over the plaintext domain, which is feasible for small domains (SSNs, email
prefixes).
FPE: reversible with the key; preserves format. See the
FPE page.
Encrypted value: AES-GCM with a tenant key; reversible only
with KMS access.
import hmac, hashlib
def pseudonymize(plaintext: str, key: bytes, field: str) -> str:
# Include the field name so "123" as SSN != "123" as account number.
msg = f"{field}:{plaintext}".encode()
return hmac.new(key, msg, hashlib.sha256).hexdigest()[:16]
4. Anonymization
Transform the data so that re-identification is infeasible for any party,
including the data controller, under any reasonable set of resources. This is
a much stronger claim than "we threw away the key." Under GDPR Recital 26, the test
is whether identification is possible "by means reasonably likely to be used" by
anyone — accounting for technical progress, available auxiliary data, and
motivated adversaries.
k-Anonymity with l-diversity / t-closeness
(k-anonymity page) —
but this is brittle against linkage attacks.
Synthetic data generated from a DP-trained model.
If you hold the key, the dataset is not anonymized. Pseudonymization and key
destruction can approach anonymization only if the key is truly destroyed and no
quasi-identifier linkage remains.
5. GDPR Treatment
Pseudonymized data is still personal data (GDPR Art. 4(5)). All
GDPR obligations apply — lawful basis, data-subject rights, breach
notification — but pseudonymization is recognized as a mitigating technical
measure (Art. 32).
Truly anonymized data falls outside GDPR (Recital 26), but the
bar is high. "Anonymized" datasets that can be re-identified by combining with
other public data are not anonymized for GDPR purposes.
CCPA / CPRA uses similar logic: "deidentified" data must be
supported by technical and organizational measures, contractual commitments not
to re-identify, and published policies. HIPAA Safe Harbor and Expert
Determination are the two recognized deidentification paths.
6. Choosing the Right One
Need reversal for authorized users? → Tokenization
(preferred) or pseudonymization with KMS-wrapped keys.
Need joins across systems but no re-identification? →
Pseudonymization with a deterministic HMAC and a carefully managed key.
Releasing aggregates to third parties? → Anonymization
via DP.
Releasing row-level data to third parties? → Be
skeptical. k-anonymity is often insufficient; consider synthetic data instead.
Unsure? Default to pseudonymization + strict access control.
It preserves optionality without overclaiming on anonymization.