Differential Privacy for Aggregates

A firm running document intelligence across hundreds of matters will eventually want cross-matter analytics: "how many contracts reference arbitration clauses?", "average redline turnaround by practice group", "which opposing firms appear most often in our settlements?". Every one of these aggregates can leak information about individual documents or individuals, especially at the tails where a single distinctive record dominates the result.

Differential privacy (DP) adds calibrated noise so that the presence or absence of any single record cannot meaningfully change the published aggregate. The guarantee is mathematical and composable across queries under a privacy budget (ε, δ).

1. The Definition in Plain Terms

A randomized mechanism M is (ε, δ)-differentially private if, for any two datasets D and D' differing in a single record, and any output S:

Pr[M(D) ∈ S] ≤ exp(ε) · Pr[M(D') ∈ S] + δ

Smaller ε = stronger privacy, noisier answers. Typical production values are ε between 0.1 and 3 per release, with δ ~10⁻⁶ scaled to dataset size.

2. Mechanisms: Laplace and Gaussian

Laplace mechanism — noise ∼ Lap(Δf/ε) added to the true answer, where Δf is the query's L1 sensitivity (how much one record can change the output). Gives pure ε-DP.
Gaussian mechanism — noise ∼ N(0, σ²) with σ derived from Δf (L2 sensitivity), ε, and δ. Required for tight composition via RDP accounting.

3. Example: Counts and Averages with Noise

import math
import numpy as np


def laplace_noise(sensitivity: float, epsilon: float) -> float:
    # Scale b = Δf / ε
    return np.random.laplace(loc=0.0, scale=sensitivity / epsilon)


def dp_count(records: list, predicate, epsilon: float) -> float:
    true_count = sum(1 for r in records if predicate(r))
    # A single record changes any count by at most 1.
    return true_count + laplace_noise(sensitivity=1.0, epsilon=epsilon)


def dp_mean(values: list[float], lo: float, hi: float, epsilon: float) -> float:
    # Clip to a known range so sensitivity is bounded.
    clipped = np.clip(values, lo, hi)
    n = len(clipped)
    # Split budget between numerator and denominator.
    eps_sum, eps_cnt = epsilon / 2, epsilon / 2
    noisy_sum = clipped.sum() + laplace_noise(sensitivity=(hi - lo), epsilon=eps_sum)
    noisy_cnt = n            + laplace_noise(sensitivity=1.0,       epsilon=eps_cnt)
    return noisy_sum / max(noisy_cnt, 1.0)


# Example: "fraction of contracts with an arbitration clause"
records = [{"has_arb": True}, {"has_arb": False}, {"has_arb": True}]
print(dp_count(records, lambda r: r["has_arb"], epsilon=0.5))

4. Privacy Budget & Accounting

Every query consumes budget. A dashboard that runs 20 DP queries at ε=0.1 each has a cumulative privacy cost of ε=2.0 under basic (sequential) composition — tighter under advanced composition or RDP. Track the budget per-dataset and refuse queries that would exceed the cap:

Per-matter budget — reset only when the underlying data changes materially.
Per-analyst budget — prevents an individual from drawing the same fact from many angles.
Audit the spend — budget consumption is a compliance artifact; log every DP query with its ε cost and running total.

5. Where to Apply DP in the Pipeline

Published dashboards — counts, averages, histograms exposed to users outside the matter team.
Training corpora — DP-SGD when fine-tuning a model on redacted documents; prevents memorization of outlier records.
Cross-tenant benchmarks — anonymized platform-level statistics ("median contract length across all customers") released to marketing or to customers themselves.

6. Pitfalls

Unclipped sensitivity — a single extreme value makes sensitivity huge and noise destroys utility. Always clip to a defensible range.
Post-processing misunderstandings — arithmetic on DP outputs is safe; joining DP outputs with raw data is not.
Group privacy — record-level DP does not protect groups (a family, a firm). If the unit of concern is larger, scale sensitivity accordingly.
Small-n noise swamps signal — DP is most useful on datasets large enough that √n dominates the noise. On a ten-record matter, publish nothing.