A firm running document intelligence across hundreds of matters will eventually want cross-matter analytics: "how many contracts reference arbitration clauses?", "average redline turnaround by practice group", "which opposing firms appear most often in our settlements?". Every one of these aggregates can leak information about individual documents or individuals, especially at the tails where a single distinctive record dominates the result.
Differential privacy (DP) adds calibrated noise so that the presence or absence of any single record cannot meaningfully change the published aggregate. The guarantee is mathematical and composable across queries under a privacy budget (ε, δ).
A randomized mechanism M is (ε, δ)-differentially private if,
for any two datasets D and D' differing in a single
record, and any output S:
Pr[M(D) ∈ S] ≤ exp(ε) · Pr[M(D') ∈ S] + δ
Smaller ε = stronger privacy, noisier answers. Typical production values are ε between 0.1 and 3 per release, with δ ~10−6 scaled to dataset size.
Lap(Δf/ε)
added to the true answer, where Δf is the query's L1 sensitivity (how
much one record can change the output). Gives pure ε-DP.N(0, σ²) with σ derived from Δf (L2 sensitivity),
ε, and δ. Required for tight composition via RDP accounting.import math
import numpy as np
def laplace_noise(sensitivity: float, epsilon: float) -> float:
# Scale b = Δf / ε
return np.random.laplace(loc=0.0, scale=sensitivity / epsilon)
def dp_count(records: list, predicate, epsilon: float) -> float:
true_count = sum(1 for r in records if predicate(r))
# A single record changes any count by at most 1.
return true_count + laplace_noise(sensitivity=1.0, epsilon=epsilon)
def dp_mean(values: list[float], lo: float, hi: float, epsilon: float) -> float:
# Clip to a known range so sensitivity is bounded.
clipped = np.clip(values, lo, hi)
n = len(clipped)
# Split budget between numerator and denominator.
eps_sum, eps_cnt = epsilon / 2, epsilon / 2
noisy_sum = clipped.sum() + laplace_noise(sensitivity=(hi - lo), epsilon=eps_sum)
noisy_cnt = n + laplace_noise(sensitivity=1.0, epsilon=eps_cnt)
return noisy_sum / max(noisy_cnt, 1.0)
# Example: "fraction of contracts with an arbitration clause"
records = [{"has_arb": True}, {"has_arb": False}, {"has_arb": True}]
print(dp_count(records, lambda r: r["has_arb"], epsilon=0.5))
Every query consumes budget. A dashboard that runs 20 DP queries at ε=0.1 each has a cumulative privacy cost of ε=2.0 under basic (sequential) composition — tighter under advanced composition or RDP. Track the budget per-dataset and refuse queries that would exceed the cap: