When privileged matters are pinned to an on-prem model, the trust boundary shrinks to the physical server running the weights. But on-prem is not the same as private: the operating system, hypervisor, and anyone with root on the host can, in principle, read the model's prompt and response memory. For attorney–client privileged content, that residual surface is not acceptable — a privileged document opened inside a model process is still privileged.
Confidential computing uses CPU-level memory encryption and attestation (Intel TDX, AMD SEV-SNP, Arm CCA, NVIDIA H100/H200 confidential compute mode) to create a trusted execution environment (TEE) whose memory is inaccessible even to the host OS and hypervisor. The model runs inside the TEE; infra operators cannot read what it processes.
A confidential VM defends against attackers with privileged host access:
gcore the
inference process and dump prompts.
What a TEE gives you: every page of guest memory is encrypted with a key the CPU
generates and never releases. Even mmap-ing the guest's physical memory
returns ciphertext.
TEE memory encryption is worthless if you cannot verify that the thing you are sending prompts to is actually a TEE running the expected image. Remote attestation solves this: before the orchestrator releases the data-encryption key that unwraps the prompt, it requests a signed quote from the TEE describing the measured boot state and the workload hash. The quote is signed by the CPU vendor's root of trust; the orchestrator verifies it against expected values.
from dataclasses import dataclass
@dataclass
class AttestationQuote:
cpu_vendor: str # "intel-tdx" | "amd-sev-snp"
measurement: bytes # hash of TEE initial state (firmware + kernel)
workload_hash: bytes # hash of the inference container image
nonce: bytes # client-supplied freshness value
signature: bytes # signed by CPU vendor root key
EXPECTED_MEASUREMENTS = {
"intel-tdx": {b"\x8a\x7f..."}, # pinned after provisioning
}
EXPECTED_WORKLOADS = {b"\xde\xad..."} # SHA-256 of the signed model server image
def verify_quote(q: AttestationQuote, expected_nonce: bytes) -> bool:
if q.nonce != expected_nonce:
return False
if q.measurement not in EXPECTED_MEASUREMENTS.get(q.cpu_vendor, set()):
return False
if q.workload_hash not in EXPECTED_WORKLOADS:
return False
return verify_vendor_signature(q) # PCCS for Intel, KDS for AMD
def release_key_if_attested(tee_endpoint, data_key_material) -> bool:
nonce = os.urandom(32)
quote = tee_endpoint.get_quote(nonce=nonce)
if not verify_quote(quote, expected_nonce=nonce):
audit.log("attestation.failed", endpoint=tee_endpoint.url)
return False
# Only now do we transfer the key that lets the TEE decrypt the prompt.
tee_endpoint.wrap_and_send(data_key_material)
return True
Attestation turns "trust the server" into "trust the CPU vendor + our image signing" — a much smaller and more auditable trust base.
Modern inference is GPU-bound. NVIDIA H100 and H200 support confidential compute mode: the GPU attests its firmware state alongside the CPU TEE, and the PCIe transport between CPU and GPU is encrypted. Without this, a CPU TEE alone is insufficient — the prompt would be re-exposed the moment it crossed the bus to the GPU.