Confidential Computing for On-Prem Inference

When privileged matters are pinned to an on-prem model, the trust boundary shrinks to the physical server running the weights. But on-prem is not the same as private: the operating system, hypervisor, and anyone with root on the host can, in principle, read the model's prompt and response memory. For attorney–client privileged content, that residual surface is not acceptable — a privileged document opened inside a model process is still privileged.

Confidential computing uses CPU-level memory encryption and attestation (Intel TDX, AMD SEV-SNP, Arm CCA, NVIDIA H100/H200 confidential compute mode) to create a trusted execution environment (TEE) whose memory is inaccessible even to the host OS and hypervisor. The model runs inside the TEE; infra operators cannot read what it processes.

1. Threat Model: What TEEs Defend Against

A confidential VM defends against attackers with privileged host access:

Infrastructure operator — cloud tenant or internal ops staff with hypervisor/root privileges who could otherwise gcore the inference process and dump prompts.
Co-tenant escape — another VM on the same host exploiting a hypervisor bug to read neighbor memory.
Cold-boot / DMA attacks — physical attackers with hardware access.

What a TEE gives you: every page of guest memory is encrypted with a key the CPU generates and never releases. Even mmap-ing the guest's physical memory returns ciphertext.

2. Hardware Options

Intel TDX (Trust Domain Extensions) — full-VM TEE on 4th-gen Xeon Scalable and later. Protects an entire guest VM; easiest porting path since the guest OS runs unchanged. Available as Azure Confidential VMs (DCesv5/ECesv5) and GCP C3 Confidential.
AMD SEV-SNP — AMD's full-VM TEE on EPYC Milan and later. Similar properties to TDX; available on Azure, GCP, and bare metal.
Arm CCA (Confidential Compute Architecture) — v9-A realm-based TEE; emerging in cloud deployments in 2026.
Intel SGX — enclave model (application-level rather than VM-level). Mature but requires application partitioning; largely superseded by TDX for inference workloads.

3. Remote Attestation

TEE memory encryption is worthless if you cannot verify that the thing you are sending prompts to is actually a TEE running the expected image. Remote attestation solves this: before the orchestrator releases the data-encryption key that unwraps the prompt, it requests a signed quote from the TEE describing the measured boot state and the workload hash. The quote is signed by the CPU vendor's root of trust; the orchestrator verifies it against expected values.

from dataclasses import dataclass


@dataclass
class AttestationQuote:
    cpu_vendor: str            # "intel-tdx" | "amd-sev-snp"
    measurement: bytes         # hash of TEE initial state (firmware + kernel)
    workload_hash: bytes       # hash of the inference container image
    nonce: bytes               # client-supplied freshness value
    signature: bytes           # signed by CPU vendor root key


EXPECTED_MEASUREMENTS = {
    "intel-tdx": {b"\x8a\x7f..."},   # pinned after provisioning
}
EXPECTED_WORKLOADS = {b"\xde\xad..."} # SHA-256 of the signed model server image


def verify_quote(q: AttestationQuote, expected_nonce: bytes) -> bool:
    if q.nonce != expected_nonce:
        return False
    if q.measurement not in EXPECTED_MEASUREMENTS.get(q.cpu_vendor, set()):
        return False
    if q.workload_hash not in EXPECTED_WORKLOADS:
        return False
    return verify_vendor_signature(q)   # PCCS for Intel, KDS for AMD


def release_key_if_attested(tee_endpoint, data_key_material) -> bool:
    nonce = os.urandom(32)
    quote = tee_endpoint.get_quote(nonce=nonce)
    if not verify_quote(quote, expected_nonce=nonce):
        audit.log("attestation.failed", endpoint=tee_endpoint.url)
        return False
    # Only now do we transfer the key that lets the TEE decrypt the prompt.
    tee_endpoint.wrap_and_send(data_key_material)
    return True

Attestation turns "trust the server" into "trust the CPU vendor + our image signing" — a much smaller and more auditable trust base.

4. Confidential GPUs and Model Weights

Modern inference is GPU-bound. NVIDIA H100 and H200 support confidential compute mode: the GPU attests its firmware state alongside the CPU TEE, and the PCIe transport between CPU and GPU is encrypted. Without this, a CPU TEE alone is insufficient — the prompt would be re-exposed the moment it crossed the bus to the GPU.

Model weights are delivered to the TEE as sealed artifacts, unsealed only inside the attested environment.
KV-cache memory is also protected; transient state from a previous privileged query cannot be read by a subsequent tenant on the same GPU.
Throughput overhead is modest (~5–15% observed on H100 confidential mode for 70B-parameter inference).

5. Operational Considerations

Key release policy — prompts and weights are never decrypted outside the TEE. The key release service (KMS, Azure MHSM, AWS Nitro KMS) only unwraps for an attested endpoint.
Image supply chain — the workload hash in the quote is only meaningful if your build pipeline is reproducible and the image is signed. Use SLSA level 3+ provenance and cosign-signed images.
Log what the TEE emits — audit logs leave the TEE and need their own integrity protection; sign them inside the TEE before writing to the external log store.
Graceful degradation — if attestation fails (e.g. a firmware update changed the measurement), the router must refuse privileged traffic rather than silently falling back to an unattested path.

6. What TEEs Do Not Protect Against

Bugs in the workload — a vulnerability inside the model server still leaks data regardless of TEE. Memory encryption does not fix application-level exposure.
Side channels — timing, cache, and power-analysis attacks remain a research area; mitigations exist but assume the attacker lacks close physical proximity.
Social engineering of key release — if an operator can register a new expected measurement without review, the policy collapses. Treat measurement allowlists as change-controlled.
Model inversion — TEE protects the memory; it does not prevent the model itself from leaking training data through well-crafted prompts.