Homomorphic Embedding Search

In a conventional RAG pipeline, the user's query is embedded on the application server and then sent in plaintext (as a vector of floats) to the vector database host, which computes similarity against indexed embeddings. For most workloads this is fine; for the most sensitive matters it is not — the query vector itself can reveal what an attorney is researching, and the vector-DB operator sits in the trust boundary.

Homomorphic encryption (HE) allows the vector database to perform similarity search on encrypted query vectors without ever decrypting them. The host returns encrypted scores; only the client can decrypt. Libraries such as Microsoft SEAL, OpenFHE, and TenSEAL (SEAL with a PyTorch-friendly wrapper) implement the CKKS scheme that supports approximate arithmetic on vectors of real numbers — exactly what cosine similarity needs.

1. When to Use HE Search

The vector-DB host is outside your trust boundary — a managed service, a partner-run index, or a cross-tenant shared cluster.
Query confidentiality is the asset — the subject of the search (a client's name, a deal code, a medical condition) is as sensitive as the documents.
Latency is not interactive-critical — HE adds hundreds of milliseconds to seconds per query; fine for batch review, not for keystroke autocomplete.

If the vector DB is on-prem inside the same TEE that runs inference (see confidential computing), you may not need HE at all — the cheaper defense is to keep plaintext vectors inside the attested boundary.

2. CKKS in One Paragraph

CKKS (Cheon–Kim–Kim–Song) encodes vectors of real numbers into polynomial ciphertexts that support addition, multiplication, and rotation. The encryption is somewhat homomorphic: each multiplication consumes "noise budget", and after a fixed number of operations the ciphertext must be bootstrapped (expensive) or the circuit must be shallow enough to stay within budget. A cosine-similarity dot product is shallow (one multiplication + sum via rotation), so CKKS handles it well.

3. Example: Encrypted Cosine Similarity with TenSEAL

import tenseal as ts
import numpy as np


def make_context() -> ts.Context:
    ctx = ts.context(
        scheme=ts.SCHEME_TYPE.CKKS,
        poly_modulus_degree=8192,
        coeff_mod_bit_sizes=[60, 40, 40, 60],
    )
    ctx.global_scale = 2 ** 40
    ctx.generate_galois_keys()   # enables rotations for summation
    return ctx


def l2_normalize(v: np.ndarray) -> np.ndarray:
    n = np.linalg.norm(v)
    return v / n if n > 0 else v


# --- Client side: encrypt query vector ---
client_ctx = make_context()
query = l2_normalize(np.random.randn(768).astype(np.float64))
enc_query = ts.ckks_vector(client_ctx, query)

# Serialize public-only context for the server.
public_ctx_bytes = client_ctx.serialize(save_secret_key=False)
enc_query_bytes  = enc_query.serialize()


# --- Server side: score against indexed (pre-normalized) document vectors ---
server_ctx = ts.context_from(public_ctx_bytes)
enc_q = ts.ckks_vector_from(server_ctx, enc_query_bytes)

doc_vectors = [l2_normalize(np.random.randn(768)) for _ in range(1000)]
enc_scores = [(i, enc_q.dot(d)) for i, d in enumerate(doc_vectors)]  # plaintext doc, encrypted query

# Return top-K ciphertexts (server cannot rank — it sends encrypted scores back).
enc_score_bytes = [(i, s.serialize()) for i, s in enc_scores]


# --- Client side: decrypt and rank ---
scores = [(i, ts.ckks_vector_from(client_ctx, b).decrypt()[0])
          for i, b in enc_score_bytes]
top_k = sorted(scores, key=lambda x: -x[1])[:10]

Note the asymmetry: document vectors stay plaintext on the server; only the query is encrypted. This is the common configuration — documents are bulk-loaded under a different trust model (often via a secure pipeline into the index), while queries are the high-sensitivity signal. Encrypting both sides is possible but multiplies cost.

4. Performance & Cost

Ciphertext expansion — a 768-dim float vector encrypts to roughly tens of kilobytes; bandwidth matters for high-QPS paths.
Dot-product latency — ~10–50 ms per document on CPU; naive brute-force search over 1M documents is impractical. Use coarse plaintext pre-filtering (by matter, date, tenant) before HE scoring to reduce candidates to the low thousands.
Ranking constraint — the server cannot sort encrypted scores, so it returns the ciphertexts for the candidate set and the client decrypts and ranks. For large candidate sets this moves bandwidth cost to the client.

5. Alternatives: PIR, Enclaves, Split Inference

Private Information Retrieval (PIR) — hides which record was fetched rather than its content. Composes with HE but is a different property.
TEE-hosted vector DB — run the vector DB inside the same TEE as inference (see confidential computing page). Lower overhead than HE, but requires trust in the TEE attestation chain rather than a mathematical guarantee.
Split inference — embed on-client, send only top-K document IDs back to client, retrieve content locally. Eliminates server-side query exposure but pushes embedding cost to the client.

6. Limits and Honest Caveats

HE protects content of the query vector, not metadata (tenant ID, timing, access pattern). A server can infer a lot from access logs alone.
Approximate arithmetic in CKKS means scores have floating-point error; set comparison tolerances accordingly.
Library maturity varies. SEAL and OpenFHE are production-grade; high-level wrappers move faster and occasionally break API compatibility.
HE is rarely the right first defense. Exhaust cheaper options (redaction, TEEs, tenant isolation) before reaching for it — reserve HE for the narrow case where cryptographic guarantees are the requirement.