In a conventional RAG pipeline, the user's query is embedded on the application server and then sent in plaintext (as a vector of floats) to the vector database host, which computes similarity against indexed embeddings. For most workloads this is fine; for the most sensitive matters it is not — the query vector itself can reveal what an attorney is researching, and the vector-DB operator sits in the trust boundary.
Homomorphic encryption (HE) allows the vector database to perform similarity search on encrypted query vectors without ever decrypting them. The host returns encrypted scores; only the client can decrypt. Libraries such as Microsoft SEAL, OpenFHE, and TenSEAL (SEAL with a PyTorch-friendly wrapper) implement the CKKS scheme that supports approximate arithmetic on vectors of real numbers — exactly what cosine similarity needs.
If the vector DB is on-prem inside the same TEE that runs inference (see confidential computing), you may not need HE at all — the cheaper defense is to keep plaintext vectors inside the attested boundary.
CKKS (Cheon–Kim–Kim–Song) encodes vectors of real numbers into polynomial ciphertexts that support addition, multiplication, and rotation. The encryption is somewhat homomorphic: each multiplication consumes "noise budget", and after a fixed number of operations the ciphertext must be bootstrapped (expensive) or the circuit must be shallow enough to stay within budget. A cosine-similarity dot product is shallow (one multiplication + sum via rotation), so CKKS handles it well.
import tenseal as ts
import numpy as np
def make_context() -> ts.Context:
ctx = ts.context(
scheme=ts.SCHEME_TYPE.CKKS,
poly_modulus_degree=8192,
coeff_mod_bit_sizes=[60, 40, 40, 60],
)
ctx.global_scale = 2 ** 40
ctx.generate_galois_keys() # enables rotations for summation
return ctx
def l2_normalize(v: np.ndarray) -> np.ndarray:
n = np.linalg.norm(v)
return v / n if n > 0 else v
# --- Client side: encrypt query vector ---
client_ctx = make_context()
query = l2_normalize(np.random.randn(768).astype(np.float64))
enc_query = ts.ckks_vector(client_ctx, query)
# Serialize public-only context for the server.
public_ctx_bytes = client_ctx.serialize(save_secret_key=False)
enc_query_bytes = enc_query.serialize()
# --- Server side: score against indexed (pre-normalized) document vectors ---
server_ctx = ts.context_from(public_ctx_bytes)
enc_q = ts.ckks_vector_from(server_ctx, enc_query_bytes)
doc_vectors = [l2_normalize(np.random.randn(768)) for _ in range(1000)]
enc_scores = [(i, enc_q.dot(d)) for i, d in enumerate(doc_vectors)] # plaintext doc, encrypted query
# Return top-K ciphertexts (server cannot rank — it sends encrypted scores back).
enc_score_bytes = [(i, s.serialize()) for i, s in enc_scores]
# --- Client side: decrypt and rank ---
scores = [(i, ts.ckks_vector_from(client_ctx, b).decrypt()[0])
for i, b in enc_score_bytes]
top_k = sorted(scores, key=lambda x: -x[1])[:10]
Note the asymmetry: document vectors stay plaintext on the server; only the query is encrypted. This is the common configuration — documents are bulk-loaded under a different trust model (often via a secure pipeline into the index), while queries are the high-sensitivity signal. Encrypting both sides is possible but multiplies cost.