How Vector Embeddings Work in RAG

How Vector Embeddings Work in RAG

1 Text → Vector (Fixed Dimensions)
multi-qa-mpnet model → 768 dimensions per text
cat
0.23, -0.45, 0.12, -0.08, 0.67, ...
768 numbers total
dog
0.18, -0.32, 0.08, 0.15, 0.59, ...
768 numbers total
car
-0.41, 0.22, -0.33, 0.54, -0.12, ...
768 numbers total
"vitamin C deficiency"
0.35, 0.61, -0.18, 0.42, 0.09, ...
768 numbers total
Key Point: Every text (word, sentence, paragraph) becomes exactly 768 numbers. The dimension count is fixed by the model, not by your data. 1 word or 1 million documents → same 768 dimensions each.
2 Cosine Similarity (Comparing Vectors)
cat vs dog
0.66
Similar (both animals)
Both vectors point in similar directions
in 768-dimensional space
cat vs smartphone
0.24
Different (unrelated concepts)
Vectors point in different directions
in 768-dimensional space
3 RAG System Flow
📄
Index Documents
Chunk your medical documents and convert each chunk to a 768-dim vector
1000 chunks → 1000 vectors stored
User Query
Convert the user's question to a vector using the same model
"scurvy symptoms" → [0.35, ...]
🔍
Vector Search
Find chunks with highest cosine similarity to query vector
Top 5 chunks: 0.89, 0.84, 0.81...
🤖
Generate Answer
Pass retrieved chunks as context to LLM for answer generation
LLM + context → accurate answer

💡 Why This Works

The embedding model learned from billions of text examples to place semantically similar text at nearby positions in 768-dimensional space. So "scurvy" and "vitamin C deficiency" end up close together, even though they share no words. That's why semantic search beats keyword matching.


Common Interview Questions:

Cosine similarity vs dot product — which should I use?

If your vectors are L2-normalized, cosine similarity and dot product are equivalent (cosine equals dot when both vectors have norm 1). Most modern embedding models (OpenAI's, bge, e5) ship normalized vectors, so dot product is faster and gives identical rankings. Use cosine when you can't guarantee normalization or when comparing across embedding sources with different norms. Use raw dot product only when norm itself carries meaning — e.g., some recommender embeddings encode confidence in the magnitude.

Why does normalization matter so much?

Without normalization, longer documents tend to have larger embedding norms, so dot product favors them regardless of relevance. Cosine cancels this by dividing out the norms, but at the cost of an extra sqrt-and-divide per comparison. The standard pipeline is: embed, L2-normalize at index time, then use dot product (or "inner product" mode in your vector DB) at query time — same accuracy as cosine, half the math. The bug to watch for: forgetting to normalize the query vector at search time. The index is normalized but the query isn't, and your top-k silently goes wrong.

What's the tradeoff with embedding dimensionality?

Higher dimensions (1536, 3072) capture more nuance and tend to score higher on benchmarks but cost linearly more in storage, RAM, and query time — a billion 3072-dim float32 vectors is 12 TB before any index overhead. Lower dimensions (384, 768) are 4–8x cheaper and the quality gap narrows on domain-specific corpora. Matryoshka embeddings (MRL) give you both: train at high dimension but truncate to lower dimension at serve time with graceful quality degradation. Practical rule: start at 768, only move to 1536+ if eval shows it actually helps your task.

Why does retrieval fail even when the answer is in the corpus?

Most common reasons in order: (1) chunking split the answer across boundaries so no single chunk contains the full context; (2) the query's vocabulary doesn't overlap semantically with the document's (a question phrased in user-speak vs a chunk in technical jargon — embeddings help here but aren't magic); (3) the right chunk is in the index but ranked just below your top-k cutoff; (4) embedding model mismatch (you re-embedded the corpus with a new model but kept the old query encoder); (5) for rare exact tokens like product SKUs, dense retrieval just doesn't work — needs BM25 hybrid.

How do you decide between an open-source and a proprietary embedding model?

Check the MTEB leaderboard for your task type (retrieval, clustering, classification) at your size budget. Open-source bge-large, e5-mistral, and gte-large are competitive with the best proprietary models on most retrieval tasks. Proprietary (Voyage, Cohere, OpenAI text-embedding-3-large) often edge out on specialized tasks (code, multilingual) and are operationally simpler — one API call, no GPU. Self-hosting wins on cost at high volume (~>100M embeddings/month), data residency, and the ability to fine-tune. The decision is rarely about peak quality; it's about ops fit.

How do you measure embedding quality on your own data?

Build a small (200–500 example) labeled set of (query, relevant_doc_id) pairs and compute recall@k and MRR for each candidate model. The query side should mirror real user queries, not paraphrases of doc text — that's the bias trap that makes every embedding model look great. For unlabeled data, bootstrap with an LLM: prompt it with each chunk and ask it to generate a plausible question, then check that the question retrieves its source chunk in the top-5. The MTEB leaderboard tells you the average; your gold set tells you the truth for your domain.