The brief is concrete: build a retrieval-augmented generation service that indexes 10 million documents, serves 100,000 monthly active users, and returns grounded answers with a p95 latency under 2 seconds end-to-end. The corpus is a mix of PDFs, internal wiki pages, and Slack exports averaging 4 KB of cleaned text per document. Users ask natural-language questions and expect citations to source spans.
Most "RAG demo to production" failures come from skipping the boring work: hybrid retrieval, a real reranker, query-result caching, and a freshness strategy that does not melt the embedding bill. This walk-through is opinionated about all four.
The system must:
Out of scope for this design: agentic tool use, multi-turn conversational memory beyond the current question, and fine-tuning. Those are separate problems.
| Metric | Target |
|---|---|
| End-to-end p50 latency | 800 ms |
| End-to-end p95 latency | 2,000 ms |
| End-to-end p99 latency | 4,000 ms |
| Sustained QPS | 50 (peak ~200) |
| Availability | 99.9% monthly (43 min downtime budget) |
| Document freshness | ≤ 5 min from upload to searchable |
| Cost ceiling | < $0.02 per query all-in |
The 50 QPS figure is derived from 100k MAU × 30 queries/user/month ÷ (30 days × 86400 s) ≈ 1.2 QPS average, with 4× daily peak and 10× burst headroom. Plan for 200 QPS sustained-burst.
Chunking and embedding count. 10M docs × 4 KB cleaned text ÷ 1 KB per chunk (about 250 tokens) ≈ 40M chunks.
Embedding storage. Using bge-large-en-v1.5 at 1024
dimensions, float16 = 2 bytes/dim:
M=32 adds ~50% → ~120 GB total.Embedding throughput at ingest. Initial backfill of 40M chunks on a single A10G running vLLM-served bge-large batched at 256 sequences ≈ 3,000 chunks/sec → 40M / 3,000 ≈ 3.7 hours. With 4× parallelism < 1 hour. After backfill, steady-state < 1 chunk/sec.
Query-time inference. Embedding the user query is one forward
pass ≈ 15 ms on an L4. Cross-encoder rerank of top-50 candidates with
bge-reranker-large ≈ 60 ms batched. LLM answer generation
(Claude Haiku or Llama 3.1 8B) at 500 input tokens + 200 output tokens
≈ 700 ms. Vector search ANN k=50 ≈ 30 ms.
Latency budget breakdown (p95 target 2000 ms):
query_embedding: 50 ms # cold tokenizer + forward pass
hybrid_retrieval: 120 ms # BM25 + dense ANN in parallel
rerank_top50: 150 ms # cross-encoder, batched
context_assembly: 30 ms # fetch chunk text + dedupe
llm_generation: 1,400 ms # 700 ms typical + 700 ms tail
network/serialization: 250 ms
---
total p95: 2,000 ms
+------------------+
user query --->| API Gateway | (auth, tenant_id injection, rate limit)
+--------+---------+
|
v
+------------------+ +-------------------+
| Query Planner |<------>| Redis Cache | (query_hash -> answer)
+--------+---------+ +-------------------+
|
+--------------+--------------+
| |
v v
+-------------------+ +-------------------+
| Embed Service | | BM25 (Postgres |
| (vLLM, bge-large) | | GIN / OpenSearch)|
+---------+---------+ +---------+---------+
| |
v |
+-------------------+ |
| Vector Store | |
| (pgvector HNSW) | |
+---------+---------+ |
| |
+--------------+--------------+
|
v
+------------------+
| Reranker | (cross-encoder, top-50 -> top-8)
+--------+---------+
|
v
+------------------+
| LLM Generator | (Claude Haiku via Bedrock OR vLLM Llama 8B)
+--------+---------+
|
v
+------------------+
| Response | (answer + citations + confidence)
+------------------+
Component responsibilities, one line each:
tenant_id into a downstream header that the query planner cannot override.bge-large-en-v1.5 on L4 GPUs; returns 1024-dim float16 vectors.tsvector for the same chunks; provides recall on rare terms and exact identifiers (case numbers, SKUs) where embeddings underperform.bge-reranker-large) on a single L4; batched at 50 query-doc pairs per call.sha256(tenant_id + normalized_query + index_version); 30 min TTL with manual invalidation on document change.Postgres is the source of truth for both chunk metadata and vectors. One database, one transaction boundary, fewer integration bugs.
CREATE TABLE documents (
doc_id UUID PRIMARY KEY,
tenant_id UUID NOT NULL,
source_uri TEXT NOT NULL,
title TEXT,
sha256 BYTEA NOT NULL, -- content hash for dedupe + idempotent re-ingest
created_at TIMESTAMPTZ DEFAULT now(),
updated_at TIMESTAMPTZ DEFAULT now(),
deleted_at TIMESTAMPTZ -- soft-delete; physically purged by GC job
);
CREATE INDEX ON documents (tenant_id, updated_at);
CREATE TABLE chunks (
chunk_id UUID PRIMARY KEY,
doc_id UUID NOT NULL REFERENCES documents(doc_id) ON DELETE CASCADE,
tenant_id UUID NOT NULL, -- denormalized for RLS predicate efficiency
ordinal INT NOT NULL, -- chunk index within document
text TEXT NOT NULL,
text_tsv TSVECTOR GENERATED ALWAYS AS (to_tsvector('english', text)) STORED,
embedding VECTOR(1024), -- pgvector type, float16 stored
embed_version SMALLINT NOT NULL, -- bumped on model upgrade
page INT,
offset_start INT,
offset_end INT
);
CREATE INDEX chunks_hnsw ON chunks USING hnsw (embedding vector_cosine_ops)
WITH (m = 32, ef_construction = 200);
CREATE INDEX chunks_bm25 ON chunks USING gin (text_tsv);
CREATE INDEX chunks_tenant ON chunks (tenant_id);
ALTER TABLE chunks ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON chunks
USING (tenant_id = current_setting('app.tenant_id')::UUID);
Sharding. At 40M chunks a single Postgres instance (db.r6i.4xlarge,
128 GB RAM) holds the HNSW index in memory. We do not shard until the largest
tenant's index exceeds a single instance's RAM budget. When that happens, shard
by tenant_id using Citus or a routing layer; never by chunk_id
hash, because cross-shard ANN search loses recall.
Retention. Soft delete via deleted_at; nightly GC
job hard-deletes after 30 days and removes vectors from the HNSW index. For
tenants under GDPR, the GC runs on demand within 72 hours of an erasure request.
The hot path for a single query, with timings:
async def answer(query: str, tenant_id: UUID) -> Answer:
# Step 1: cache lookup (5 ms p99)
cache_key = hash_key(tenant_id, normalize(query), INDEX_VERSION)
if cached := await redis.get(cache_key):
return Answer.parse_raw(cached)
# Step 2: parallel embed + BM25 (120 ms)
embed_task = embed_service.embed(query)
bm25_task = pg.fetch_bm25(query, tenant_id, k=50)
q_vec, bm25_hits = await asyncio.gather(embed_task, bm25_task)
# Step 3: dense ANN search (30 ms)
dense_hits = await pg.fetch_ann(q_vec, tenant_id, k=50)
# Step 4: reciprocal rank fusion (1 ms)
fused = rrf(dense_hits, bm25_hits, k=60)[:50]
# Step 5: cross-encoder rerank (150 ms)
reranked = await reranker.score(query, [c.text for c in fused])
top_k = [fused[i] for i in argsort(reranked)[-8:]]
# Step 6: LLM generation with citations (1,400 ms p95)
context = format_context(top_k)
answer = await llm.generate(SYSTEM_PROMPT, query, context)
# Step 7: cache + emit telemetry
await redis.setex(cache_key, 1800, answer.json())
return answer
Reciprocal Rank Fusion (RRF) is the boring-and-correct way to
combine BM25 and dense scores without learning per-tenant weights. The score is
sum(1 / (k + rank_i)) across retrievers; k=60 is the
canonical default and works well in practice.
What breaks first as load or corpus grows:
tenant_id filter selectivity.
pgvector's HNSW does the filter after graph traversal, so a tenant
holding 0.1% of the corpus may need ef_search raised to 200–400 to
recover recall. At > 100 tenants, switch to per-tenant partial indexes or
move that tenant to a dedicated index.(cores × 4) + spare, not at max_connections./health with a synthetic embed call after deploy.SETNX + short lock or the cachetools
single-flight pattern; serve a stale answer for ≤ 60 s while the fresh one
computes.Each external dependency has a defined degradation path:
Idempotency. Every request carries a client-generated
request_id; the planner checks Redis for a 5-min in-flight marker
before starting work. Duplicate requests return the same answer or wait on the
marker, never trigger duplicate LLM bills.
Per 1,000 queries, assuming 40% cache hit rate (so 600 LLM calls):
embed_service:
amortized_l4_gpu: $0.05 # 1 L4 / 1000 QPS capacity
bedrock_haiku:
input_tokens: $1.50 # 600 calls x 4000 tok x $0.25 / 1M
output_tokens: $1.50 # 600 calls x 200 tok x $1.25 / 1M
reranker_gpu: $0.05 # batched
postgres: $0.20 # db.r6i.4xlarge amortized
redis_cache: $0.05
network_egress: $0.10
api_gateway: $0.05
---
total_per_1000: $3.50 # = $0.0035 per query
total_with_overhead: $0.007 # double for monitoring, logs, support; well under $0.02 ceiling
The dominant line item is Bedrock token spend. If usage 10×'s, switch hot tenants to provisioned throughput (~30% cheaper at scale) or to on-prem Llama 3.1 8B (fixed GPU cost, marginal token cost ~zero).
pgvector vs Pinecone vs Qdrant. I would pick pgvector at this scale because it keeps the chunk text, metadata, and vector in one transactional store; one less system to operate, one less consistency boundary, and Postgres RLS gives tenant isolation for free. Pinecone wins above ~500M vectors or when you need geo-replication out of the box. Qdrant wins when you need rich payload filtering with high cardinality, and its scalar quantization saves real money at > 100M vectors. The test is operational complexity, not benchmarks: choose Postgres until it hurts.
Bedrock Knowledge Bases vs custom. Bedrock KB ships the same ingest → chunk → embed → OpenSearch pipeline I just described, and removes one team's worth of glue code. The reasons to build custom anyway: (1) control over the chunking algorithm (KB's defaults are weak for legal/medical docs); (2) hybrid retrieval with rerank, which KB does not expose well; (3) multi-region or on-prem deployment; (4) cost transparency at scale.
Embedding refresh strategy. When a new embedding model ships
(say bge-large-v2), naively re-embedding 40M chunks costs ~$2k and takes hours
on a single GPU pool. The cheap approach: dual-write embeddings under
embed_version=2 alongside v1; route 5% of traffic to v2; promote
when eval metrics improve; lazy-migrate the long tail by re-embedding chunks
on next read. Avoids any downtime.
Tenant isolation: shared index + RLS vs per-tenant indexes. Shared index with RLS scales operationally (one Postgres, one HNSW). Per-tenant indexes scale on the recall axis (no filter penalty) and simplify deletion. Hybrid policy: shared index up to 100 tenants or until any tenant's chunks exceed 5% of total; promote large or regulated tenants to dedicated indexes.
Cache the embeddings of common queries. Below the answer cache, keep a smaller cache of normalized-query → embedding vector. Saves 50 ms and one GPU forward pass on common queries; the embedding is small (2 KB) so a 100k entry cache fits in 200 MB of Redis.
Dense embeddings are great at semantic similarity but underperform on rare tokens, exact identifiers (case numbers, SKUs, error codes), and acronyms the model has never seen. BM25 catches those. On internal evals across legal, code, and support corpora, hybrid + RRF beats dense-only by 8–15% on top-5 recall, and the implementation cost is one Postgres GIN index. There is no defensible reason to ship dense-only in production unless you literally have no exact identifiers in your corpus.
Start at 512 tokens with 64-token overlap; this is the sweet spot for most corpora and matches what the embedding model was trained on. Tune by measuring recall@k on a held-out eval set: if recall is low and chunks are long, the relevant span is being diluted by surrounding text — shrink to 256. If recall is high but the LLM gets confused, chunks are too short and missing context — grow to 1024. Never just pick a number from a blog post.
Soft-delete the document row in Postgres, which cascades to chunks via
ON DELETE CASCADE. The HNSW index does not physically remove
points on row delete; pgvector marks them tombstoned and the GC job rebuilds
the index segment within 72 h to satisfy GDPR Art. 17 timing. Until rebuild,
RLS prevents the chunks from being returned. Audit logs of the original ingest
remain (legitimate-interest exception) but never contain the chunk text itself.
Three triggers. (1) The HNSW index no longer fits in a single Postgres instance's RAM and partitioning by tenant is no longer enough — typically above 200–500M vectors. (2) Filter selectivity becomes pathological because RLS forces post-filter on a large shared index, and the pgvector roadmap for pre-filtered HNSW has not landed yet. (3) The team needs operational features Postgres does not have: snapshot replication of just the vector tier, real-time multi-region, or scalar quantization for cost. Until then, the integration cost of a separate system is not worth the marginal performance.
Three layers: (1) the system prompt requires every claim to cite a chunk ID from the provided context, and the planner post-validates that every cited chunk was actually in the retrieval; (2) when the top reranker score falls below a tuned threshold (e.g. 0.3), the planner returns "I don't have enough information to answer" rather than calling the LLM at all; (3) an output classifier flags answers whose entities do not appear in the cited chunks for human review. None of this guarantees zero hallucination, but it makes the common ones cheap to detect and fix.
S3 PUT triggers an SQS message to the ingest worker. Worker fetches the PDF,
runs OCR-or-text-extract via unstructured.io or Textract, normalizes to clean
text. Splits at semantic boundaries (headings, paragraphs) targeting 512 tokens
with 64 overlap, producing roughly 400 chunks for a 200-page legal PDF. Inserts
the document row and chunk rows in one transaction (chunks get
embedding=NULL). An async embed job picks up null-embedding chunks
in batches of 256, calls the vLLM-served bge-large endpoint, and updates the
rows. Postgres' HNSW index incrementally absorbs the new vectors. Total time
from PUT to "fully searchable": about 90 seconds for a 200-page PDF, well under
the 5-minute SLO. The user sees a "ready" notification via WebSocket.