Pinecone, Qdrant, Weaviate, Milvus, Vespa — the dedicated vector database market is crowded and well-funded. pgvector is the boring answer that wins more production RAG bake-offs than the marketing in that space would suggest, because most teams already run Postgres and the operational tax of adding a second stateful system rarely pays back the recall-per-millisecond gains.
This page walks through the extension end-to-end: install, types, index choice (IVFFlat vs HNSW), distance operators, quantization, hybrid search with tsvector, multi-tenant isolation via row-level security, performance tuning, and a head-to-head against the dedicated vector DBs.
Three operational reasons, in order of how often they decide the bake-off:
WHERE tenant_id = ? AND published_at > ? AND access_level = 'public') are first-class SQL, not a bolt-on filter DSL. Joins to users, documents, permissions tables are free.The case against pgvector is sharper than people admit. It's the right call when:
It's the wrong call when you need 1B+ vectors with sub-10ms p99, GPU-accelerated indexing, or distributed multi-region replication of a sharded vector index. At that scale a dedicated system (Vespa, Milvus, Mosaic) earns its operational cost.
pgvector ships as a standard PostgreSQL extension. Compile from source, install via package manager, or use the prebuilt image.
# Debian/Ubuntu
sudo apt install postgresql-16-pgvector
# macOS via Homebrew Postgres
brew install pgvector
# Docker (handy for local dev)
docker run -d --name pg16 -e POSTGRES_PASSWORD=dev \
-p 5432:5432 pgvector/pgvector:pg16
Once the binary is on disk, enable the extension per database:
CREATE EXTENSION IF NOT EXISTS vector;
SELECT extversion FROM pg_extension WHERE extname = 'vector';
-- 0.8.x or newer for halfvec, sparsevec, and iterative index scan.
Four vector types are supported as of 0.8:
| Type | Storage | Use case |
|---|---|---|
vector(N) | 4 bytes/dim (float32) | Default. Bge-large at 1024 dim = 4 KB/row. |
halfvec(N) | 2 bytes/dim (float16) | 50% storage cut with negligible recall loss for most embedding models. Drop-in replacement. |
bit(N) | N/8 bytes | Binary quantization. Hamming distance via <~>. 32x storage cut, useful for first-stage retrieval. |
sparsevec(N) | variable (index, value pairs) | Learned sparse retrieval (SPLADE, ELSER). Stores only non-zero dims. |
Without an index, pgvector does a brute-force sequential scan — correct but O(N) per query. Two ANN index types are available, and the choice is not subtle.
IVFFlat partitions vectors into lists clusters via k-means at build time. At query time it searches the probes nearest clusters. Faster to build, smaller on disk, but you must rebuild periodically as data shifts the cluster centroids and recall degrades.
-- Rule of thumb: lists = sqrt(N) for < 1M rows, N/1000 above that.
CREATE INDEX chunks_ivfflat ON chunks
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 1000);
-- At query time:
SET ivfflat.probes = 10; -- higher = more recall, slower. Tune to recall target.
HNSW builds a multi-layer navigable small-world graph. Higher recall at the same latency, slower to build, larger on disk, but zero-maintenance — no rebuild required as data evolves. This is the right default for almost every workload above a few hundred thousand rows.
CREATE INDEX chunks_hnsw ON chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- At query time:
SET hnsw.ef_search = 100; -- candidate list size. Default 40, raise for recall.
Parameter cheat-sheet:
m (HNSW): max connections per node per layer. 16 is the default and right for most. Bump to 32 for very high recall corpora; cuts insert speed and roughly doubles index size.ef_construction (HNSW): how thoroughly to search during build. 64 is fine for prototypes; 200 for production. Higher means slower builds, better graph quality.ef_search (HNSW): runtime candidate list size. Tune this per query, not at index build. 100 for general use; raise to 400 if you need recall above 99%.lists (IVFFlat): number of inverted-list clusters. sqrt(N) rule of thumb.probes (IVFFlat): clusters to scan per query. Start with sqrt(lists).When IVFFlat actually wins. Two cases: index build time matters more than recall (frequent re-embed of the whole corpus), or you're memory-bound and HNSW won't fit. Otherwise pick HNSW.
pgvector exposes distance as binary operators so the planner can use the index. The operator you pick at query time must match the operator class declared at index build, or the planner falls back to a sequential scan.
| Operator | Distance | Index opclass | When to use |
|---|---|---|---|
<-> | L2 (Euclidean) | vector_l2_ops | Image embeddings; CLIP. Magnitude matters. |
<=> | Cosine | vector_cosine_ops | Default for text embeddings. Magnitude-invariant. |
<#> | Negative inner product | vector_ip_ops | Equivalent to cosine on normalized vectors, slightly faster. |
<+> | L1 (Manhattan) | vector_l1_ops | Rare; some recommendation systems. |
<~> | Hamming (bit) | bit_hamming_ops | Binary-quantized first-stage retrieval. |
<%> | Jaccard (bit) | bit_jaccard_ops | Set-similarity over bitmaps. |
-- Cosine is the right default for text embeddings.
SELECT chunk_id, text, 1 - (embedding <=> $1) AS similarity
FROM chunks
ORDER BY embedding <=> $1
LIMIT 10;
-- The "1 - distance" trick converts cosine distance back to similarity for the
-- application. Index uses the distance form because that's what's monotonic.
Trap: cosine distance = 1 - cosine_similarity. Smaller is better for ranking. The ORDER BY ... LIMIT shape is what triggers the HNSW index scan; WHERE distance < threshold is harder to push into the index and often sequential-scans.
At 40M chunks × 1024 dim × 4 bytes you're looking at 164 GB of raw vectors plus index overhead. Two cheap quantization tricks cut that.
Half-precision. Cast vector(1536) to halfvec(1536) for 50% storage and roughly identical recall on modern embedding models trained in fp16/bf16 anyway.
ALTER TABLE chunks
ALTER COLUMN embedding TYPE halfvec(1024)
USING embedding::halfvec(1024);
CREATE INDEX chunks_hnsw_h ON chunks
USING hnsw (embedding halfvec_cosine_ops)
WITH (m = 16, ef_construction = 200);
OpenAI's text-embedding-3-large at 1536 dim drops from 6 KB/row to 3 KB/row. Recall drop measured in BEIR is consistently < 0.5 points. There is essentially no reason to use full vector for new builds.
Binary quantization compresses each dim to one bit (sign of the float). 32x storage cut, lossy but useful as a coarse first-stage filter. Two-stage retrieval: search bit index for top-1000, re-rank by full-precision cosine.
ALTER TABLE chunks ADD COLUMN embedding_b bit(1024);
UPDATE chunks SET embedding_b =
(SELECT string_agg(CASE WHEN v >= 0 THEN '1' ELSE '0' END, '')
FROM unnest(embedding::float4[]) AS v)::bit(1024);
CREATE INDEX chunks_hnsw_b ON chunks
USING hnsw (embedding_b bit_hamming_ops);
-- Two-stage query
WITH coarse AS (
SELECT chunk_id FROM chunks
ORDER BY embedding_b <~> $1::bit(1024)
LIMIT 1000
)
SELECT c.chunk_id, c.text, c.embedding <=> $2 AS dist
FROM chunks c JOIN coarse USING (chunk_id)
ORDER BY dist LIMIT 10;
Sparse vectors store learned-sparse-retrieval output (SPLADE, ELSER) without paying for the zeros. Useful if you're running SPLADE alongside or instead of dense.
Postgres has had full-text search since 2008. tsvector + GIN index gives you BM25-style lexical retrieval in the same query, the same transaction, the same row-level-security policy as your dense vectors.
ALTER TABLE chunks
ADD COLUMN text_tsv tsvector
GENERATED ALWAYS AS (to_tsvector('english', text)) STORED;
CREATE INDEX chunks_text_tsv ON chunks USING gin (text_tsv);
Run dense and lexical in parallel as CTEs, fuse with Reciprocal Rank Fusion, all in a single round-trip:
WITH dense AS (
SELECT chunk_id, ROW_NUMBER() OVER (ORDER BY embedding <=> $1) AS rnk
FROM chunks
WHERE tenant_id = current_setting('app.tenant_id')::uuid
ORDER BY embedding <=> $1
LIMIT 50
),
lexical AS (
SELECT chunk_id, ROW_NUMBER() OVER (ORDER BY ts_rank(text_tsv, q) DESC) AS rnk
FROM chunks, plainto_tsquery('english', $2) AS q
WHERE tenant_id = current_setting('app.tenant_id')::uuid
AND text_tsv @@ q
ORDER BY ts_rank(text_tsv, q) DESC
LIMIT 50
),
fused AS (
SELECT chunk_id, SUM(1.0 / (60 + rnk)) AS rrf_score
FROM (SELECT * FROM dense UNION ALL SELECT * FROM lexical) r
GROUP BY chunk_id
ORDER BY rrf_score DESC
LIMIT 20
)
SELECT c.chunk_id, c.text, f.rrf_score
FROM fused f JOIN chunks c USING (chunk_id)
ORDER BY f.rrf_score DESC;
The k = 60 is the standard RRF constant. Both branches share the RLS predicate. Both indexes are used. One round-trip from the application.
Note on Postgres FTS vs real BM25: ts_rank is not BM25, it's a simpler tf-idf variant. For most RAG it's good enough; the dense branch carries the semantic signal and the lexical branch only needs to catch identifiers and rare tokens. If you need true BM25, the pg_search extension (ParadeDB) wraps Tantivy.
RLS is the single feature that pushes pgvector ahead of dedicated vector DBs for SaaS RAG. The database enforces the tenant filter; the application cannot forget it.
ALTER TABLE chunks ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON chunks
USING (tenant_id = current_setting('app.tenant_id', true)::uuid);
-- Per-connection (or per-transaction) tenant pin from the application layer:
SET app.tenant_id = '550e8400-e29b-41d4-a716-446655440000';
-- Now every query against chunks is automatically scoped:
SELECT chunk_id FROM chunks ORDER BY embedding <=> $1 LIMIT 10;
-- ^ The planner adds the policy as an implicit WHERE; even if the application
-- forgets to filter, no other tenant's chunks can leak.
Index implication. The HNSW index does not know about RLS — it returns nearest neighbors, then Postgres filters. For very selective tenants (one tenant holds 0.1% of rows) you may get back fewer than 10 results after filtering. Mitigations:
hnsw.ef_search for that query (200–400 typical).CREATE INDEX ... WHERE tenant_id = '...'.tenant_id hash; each partition gets its own HNSW.HNSW build is memory-bound and embarrassingly parallel. Get both right or pay for it.
-- Per-session, before CREATE INDEX:
SET maintenance_work_mem = '8GB'; -- HNSW build buffer; must fit graph during construction
SET max_parallel_maintenance_workers = 7; -- one less than vCPUs typically
-- Per-session, runtime:
SET work_mem = '64MB'; -- per-sort, per-hash; not per-query
SET hnsw.ef_search = 100; -- raise for recall, lower for latency
Server-level (postgresql.conf) on a db.r6i.4xlarge (16 vCPU, 128 GB RAM) holding ~50M halfvec(1024):
shared_buffers: 32GB # 25% of RAM
effective_cache_size: 96GB # 75% of RAM
maintenance_work_mem: 8GB
max_parallel_workers: 16
max_parallel_workers_per_gather: 4
max_parallel_maintenance_workers: 7
random_page_cost: 1.1 # NVMe: random I/O ~ sequential
checkpoint_timeout: 15min
max_wal_size: 16GB
Partitioning by tenant. Above ~50M chunks, hash-partition chunks by tenant_id with 32 partitions. Each partition has its own HNSW; query planner prunes to the right partition based on the RLS predicate. Result: 32 smaller, faster indexes instead of one giant graph.
CREATE TABLE chunks (
chunk_id UUID,
tenant_id UUID NOT NULL,
text TEXT,
embedding halfvec(1024),
text_tsv TSVECTOR,
PRIMARY KEY (tenant_id, chunk_id)
) PARTITION BY HASH (tenant_id);
DO $$
BEGIN
FOR i IN 0..31 LOOP
EXECUTE format(
'CREATE TABLE chunks_p%s PARTITION OF chunks FOR VALUES WITH (modulus 32, remainder %s)',
i, i);
END LOOP;
END $$;
Build the HNSW after the bulk load, never before. Inserting into an existing HNSW is O(log N) per row but the constant factor is enormous. Bulk-load with the index dropped, then CREATE INDEX at the end with parallel workers.
| System | Best at | Worst at |
|---|---|---|
| pgvector | Operational simplicity, transactional consistency with metadata, RLS, hybrid in one query, < 100M vectors. | 1B+ vector scale; GPU-accelerated indexing; sharding without Citus glue. |
| pgvecto.rs (now VectorChord) | Rust rewrite of the index layer in Postgres; significantly faster build and search than pgvector at > 10M vectors. Drop-in replacement. | Smaller community; fewer managed offerings; some Postgres extensions don't compose with it. |
| Pinecone | Fully managed, serverless billing, pod auto-scaling, multi-region replication. Zero ops. | Per-vector pricing; metadata filter DSL is limited vs SQL; data lock-in; egress fees. |
| Qdrant | Open-source, fast HNSW with payload filtering at graph traversal (not after), gRPC API, on-disk indexes for cost-tight large corpora. | Yet another stateful service to operate; metadata-vector consistency is your job; weaker SQL story. |
| Weaviate | Built-in modules for embedding, hybrid (BM25 + dense), GraphQL API, multi-tenancy as a first-class concept. | Heavier resource footprint; opinionated schema; less flexible than raw Postgres for joins. |
| Milvus / Zilliz | Billion-vector scale; GPU index build; rich index types (IVF_PQ, DiskANN, ScaNN). Cloud-native architecture. | Operational complexity (etcd, Pulsar, MinIO, multiple stateful components); overkill below 100M vectors. |
| Vespa | Yahoo-scale combined ranking (vector + BM25 + ML rank features) in one query. Production at billion-doc scale for two decades. | Steep learning curve; YAML config sprawl; small community vs the others. |
| Mosaic AI Vector Search (Databricks) | Auto-syncs from Delta tables; managed within the Databricks platform; Unity Catalog governance. | Locked to Databricks; cost scales with the platform; less control over index parameters. |
Decision rule: start with pgvector. Move only when you've measured a real bottleneck (HNSW won't fit RAM, build time exceeds maintenance windows, RLS-filtered recall is unacceptable even with partitioning) and you've costed the operational tax of a second stateful system.
Full minimal stack: schema, RLS, indexes, hybrid query, Python client embedding with sentence-transformers.
-- schema.sql
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE documents (
doc_id UUID PRIMARY KEY,
tenant_id UUID NOT NULL,
title TEXT,
uri TEXT,
created_at TIMESTAMPTZ DEFAULT now()
);
CREATE TABLE chunks (
chunk_id UUID PRIMARY KEY,
doc_id UUID NOT NULL REFERENCES documents(doc_id) ON DELETE CASCADE,
tenant_id UUID NOT NULL,
ordinal INT NOT NULL,
text TEXT NOT NULL,
text_tsv TSVECTOR GENERATED ALWAYS AS (to_tsvector('english', text)) STORED,
embedding halfvec(1024)
);
CREATE INDEX chunks_hnsw ON chunks USING hnsw (embedding halfvec_cosine_ops)
WITH (m = 16, ef_construction = 200);
CREATE INDEX chunks_text_tsv ON chunks USING gin (text_tsv);
CREATE INDEX chunks_tenant ON chunks (tenant_id);
ALTER TABLE chunks ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON chunks
USING (tenant_id = current_setting('app.tenant_id', true)::uuid);
pip install psycopg[binary] sentence-transformers numpy
"""
End-to-end hybrid search against pgvector.
Embed with bge-large-en-v1.5, store as halfvec(1024), query via RRF in SQL.
"""
import uuid
import psycopg
import numpy as np
from sentence_transformers import SentenceTransformer
DSN = "postgresql://app:secret@localhost:5432/rag"
TENANT = "550e8400-e29b-41d4-a716-446655440000"
embedder = SentenceTransformer("BAAI/bge-large-en-v1.5")
def to_halfvec_literal(v: np.ndarray) -> str:
"""Format a float vector as the pgvector text literal: '[0.12,-0.04,...]'."""
return "[" + ",".join(f"{x:.6f}" for x in v.astype(np.float32)) + "]"
def upsert_chunk(conn, doc_id, ordinal, text):
vec = embedder.encode(text, normalize_embeddings=True)
conn.execute(
"""
INSERT INTO chunks (chunk_id, doc_id, tenant_id, ordinal, text, embedding)
VALUES (%s, %s, %s, %s, %s, %s::halfvec)
""",
(uuid.uuid4(), doc_id, TENANT, ordinal, text, to_halfvec_literal(vec)),
)
def hybrid_search(conn, query: str, k: int = 10):
qvec = embedder.encode(query, normalize_embeddings=True)
sql = """
SET LOCAL app.tenant_id = %(tenant)s;
WITH dense AS (
SELECT chunk_id, ROW_NUMBER() OVER (ORDER BY embedding <=> %(qvec)s::halfvec) rnk
FROM chunks
ORDER BY embedding <=> %(qvec)s::halfvec
LIMIT 50
),
lexical AS (
SELECT chunk_id, ROW_NUMBER() OVER (ORDER BY ts_rank(text_tsv, q) DESC) rnk
FROM chunks, plainto_tsquery('english', %(query)s) q
WHERE text_tsv @@ q
ORDER BY ts_rank(text_tsv, q) DESC
LIMIT 50
),
fused AS (
SELECT chunk_id, SUM(1.0 / (60 + rnk)) AS score
FROM (SELECT * FROM dense UNION ALL SELECT * FROM lexical) r
GROUP BY chunk_id
ORDER BY score DESC
LIMIT %(k)s
)
SELECT c.chunk_id, c.text, f.score
FROM fused f JOIN chunks c USING (chunk_id)
ORDER BY f.score DESC;
"""
with conn.cursor() as cur:
cur.execute(sql, {
"tenant": TENANT,
"qvec": to_halfvec_literal(qvec),
"query": query,
"k": k,
})
return cur.fetchall()
with psycopg.connect(DSN, autocommit=False) as conn:
for cid, text, score in hybrid_search(conn, "how do I cancel my subscription?"):
print(f"{score:.4f} {text[:80]}")
Production notes:
SET LOCAL app.tenant_id per transaction so the setting doesn't leak across pooled connections.encode(..., batch_size=64); one row at a time wastes the GPU.COPY the data, recreate the index with maintenance_work_mem = 8GB and parallel workers.pg_stat_user_indexes to confirm the HNSW is actually used (idx_scan > 0). If it's zero, the planner chose a sequential scan — usually a type mismatch or a non-monotonic predicate.When the application already runs on Postgres and the corpus fits comfortably in a single instance (typically < 100M vectors at modern dim). pgvector eliminates the dual-write consistency problem — chunk metadata, embeddings, permissions, and RLS all live in one transaction. You also keep SQL joins to users, documents, permissions. Pinecone wins when you want zero ops and serverless billing; Qdrant wins when you need filter-during-traversal at very high selectivity. For most enterprise SaaS RAG, pgvector wins on operational tax.
HNSW is the default. Higher recall at equivalent latency, no rebuild required as data drifts, more predictable behavior. IVFFlat wins only when index build time matters more than recall (frequent full re-embed), or when memory is tight enough that the HNSW graph won't fit. Tradeoffs: HNSW takes 2–5x longer to build and roughly 1.5x the disk space, but you tune ef_search per query at runtime instead of locking probes at build time. The m parameter controls graph degree (16 default), ef_construction the build quality (200 for production).
Two CTEs — one ordering by embedding <=> query_vec for dense, one by ts_rank(text_tsv, plainto_tsquery(...)) for lexical — combined via Reciprocal Rank Fusion. RRF assigns each candidate 1 / (60 + rank) in each list and sums, so you don't need to calibrate cosine scores against tf-idf magnitudes. The whole thing runs in one round-trip, both branches share the same row-level-security predicate, both indexes are used (HNSW for dense, GIN on tsvector for lexical). For true BM25 instead of Postgres tf-idf, the pg_search extension wraps Tantivy.
The database enforces tenant isolation; the application cannot forget it. You declare a policy USING (tenant_id = current_setting('app.tenant_id')::uuid) on the chunks table, set the session variable per request, and every SELECT/UPDATE/DELETE is automatically scoped. Even a query with no WHERE tenant_id clause cannot leak across tenants. Tradeoff: HNSW does not know about RLS, so it returns nearest neighbors then Postgres filters. For very selective tenants you may need to raise hnsw.ef_search or use partial indexes / hash partitioning per tenant to recover recall.
Almost never for new builds. Modern embedding models (bge, e5, OpenAI text-embedding-3) are trained in fp16/bf16 anyway; the float32 storage is mostly carrying noise. Empirical recall drop on BEIR moving from vector(1536) to halfvec(1536) is consistently < 0.5 points, while storage halves and HNSW build/search both speed up. Use full vector only if your embedding model genuinely produces high-magnitude floats outside fp16's range (rare) or you're doing exact reproducibility against an existing fp32 reference index.
Three levers in order. First, halfvec quantization — halves storage, halves cache pressure, recall stays. Second, hash-partition by tenant_id so each HNSW is smaller and the planner prunes; on a 16-vCPU box, 32 partitions is a good starting point. Third, drop and rebuild the HNSW after large bulk loads (insert-into-HNSW is O(log N) but the constant is brutal); set maintenance_work_mem = 8GB and parallel workers to soak the CPUs. Past ~100M halfvecs you've outgrown a single instance: either move to VectorChord (pgvecto.rs) which has a faster index implementation, or shard with Citus by tenant_id. Don't shard by random hash; cross-shard ANN loses recall.