pgvector: Postgres for Vector Search

Pinecone, Qdrant, Weaviate, Milvus, Vespa — the dedicated vector database market is crowded and well-funded. pgvector is the boring answer that wins more production RAG bake-offs than the marketing in that space would suggest, because most teams already run Postgres and the operational tax of adding a second stateful system rarely pays back the recall-per-millisecond gains.

This page walks through the extension end-to-end: install, types, index choice (IVFFlat vs HNSW), distance operators, quantization, hybrid search with tsvector, multi-tenant isolation via row-level security, performance tuning, and a head-to-head against the dedicated vector DBs.

1. Why pgvector Wins for Most RAG

Three operational reasons, in order of how often they decide the bake-off:

One system, one transaction boundary. Insert chunk, write embedding, update metadata, all atomic. With Pinecone or Qdrant you ship the chunk to Postgres and the vector to the second store and pray the two never diverge.
Metadata is already there. Hybrid filters (WHERE tenant_id = ? AND published_at > ? AND access_level = 'public') are first-class SQL, not a bolt-on filter DSL. Joins to users, documents, permissions tables are free.
Row-level security. Multi-tenant isolation enforced by the database, not by remembering to add a filter to every query. Critical for SaaS RAG.

The case against pgvector is sharper than people admit. It's the right call when:

Your corpus fits comfortably in a single Postgres instance (HNSW on ~50M vectors at 1024 dim ≈ 200 GB — one db.r6i.8xlarge holds it).
You want metadata + permissions + transactions in the same query.
You don't need sub-10ms search across hundreds of millions of vectors.

It's the wrong call when you need 1B+ vectors with sub-10ms p99, GPU-accelerated indexing, or distributed multi-region replication of a sharded vector index. At that scale a dedicated system (Vespa, Milvus, Mosaic) earns its operational cost.

2. Installation and Supported Types

pgvector ships as a standard PostgreSQL extension. Compile from source, install via package manager, or use the prebuilt image.


# Debian/Ubuntu
sudo apt install postgresql-16-pgvector

# macOS via Homebrew Postgres
brew install pgvector

# Docker (handy for local dev)
docker run -d --name pg16 -e POSTGRES_PASSWORD=dev \
  -p 5432:5432 pgvector/pgvector:pg16

Once the binary is on disk, enable the extension per database:


CREATE EXTENSION IF NOT EXISTS vector;
SELECT extversion FROM pg_extension WHERE extname = 'vector';
-- 0.8.x or newer for halfvec, sparsevec, and iterative index scan.

Four vector types are supported as of 0.8:

Type	Storage	Use case
`vector(N)`	4 bytes/dim (float32)	Default. Bge-large at 1024 dim = 4 KB/row.
`halfvec(N)`	2 bytes/dim (float16)	50% storage cut with negligible recall loss for most embedding models. Drop-in replacement.
`bit(N)`	N/8 bytes	Binary quantization. Hamming distance via `<~>`. 32x storage cut, useful for first-stage retrieval.
`sparsevec(N)`	variable (index, value pairs)	Learned sparse retrieval (SPLADE, ELSER). Stores only non-zero dims.

3. Index Types: IVFFlat vs HNSW

Without an index, pgvector does a brute-force sequential scan — correct but O(N) per query. Two ANN index types are available, and the choice is not subtle.

IVFFlat partitions vectors into lists clusters via k-means at build time. At query time it searches the probes nearest clusters. Faster to build, smaller on disk, but you must rebuild periodically as data shifts the cluster centroids and recall degrades.


-- Rule of thumb: lists = sqrt(N) for < 1M rows, N/1000 above that.
CREATE INDEX chunks_ivfflat ON chunks
  USING ivfflat (embedding vector_cosine_ops)
  WITH (lists = 1000);

-- At query time:
SET ivfflat.probes = 10;   -- higher = more recall, slower. Tune to recall target.

HNSW builds a multi-layer navigable small-world graph. Higher recall at the same latency, slower to build, larger on disk, but zero-maintenance — no rebuild required as data evolves. This is the right default for almost every workload above a few hundred thousand rows.


CREATE INDEX chunks_hnsw ON chunks
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

-- At query time:
SET hnsw.ef_search = 100;   -- candidate list size. Default 40, raise for recall.

Parameter cheat-sheet:

m (HNSW): max connections per node per layer. 16 is the default and right for most. Bump to 32 for very high recall corpora; cuts insert speed and roughly doubles index size.
ef_construction (HNSW): how thoroughly to search during build. 64 is fine for prototypes; 200 for production. Higher means slower builds, better graph quality.
ef_search (HNSW): runtime candidate list size. Tune this per query, not at index build. 100 for general use; raise to 400 if you need recall above 99%.
lists (IVFFlat): number of inverted-list clusters. sqrt(N) rule of thumb.
probes (IVFFlat): clusters to scan per query. Start with sqrt(lists).

When IVFFlat actually wins. Two cases: index build time matters more than recall (frequent re-embed of the whole corpus), or you're memory-bound and HNSW won't fit. Otherwise pick HNSW.

4. Distance Operators

pgvector exposes distance as binary operators so the planner can use the index. The operator you pick at query time must match the operator class declared at index build, or the planner falls back to a sequential scan.

Operator	Distance	Index opclass	When to use
`<->`	L2 (Euclidean)	`vector_l2_ops`	Image embeddings; CLIP. Magnitude matters.
`<=>`	Cosine	`vector_cosine_ops`	Default for text embeddings. Magnitude-invariant.
`<#>`	Negative inner product	`vector_ip_ops`	Equivalent to cosine on normalized vectors, slightly faster.
`<+>`	L1 (Manhattan)	`vector_l1_ops`	Rare; some recommendation systems.
`<~>`	Hamming (bit)	`bit_hamming_ops`	Binary-quantized first-stage retrieval.
`<%>`	Jaccard (bit)	`bit_jaccard_ops`	Set-similarity over bitmaps.


-- Cosine is the right default for text embeddings.
SELECT chunk_id, text, 1 - (embedding <=> $1) AS similarity
FROM chunks
ORDER BY embedding <=> $1
LIMIT 10;

-- The "1 - distance" trick converts cosine distance back to similarity for the
-- application. Index uses the distance form because that's what's monotonic.

Trap: cosine distance = 1 - cosine_similarity. Smaller is better for ranking. The ORDER BY ... LIMIT shape is what triggers the HNSW index scan; WHERE distance < threshold is harder to push into the index and often sequential-scans.

5. Quantization: halfvec, bit, sparsevec

At 40M chunks × 1024 dim × 4 bytes you're looking at 164 GB of raw vectors plus index overhead. Two cheap quantization tricks cut that.

Half-precision. Cast vector(1536) to halfvec(1536) for 50% storage and roughly identical recall on modern embedding models trained in fp16/bf16 anyway.


ALTER TABLE chunks
  ALTER COLUMN embedding TYPE halfvec(1024)
  USING embedding::halfvec(1024);

CREATE INDEX chunks_hnsw_h ON chunks
  USING hnsw (embedding halfvec_cosine_ops)
  WITH (m = 16, ef_construction = 200);

OpenAI's text-embedding-3-large at 1536 dim drops from 6 KB/row to 3 KB/row. Recall drop measured in BEIR is consistently < 0.5 points. There is essentially no reason to use full vector for new builds.

Binary quantization compresses each dim to one bit (sign of the float). 32x storage cut, lossy but useful as a coarse first-stage filter. Two-stage retrieval: search bit index for top-1000, re-rank by full-precision cosine.


ALTER TABLE chunks ADD COLUMN embedding_b bit(1024);

UPDATE chunks SET embedding_b =
  (SELECT string_agg(CASE WHEN v >= 0 THEN '1' ELSE '0' END, '')
   FROM unnest(embedding::float4[]) AS v)::bit(1024);

CREATE INDEX chunks_hnsw_b ON chunks
  USING hnsw (embedding_b bit_hamming_ops);

-- Two-stage query
WITH coarse AS (
  SELECT chunk_id FROM chunks
  ORDER BY embedding_b <~> $1::bit(1024)
  LIMIT 1000
)
SELECT c.chunk_id, c.text, c.embedding <=> $2 AS dist
FROM chunks c JOIN coarse USING (chunk_id)
ORDER BY dist LIMIT 10;

Sparse vectors store learned-sparse-retrieval output (SPLADE, ELSER) without paying for the zeros. Useful if you're running SPLADE alongside or instead of dense.

6. Hybrid Search with tsvector

Postgres has had full-text search since 2008. tsvector + GIN index gives you BM25-style lexical retrieval in the same query, the same transaction, the same row-level-security policy as your dense vectors.


ALTER TABLE chunks
  ADD COLUMN text_tsv tsvector
  GENERATED ALWAYS AS (to_tsvector('english', text)) STORED;

CREATE INDEX chunks_text_tsv ON chunks USING gin (text_tsv);

Run dense and lexical in parallel as CTEs, fuse with Reciprocal Rank Fusion, all in a single round-trip:


WITH dense AS (
  SELECT chunk_id, ROW_NUMBER() OVER (ORDER BY embedding <=> $1) AS rnk
  FROM chunks
  WHERE tenant_id = current_setting('app.tenant_id')::uuid
  ORDER BY embedding <=> $1
  LIMIT 50
),
lexical AS (
  SELECT chunk_id, ROW_NUMBER() OVER (ORDER BY ts_rank(text_tsv, q) DESC) AS rnk
  FROM chunks, plainto_tsquery('english', $2) AS q
  WHERE tenant_id = current_setting('app.tenant_id')::uuid
    AND text_tsv @@ q
  ORDER BY ts_rank(text_tsv, q) DESC
  LIMIT 50
),
fused AS (
  SELECT chunk_id, SUM(1.0 / (60 + rnk)) AS rrf_score
  FROM (SELECT * FROM dense UNION ALL SELECT * FROM lexical) r
  GROUP BY chunk_id
  ORDER BY rrf_score DESC
  LIMIT 20
)
SELECT c.chunk_id, c.text, f.rrf_score
FROM fused f JOIN chunks c USING (chunk_id)
ORDER BY f.rrf_score DESC;

The k = 60 is the standard RRF constant. Both branches share the RLS predicate. Both indexes are used. One round-trip from the application.

Note on Postgres FTS vs real BM25: ts_rank is not BM25, it's a simpler tf-idf variant. For most RAG it's good enough; the dense branch carries the semantic signal and the lexical branch only needs to catch identifiers and rare tokens. If you need true BM25, the pg_search extension (ParadeDB) wraps Tantivy.

7. Row-Level Security for Multi-Tenancy

RLS is the single feature that pushes pgvector ahead of dedicated vector DBs for SaaS RAG. The database enforces the tenant filter; the application cannot forget it.


ALTER TABLE chunks ENABLE ROW LEVEL SECURITY;

CREATE POLICY tenant_isolation ON chunks
  USING (tenant_id = current_setting('app.tenant_id', true)::uuid);

-- Per-connection (or per-transaction) tenant pin from the application layer:
SET app.tenant_id = '550e8400-e29b-41d4-a716-446655440000';

-- Now every query against chunks is automatically scoped:
SELECT chunk_id FROM chunks ORDER BY embedding <=> $1 LIMIT 10;
-- ^ The planner adds the policy as an implicit WHERE; even if the application
--   forgets to filter, no other tenant's chunks can leak.

Index implication. The HNSW index does not know about RLS — it returns nearest neighbors, then Postgres filters. For very selective tenants (one tenant holds 0.1% of rows) you may get back fewer than 10 results after filtering. Mitigations:

Raise hnsw.ef_search for that query (200–400 typical).
Use partial indexes per large tenant: CREATE INDEX ... WHERE tenant_id = '...'.
Partition the table by tenant_id hash; each partition gets its own HNSW.

8. Performance Tuning

HNSW build is memory-bound and embarrassingly parallel. Get both right or pay for it.


-- Per-session, before CREATE INDEX:
SET maintenance_work_mem = '8GB';        -- HNSW build buffer; must fit graph during construction
SET max_parallel_maintenance_workers = 7; -- one less than vCPUs typically

-- Per-session, runtime:
SET work_mem = '64MB';                   -- per-sort, per-hash; not per-query
SET hnsw.ef_search = 100;                -- raise for recall, lower for latency

Server-level (postgresql.conf) on a db.r6i.4xlarge (16 vCPU, 128 GB RAM) holding ~50M halfvec(1024):


shared_buffers: 32GB           # 25% of RAM
effective_cache_size: 96GB     # 75% of RAM
maintenance_work_mem: 8GB
max_parallel_workers: 16
max_parallel_workers_per_gather: 4
max_parallel_maintenance_workers: 7
random_page_cost: 1.1          # NVMe: random I/O ~ sequential
checkpoint_timeout: 15min
max_wal_size: 16GB

Partitioning by tenant. Above ~50M chunks, hash-partition chunks by tenant_id with 32 partitions. Each partition has its own HNSW; query planner prunes to the right partition based on the RLS predicate. Result: 32 smaller, faster indexes instead of one giant graph.


CREATE TABLE chunks (
  chunk_id UUID,
  tenant_id UUID NOT NULL,
  text TEXT,
  embedding halfvec(1024),
  text_tsv TSVECTOR,
  PRIMARY KEY (tenant_id, chunk_id)
) PARTITION BY HASH (tenant_id);

DO $$
BEGIN
  FOR i IN 0..31 LOOP
    EXECUTE format(
      'CREATE TABLE chunks_p%s PARTITION OF chunks FOR VALUES WITH (modulus 32, remainder %s)',
      i, i);
  END LOOP;
END $$;

Build the HNSW after the bulk load, never before. Inserting into an existing HNSW is O(log N) per row but the constant factor is enormous. Bulk-load with the index dropped, then CREATE INDEX at the end with parallel workers.

9. pgvector vs Pinecone vs Qdrant vs Mosaic

System	Best at	Worst at
pgvector	Operational simplicity, transactional consistency with metadata, RLS, hybrid in one query, < 100M vectors.	1B+ vector scale; GPU-accelerated indexing; sharding without Citus glue.
pgvecto.rs (now VectorChord)	Rust rewrite of the index layer in Postgres; significantly faster build and search than pgvector at > 10M vectors. Drop-in replacement.	Smaller community; fewer managed offerings; some Postgres extensions don't compose with it.
Pinecone	Fully managed, serverless billing, pod auto-scaling, multi-region replication. Zero ops.	Per-vector pricing; metadata filter DSL is limited vs SQL; data lock-in; egress fees.
Qdrant	Open-source, fast HNSW with payload filtering at graph traversal (not after), gRPC API, on-disk indexes for cost-tight large corpora.	Yet another stateful service to operate; metadata-vector consistency is your job; weaker SQL story.
Weaviate	Built-in modules for embedding, hybrid (BM25 + dense), GraphQL API, multi-tenancy as a first-class concept.	Heavier resource footprint; opinionated schema; less flexible than raw Postgres for joins.
Milvus / Zilliz	Billion-vector scale; GPU index build; rich index types (IVF_PQ, DiskANN, ScaNN). Cloud-native architecture.	Operational complexity (etcd, Pulsar, MinIO, multiple stateful components); overkill below 100M vectors.
Vespa	Yahoo-scale combined ranking (vector + BM25 + ML rank features) in one query. Production at billion-doc scale for two decades.	Steep learning curve; YAML config sprawl; small community vs the others.
Mosaic AI Vector Search (Databricks)	Auto-syncs from Delta tables; managed within the Databricks platform; Unity Catalog governance.	Locked to Databricks; cost scales with the platform; less control over index parameters.

Decision rule: start with pgvector. Move only when you've measured a real bottleneck (HNSW won't fit RAM, build time exceeds maintenance windows, RLS-filtered recall is unacceptable even with partitioning) and you've costed the operational tax of a second stateful system.

10. End-to-End: Schema, Index, Hybrid Query, Python Client

Full minimal stack: schema, RLS, indexes, hybrid query, Python client embedding with sentence-transformers.


-- schema.sql
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
  doc_id    UUID PRIMARY KEY,
  tenant_id UUID NOT NULL,
  title     TEXT,
  uri       TEXT,
  created_at TIMESTAMPTZ DEFAULT now()
);

CREATE TABLE chunks (
  chunk_id   UUID PRIMARY KEY,
  doc_id     UUID NOT NULL REFERENCES documents(doc_id) ON DELETE CASCADE,
  tenant_id  UUID NOT NULL,
  ordinal    INT  NOT NULL,
  text       TEXT NOT NULL,
  text_tsv   TSVECTOR GENERATED ALWAYS AS (to_tsvector('english', text)) STORED,
  embedding  halfvec(1024)
);

CREATE INDEX chunks_hnsw ON chunks USING hnsw (embedding halfvec_cosine_ops)
  WITH (m = 16, ef_construction = 200);
CREATE INDEX chunks_text_tsv ON chunks USING gin (text_tsv);
CREATE INDEX chunks_tenant   ON chunks (tenant_id);

ALTER TABLE chunks ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON chunks
  USING (tenant_id = current_setting('app.tenant_id', true)::uuid);


pip install psycopg[binary] sentence-transformers numpy


"""
End-to-end hybrid search against pgvector.
Embed with bge-large-en-v1.5, store as halfvec(1024), query via RRF in SQL.
"""
import uuid
import psycopg
import numpy as np
from sentence_transformers import SentenceTransformer

DSN = "postgresql://app:secret@localhost:5432/rag"
TENANT = "550e8400-e29b-41d4-a716-446655440000"

embedder = SentenceTransformer("BAAI/bge-large-en-v1.5")

def to_halfvec_literal(v: np.ndarray) -> str:
    """Format a float vector as the pgvector text literal: '[0.12,-0.04,...]'."""
    return "[" + ",".join(f"{x:.6f}" for x in v.astype(np.float32)) + "]"

def upsert_chunk(conn, doc_id, ordinal, text):
    vec = embedder.encode(text, normalize_embeddings=True)
    conn.execute(
        """
        INSERT INTO chunks (chunk_id, doc_id, tenant_id, ordinal, text, embedding)
        VALUES (%s, %s, %s, %s, %s, %s::halfvec)
        """,
        (uuid.uuid4(), doc_id, TENANT, ordinal, text, to_halfvec_literal(vec)),
    )

def hybrid_search(conn, query: str, k: int = 10):
    qvec = embedder.encode(query, normalize_embeddings=True)
    sql = """
    SET LOCAL app.tenant_id = %(tenant)s;
    WITH dense AS (
      SELECT chunk_id, ROW_NUMBER() OVER (ORDER BY embedding <=> %(qvec)s::halfvec) rnk
      FROM chunks
      ORDER BY embedding <=> %(qvec)s::halfvec
      LIMIT 50
    ),
    lexical AS (
      SELECT chunk_id, ROW_NUMBER() OVER (ORDER BY ts_rank(text_tsv, q) DESC) rnk
      FROM chunks, plainto_tsquery('english', %(query)s) q
      WHERE text_tsv @@ q
      ORDER BY ts_rank(text_tsv, q) DESC
      LIMIT 50
    ),
    fused AS (
      SELECT chunk_id, SUM(1.0 / (60 + rnk)) AS score
      FROM (SELECT * FROM dense UNION ALL SELECT * FROM lexical) r
      GROUP BY chunk_id
      ORDER BY score DESC
      LIMIT %(k)s
    )
    SELECT c.chunk_id, c.text, f.score
    FROM fused f JOIN chunks c USING (chunk_id)
    ORDER BY f.score DESC;
    """
    with conn.cursor() as cur:
        cur.execute(sql, {
            "tenant": TENANT,
            "qvec":   to_halfvec_literal(qvec),
            "query":  query,
            "k":      k,
        })
        return cur.fetchall()

with psycopg.connect(DSN, autocommit=False) as conn:
    for cid, text, score in hybrid_search(conn, "how do I cancel my subscription?"):
        print(f"{score:.4f}  {text[:80]}")

Production notes:

Use a connection pool (PgBouncer in transaction mode) and call SET LOCAL app.tenant_id per transaction so the setting doesn't leak across pooled connections.
Batch embedding at ingest with encode(..., batch_size=64); one row at a time wastes the GPU.
For backfills, drop the HNSW, COPY the data, recreate the index with maintenance_work_mem = 8GB and parallel workers.
Monitor pg_stat_user_indexes to confirm the HNSW is actually used (idx_scan > 0). If it's zero, the planner chose a sequential scan — usually a type mismatch or a non-monotonic predicate.

Common Interview Questions:

When would you pick pgvector over Pinecone or Qdrant?

When the application already runs on Postgres and the corpus fits comfortably in a single instance (typically < 100M vectors at modern dim). pgvector eliminates the dual-write consistency problem — chunk metadata, embeddings, permissions, and RLS all live in one transaction. You also keep SQL joins to users, documents, permissions. Pinecone wins when you want zero ops and serverless billing; Qdrant wins when you need filter-during-traversal at very high selectivity. For most enterprise SaaS RAG, pgvector wins on operational tax.

HNSW vs IVFFlat — which do you pick and why?

HNSW is the default. Higher recall at equivalent latency, no rebuild required as data drifts, more predictable behavior. IVFFlat wins only when index build time matters more than recall (frequent full re-embed), or when memory is tight enough that the HNSW graph won't fit. Tradeoffs: HNSW takes 2–5x longer to build and roughly 1.5x the disk space, but you tune ef_search per query at runtime instead of locking probes at build time. The m parameter controls graph degree (16 default), ef_construction the build quality (200 for production).

How do you do hybrid search in pgvector?

Two CTEs — one ordering by embedding <=> query_vec for dense, one by ts_rank(text_tsv, plainto_tsquery(...)) for lexical — combined via Reciprocal Rank Fusion. RRF assigns each candidate 1 / (60 + rank) in each list and sums, so you don't need to calibrate cosine scores against tf-idf magnitudes. The whole thing runs in one round-trip, both branches share the same row-level-security predicate, both indexes are used (HNSW for dense, GIN on tsvector for lexical). For true BM25 instead of Postgres tf-idf, the pg_search extension wraps Tantivy.

What does row-level security buy you for multi-tenant RAG?

The database enforces tenant isolation; the application cannot forget it. You declare a policy USING (tenant_id = current_setting('app.tenant_id')::uuid) on the chunks table, set the session variable per request, and every SELECT/UPDATE/DELETE is automatically scoped. Even a query with no WHERE tenant_id clause cannot leak across tenants. Tradeoff: HNSW does not know about RLS, so it returns nearest neighbors then Postgres filters. For very selective tenants you may need to raise hnsw.ef_search or use partial indexes / hash partitioning per tenant to recover recall.

halfvec vs vector — is there ever a reason to use full float32?

Almost never for new builds. Modern embedding models (bge, e5, OpenAI text-embedding-3) are trained in fp16/bf16 anyway; the float32 storage is mostly carrying noise. Empirical recall drop on BEIR moving from vector(1536) to halfvec(1536) is consistently < 0.5 points, while storage halves and HNSW build/search both speed up. Use full vector only if your embedding model genuinely produces high-magnitude floats outside fp16's range (rare) or you're doing exact reproducibility against an existing fp32 reference index.

How do you keep pgvector fast as the corpus grows past 50M chunks?

Three levers in order. First, halfvec quantization — halves storage, halves cache pressure, recall stays. Second, hash-partition by tenant_id so each HNSW is smaller and the planner prunes; on a 16-vCPU box, 32 partitions is a good starting point. Third, drop and rebuild the HNSW after large bulk loads (insert-into-HNSW is O(log N) but the constant is brutal); set maintenance_work_mem = 8GB and parallel workers to soak the CPUs. Past ~100M halfvecs you've outgrown a single instance: either move to VectorChord (pgvecto.rs) which has a faster index implementation, or shard with Citus by tenant_id. Don't shard by random hash; cross-shard ANN loses recall.