pgvector on Kubernetes: Production Postgres Vector Search Guide (2026)
Run pgvector on Kubernetes in production: CloudNativePG cluster setup, HNSW vs IVFFlat indexing, query tuning, pgvectorscale for 100M+ vectors, multi-tenant patterns, and when to graduate to a dedicated vector DB.
Most teams adopting RAG in 2026 already have Postgres running in production. For them, the question isn’t which vector database to pick - it’s whether they need one at all. pgvector on Kubernetes answers that: if your corpus fits the ceiling, you can run production RAG retrieval inside your existing Postgres cluster with zero new operational surface.
This guide covers when pgvector is enough, how to deploy it on Kubernetes with CloudNativePG, how to tune HNSW for production workloads, and when to upgrade to pgvectorscale or migrate to a dedicated vector database.
When pgvector is enough
pgvector is a genuinely good production choice for a specific envelope:
| You have | pgvector fits? |
|---|---|
| < 10M vectors, Postgres already in prod | Yes, just enable the extension |
| 10-50M vectors | Yes, with careful tuning |
| 50-100M vectors | Yes, but enable pgvectorscale |
| 100M-500M vectors | Yes with pgvectorscale and capable hardware, but start comparing to Qdrant |
| 500M+ vectors | Move to Qdrant or Milvus |
| p95 latency budget < 20ms | Probably no, purpose-built DBs win |
| Transactional consistency between business data and embeddings matters | Yes - this is pgvector’s killer feature |
The transactional story is underrated. With pgvector, inserting a document row and its embedding is a single commit. With a separate vector DB, you have two writes and an eventual consistency window where retrieval returns stale data or references deleted rows. For tightly-integrated RAG over your primary data, that matters.
Architecture
┌───────────────────────────────────────────────┐
│ CloudNativePG Postgres Cluster │
│ │
│ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │primary │◄────►│replica │◄────►│replica │ │
│ │ rw │ │ ro │ │ ro │ │
│ └────┬───┘ └────┬───┘ └────┬───┘ │
│ │ │ │ │
│ ┌────┴──────────┬───┴──────┬───────┴───┐ │
│ │ pg_data PVC │ │ │ │
│ │ WAL PVC │ pgvector extension │ │
│ │ │ StreamingDiskANN │ │
│ └───────────────┴──────────┴──────────┘ │
└────────────────────┬──────────────────────────┘
│
▼
┌──────────────────┐
│ S3 / Blob │ Barman WAL + backups
└──────────────────┘
- Primary handles writes and reads; replicas handle read-heavy vector queries
- HNSW index is built per-partition or per-table, stored alongside table data
- WAL archiving to S3 for point-in-time recovery
- Streaming replication with synchronous or asynchronous commit depending on RPO
Deploying CNPG + pgvector on Kubernetes
Install the operator:
kubectl apply -f \
https://raw.githubusercontent.com/cloudnative-pg/cloudnative-pg/release-1.24/releases/cnpg-1.24.0.yaml
kubectl -n cnpg-system wait --for=condition=Available deployment/cnpg-controller-manager --timeout=5m
Production Cluster CR with pgvector:
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: rag-pg
namespace: data
spec:
instances: 3
imageName: ghcr.io/cloudnative-pg/postgresql:16.3-pgvector
primaryUpdateStrategy: unsupervised
postgresql:
parameters:
shared_preload_libraries: "vector"
# Memory for sort/hash ops and index builds
maintenance_work_mem: "2GB"
# Parallel vector index build
max_parallel_maintenance_workers: "4"
max_parallel_workers_per_gather: "4"
max_parallel_workers: "8"
# Critical for HNSW query performance
effective_cache_size: "24GB"
shared_buffers: "8GB"
work_mem: "64MB"
# Vector-specific
hnsw.ef_search: "64" # default - tune per query
pg_hba:
- "hostssl all all 10.0.0.0/8 scram-sha-256"
storage:
size: 500Gi
storageClass: gp3 # NVMe on EBS for HNSW
walStorage:
size: 50Gi
storageClass: gp3
resources:
requests:
cpu: "4"
memory: "32Gi"
limits:
memory: "32Gi"
bootstrap:
initdb:
database: rag
owner: rag
postInitApplicationSQL:
- CREATE EXTENSION IF NOT EXISTS vector;
backup:
barmanObjectStore:
destinationPath: s3://nomadx-pg-backups-me-central-1/rag-pg
s3Credentials:
accessKeyId:
name: pg-backup-creds
key: access-key-id
secretAccessKey:
name: pg-backup-creds
key: secret-access-key
wal:
retention: "7d"
data:
retention: "30d"
monitoring:
enablePodMonitor: true
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: cnpg.io/cluster
operator: In
values: [rag-pg]
topologyKey: kubernetes.io/hostname
Apply and wait:
kubectl apply -f rag-pg.yaml
kubectl -n data wait --for=condition=Ready cluster/rag-pg --timeout=10m
Schema and index design
The production embeddings table:
CREATE TABLE embeddings (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
tenant_id text NOT NULL,
source_id text NOT NULL,
source_type text NOT NULL, -- doc, faq, chat, ...
chunk_ix int NOT NULL,
content text NOT NULL,
embedding vector(768) NOT NULL, -- match embedding model dim
metadata jsonb NOT NULL DEFAULT '{}'::jsonb,
created_at timestamptz NOT NULL DEFAULT now(),
UNIQUE (tenant_id, source_id, chunk_ix)
) PARTITION BY HASH (tenant_id);
-- Create hash partitions (16 is a good default)
CREATE TABLE embeddings_p0 PARTITION OF embeddings FOR VALUES WITH (modulus 16, remainder 0);
CREATE TABLE embeddings_p1 PARTITION OF embeddings FOR VALUES WITH (modulus 16, remainder 1);
-- ... through p15
-- HNSW vector index on each partition
CREATE INDEX ON embeddings_p0 USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- ... repeat for each partition, ideally via a DO block
-- Scalar index for filter performance
CREATE INDEX ON embeddings (tenant_id, source_id);
CREATE INDEX ON embeddings USING GIN (metadata);
Hash partitioning by tenant_id does three things: keeps per-partition HNSW indexes small (so they fit in memory better), parallelizes query execution across workers, and gives you a clean path to per-tenant REINDEX without blocking the whole table.
For 10M vectors at 768 dimensions across 16 partitions with m=16, expect HNSW indexes to use ~8-10 GB of memory for the graph. Tune shared_buffers and effective_cache_size to give Postgres room.
Query patterns for production RAG
The naive query:
SELECT id, content, 1 - (embedding <=> $1) AS similarity
FROM embeddings
WHERE tenant_id = $2
ORDER BY embedding <=> $1
LIMIT 20;
Works, but leaves performance on the table. Production patterns:
1. Pre-filter via CTE for selectivity:
WITH candidates AS (
SELECT id, content, embedding
FROM embeddings
WHERE tenant_id = $2
AND source_type IN ('doc', 'faq')
AND created_at > now() - interval '90 days'
)
SELECT id, content, 1 - (embedding <=> $1) AS similarity
FROM candidates
ORDER BY embedding <=> $1
LIMIT 20;
HNSW in pgvector does not honor arbitrary WHERE clauses well - it either ignores them (fetches k and filters after, which under-returns) or falls back to a sequential scan. A pre-filter CTE that’s highly selective tells the planner to narrow first, then vector-search the narrowed set.
2. Dynamic ef_search per query:
SET LOCAL hnsw.ef_search = 128; -- higher recall for this query
SELECT ... FROM embeddings ORDER BY embedding <=> $1 LIMIT 50;
Start with ef_search = 64, raise to 128-256 for high-value queries that need better recall.
3. Route vector queries to replicas:
Use CNPG’s -ro service (rag-pg-ro) for search queries. Primary handles writes only. This is usually a 2-3x throughput win on RAG-heavy workloads.
pgvectorscale: the 100M-vector upgrade
For workloads above 20M vectors, install pgvectorscale which adds StreamingDiskANN:
# In the CNPG cluster spec
postgresql:
shared_preload_libraries: "vector,vectorscale"
And in SQL:
CREATE EXTENSION IF NOT EXISTS vectorscale CASCADE;
CREATE INDEX embeddings_diskann_idx
ON embeddings
USING diskann (embedding vector_cosine_ops);
StreamingDiskANN keeps only graph metadata in memory and streams vectors from disk, which is why it scales beyond HNSW’s memory ceiling. On NVMe-class storage, throughput is comparable to HNSW for queries and faster for bulk ingestion.
Requirements:
- NVMe or io2-class storage (latency matters here)
- Postgres 14+
- pgvector 0.7+
Timescale published benchmarks showing 28x better QPS vs HNSW at 50M vectors on identical hardware. We’ve seen similar results on client workloads.
Observability
CNPG ships Prometheus metrics via pod monitor. Useful metrics and alerts:
pg_stat_database_tup_fetched / pg_stat_database_tup_returned- index effectiveness ratiopg_stat_user_indexes_idx_scan{indexname ~ "embeddings.*"}- vector-index usagepg_replication_lag_bytes- alert > 100MBpg_stat_bgwriter_checkpoints_timed / _req- high_reqmeans checkpoints are falling behind, often from HNSW build pressurepg_stat_activity_count{state="active"}- saturation
Query-level tracing via auto_explain for queries > 500ms:
ALTER SYSTEM SET auto_explain.log_min_duration = '500ms';
ALTER SYSTEM SET auto_explain.log_analyze = on;
SELECT pg_reload_conf();
Backup strategy
Barman archives WAL continuously. Point-in-time recovery to any second in the retention window:
kubectl cnpg restore rag-pg-dr \
--source rag-pg \
--target-time "2026-04-24 10:30:00+04"
Test restore quarterly. A 500 GB cluster restores in roughly 10-15 minutes on gp3.
When to graduate to a dedicated vector DB
pgvector hits walls at predictable points:
- Query latency p99 exceeds SLO - typically around 50-100M vectors without pgvectorscale, 500M with it
- Index build time dominates maintenance windows - HNSW build is O(n log n) and not incremental; at 100M vectors, a REINDEX takes hours
- Multi-tenant partition count explodes - beyond ~256 hash partitions, planning overhead creeps in
- Write throughput contends with read throughput - at high ingest rates the same cluster can’t also serve low-latency search
At that point, migrate search to Qdrant or Milvus while keeping the source-of-truth documents in Postgres. Our Qdrant guide covers the migration target.
Common failure modes
- HNSW index build OOMs during large initial load - raise
maintenance_work_memto 4-8 GB temporarily, or build in chunks. - Queries slow after large delete/update - dead tuples in HNSW inflate the graph.
VACUUMdoesn’t reclaim HNSW space efficiently; periodicREINDEX CONCURRENTLYon affected partitions is the fix. - Sequential scans instead of HNSW - query is returning too few rows (<
hnsw.ef_search) or using a filter that HNSW can’t honor. Use the CTE pre-filter pattern. - Random latency spikes at checkpoint time - HNSW write-heavy workloads push checkpointer hard. Tune
max_wal_sizeto 8-16 GB and spread checkpoints. - Replication falls behind - vector index builds generate a lot of WAL. Use logical replication for non-critical replicas, or accept lag during bulk loads.
What this connects to
pgvector is the Postgres-native retrieval tier in a RAG stack. See the Production RAG Stack on Kubernetes reference architecture for how retrieval connects to LiteLLM, Langfuse, and the orchestration layer.
If you’ve outgrown pgvector: Qdrant for 100M-1B vectors, Milvus beyond that.
Getting help
We deploy production Postgres platforms for GCC enterprise teams - including CloudNativePG clusters that serve both transactional and RAG retrieval workloads. If you want a sizing review, a pgvector-to-Qdrant migration plan, or help enabling pgvectorscale under an existing workload, AI/ML Infrastructure on Kubernetes covers this. Typical engagement: 2-4 weeks.
Frequently Asked Questions
When does pgvector make sense for production RAG?
pgvector is the right choice when you already operate Postgres in production, your corpus is under 10-50M vectors, and you value transactional guarantees (a single write commits the business row and its embedding together). It stops being the right choice above ~100M vectors without pgvectorscale, or when query latency SLOs fall below 20-30ms at high concurrency. For teams with no Postgres investment, a purpose-built vector database like Qdrant is usually simpler.
HNSW or IVFFlat - which pgvector index should I use?
Use HNSW for almost all production workloads. HNSW gives better recall/latency trade-offs and doesn't require re-training as data grows. IVFFlat is faster to build and uses less memory but degrades as the cluster centroids become stale, which means you have to periodically REINDEX. HNSW's m=16, ef_construction=64 is a good starting point for 768-dim embeddings; raise ef_search at query time if you need higher recall.
What is pgvectorscale and when should I enable it?
pgvectorscale (from Timescale) adds the StreamingDiskANN index on top of pgvector. It pushes pgvector's practical ceiling from ~10-50M to 100M+ vectors by keeping only index metadata in memory and streaming vector comparisons from disk. Enable it when you're past 20M vectors, have NVMe storage, and want to stay on Postgres rather than migrating to a dedicated vector DB.
How do I run HA Postgres with pgvector on Kubernetes?
Use CloudNativePG (CNPG) as the operator. CNPG handles streaming replication, automatic failover, backup via Barman to S3, point-in-time recovery, and rolling upgrades. Enable the pgvector extension in the cluster spec's postgresql.shared_preload_libraries and install the extension per database with CREATE EXTENSION vector. CNPG ships with a -pgvector variant of its Postgres image that includes the extension pre-built.
How should I partition vector data for multi-tenant RAG?
Two patterns work: (1) a single embeddings table with tenant_id column and a composite index on (tenant_id, vector) - works up to ~100 tenants; (2) Postgres native partitioning by tenant_id list or hash - works for 1000s of tenants and keeps HNSW indexes per-partition small. Don't create a schema-per-tenant for embeddings - schema operations become the bottleneck long before vector search does.
Can pgvector be used for GCC data-sovereign RAG?
Yes. pgvector runs inside your Postgres cluster, which can be deployed on Azure UAE North, AWS RDS Middle East, or self-hosted CNPG on sovereign Kubernetes. No external services required - this is actually its main sovereignty advantage over managed vector databases. Ensure backups and replicas stay in-region.
Complementary NomadX Services
Get Started for Free
We would be happy to speak with you and arrange a free consultation with our Kubernetes Expert in Dubai, UAE. 30-minute call, actionable results in days.
Talk to an Expert