April 23, 2026 · 8 min read · Aizhan Azhybaeva

pgvector on Kubernetes: Production Postgres Vector Search Guide (2026)

Q: When does pgvector make sense for production RAG?

pgvector is the right choice when you already operate Postgres in production, your corpus is under 10-50M vectors, and you value transactional guarantees (a single write commits the business row and its embedding together). It stops being the right choice above ~100M vectors without pgvectorscale, or when query latency SLOs fall below 20-30ms at high concurrency. For teams with no Postgres investment, a purpose-built vector database like Qdrant is usually simpler.

Q: HNSW or IVFFlat - which pgvector index should I use?

Use HNSW for almost all production workloads. HNSW gives better recall/latency trade-offs and doesn't require re-training as data grows. IVFFlat is faster to build and uses less memory but degrades as the cluster centroids become stale, which means you have to periodically REINDEX. HNSW's m=16, ef_construction=64 is a good starting point for 768-dim embeddings; raise ef_search at query time if you need higher recall.

Q: What is pgvectorscale and when should I enable it?

pgvectorscale (from Timescale) adds the StreamingDiskANN index on top of pgvector. It pushes pgvector's practical ceiling from ~10-50M to 100M+ vectors by keeping only index metadata in memory and streaming vector comparisons from disk. Enable it when you're past 20M vectors, have NVMe storage, and want to stay on Postgres rather than migrating to a dedicated vector DB.

Q: How do I run HA Postgres with pgvector on Kubernetes?

Use CloudNativePG (CNPG) as the operator. CNPG handles streaming replication, automatic failover, backup via Barman to S3, point-in-time recovery, and rolling upgrades. Enable the pgvector extension in the cluster spec's postgresql.shared_preload_libraries and install the extension per database with CREATE EXTENSION vector . CNPG ships with a -pgvector variant of its Postgres image that includes the extension pre-built.

Q: How should I partition vector data for multi-tenant RAG?

Two patterns work: (1) a single embeddings table with tenant_id column and a composite index on (tenant_id, vector) - works up to ~100 tenants; (2) Postgres native partitioning by tenant_id list or hash - works for 1000s of tenants and keeps HNSW indexes per-partition small. Don't create a schema-per-tenant for embeddings - schema operations become the bottleneck long before vector search does.

Q: Can pgvector be used for GCC data-sovereign RAG?

Yes. pgvector runs inside your Postgres cluster, which can be deployed on Azure UAE North, AWS RDS Middle East, or self-hosted CNPG on sovereign Kubernetes. No external services required - this is actually its main sovereignty advantage over managed vector databases. Ensure backups and replicas stay in-region.

Run pgvector on Kubernetes in production: CloudNativePG cluster setup, HNSW vs IVFFlat indexing, query tuning, pgvectorscale for 100M+ vectors, multi-tenant patterns, and when to graduate to a dedicated vector DB.

Most teams adopting RAG in 2026 already have Postgres running in production. For them, the question isn’t which vector database to pick - it’s whether they need one at all. pgvector on Kubernetes answers that: if your corpus fits the ceiling, you can run production RAG retrieval inside your existing Postgres cluster with zero new operational surface.

This guide covers when pgvector is enough, how to deploy it on Kubernetes with CloudNativePG, how to tune HNSW for production workloads, and when to upgrade to pgvectorscale or migrate to a dedicated vector database.

When pgvector is enough

pgvector is a genuinely good production choice for a specific envelope:

You have	pgvector fits?
< 10M vectors, Postgres already in prod	Yes, just enable the extension
10-50M vectors	Yes, with careful tuning
50-100M vectors	Yes, but enable pgvectorscale
100M-500M vectors	Yes with pgvectorscale and capable hardware, but start comparing to Qdrant
500M+ vectors	Move to Qdrant or Milvus
p95 latency budget < 20ms	Probably no, purpose-built DBs win
Transactional consistency between business data and embeddings matters	Yes - this is pgvector’s killer feature

The transactional story is underrated. With pgvector, inserting a document row and its embedding is a single commit. With a separate vector DB, you have two writes and an eventual consistency window where retrieval returns stale data or references deleted rows. For tightly-integrated RAG over your primary data, that matters.

Architecture

     ┌───────────────────────────────────────────────┐
     │       CloudNativePG Postgres Cluster          │
     │                                                │
     │  ┌────────┐      ┌────────┐      ┌────────┐    │
     │  │primary │◄────►│replica │◄────►│replica │    │
     │  │  rw    │      │  ro    │      │  ro    │    │
     │  └────┬───┘      └────┬───┘      └────┬───┘    │
     │       │               │               │        │
     │  ┌────┴──────────┬───┴──────┬───────┴───┐    │
     │  │  pg_data PVC  │          │           │     │
     │  │  WAL PVC      │  pgvector extension  │     │
     │  │               │  StreamingDiskANN    │     │
     │  └───────────────┴──────────┴──────────┘     │
     └────────────────────┬──────────────────────────┘
                          │
                          ▼
                ┌──────────────────┐
                │     S3 / Blob    │  Barman WAL + backups
                └──────────────────┘

Primary handles writes and reads; replicas handle read-heavy vector queries
HNSW index is built per-partition or per-table, stored alongside table data
WAL archiving to S3 for point-in-time recovery
Streaming replication with synchronous or asynchronous commit depending on RPO

Deploying CNPG + pgvector on Kubernetes

Install the operator:

kubectl apply -f \
  https://raw.githubusercontent.com/cloudnative-pg/cloudnative-pg/release-1.24/releases/cnpg-1.24.0.yaml

kubectl -n cnpg-system wait --for=condition=Available deployment/cnpg-controller-manager --timeout=5m

Production Cluster CR with pgvector:

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: rag-pg
  namespace: data
spec:
  instances: 3
  imageName: ghcr.io/cloudnative-pg/postgresql:16.3-pgvector
  primaryUpdateStrategy: unsupervised

  postgresql:
    parameters:
      shared_preload_libraries: "vector"
      # Memory for sort/hash ops and index builds
      maintenance_work_mem: "2GB"
      # Parallel vector index build
      max_parallel_maintenance_workers: "4"
      max_parallel_workers_per_gather: "4"
      max_parallel_workers: "8"
      # Critical for HNSW query performance
      effective_cache_size: "24GB"
      shared_buffers: "8GB"
      work_mem: "64MB"
      # Vector-specific
      hnsw.ef_search: "64"        # default - tune per query
    pg_hba:
      - "hostssl all all 10.0.0.0/8 scram-sha-256"

  storage:
    size: 500Gi
    storageClass: gp3             # NVMe on EBS for HNSW
  walStorage:
    size: 50Gi
    storageClass: gp3

  resources:
    requests:
      cpu: "4"
      memory: "32Gi"
    limits:
      memory: "32Gi"

  bootstrap:
    initdb:
      database: rag
      owner: rag
      postInitApplicationSQL:
        - CREATE EXTENSION IF NOT EXISTS vector;

  backup:
    barmanObjectStore:
      destinationPath: s3://nomadx-pg-backups-me-central-1/rag-pg
      s3Credentials:
        accessKeyId:
          name: pg-backup-creds
          key: access-key-id
        secretAccessKey:
          name: pg-backup-creds
          key: secret-access-key
      wal:
        retention: "7d"
      data:
        retention: "30d"

  monitoring:
    enablePodMonitor: true

  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchExpressions:
              - key: cnpg.io/cluster
                operator: In
                values: [rag-pg]
          topologyKey: kubernetes.io/hostname

Apply and wait:

kubectl apply -f rag-pg.yaml
kubectl -n data wait --for=condition=Ready cluster/rag-pg --timeout=10m

Schema and index design

The production embeddings table:

CREATE TABLE embeddings (
    id          uuid PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id   text NOT NULL,
    source_id   text NOT NULL,
    source_type text NOT NULL,                    -- doc, faq, chat, ...
    chunk_ix    int NOT NULL,
    content     text NOT NULL,
    embedding   vector(768) NOT NULL,             -- match embedding model dim
    metadata    jsonb NOT NULL DEFAULT '{}'::jsonb,
    created_at  timestamptz NOT NULL DEFAULT now(),
    UNIQUE (tenant_id, source_id, chunk_ix)
) PARTITION BY HASH (tenant_id);

-- Create hash partitions (16 is a good default)
CREATE TABLE embeddings_p0  PARTITION OF embeddings FOR VALUES WITH (modulus 16, remainder 0);
CREATE TABLE embeddings_p1  PARTITION OF embeddings FOR VALUES WITH (modulus 16, remainder 1);
-- ... through p15

-- HNSW vector index on each partition
CREATE INDEX ON embeddings_p0 USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);
-- ... repeat for each partition, ideally via a DO block

-- Scalar index for filter performance
CREATE INDEX ON embeddings (tenant_id, source_id);
CREATE INDEX ON embeddings USING GIN (metadata);

Hash partitioning by tenant_id does three things: keeps per-partition HNSW indexes small (so they fit in memory better), parallelizes query execution across workers, and gives you a clean path to per-tenant REINDEX without blocking the whole table.

For 10M vectors at 768 dimensions across 16 partitions with m=16, expect HNSW indexes to use ~8-10 GB of memory for the graph. Tune shared_buffers and effective_cache_size to give Postgres room.

Query patterns for production RAG

The naive query:

SELECT id, content, 1 - (embedding <=> $1) AS similarity
FROM embeddings
WHERE tenant_id = $2
ORDER BY embedding <=> $1
LIMIT 20;

Works, but leaves performance on the table. Production patterns:

1. Pre-filter via CTE for selectivity:

WITH candidates AS (
    SELECT id, content, embedding
    FROM embeddings
    WHERE tenant_id = $2
      AND source_type IN ('doc', 'faq')
      AND created_at > now() - interval '90 days'
)
SELECT id, content, 1 - (embedding <=> $1) AS similarity
FROM candidates
ORDER BY embedding <=> $1
LIMIT 20;

HNSW in pgvector does not honor arbitrary WHERE clauses well - it either ignores them (fetches k and filters after, which under-returns) or falls back to a sequential scan. A pre-filter CTE that’s highly selective tells the planner to narrow first, then vector-search the narrowed set.

2. Dynamic ef_search per query:

SET LOCAL hnsw.ef_search = 128;    -- higher recall for this query
SELECT ... FROM embeddings ORDER BY embedding <=> $1 LIMIT 50;

Start with ef_search = 64, raise to 128-256 for high-value queries that need better recall.

3. Route vector queries to replicas:

Use CNPG’s -ro service (rag-pg-ro) for search queries. Primary handles writes only. This is usually a 2-3x throughput win on RAG-heavy workloads.

pgvectorscale: the 100M-vector upgrade

For workloads above 20M vectors, install pgvectorscale which adds StreamingDiskANN:

# In the CNPG cluster spec
postgresql:
  shared_preload_libraries: "vector,vectorscale"

And in SQL:

CREATE EXTENSION IF NOT EXISTS vectorscale CASCADE;

CREATE INDEX embeddings_diskann_idx
ON embeddings
USING diskann (embedding vector_cosine_ops);

StreamingDiskANN keeps only graph metadata in memory and streams vectors from disk, which is why it scales beyond HNSW’s memory ceiling. On NVMe-class storage, throughput is comparable to HNSW for queries and faster for bulk ingestion.

Requirements:

NVMe or io2-class storage (latency matters here)
Postgres 14+
pgvector 0.7+

Timescale published benchmarks showing 28x better QPS vs HNSW at 50M vectors on identical hardware. We’ve seen similar results on client workloads.

Observability

CNPG ships Prometheus metrics via pod monitor. Useful metrics and alerts:

pg_stat_database_tup_fetched / pg_stat_database_tup_returned - index effectiveness ratio
pg_stat_user_indexes_idx_scan{indexname ~ "embeddings.*"} - vector-index usage
pg_replication_lag_bytes - alert > 100MB
pg_stat_bgwriter_checkpoints_timed / _req - high _req means checkpoints are falling behind, often from HNSW build pressure
pg_stat_activity_count{state="active"} - saturation

Query-level tracing via auto_explain for queries > 500ms:

ALTER SYSTEM SET auto_explain.log_min_duration = '500ms';
ALTER SYSTEM SET auto_explain.log_analyze = on;
SELECT pg_reload_conf();

Backup strategy

Barman archives WAL continuously. Point-in-time recovery to any second in the retention window:

kubectl cnpg restore rag-pg-dr \
  --source rag-pg \
  --target-time "2026-04-24 10:30:00+04"

Test restore quarterly. A 500 GB cluster restores in roughly 10-15 minutes on gp3.

When to graduate to a dedicated vector DB

pgvector hits walls at predictable points:

Query latency p99 exceeds SLO - typically around 50-100M vectors without pgvectorscale, 500M with it
Index build time dominates maintenance windows - HNSW build is O(n log n) and not incremental; at 100M vectors, a REINDEX takes hours
Multi-tenant partition count explodes - beyond ~256 hash partitions, planning overhead creeps in
Write throughput contends with read throughput - at high ingest rates the same cluster can’t also serve low-latency search

At that point, migrate search to Qdrant or Milvus while keeping the source-of-truth documents in Postgres. Our Qdrant guide covers the migration target.

Common failure modes

HNSW index build OOMs during large initial load - raise maintenance_work_mem to 4-8 GB temporarily, or build in chunks.
Queries slow after large delete/update - dead tuples in HNSW inflate the graph. VACUUM doesn’t reclaim HNSW space efficiently; periodic REINDEX CONCURRENTLY on affected partitions is the fix.
Sequential scans instead of HNSW - query is returning too few rows (< hnsw.ef_search) or using a filter that HNSW can’t honor. Use the CTE pre-filter pattern.
Random latency spikes at checkpoint time - HNSW write-heavy workloads push checkpointer hard. Tune max_wal_size to 8-16 GB and spread checkpoints.
Replication falls behind - vector index builds generate a lot of WAL. Use logical replication for non-critical replicas, or accept lag during bulk loads.

What this connects to

pgvector is the Postgres-native retrieval tier in a RAG stack. See the Production RAG Stack on Kubernetes reference architecture for how retrieval connects to LiteLLM, Langfuse, and the orchestration layer.

If you’ve outgrown pgvector: Qdrant for 100M-1B vectors, Milvus beyond that.

Getting help

We deploy production Postgres platforms for GCC enterprise teams - including CloudNativePG clusters that serve both transactional and RAG retrieval workloads. If you want a sizing review, a pgvector-to-Qdrant migration plan, or help enabling pgvectorscale under an existing workload, AI/ML Infrastructure on Kubernetes covers this. Typical engagement: 2-4 weeks.

Common Questions

Frequently Asked Questions

When does pgvector make sense for production RAG?

pgvector is the right choice when you already operate Postgres in production, your corpus is under 10-50M vectors, and you value transactional guarantees (a single write commits the business row and its embedding together). It stops being the right choice above ~100M vectors without pgvectorscale, or when query latency SLOs fall below 20-30ms at high concurrency. For teams with no Postgres investment, a purpose-built vector database like Qdrant is usually simpler.

HNSW or IVFFlat - which pgvector index should I use?

Use HNSW for almost all production workloads. HNSW gives better recall/latency trade-offs and doesn't require re-training as data grows. IVFFlat is faster to build and uses less memory but degrades as the cluster centroids become stale, which means you have to periodically REINDEX. HNSW's m=16, ef_construction=64 is a good starting point for 768-dim embeddings; raise ef_search at query time if you need higher recall.

What is pgvectorscale and when should I enable it?

pgvectorscale (from Timescale) adds the StreamingDiskANN index on top of pgvector. It pushes pgvector's practical ceiling from ~10-50M to 100M+ vectors by keeping only index metadata in memory and streaming vector comparisons from disk. Enable it when you're past 20M vectors, have NVMe storage, and want to stay on Postgres rather than migrating to a dedicated vector DB.

How do I run HA Postgres with pgvector on Kubernetes?

Use CloudNativePG (CNPG) as the operator. CNPG handles streaming replication, automatic failover, backup via Barman to S3, point-in-time recovery, and rolling upgrades. Enable the pgvector extension in the cluster spec's postgresql.shared_preload_libraries and install the extension per database with CREATE EXTENSION vector. CNPG ships with a -pgvector variant of its Postgres image that includes the extension pre-built.

How should I partition vector data for multi-tenant RAG?

Two patterns work: (1) a single embeddings table with tenant_id column and a composite index on (tenant_id, vector) - works up to ~100 tenants; (2) Postgres native partitioning by tenant_id list or hash - works for 1000s of tenants and keeps HNSW indexes per-partition small. Don't create a schema-per-tenant for embeddings - schema operations become the bottleneck long before vector search does.

Can pgvector be used for GCC data-sovereign RAG?