Deploy Milvus on Kubernetes: Production HA Guide for Billion-Scale Vector Search (2026)
Run Milvus 2.4+ in production on Kubernetes: distributed architecture with etcd, Pulsar, and MinIO/S3, Milvus Operator deployment, collection sharding, billion-vector sizing, observability, and GCC data-sovereign deployment.
If you’ve scaled a RAG application past a few hundred million vectors or need multi-tenant isolation on a scale where payload filtering no longer cuts it, you’ve probably outgrown Qdrant and are looking at Milvus on Kubernetes. Milvus is the billion-scale vector database - used by companies like Roblox, Shopee, and eBay - at the cost of an operationally heavier distributed system.
This guide covers the production topology we deploy for clients running Milvus 2.4+ at scale, including the GCC data-sovereignty patterns.
When Milvus is the right choice
| You’re here | Pick |
|---|---|
| Under 100M vectors, single-tenant or tenant-per-filter | Qdrant (simpler ops) |
| 100M-1B vectors, clear growth trajectory | Milvus or Qdrant - evaluate both |
| 1B+ vectors, multi-tenant, multi-collection | Milvus |
| Need Postgres as primary + small vector feature | pgvector |
| Need managed and US/EU residency is fine | Zilliz Cloud or Pinecone |
If you’re still below 100M vectors, start with our Qdrant on Kubernetes guide. This post assumes you’ve decided on Milvus.
Architecture refresher
Milvus 2.x is a cloud-native fully distributed system with 8 component types across 3 layers:
┌──────────────────────┐
Client SDK ───────▶│ ProxyNode × 3 │ Auth, request routing
(gRPC) └──────────┬───────────┘
│
┌────────────────────────┼────────────────────────┐
▼ ▼ ▼
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ QueryCoord + │ │ DataCoord + │ │ RootCoord │
│ QueryNode × N │ │ DataNode × M │ │ (cluster mgmt) │
│ (in-memory │ │ (segment │ └────────┬─────────┘
│ vector search) │ │ assembly, idx) │ │
└─────────┬────────┘ └──────────┬───────┘ │
│ │ │
▼ ▼ ▼
┌──────────────────────────────┐
│ IndexCoord + IndexNode × K │ Background index builds
└──────────────────────────────┘
│
┌─────────────────────────────┼─────────────────────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ etcd │ │ Pulsar │ │ S3 / │
│ (metadata│ │ (WAL / │ │ MinIO │
│ cluster)│ │ message │ │ (segment │
│ │ │ bus) │ │ storage)│
└──────────┘ └──────────┘ └──────────┘
Invariants:
- ProxyNode is stateless. Scale horizontally for client RPS.
- QueryNode holds memory-mapped indices. Scale for query RPS and collection size.
- DataNode ingests streaming data. Scale with write throughput.
- IndexNode builds indices in the background. Scale with index-build backlog.
- etcd holds cluster metadata. Must be HA (3 or 5 peers).
- Pulsar is the write-ahead log and internal message bus. The largest operational surface.
- Object storage holds segment data. Do not skimp on durability - this is your source of truth.
Prerequisites
kubectl version --client # 1.28+
helm version # 3.14+
Cluster add-ons:
- cert-manager, ingress-nginx (gRPC-capable)
- external-secrets-operator for S3 + etcd + Pulsar credentials
- prometheus-operator (Milvus exposes rich Prometheus metrics)
- fast SSD StorageClass (
gp3,pd-ssd,Premium_LRS) for etcd and Pulsar bookies - A dedicated node pool for QueryNodes (memory-optimized instances)
External provisioned:
- S3-compatible bucket dedicated to Milvus (e.g.,
milvus-prod-me-central-1) - etcd: either a CloudNativePG-style etcd operator, an external etcd cluster, or the in-chart etcd subchart for small deployments
- Pulsar: Apache Pulsar cluster via the Pulsar Operator, or the in-chart Pulsar subchart
Install the Milvus Operator
helm repo add milvus-operator https://zilliztech.github.io/milvus-operator/
helm repo update
kubectl create namespace milvus-operator
helm upgrade --install milvus-operator milvus-operator/milvus-operator \
--namespace milvus-operator \
--version 1.0.0 \
--wait --timeout 5m
Verify CRDs installed:
kubectl get crd | grep milvus
# milvuses.milvus.io
# milvusclusters.milvus.io
# milvusupgrades.milvus.io
Production Milvus CR
Separate namespaces: milvus for the application, milvus-data for stateful dependencies (etcd, Pulsar, MinIO if in-cluster).
apiVersion: milvus.io/v1beta1
kind: Milvus
metadata:
name: prod
namespace: milvus
spec:
mode: cluster
dependencies:
etcd:
inCluster:
deletionPolicy: Retain
pvcDeletion: false
values:
replicaCount: 3
persistence:
size: 50Gi
storageClass: gp3
resources:
requests: {cpu: "500m", memory: "2Gi"}
limits: {memory: "2Gi"}
pulsar:
inCluster:
deletionPolicy: Retain
values:
components:
autorecovery: true
broker: true
bookkeeper: true
functions: false
proxy: true
toolset: false
zookeeper: true
broker:
replicaCount: 3
resources:
requests: {cpu: "1", memory: "4Gi"}
limits: {memory: "4Gi"}
bookkeeper:
replicaCount: 3
volumes:
journal:
size: 50Gi
storageClassName: gp3
ledgers:
size: 500Gi
storageClassName: gp3
resources:
requests: {cpu: "2", memory: "8Gi"}
limits: {memory: "8Gi"}
zookeeper:
replicaCount: 3
volumes:
data:
size: 20Gi
storageClassName: gp3
storage:
# Use external S3, not in-cluster MinIO
external: true
type: "S3"
endpoint: "s3.me-central-1.amazonaws.com"
secretRef: "milvus-s3-creds"
config:
common:
storageType: remote
minio:
bucketName: "milvus-prod-me-central-1"
rootPath: "milvus-prod"
useSSL: true
region: "me-central-1"
log:
level: "info"
quotaAndLimits:
enabled: true
limits:
maxCollectionNum: 100
maxCollectionNumPerDB: 100
dml:
enabled: true
insertRate:
max: 30 # MB/s - tune per tier
deleteRate:
max: 0.1 # MB/s
components:
proxy:
replicas: 3
resources:
requests: {cpu: "1", memory: "4Gi"}
limits: {memory: "4Gi"}
rootCoord:
replicas: 1
resources:
requests: {cpu: "1", memory: "4Gi"}
limits: {memory: "4Gi"}
queryCoord:
replicas: 1
resources:
requests: {cpu: "1", memory: "4Gi"}
dataCoord:
replicas: 1
resources:
requests: {cpu: "1", memory: "4Gi"}
indexCoord:
replicas: 1
resources:
requests: {cpu: "1", memory: "4Gi"}
queryNode:
replicas: 4 # scale to collection size
resources:
requests: {cpu: "4", memory: "32Gi"}
limits: {memory: "32Gi"}
nodeSelector:
nomadx.io/pool: memory-optimized
dataNode:
replicas: 3
resources:
requests: {cpu: "2", memory: "8Gi"}
limits: {memory: "8Gi"}
indexNode:
replicas: 2
resources:
requests: {cpu: "4", memory: "8Gi"}
limits: {memory: "8Gi"}
Apply:
kubectl apply -f milvus-prod.yaml
kubectl -n milvus wait --for=condition=Ready milvus/prod --timeout=30m
First bring-up takes 15-20 minutes while Pulsar’s ZooKeeper ensemble forms and Milvus components handshake. Do not panic at intermediate NotReady states - wait for the operator.
Collection design for production
Milvus collection schema matters more than for Qdrant because its partition model and index choices are richer. A production chat-RAG collection:
from pymilvus import (
connections, FieldSchema, CollectionSchema, DataType,
Collection, utility,
)
connections.connect(
alias="default",
host="milvus-prod.milvus.svc.cluster.local",
port="19530",
secure=True,
user="root",
password=os.environ["MILVUS_PASSWORD"],
)
fields = [
FieldSchema(name="id", dtype=DataType.VARCHAR, max_length=64, is_primary=True),
FieldSchema(name="tenant_id", dtype=DataType.VARCHAR, max_length=64),
FieldSchema(name="source_id", dtype=DataType.VARCHAR, max_length=128),
FieldSchema(name="chunk_ix", dtype=DataType.INT32),
FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=4096),
FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=768),
FieldSchema(name="created_at", dtype=DataType.INT64),
]
schema = CollectionSchema(fields, enable_dynamic_field=True)
col = Collection(
name="documents",
schema=schema,
shards_num=6, # write throughput parallelism
consistency_level="Bounded", # balance latency vs. consistency
)
# Partition-by-tenant for physical isolation
col.create_partition("tenant_default")
# Vector index with HNSW + scalar int8 quantization
col.create_index(
field_name="vector",
index_params={
"index_type": "HNSW",
"metric_type": "COSINE",
"params": {"M": 16, "efConstruction": 200},
},
)
# Scalar indexes for filter performance
col.create_index(field_name="tenant_id", index_params={"index_type": "Trie"})
col.create_index(field_name="source_id", index_params={"index_type": "Trie"})
col.load()
Key design decisions:
- Partitions for tenancy, not collections. One collection per logical corpus with partitions per tenant keeps schema management sane. Up to ~4096 partitions per collection is comfortable.
shards_num = 2 × QueryNode countas a starting heuristic. Can be raised to 16 for very high write throughput.- Scalar indexes on filter fields. Milvus filtering degrades fast without them.
Triefor strings,STL_SORTorINVERTEDfor numerics. consistency_level: Boundedgives ~2-second bounded staleness in exchange for 3-5x query throughput versusStrong. RAG tolerates this.- Enable
enable_dynamic_fieldso you can add metadata later without schema migrations. Pay for it in slightly larger payloads.
Backup and disaster recovery
Milvus ships an official milvus-backup tool. Deploy as a CronJob:
apiVersion: batch/v1
kind: CronJob
metadata:
name: milvus-backup
namespace: milvus
spec:
schedule: "0 2 * * *" # 02:00 daily
concurrencyPolicy: Forbid
jobTemplate:
spec:
template:
spec:
restartPolicy: OnFailure
containers:
- name: backup
image: milvusdb/milvus-backup:v0.4.20
args:
- "create"
- "-n"
- "daily-$(date +%Y%m%d)"
- "-c"
- "/config/backup.yaml"
volumeMounts:
- name: config
mountPath: /config
volumes:
- name: config
secret:
secretName: milvus-backup-config
The tool copies etcd metadata + segment data into a separate S3 prefix (milvus-prod-backups/). Lifecycle to Glacier after 30 days.
Restore drill cadence: quarterly, on a scratch cluster. Full-restore time for a 100M-vector collection on our reference hardware is ~20 minutes.
Observability
Milvus exposes rich Prometheus metrics. Build these dashboards on day one:
- Segment count (
milvus_querynode_num_segments) - unbalanced growth indicates rebalance lag - Query latency p99 per collection (
milvus_proxy_req_latency_bucket) - target < 100ms - QueryNode RAM utilization - alert at 80%, because HNSW falling out of memory crashes throughput
- DataNode flush lag (
milvus_datanode_flowgraph_flow_count) - ingestion backlog indicator - Pulsar topic backlog per shard - upstream of Milvus but downstream of your ingest
- etcd leader changes - > 1/hour means etcd is unhappy
Ship OpenTelemetry traces to your existing collector - Milvus v2.4 supports OTEL.
Sizing tiers
| Tier | Vectors | Config | Est. monthly cost (AED, EKS me-central-1) |
|---|---|---|---|
| Small | <100M | 2 QueryNode × r6i.xlarge, 3 DataNode × m6i.large, 3-node etcd, 3-node Pulsar | ~35,000 |
| Medium | 100M-500M | 6 QueryNode × r6i.2xlarge + int8 quant, 3 DataNode × m6i.xlarge | ~95,000 |
| Large | 500M-2B | 16 QueryNode × r6i.4xlarge + product quant, 6 DataNode × m6i.xlarge, 5-node etcd | ~280,000 |
| XL | 2B+ | 32+ QueryNode × r6i.8xlarge, 8+ DataNode, 5+ IndexNode, hot/cold tiering | negotiated |
Add 20-30% for S3 storage, egress, and backup infrastructure.
GCC data sovereignty checklist
- Full stack in one region: compute, etcd, Pulsar, object storage
- External S3 with SSE-KMS CMK; no cross-region replication out of the region
- Milvus auth enabled; password via external-secrets
- TLS on gRPC (19530) and HTTP ingress (9091)
- NetworkPolicy restricting inbound to client namespaces, internal-only for etcd/Pulsar
- Audit log via Pulsar topic or filebeat to your SIEM
Common failure modes
- Queries returning empty results after reload -
col.load()didn’t complete before query traffic started. Add a readiness check that pollscol.num_entitiesandutility.load_state(). - Pulsar BookKeeper OOM under write bursts - default
journalMaxSizeMBis too small. RaisejournalMaxSizeMB: 2048and size the journal PVC accordingly. - etcd write latency spikes - etcd WAL and snapshot on slow SSDs. Move etcd to io2 or Premium_LRS.
- QueryNode segment bloat - compaction falling behind. Tune
common.retentionDurationand run manual compaction during low traffic. - Collection drops after cluster restart - etcd data wasn’t persisted (wrong StorageClass). Confirm PVCs for etcd are
RetainnotDelete.
What this connects to
Milvus is the vector-retrieval layer. See the Production RAG Stack on Kubernetes reference architecture for how it fits with LiteLLM gateway, Langfuse observability, and the orchestration layer.
For teams under 100M vectors, Qdrant remains the simpler choice.
Getting help
We deploy Milvus for GCC enterprise teams running customer-facing RAG at 500M-2B vector scale, including regulated fintech and government workloads on sovereign cloud. AI/ML Infrastructure on Kubernetes is the engagement entry - typical Milvus bring-up runs 3-5 weeks including a DR drill before production cutover.
Frequently Asked Questions
When should I choose Milvus over Qdrant?
Pick Milvus when you expect to cross 1 billion vectors, need multi-collection hot/cold tiering, or require multi-tenant partitioning beyond what Qdrant payload filters support efficiently. Qdrant is simpler operationally and wins up to ~500M vectors. Milvus runs as a full distributed system (etcd, Pulsar/Kafka, MinIO/S3) - that complexity is the entry cost you pay for the scale ceiling.
Should I use the Milvus Operator or the Helm chart?
Use the Milvus Operator (milvusio/milvus-operator) for production. It handles rolling upgrades with component-aware ordering, manages etcd and Pulsar subcharts, and lets you declare cluster topology as a Milvus CR. The Helm chart is fine for development, but operator-driven deployments are what every large-scale production deployment actually runs.
What storage backend should I use for Milvus?
Use S3-compatible object storage (AWS S3, Azure Blob with HNS, GCS, MinIO) for persistent segment data. Do not use in-cluster MinIO in production unless you have a dedicated ops team - replicated MinIO clusters are another distributed system to operate. For GCC, Azure Blob in UAE North, AWS S3 in Middle East (Bahrain or UAE), or an in-region MinIO tenant managed by your MSP are the three common choices.
How do I size Milvus for 1 billion vectors?
For 1B vectors at 768 dimensions with HNSW index, plan for ~600 GB working set across QueryNodes (memory-mapped index files), 10-20 TB S3 segment storage, 16 QueryNodes minimum (4 vCPU / 32 GB each), 6 DataNodes for ingestion, 3 IndexNodes for background index build, 3 ProxyNodes for client traffic, 3-node etcd, and a Pulsar cluster of 3 brokers + 3 bookies. Enable scalar int8 or product quantization above 500M vectors to cut memory 4x-32x.
How do I back up Milvus on Kubernetes?
Milvus stores metadata in etcd and segment data in S3. Back up each: (1) etcd snapshots via the etcdctl snapshot cron (or CloudNativePG-style operator if you run etcd outside the Milvus chart); (2) S3 bucket lifecycle versioning or cross-region replication for segment data; (3) use the milvus-backup tool to create consistent collection-level backups that combine both. Test restore quarterly against a DR cluster.
Can Milvus run in a UAE-sovereign deployment?
Yes. Deploy the full stack (Milvus, etcd, Pulsar, S3) into an in-region Kubernetes cluster and bucket. Zilliz Cloud (managed Milvus) hosts data in US/EU/APAC regions and is unsuitable for workloads covered by NESA, CBUAE, or ADGM residency rules. Self-hosted on Azure UAE North, AWS Middle East Bahrain, or Core42 sovereign cloud is the standard pattern we deploy.
Complementary NomadX Services
Get Started for Free
We would be happy to speak with you and arrange a free consultation with our Kubernetes Expert in Dubai, UAE. 30-minute call, actionable results in days.
Talk to an Expert