April 23, 2026 · 8 min read · Aizhan Azhybaeva

Deploy Milvus on Kubernetes: Production HA Guide for Billion-Scale Vector Search (2026)

Run Milvus 2.4+ in production on Kubernetes: distributed architecture with etcd, Pulsar, and MinIO/S3, Milvus Operator deployment, collection sharding, billion-vector sizing, observability, and GCC data-sovereign deployment.

Deploy Milvus on Kubernetes: Production HA Guide for Billion-Scale Vector Search (2026)

If you’ve scaled a RAG application past a few hundred million vectors or need multi-tenant isolation on a scale where payload filtering no longer cuts it, you’ve probably outgrown Qdrant and are looking at Milvus on Kubernetes. Milvus is the billion-scale vector database - used by companies like Roblox, Shopee, and eBay - at the cost of an operationally heavier distributed system.

This guide covers the production topology we deploy for clients running Milvus 2.4+ at scale, including the GCC data-sovereignty patterns.

When Milvus is the right choice

You’re herePick
Under 100M vectors, single-tenant or tenant-per-filterQdrant (simpler ops)
100M-1B vectors, clear growth trajectoryMilvus or Qdrant - evaluate both
1B+ vectors, multi-tenant, multi-collectionMilvus
Need Postgres as primary + small vector featurepgvector
Need managed and US/EU residency is fineZilliz Cloud or Pinecone

If you’re still below 100M vectors, start with our Qdrant on Kubernetes guide. This post assumes you’ve decided on Milvus.

Architecture refresher

Milvus 2.x is a cloud-native fully distributed system with 8 component types across 3 layers:

                           ┌──────────────────────┐
   Client SDK      ───────▶│   ProxyNode × 3      │   Auth, request routing
   (gRPC)                   └──────────┬───────────┘
                                       │
              ┌────────────────────────┼────────────────────────┐
              ▼                        ▼                        ▼
    ┌──────────────────┐    ┌──────────────────┐     ┌──────────────────┐
    │  QueryCoord +    │    │  DataCoord +     │     │  RootCoord       │
    │  QueryNode × N   │    │  DataNode × M    │     │  (cluster mgmt)  │
    │  (in-memory      │    │  (segment        │     └────────┬─────────┘
    │  vector search)  │    │  assembly, idx)  │              │
    └─────────┬────────┘    └──────────┬───────┘              │
              │                        │                       │
              ▼                        ▼                       ▼
                     ┌──────────────────────────────┐
                     │  IndexCoord + IndexNode × K  │  Background index builds
                     └──────────────────────────────┘
                                       │
         ┌─────────────────────────────┼─────────────────────────────┐
         ▼                             ▼                             ▼
    ┌──────────┐                  ┌──────────┐                  ┌──────────┐
    │   etcd   │                  │  Pulsar  │                  │   S3 /   │
    │ (metadata│                  │ (WAL /   │                  │   MinIO  │
    │  cluster)│                  │ message  │                  │ (segment │
    │          │                  │  bus)    │                  │  storage)│
    └──────────┘                  └──────────┘                  └──────────┘

Invariants:

  • ProxyNode is stateless. Scale horizontally for client RPS.
  • QueryNode holds memory-mapped indices. Scale for query RPS and collection size.
  • DataNode ingests streaming data. Scale with write throughput.
  • IndexNode builds indices in the background. Scale with index-build backlog.
  • etcd holds cluster metadata. Must be HA (3 or 5 peers).
  • Pulsar is the write-ahead log and internal message bus. The largest operational surface.
  • Object storage holds segment data. Do not skimp on durability - this is your source of truth.

Prerequisites

kubectl version --client    # 1.28+
helm version                # 3.14+

Cluster add-ons:

  • cert-manager, ingress-nginx (gRPC-capable)
  • external-secrets-operator for S3 + etcd + Pulsar credentials
  • prometheus-operator (Milvus exposes rich Prometheus metrics)
  • fast SSD StorageClass (gp3, pd-ssd, Premium_LRS) for etcd and Pulsar bookies
  • A dedicated node pool for QueryNodes (memory-optimized instances)

External provisioned:

  • S3-compatible bucket dedicated to Milvus (e.g., milvus-prod-me-central-1)
  • etcd: either a CloudNativePG-style etcd operator, an external etcd cluster, or the in-chart etcd subchart for small deployments
  • Pulsar: Apache Pulsar cluster via the Pulsar Operator, or the in-chart Pulsar subchart

Install the Milvus Operator

helm repo add milvus-operator https://zilliztech.github.io/milvus-operator/
helm repo update

kubectl create namespace milvus-operator
helm upgrade --install milvus-operator milvus-operator/milvus-operator \
  --namespace milvus-operator \
  --version 1.0.0 \
  --wait --timeout 5m

Verify CRDs installed:

kubectl get crd | grep milvus
# milvuses.milvus.io
# milvusclusters.milvus.io
# milvusupgrades.milvus.io

Production Milvus CR

Separate namespaces: milvus for the application, milvus-data for stateful dependencies (etcd, Pulsar, MinIO if in-cluster).

apiVersion: milvus.io/v1beta1
kind: Milvus
metadata:
  name: prod
  namespace: milvus
spec:
  mode: cluster
  dependencies:
    etcd:
      inCluster:
        deletionPolicy: Retain
        pvcDeletion: false
        values:
          replicaCount: 3
          persistence:
            size: 50Gi
            storageClass: gp3
          resources:
            requests: {cpu: "500m", memory: "2Gi"}
            limits:   {memory: "2Gi"}
    pulsar:
      inCluster:
        deletionPolicy: Retain
        values:
          components:
            autorecovery: true
            broker: true
            bookkeeper: true
            functions: false
            proxy: true
            toolset: false
            zookeeper: true
          broker:
            replicaCount: 3
            resources:
              requests: {cpu: "1", memory: "4Gi"}
              limits:   {memory: "4Gi"}
          bookkeeper:
            replicaCount: 3
            volumes:
              journal:
                size: 50Gi
                storageClassName: gp3
              ledgers:
                size: 500Gi
                storageClassName: gp3
            resources:
              requests: {cpu: "2", memory: "8Gi"}
              limits:   {memory: "8Gi"}
          zookeeper:
            replicaCount: 3
            volumes:
              data:
                size: 20Gi
                storageClassName: gp3
    storage:
      # Use external S3, not in-cluster MinIO
      external: true
      type: "S3"
      endpoint: "s3.me-central-1.amazonaws.com"
      secretRef: "milvus-s3-creds"
  config:
    common:
      storageType: remote
    minio:
      bucketName: "milvus-prod-me-central-1"
      rootPath: "milvus-prod"
      useSSL: true
      region: "me-central-1"
    log:
      level: "info"
    quotaAndLimits:
      enabled: true
      limits:
        maxCollectionNum: 100
        maxCollectionNumPerDB: 100
      dml:
        enabled: true
        insertRate:
          max: 30                 # MB/s - tune per tier
        deleteRate:
          max: 0.1                # MB/s
  components:
    proxy:
      replicas: 3
      resources:
        requests: {cpu: "1", memory: "4Gi"}
        limits:   {memory: "4Gi"}
    rootCoord:
      replicas: 1
      resources:
        requests: {cpu: "1", memory: "4Gi"}
        limits:   {memory: "4Gi"}
    queryCoord:
      replicas: 1
      resources:
        requests: {cpu: "1", memory: "4Gi"}
    dataCoord:
      replicas: 1
      resources:
        requests: {cpu: "1", memory: "4Gi"}
    indexCoord:
      replicas: 1
      resources:
        requests: {cpu: "1", memory: "4Gi"}
    queryNode:
      replicas: 4                   # scale to collection size
      resources:
        requests: {cpu: "4", memory: "32Gi"}
        limits:   {memory: "32Gi"}
      nodeSelector:
        nomadx.io/pool: memory-optimized
    dataNode:
      replicas: 3
      resources:
        requests: {cpu: "2", memory: "8Gi"}
        limits:   {memory: "8Gi"}
    indexNode:
      replicas: 2
      resources:
        requests: {cpu: "4", memory: "8Gi"}
        limits:   {memory: "8Gi"}

Apply:

kubectl apply -f milvus-prod.yaml
kubectl -n milvus wait --for=condition=Ready milvus/prod --timeout=30m

First bring-up takes 15-20 minutes while Pulsar’s ZooKeeper ensemble forms and Milvus components handshake. Do not panic at intermediate NotReady states - wait for the operator.

Collection design for production

Milvus collection schema matters more than for Qdrant because its partition model and index choices are richer. A production chat-RAG collection:

from pymilvus import (
    connections, FieldSchema, CollectionSchema, DataType,
    Collection, utility,
)

connections.connect(
    alias="default",
    host="milvus-prod.milvus.svc.cluster.local",
    port="19530",
    secure=True,
    user="root",
    password=os.environ["MILVUS_PASSWORD"],
)

fields = [
    FieldSchema(name="id", dtype=DataType.VARCHAR, max_length=64, is_primary=True),
    FieldSchema(name="tenant_id", dtype=DataType.VARCHAR, max_length=64),
    FieldSchema(name="source_id", dtype=DataType.VARCHAR, max_length=128),
    FieldSchema(name="chunk_ix", dtype=DataType.INT32),
    FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=4096),
    FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=768),
    FieldSchema(name="created_at", dtype=DataType.INT64),
]
schema = CollectionSchema(fields, enable_dynamic_field=True)
col = Collection(
    name="documents",
    schema=schema,
    shards_num=6,                    # write throughput parallelism
    consistency_level="Bounded",     # balance latency vs. consistency
)

# Partition-by-tenant for physical isolation
col.create_partition("tenant_default")

# Vector index with HNSW + scalar int8 quantization
col.create_index(
    field_name="vector",
    index_params={
        "index_type": "HNSW",
        "metric_type": "COSINE",
        "params": {"M": 16, "efConstruction": 200},
    },
)
# Scalar indexes for filter performance
col.create_index(field_name="tenant_id", index_params={"index_type": "Trie"})
col.create_index(field_name="source_id", index_params={"index_type": "Trie"})

col.load()

Key design decisions:

  • Partitions for tenancy, not collections. One collection per logical corpus with partitions per tenant keeps schema management sane. Up to ~4096 partitions per collection is comfortable.
  • shards_num = 2 × QueryNode count as a starting heuristic. Can be raised to 16 for very high write throughput.
  • Scalar indexes on filter fields. Milvus filtering degrades fast without them. Trie for strings, STL_SORT or INVERTED for numerics.
  • consistency_level: Bounded gives ~2-second bounded staleness in exchange for 3-5x query throughput versus Strong. RAG tolerates this.
  • Enable enable_dynamic_field so you can add metadata later without schema migrations. Pay for it in slightly larger payloads.

Backup and disaster recovery

Milvus ships an official milvus-backup tool. Deploy as a CronJob:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: milvus-backup
  namespace: milvus
spec:
  schedule: "0 2 * * *"            # 02:00 daily
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          containers:
            - name: backup
              image: milvusdb/milvus-backup:v0.4.20
              args:
                - "create"
                - "-n"
                - "daily-$(date +%Y%m%d)"
                - "-c"
                - "/config/backup.yaml"
              volumeMounts:
                - name: config
                  mountPath: /config
          volumes:
            - name: config
              secret:
                secretName: milvus-backup-config

The tool copies etcd metadata + segment data into a separate S3 prefix (milvus-prod-backups/). Lifecycle to Glacier after 30 days.

Restore drill cadence: quarterly, on a scratch cluster. Full-restore time for a 100M-vector collection on our reference hardware is ~20 minutes.

Observability

Milvus exposes rich Prometheus metrics. Build these dashboards on day one:

  • Segment count (milvus_querynode_num_segments) - unbalanced growth indicates rebalance lag
  • Query latency p99 per collection (milvus_proxy_req_latency_bucket) - target < 100ms
  • QueryNode RAM utilization - alert at 80%, because HNSW falling out of memory crashes throughput
  • DataNode flush lag (milvus_datanode_flowgraph_flow_count) - ingestion backlog indicator
  • Pulsar topic backlog per shard - upstream of Milvus but downstream of your ingest
  • etcd leader changes - > 1/hour means etcd is unhappy

Ship OpenTelemetry traces to your existing collector - Milvus v2.4 supports OTEL.

Sizing tiers

TierVectorsConfigEst. monthly cost (AED, EKS me-central-1)
Small<100M2 QueryNode × r6i.xlarge, 3 DataNode × m6i.large, 3-node etcd, 3-node Pulsar~35,000
Medium100M-500M6 QueryNode × r6i.2xlarge + int8 quant, 3 DataNode × m6i.xlarge~95,000
Large500M-2B16 QueryNode × r6i.4xlarge + product quant, 6 DataNode × m6i.xlarge, 5-node etcd~280,000
XL2B+32+ QueryNode × r6i.8xlarge, 8+ DataNode, 5+ IndexNode, hot/cold tieringnegotiated

Add 20-30% for S3 storage, egress, and backup infrastructure.

GCC data sovereignty checklist

  • Full stack in one region: compute, etcd, Pulsar, object storage
  • External S3 with SSE-KMS CMK; no cross-region replication out of the region
  • Milvus auth enabled; password via external-secrets
  • TLS on gRPC (19530) and HTTP ingress (9091)
  • NetworkPolicy restricting inbound to client namespaces, internal-only for etcd/Pulsar
  • Audit log via Pulsar topic or filebeat to your SIEM

Common failure modes

  • Queries returning empty results after reload - col.load() didn’t complete before query traffic started. Add a readiness check that polls col.num_entities and utility.load_state().
  • Pulsar BookKeeper OOM under write bursts - default journalMaxSizeMB is too small. Raise journalMaxSizeMB: 2048 and size the journal PVC accordingly.
  • etcd write latency spikes - etcd WAL and snapshot on slow SSDs. Move etcd to io2 or Premium_LRS.
  • QueryNode segment bloat - compaction falling behind. Tune common.retentionDuration and run manual compaction during low traffic.
  • Collection drops after cluster restart - etcd data wasn’t persisted (wrong StorageClass). Confirm PVCs for etcd are Retain not Delete.

What this connects to

Milvus is the vector-retrieval layer. See the Production RAG Stack on Kubernetes reference architecture for how it fits with LiteLLM gateway, Langfuse observability, and the orchestration layer.

For teams under 100M vectors, Qdrant remains the simpler choice.

Getting help

We deploy Milvus for GCC enterprise teams running customer-facing RAG at 500M-2B vector scale, including regulated fintech and government workloads on sovereign cloud. AI/ML Infrastructure on Kubernetes is the engagement entry - typical Milvus bring-up runs 3-5 weeks including a DR drill before production cutover.

Frequently Asked Questions

When should I choose Milvus over Qdrant?

Pick Milvus when you expect to cross 1 billion vectors, need multi-collection hot/cold tiering, or require multi-tenant partitioning beyond what Qdrant payload filters support efficiently. Qdrant is simpler operationally and wins up to ~500M vectors. Milvus runs as a full distributed system (etcd, Pulsar/Kafka, MinIO/S3) - that complexity is the entry cost you pay for the scale ceiling.

Should I use the Milvus Operator or the Helm chart?

Use the Milvus Operator (milvusio/milvus-operator) for production. It handles rolling upgrades with component-aware ordering, manages etcd and Pulsar subcharts, and lets you declare cluster topology as a Milvus CR. The Helm chart is fine for development, but operator-driven deployments are what every large-scale production deployment actually runs.

What storage backend should I use for Milvus?

Use S3-compatible object storage (AWS S3, Azure Blob with HNS, GCS, MinIO) for persistent segment data. Do not use in-cluster MinIO in production unless you have a dedicated ops team - replicated MinIO clusters are another distributed system to operate. For GCC, Azure Blob in UAE North, AWS S3 in Middle East (Bahrain or UAE), or an in-region MinIO tenant managed by your MSP are the three common choices.

How do I size Milvus for 1 billion vectors?

For 1B vectors at 768 dimensions with HNSW index, plan for ~600 GB working set across QueryNodes (memory-mapped index files), 10-20 TB S3 segment storage, 16 QueryNodes minimum (4 vCPU / 32 GB each), 6 DataNodes for ingestion, 3 IndexNodes for background index build, 3 ProxyNodes for client traffic, 3-node etcd, and a Pulsar cluster of 3 brokers + 3 bookies. Enable scalar int8 or product quantization above 500M vectors to cut memory 4x-32x.

How do I back up Milvus on Kubernetes?

Milvus stores metadata in etcd and segment data in S3. Back up each: (1) etcd snapshots via the etcdctl snapshot cron (or CloudNativePG-style operator if you run etcd outside the Milvus chart); (2) S3 bucket lifecycle versioning or cross-region replication for segment data; (3) use the milvus-backup tool to create consistent collection-level backups that combine both. Test restore quarterly against a DR cluster.

Can Milvus run in a UAE-sovereign deployment?

Yes. Deploy the full stack (Milvus, etcd, Pulsar, S3) into an in-region Kubernetes cluster and bucket. Zilliz Cloud (managed Milvus) hosts data in US/EU/APAC regions and is unsuitable for workloads covered by NESA, CBUAE, or ADGM residency rules. Self-hosted on Azure UAE North, AWS Middle East Bahrain, or Core42 sovereign cloud is the standard pattern we deploy.

Get Started for Free

We would be happy to speak with you and arrange a free consultation with our Kubernetes Expert in Dubai, UAE. 30-minute call, actionable results in days.

Talk to an Expert