April 23, 2026 · 7 min read · Aizhan Azhybaeva

Deploy BentoML + Yatai on Kubernetes: Production ML Model Serving Guide (2026)

Q: How does BentoML handle GPU vs CPU routing?

BentoML's BentoDeployment CR lets you specify resource requests and node selectors per deployment. Common pattern: run one BentoML runner pool on GPU nodes for models that need them (embeddings, vision, ASR), another on CPU-only nodes for tabular models. The API server is a lightweight FastAPI stub that routes to runners over gRPC, so you can mix pools freely. Adaptive batching happens inside the runner, which is where GPU efficiency comes from.

Q: How do I integrate BentoML with MLflow model registry?

Use bentoml.mlflow.import_model() to pull an MLflow-registered model into BentoML's local store, then build the bento. The CI pipeline: MLflow tracks experiments and registers models; when a model is promoted to 'Production' in MLflow, a CI job imports it to BentoML, builds the bento, pushes to Yatai's registry, and creates a BentoDeployment . This keeps MLflow as the source of truth for lineage and BentoML as the serving runtime.

Serve classical ML models in production on Kubernetes with BentoML and Yatai: containerized bentos, auto-scaling deployments, traffic splitting for A/B tests, GPU and CPU pool routing, and integration with MLflow and Prometheus.

Production ML programs rarely consist of just LLMs. Behind every customer-facing AI feature is a pile of classical models - fraud classifiers, churn predictors, ranking models, recommendation engines, OCR pipelines - that need the same rigour of deployment, monitoring, and traffic management. BentoML + Yatai on Kubernetes is the most ergonomic way we’ve found to serve that portfolio at scale.

This guide covers the production deployment pattern we deploy with clients: Yatai as model registry and deployment control plane, BentoML runners split across GPU and CPU pools, MLflow integration for lineage, and adaptive batching for cost efficiency.

Where BentoML fits in the ML stack

Workload	Best fit
LLM generation (GPT-class)	vLLM / TGI / Triton
Sentence / text embeddings	BentoML or TEI, either works
Tabular classifier / regressor	BentoML
Computer vision inference	BentoML
Multi-model ensemble / pipeline	BentoML
Streaming ML feature transforms	BentoML or dedicated streaming framework

BentoML’s developer surface is the differentiator: a decorator-based service API, automatic containerization, and a Kubernetes-native deployment story via Yatai CRDs. For a team that just wants to bentoml build and ship, there’s less operator tax than KServe.

Architecture

      ┌──────────────────┐
      │  ML Engineer     │   bentoml build
      │  CI/CD           │
      └────────┬─────────┘
               │  push bento
               ▼
      ┌──────────────────┐
      │   Yatai          │───▶ S3 / Blob (bento artifacts, models)
      │   (registry +    │───▶ Postgres (metadata)
      │    control plane)│
      └────────┬─────────┘
               │  BentoDeployment CR
               ▼
      ┌──────────────────────────────────────────────┐
      │           Yatai Deployment Operator          │
      │  (reconciles CRs to Deployments + Services)  │
      └────────┬─────────────────────────┬───────────┘
               │                         │
               ▼                         ▼
      ┌──────────────────┐      ┌──────────────────┐
      │  API Server pool │      │  Runner pool(s)  │
      │  (FastAPI stub,  │◄────▶│  (model compute, │
      │  CPU, stateless) │ gRPC │  GPU or CPU)     │
      └──────────────────┘      └──────────────────┘
               ▲                         ▲
               │ HPA on RPS              │ HPA on queue depth /
               │                         │ GPU utilization
               │                         ▼
    Client (LiteLLM/app)        ┌──────────────────┐
                                │   GPU node pool  │
                                │   (L4 / A10 /    │
                                │    H100 /MI300)  │
                                └──────────────────┘

Key invariants:

API Server is stateless, scales on RPS.
Runners hold model state in memory, scale on queue depth or GPU utilization.
Adaptive batching happens inside runners.
Yatai holds the registry and control plane; it is not in the serving hot path.
Bento artifacts (models + code + dependencies) live in S3-compatible storage.

Prerequisites

Kubernetes 1.28+
Helm 3.14+
cert-manager + ingress-nginx
A Postgres instance for Yatai metadata (or use CNPG as in the pgvector guide)
S3-compatible bucket for bento artifacts
A dedicated GPU node pool if serving GPU models

Install Yatai

helm repo add bentoml https://charts.bentoml.org
helm repo update
kubectl create namespace yatai-system

Production values.yaml:

# yatai-values.yaml
image:
  registry: quay.io
  repository: bentoml/yatai
  tag: "1.1.0"

replicas: 2

postgresql:
  deploy: false
  host: yatai-pg-rw.data.svc.cluster.local
  port: 5432
  database: yatai
  user: yatai
  existingSecret:
    name: yatai-pg-creds
    passwordKey: password

s3:
  bucketName: nomadx-bentos-me-central-1
  region: me-central-1
  endpoint: https://s3.me-central-1.amazonaws.com
  existingSecret:
    name: yatai-s3-creds

ingress:
  enabled: true
  className: nginx
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
  hosts:
    - host: yatai.example.ae
      paths: [{path: /, pathType: Prefix}]
  tls:
    - secretName: yatai-tls
      hosts: [yatai.example.ae]

resources:
  requests: {cpu: 500m, memory: 1Gi}
  limits: {memory: 2Gi}

serviceMonitor:
  enabled: true

Install:

helm upgrade --install yatai bentoml/yatai \
  --namespace yatai-system \
  --values yatai-values.yaml \
  --wait --timeout 10m

# Install the deployment operator
helm upgrade --install yatai-deployment bentoml/yatai-deployment \
  --namespace yatai-deployment \
  --create-namespace \
  --wait --timeout 10m

Packaging a production bento

A minimal fraud-scoring service:

# service.py
from __future__ import annotations
import bentoml
import numpy as np
from bentoml.io import NumpyNdarray, JSON

fraud_model_runner = bentoml.xgboost.get("fraud_model:latest").to_runner()

svc = bentoml.Service(
    name="fraud-scoring",
    runners=[fraud_model_runner],
)

@svc.api(
    input=NumpyNdarray(dtype="float32", shape=(-1, 42)),
    output=JSON(),
    route="/predict",
)
async def predict(features: np.ndarray) -> dict:
    scores = await fraud_model_runner.predict.async_run(features)
    return {
        "scores": scores.tolist(),
        "n": len(scores),
    }

The bentofile:

# bentofile.yaml
service: "service:svc"
labels:
  owner: fraud-team
  stage: production
include:
  - "*.py"
python:
  packages:
    - xgboost==2.0.3
    - numpy==1.26.4
  lock_packages: true
models:
  - "fraud_model:latest"
docker:
  distro: debian
  python_version: "3.11"
  cuda_version: null                    # CPU-only model

Build and push:

bentoml build
bentoml push fraud-scoring:latest          # to Yatai registry

For GPU models, set cuda_version: "12.1" and add CUDA-specific deps.

Deploy via BentoDeployment CR

apiVersion: resources.yatai.ai/v1alpha1
kind: BentoDeployment
metadata:
  name: fraud-scoring-prod
  namespace: ml-serving
spec:
  bento: fraud-scoring:prod-2026-04-24
  autoscaling:
    minReplicas: 3
    maxReplicas: 20
    metrics:
      - type: Resource
        resource:
          name: cpu
          target:
            type: Utilization
            averageUtilization: 70
  resources:
    requests: {cpu: "1", memory: "2Gi"}
    limits:   {memory: "2Gi"}
  runners:
    - name: fraud_model_runner
      autoscaling:
        minReplicas: 2
        maxReplicas: 10
      resources:
        requests: {cpu: "2", memory: "4Gi"}
        limits:   {memory: "4Gi"}
      nodeSelector:
        nomadx.io/pool: cpu-optimized
  ingress:
    enabled: true
    annotations:
      cert-manager.io/cluster-issuer: letsencrypt-prod
    host: fraud.api.example.ae

For a GPU model deployment, swap the runner section:

runners:
  - name: embedding_runner
    resources:
      requests: {cpu: "4", memory: "16Gi", "nvidia.com/gpu": "1"}
      limits:   {memory: "16Gi", "nvidia.com/gpu": "1"}
    nodeSelector:
      node.kubernetes.io/gpu-family: "L4"
    tolerations:
      - key: "nvidia.com/gpu"
        operator: Exists

Apply:

kubectl apply -f fraud-deployment.yaml

The operator reconciles this into API-server Deployments, runner Deployments, Services, HPAs, and an Ingress rule.

Adaptive batching

BentoML’s secret weapon. Configure per-runner:

@bentoml.Runnable.method(batchable=True, batch_dim=0)
def predict(self, input_data: np.ndarray) -> np.ndarray:
    return self.model.predict(input_data)

Runtime config in bentofile.yaml:

runners:
  fraud_model_runner:
    batching:
      enabled: true
      max_batch_size: 256
      max_latency_ms: 20
    workers_per_resource: 1
    resources:
      cpu: 2
      memory: "4Gi"

The runner accumulates incoming requests up to max_batch_size or max_latency_ms (whichever comes first) and processes them as a single batch. For XGBoost and similar models, this can give 5-10x throughput improvement over single-request inference with negligible latency impact.

MLflow integration

# build_bento.py - runs in CI
import mlflow
import bentoml

mlflow.set_tracking_uri("https://mlflow.example.ae")
model_uri = "models:/fraud_model/Production"

bento_model = bentoml.mlflow.import_model(
    name="fraud_model",
    model_uri=model_uri,
    signatures={"predict": {"batchable": True, "batch_dim": 0}},
)
print(f"Imported {bento_model.tag}")

Then bentoml build + bentoml push as usual. This keeps MLflow as the lineage system of record while BentoML handles packaging and serving.

Traffic splitting for A/B tests

Yatai supports canary and traffic-split via multiple BentoDeployments pointing at an Ingress with weighted backends. Simpler pattern using ingress-nginx canary annotations:

# Baseline deployment
apiVersion: resources.yatai.ai/v1alpha1
kind: BentoDeployment
metadata: {name: fraud-v1, namespace: ml-serving}
spec:
  bento: fraud-scoring:v1
  ingress:
    enabled: true
    host: fraud.api.example.ae
---
# Canary deployment
apiVersion: resources.yatai.ai/v1alpha1
kind: BentoDeployment
metadata: {name: fraud-v2-canary, namespace: ml-serving}
spec:
  bento: fraud-scoring:v2
  ingress:
    enabled: true
    host: fraud.api.example.ae
    annotations:
      nginx.ingress.kubernetes.io/canary: "true"
      nginx.ingress.kubernetes.io/canary-weight: "10"    # 10% canary

Raise canary-weight over hours as metrics look good; flip back to v1 instantly if not.

Observability

BentoML exposes Prometheus metrics from API servers and runners. Key metrics:

bentoml_service_request_duration_seconds_bucket - latency per API and runner
bentoml_service_request_total - request volume by status code
bentoml_runner_adaptive_batch_size - observed batch size, useful to tune max_batch_size
bentoml_runner_busy_ratio - runner utilization; HPA signal

Structured request logs via BentoML’s tracing middleware ship to your OTEL collector.

Sizing tiers

Tier	Models	Deployments / replicas	Yatai / Postgres	Est. monthly cost (AED, EKS me-central-1)
Small	1-5	3 API + 2 runner per model	Small Postgres	~8,000
Medium	5-25	3-10 API + 2-10 runners	db.r6g.large	~30,000
Large	25-100	Mixed GPU + CPU pools, autoscaled	db.r6g.xlarge	~90,000

GPU models dominate the bill above 10 concurrent model deployments.

Common failure modes

Cold-start latency on new replicas - model loading into memory is slow. Set minReplicas such that scale-up doesn’t trigger during latency-critical windows.
OOM on GPU runners under adaptive batch - max_batch_size is too high for available VRAM. Lower until VRAM usage fits with headroom for variable input shapes.
Yatai deployment operator stuck in Pending - usually a PVC storage class mismatch or missing S3 creds. kubectl describe bentodeployment ... shows events.
Requests timing out after bento update - new bento version has different API signature and clients haven’t updated. Use explicit versioning on endpoints (/v1/predict, /v2/predict) and traffic-split rather than hot-swap.
Runner HPA thrashing - CPU-based scaling is noisy for ML workloads. Scale on queue depth (via KEDA + Prometheus scaler) or bentoml_runner_busy_ratio instead.

What this connects to

BentoML is the classical-ML serving layer in a broader AI platform. It complements, rather than competes with:

vLLM / TGI / Triton for LLM generation
LiteLLM as the front-door API gateway - BentoML endpoints can register as OpenAI-compatible providers
Langfuse for observability if you wrap prediction logging
Production RAG Stack - BentoML commonly serves the reranker or embedder in a RAG pipeline

Getting help

We deploy BentoML + Yatai for GCC enterprise teams running diverse model portfolios - fraud, recommendation, vision, ML feature pipelines - alongside their LLM workloads. AI/ML Infrastructure on Kubernetes is the engagement entry. Typical rollout: 3-5 weeks including MLflow integration and first 3-5 models in production.

Common Questions

Frequently Asked Questions

When should I use BentoML instead of vLLM or KServe?

Use BentoML when you're serving classical ML models (XGBoost, LightGBM, scikit-learn, custom PyTorch/TensorFlow networks) rather than foundation-model LLMs. BentoML's strengths are the developer ergonomics - you wrap a model in a decorator-based Python service and get a containerized 'bento' with batch inference, adaptive batching, and multi-model composition out of the box. vLLM is optimized for transformer generation; KServe is a more operator-heavy abstraction. For a fraud model, churn classifier, or ranker, BentoML is usually the fastest path to production.

What is Yatai and do I need it?

Yatai is the model deployment and management control plane for BentoML on Kubernetes. It provides a model registry (complementing or integrating with MLflow), deployment CRDs for declarative model rollouts, traffic splitting, auto-scaling policies, and a UI. For a handful of models, raw BentoML + vanilla Deployments + HPA is enough. Once you have 10+ models across multiple teams, Yatai's registry and traffic-management CRDs pay for themselves.

How does BentoML handle GPU vs CPU routing?

BentoML's BentoDeployment CR lets you specify resource requests and node selectors per deployment. Common pattern: run one BentoML runner pool on GPU nodes for models that need them (embeddings, vision, ASR), another on CPU-only nodes for tabular models. The API server is a lightweight FastAPI stub that routes to runners over gRPC, so you can mix pools freely. Adaptive batching happens inside the runner, which is where GPU efficiency comes from.

How do I integrate BentoML with MLflow model registry?

Use bentoml.mlflow.import_model() to pull an MLflow-registered model into BentoML's local store, then build the bento. The CI pipeline: MLflow tracks experiments and registers models; when a model is promoted to 'Production' in MLflow, a CI job imports it to BentoML, builds the bento, pushes to Yatai's registry, and creates a BentoDeployment. This keeps MLflow as the source of truth for lineage and BentoML as the serving runtime.

Can BentoML serve LLMs?

Yes, but usually it's not the best fit for large transformer models. BentoML can wrap vLLM or Transformers pipelines as a bento, and that works for small-to-medium LLMs. For dedicated LLM serving at scale, vLLM, TGI, or Triton directly are better - see our LLM serving benchmark. Use BentoML where it shines: classical ML models, custom ensembles, and pipelines that combine classical + small LLM components.

How does BentoML on Kubernetes handle data sovereignty?

BentoML runs entirely inside your cluster with no external telemetry. Model artifacts live in your Yatai-managed S3-compatible storage. For GCC deployments, point Yatai at in-region Azure Blob or AWS S3 Middle East, deploy runners on in-region GPU/CPU pools, and route traffic through an in-region ingress. No cross-region data flow unless you explicitly configure it.

Get Started for Free

We would be happy to speak with you and arrange a free consultation with our Kubernetes Expert in Dubai, UAE. 30-minute call, actionable results in days.

Talk to an Expert