Deploy BentoML + Yatai on Kubernetes: Production ML Model Serving Guide (2026)
Serve classical ML models in production on Kubernetes with BentoML and Yatai: containerized bentos, auto-scaling deployments, traffic splitting for A/B tests, GPU and CPU pool routing, and integration with MLflow and Prometheus.
Production ML programs rarely consist of just LLMs. Behind every customer-facing AI feature is a pile of classical models - fraud classifiers, churn predictors, ranking models, recommendation engines, OCR pipelines - that need the same rigour of deployment, monitoring, and traffic management. BentoML + Yatai on Kubernetes is the most ergonomic way we’ve found to serve that portfolio at scale.
This guide covers the production deployment pattern we deploy with clients: Yatai as model registry and deployment control plane, BentoML runners split across GPU and CPU pools, MLflow integration for lineage, and adaptive batching for cost efficiency.
Where BentoML fits in the ML stack
| Workload | Best fit |
|---|---|
| LLM generation (GPT-class) | vLLM / TGI / Triton |
| Sentence / text embeddings | BentoML or TEI, either works |
| Tabular classifier / regressor | BentoML |
| Computer vision inference | BentoML |
| Multi-model ensemble / pipeline | BentoML |
| Streaming ML feature transforms | BentoML or dedicated streaming framework |
BentoML’s developer surface is the differentiator: a decorator-based service API, automatic containerization, and a Kubernetes-native deployment story via Yatai CRDs. For a team that just wants to bentoml build and ship, there’s less operator tax than KServe.
Architecture
┌──────────────────┐
│ ML Engineer │ bentoml build
│ CI/CD │
└────────┬─────────┘
│ push bento
▼
┌──────────────────┐
│ Yatai │───▶ S3 / Blob (bento artifacts, models)
│ (registry + │───▶ Postgres (metadata)
│ control plane)│
└────────┬─────────┘
│ BentoDeployment CR
▼
┌──────────────────────────────────────────────┐
│ Yatai Deployment Operator │
│ (reconciles CRs to Deployments + Services) │
└────────┬─────────────────────────┬───────────┘
│ │
▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ API Server pool │ │ Runner pool(s) │
│ (FastAPI stub, │◄────▶│ (model compute, │
│ CPU, stateless) │ gRPC │ GPU or CPU) │
└──────────────────┘ └──────────────────┘
▲ ▲
│ HPA on RPS │ HPA on queue depth /
│ │ GPU utilization
│ ▼
Client (LiteLLM/app) ┌──────────────────┐
│ GPU node pool │
│ (L4 / A10 / │
│ H100 /MI300) │
└──────────────────┘
Key invariants:
- API Server is stateless, scales on RPS.
- Runners hold model state in memory, scale on queue depth or GPU utilization.
- Adaptive batching happens inside runners.
- Yatai holds the registry and control plane; it is not in the serving hot path.
- Bento artifacts (models + code + dependencies) live in S3-compatible storage.
Prerequisites
- Kubernetes 1.28+
- Helm 3.14+
- cert-manager + ingress-nginx
- A Postgres instance for Yatai metadata (or use CNPG as in the pgvector guide)
- S3-compatible bucket for bento artifacts
- A dedicated GPU node pool if serving GPU models
Install Yatai
helm repo add bentoml https://charts.bentoml.org
helm repo update
kubectl create namespace yatai-system
Production values.yaml:
# yatai-values.yaml
image:
registry: quay.io
repository: bentoml/yatai
tag: "1.1.0"
replicas: 2
postgresql:
deploy: false
host: yatai-pg-rw.data.svc.cluster.local
port: 5432
database: yatai
user: yatai
existingSecret:
name: yatai-pg-creds
passwordKey: password
s3:
bucketName: nomadx-bentos-me-central-1
region: me-central-1
endpoint: https://s3.me-central-1.amazonaws.com
existingSecret:
name: yatai-s3-creds
ingress:
enabled: true
className: nginx
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
hosts:
- host: yatai.example.ae
paths: [{path: /, pathType: Prefix}]
tls:
- secretName: yatai-tls
hosts: [yatai.example.ae]
resources:
requests: {cpu: 500m, memory: 1Gi}
limits: {memory: 2Gi}
serviceMonitor:
enabled: true
Install:
helm upgrade --install yatai bentoml/yatai \
--namespace yatai-system \
--values yatai-values.yaml \
--wait --timeout 10m
# Install the deployment operator
helm upgrade --install yatai-deployment bentoml/yatai-deployment \
--namespace yatai-deployment \
--create-namespace \
--wait --timeout 10m
Packaging a production bento
A minimal fraud-scoring service:
# service.py
from __future__ import annotations
import bentoml
import numpy as np
from bentoml.io import NumpyNdarray, JSON
fraud_model_runner = bentoml.xgboost.get("fraud_model:latest").to_runner()
svc = bentoml.Service(
name="fraud-scoring",
runners=[fraud_model_runner],
)
@svc.api(
input=NumpyNdarray(dtype="float32", shape=(-1, 42)),
output=JSON(),
route="/predict",
)
async def predict(features: np.ndarray) -> dict:
scores = await fraud_model_runner.predict.async_run(features)
return {
"scores": scores.tolist(),
"n": len(scores),
}
The bentofile:
# bentofile.yaml
service: "service:svc"
labels:
owner: fraud-team
stage: production
include:
- "*.py"
python:
packages:
- xgboost==2.0.3
- numpy==1.26.4
lock_packages: true
models:
- "fraud_model:latest"
docker:
distro: debian
python_version: "3.11"
cuda_version: null # CPU-only model
Build and push:
bentoml build
bentoml push fraud-scoring:latest # to Yatai registry
For GPU models, set cuda_version: "12.1" and add CUDA-specific deps.
Deploy via BentoDeployment CR
apiVersion: resources.yatai.ai/v1alpha1
kind: BentoDeployment
metadata:
name: fraud-scoring-prod
namespace: ml-serving
spec:
bento: fraud-scoring:prod-2026-04-24
autoscaling:
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
resources:
requests: {cpu: "1", memory: "2Gi"}
limits: {memory: "2Gi"}
runners:
- name: fraud_model_runner
autoscaling:
minReplicas: 2
maxReplicas: 10
resources:
requests: {cpu: "2", memory: "4Gi"}
limits: {memory: "4Gi"}
nodeSelector:
nomadx.io/pool: cpu-optimized
ingress:
enabled: true
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
host: fraud.api.example.ae
For a GPU model deployment, swap the runner section:
runners:
- name: embedding_runner
resources:
requests: {cpu: "4", memory: "16Gi", "nvidia.com/gpu": "1"}
limits: {memory: "16Gi", "nvidia.com/gpu": "1"}
nodeSelector:
node.kubernetes.io/gpu-family: "L4"
tolerations:
- key: "nvidia.com/gpu"
operator: Exists
Apply:
kubectl apply -f fraud-deployment.yaml
The operator reconciles this into API-server Deployments, runner Deployments, Services, HPAs, and an Ingress rule.
Adaptive batching
BentoML’s secret weapon. Configure per-runner:
@bentoml.Runnable.method(batchable=True, batch_dim=0)
def predict(self, input_data: np.ndarray) -> np.ndarray:
return self.model.predict(input_data)
Runtime config in bentofile.yaml:
runners:
fraud_model_runner:
batching:
enabled: true
max_batch_size: 256
max_latency_ms: 20
workers_per_resource: 1
resources:
cpu: 2
memory: "4Gi"
The runner accumulates incoming requests up to max_batch_size or max_latency_ms (whichever comes first) and processes them as a single batch. For XGBoost and similar models, this can give 5-10x throughput improvement over single-request inference with negligible latency impact.
MLflow integration
Register models in MLflow, import into BentoML at build time:
# build_bento.py - runs in CI
import mlflow
import bentoml
mlflow.set_tracking_uri("https://mlflow.example.ae")
model_uri = "models:/fraud_model/Production"
bento_model = bentoml.mlflow.import_model(
name="fraud_model",
model_uri=model_uri,
signatures={"predict": {"batchable": True, "batch_dim": 0}},
)
print(f"Imported {bento_model.tag}")
Then bentoml build + bentoml push as usual. This keeps MLflow as the lineage system of record while BentoML handles packaging and serving.
Traffic splitting for A/B tests
Yatai supports canary and traffic-split via multiple BentoDeployments pointing at an Ingress with weighted backends. Simpler pattern using ingress-nginx canary annotations:
# Baseline deployment
apiVersion: resources.yatai.ai/v1alpha1
kind: BentoDeployment
metadata: {name: fraud-v1, namespace: ml-serving}
spec:
bento: fraud-scoring:v1
ingress:
enabled: true
host: fraud.api.example.ae
---
# Canary deployment
apiVersion: resources.yatai.ai/v1alpha1
kind: BentoDeployment
metadata: {name: fraud-v2-canary, namespace: ml-serving}
spec:
bento: fraud-scoring:v2
ingress:
enabled: true
host: fraud.api.example.ae
annotations:
nginx.ingress.kubernetes.io/canary: "true"
nginx.ingress.kubernetes.io/canary-weight: "10" # 10% canary
Raise canary-weight over hours as metrics look good; flip back to v1 instantly if not.
Observability
BentoML exposes Prometheus metrics from API servers and runners. Key metrics:
bentoml_service_request_duration_seconds_bucket- latency per API and runnerbentoml_service_request_total- request volume by status codebentoml_runner_adaptive_batch_size- observed batch size, useful to tunemax_batch_sizebentoml_runner_busy_ratio- runner utilization; HPA signal
Structured request logs via BentoML’s tracing middleware ship to your OTEL collector.
Sizing tiers
| Tier | Models | Deployments / replicas | Yatai / Postgres | Est. monthly cost (AED, EKS me-central-1) |
|---|---|---|---|---|
| Small | 1-5 | 3 API + 2 runner per model | Small Postgres | ~8,000 |
| Medium | 5-25 | 3-10 API + 2-10 runners | db.r6g.large | ~30,000 |
| Large | 25-100 | Mixed GPU + CPU pools, autoscaled | db.r6g.xlarge | ~90,000 |
GPU models dominate the bill above 10 concurrent model deployments.
Common failure modes
- Cold-start latency on new replicas - model loading into memory is slow. Set
minReplicassuch that scale-up doesn’t trigger during latency-critical windows. - OOM on GPU runners under adaptive batch - max_batch_size is too high for available VRAM. Lower until VRAM usage fits with headroom for variable input shapes.
- Yatai deployment operator stuck in
Pending- usually a PVC storage class mismatch or missing S3 creds.kubectl describe bentodeployment ...shows events. - Requests timing out after bento update - new bento version has different API signature and clients haven’t updated. Use explicit versioning on endpoints (
/v1/predict,/v2/predict) and traffic-split rather than hot-swap. - Runner HPA thrashing - CPU-based scaling is noisy for ML workloads. Scale on queue depth (via KEDA + Prometheus scaler) or
bentoml_runner_busy_ratioinstead.
What this connects to
BentoML is the classical-ML serving layer in a broader AI platform. It complements, rather than competes with:
- vLLM / TGI / Triton for LLM generation
- LiteLLM as the front-door API gateway - BentoML endpoints can register as OpenAI-compatible providers
- Langfuse for observability if you wrap prediction logging
- Production RAG Stack - BentoML commonly serves the reranker or embedder in a RAG pipeline
Getting help
We deploy BentoML + Yatai for GCC enterprise teams running diverse model portfolios - fraud, recommendation, vision, ML feature pipelines - alongside their LLM workloads. AI/ML Infrastructure on Kubernetes is the engagement entry. Typical rollout: 3-5 weeks including MLflow integration and first 3-5 models in production.
Frequently Asked Questions
When should I use BentoML instead of vLLM or KServe?
Use BentoML when you're serving classical ML models (XGBoost, LightGBM, scikit-learn, custom PyTorch/TensorFlow networks) rather than foundation-model LLMs. BentoML's strengths are the developer ergonomics - you wrap a model in a decorator-based Python service and get a containerized 'bento' with batch inference, adaptive batching, and multi-model composition out of the box. vLLM is optimized for transformer generation; KServe is a more operator-heavy abstraction. For a fraud model, churn classifier, or ranker, BentoML is usually the fastest path to production.
What is Yatai and do I need it?
Yatai is the model deployment and management control plane for BentoML on Kubernetes. It provides a model registry (complementing or integrating with MLflow), deployment CRDs for declarative model rollouts, traffic splitting, auto-scaling policies, and a UI. For a handful of models, raw BentoML + vanilla Deployments + HPA is enough. Once you have 10+ models across multiple teams, Yatai's registry and traffic-management CRDs pay for themselves.
How does BentoML handle GPU vs CPU routing?
BentoML's BentoDeployment CR lets you specify resource requests and node selectors per deployment. Common pattern: run one BentoML runner pool on GPU nodes for models that need them (embeddings, vision, ASR), another on CPU-only nodes for tabular models. The API server is a lightweight FastAPI stub that routes to runners over gRPC, so you can mix pools freely. Adaptive batching happens inside the runner, which is where GPU efficiency comes from.
How do I integrate BentoML with MLflow model registry?
Use bentoml.mlflow.import_model() to pull an MLflow-registered model into BentoML's local store, then build the bento. The CI pipeline: MLflow tracks experiments and registers models; when a model is promoted to 'Production' in MLflow, a CI job imports it to BentoML, builds the bento, pushes to Yatai's registry, and creates a BentoDeployment. This keeps MLflow as the source of truth for lineage and BentoML as the serving runtime.
Can BentoML serve LLMs?
Yes, but usually it's not the best fit for large transformer models. BentoML can wrap vLLM or Transformers pipelines as a bento, and that works for small-to-medium LLMs. For dedicated LLM serving at scale, vLLM, TGI, or Triton directly are better - see our LLM serving benchmark. Use BentoML where it shines: classical ML models, custom ensembles, and pipelines that combine classical + small LLM components.
How does BentoML on Kubernetes handle data sovereignty?
BentoML runs entirely inside your cluster with no external telemetry. Model artifacts live in your Yatai-managed S3-compatible storage. For GCC deployments, point Yatai at in-region Azure Blob or AWS S3 Middle East, deploy runners on in-region GPU/CPU pools, and route traffic through an in-region ingress. No cross-region data flow unless you explicitly configure it.
Complementary NomadX Services
Get Started for Free
We would be happy to speak with you and arrange a free consultation with our Kubernetes Expert in Dubai, UAE. 30-minute call, actionable results in days.
Talk to an Expert