Ragas on Kubernetes: Continuous RAG Evaluation in Production (2026)
Run Ragas evaluations as a production Kubernetes workload: offline eval suites, online LLM-as-judge sampling from Langfuse traces, drift detection, and closing the RAG quality loop with automated alerts and SLO-based rollback.
RAG quality is not a one-time measurement - it’s a continuous SLO. Your retrieval degrades when the corpus evolves. Your generation drifts when you change models. Your context precision drops when you add new document types that look superficially relevant. Without continuous evaluation, the drift stays invisible until a user complains. Ragas on Kubernetes is the pattern we deploy to catch it.
This is the final post in our AI engineering stack series - it closes the loop back to Langfuse, which captures the traces that evaluation scores.
What “continuous evaluation” actually means
Three levels of evaluation rigor, by maturity:
| Level | Cadence | What you catch |
|---|---|---|
| 1. Offline eval | Pre-deploy, on a fixed suite | Regressions before they hit users |
| 2. Scheduled online eval | Hourly or daily on sampled production traces | Drift, quality trends, model changes |
| 3. Continuous LLM-as-judge | Per-trace on a sampling rate (1-5%) | Real-time quality degradation, alerts |
Most production RAG runs at level 1 (eval before release) plus level 2 (nightly drift check). Level 3 is worth it for products with strict quality SLOs or high prompt-injection risk.
Architecture
┌────────────────────┐
│ Production RAG │ Generates traces into Langfuse
│ (orchestrator) │ with question, context, answer
└──────────┬─────────┘
│
▼
┌────────────────────┐
│ Langfuse │
│ (trace store) │
└──────────┬─────────┘
│
┌──────────────┼──────────────┐
│ │ │
▼ ▼ ▼
Offline eval Scheduled eval Online sampler
(manual, (CronJob, (KEDA-scaled
pre-deploy) hourly/daily) worker)
│ │ │
│ │ │
└──────────────┼──────────────┘
▼
┌────────────────────┐
│ Ragas │ LLM-as-judge via
│ (evaluation │ LiteLLM gateway
│ library) │
└──────────┬─────────┘
│
▼
┌────────────────────┐
│ Scores │
│ written back to │───▶ Grafana dashboard
│ Langfuse │───▶ Slack alert on SLO breach
│ as custom scores │
└────────────────────┘
Prerequisites
- Langfuse deployed and capturing RAG traces
- LiteLLM gateway with an evaluator LLM virtual key
- Prometheus + Grafana for dashboarding (already in the NomadX baseline stack)
- Python 3.10+ image for running evaluation jobs
The evaluation worker image
# Dockerfile
FROM python:3.11-slim
RUN pip install --no-cache-dir \
ragas==0.2.8 \
langfuse==2.54.0 \
langchain-openai==0.2.0 \
langchain-community==0.3.0 \
datasets==3.0.0 \
prometheus-client==0.21.0
COPY eval/ /app/
WORKDIR /app
ENTRYPOINT ["python", "-m", "eval.runner"]
The evaluation runner:
# eval/runner.py
import asyncio
import os
from datetime import datetime, timedelta, timezone
from datasets import Dataset
from langchain_openai import ChatOpenAI
from langfuse import Langfuse
from ragas import evaluate
from ragas.metrics import (
answer_relevancy,
faithfulness,
context_precision,
context_recall,
)
EVALUATOR_LLM = ChatOpenAI(
model="gpt-4o-uae-primary",
openai_api_base=os.environ["LITELLM_URL"],
openai_api_key=os.environ["LITELLM_EVAL_KEY"],
temperature=0.0,
max_tokens=1024,
# Tag evaluations in Langfuse
default_headers={"langfuse-tags": "ragas-eval"},
)
lf = Langfuse(
public_key=os.environ["LANGFUSE_PUBLIC_KEY"],
secret_key=os.environ["LANGFUSE_SECRET_KEY"],
host=os.environ["LANGFUSE_HOST"],
)
async def pull_recent_traces(hours: int = 1, sample_rate: float = 0.02):
since = datetime.now(timezone.utc) - timedelta(hours=hours)
traces = lf.fetch_traces(
from_timestamp=since,
tags=["rag", "prod"],
limit=5000,
).data
# Sample rather than evaluate all
import random
sampled = [t for t in traces if random.random() < sample_rate]
return sampled
def extract_eval_row(trace):
"""Pull question, retrieved context, answer from a Langfuse trace."""
q = trace.input.get("message") or trace.input.get("question")
# Context retrieved is typically in a child span named "retrieve"
ctx_span = next((s for s in trace.observations if s.name == "retrieve"), None)
contexts = (ctx_span.output or []) if ctx_span else []
contexts = [c.get("text", "") for c in contexts] if isinstance(contexts, list) else []
answer = trace.output.get("answer") if isinstance(trace.output, dict) else str(trace.output)
return {
"trace_id": trace.id,
"question": q,
"contexts": contexts,
"answer": answer,
}
def write_scores_back(trace_id: str, scores: dict):
for metric, value in scores.items():
if value is None or value != value: # skip NaN
continue
lf.score(
trace_id=trace_id,
name=f"ragas.{metric}",
value=float(value),
data_type="NUMERIC",
)
async def main():
traces = await pull_recent_traces(hours=1, sample_rate=0.02)
rows = [extract_eval_row(t) for t in traces if t.output]
rows = [r for r in rows if r["question"] and r["answer"] and r["contexts"]]
if not rows:
print("No eval rows after filtering")
return
ds = Dataset.from_list(rows)
result = evaluate(
dataset=ds,
metrics=[faithfulness, answer_relevancy], # no ground truth needed
llm=EVALUATOR_LLM,
raise_exceptions=False,
)
for i, row in enumerate(rows):
write_scores_back(
row["trace_id"],
{
"faithfulness": result["faithfulness"][i],
"answer_relevancy": result["answer_relevancy"][i],
},
)
lf.flush()
print(f"Evaluated {len(rows)} traces. Mean faithfulness: {result['faithfulness'].mean():.3f}")
if __name__ == "__main__":
asyncio.run(main())
Schedule as a Kubernetes CronJob
apiVersion: batch/v1
kind: CronJob
metadata:
name: ragas-online-eval
namespace: evaluation
spec:
schedule: "0 * * * *" # hourly
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 7
jobTemplate:
spec:
backoffLimit: 2
activeDeadlineSeconds: 1800 # fail hard if taking > 30 min
template:
spec:
restartPolicy: OnFailure
containers:
- name: eval
image: registry.example.ae/eval/ragas-runner:0.2.8
env:
- name: LITELLM_URL
value: "http://litellm.llm-gateway.svc.cluster.local:4000"
- name: LANGFUSE_HOST
value: "http://langfuse-web.langfuse.svc.cluster.local:3000"
- name: SAMPLE_RATE
value: "0.02"
- name: WINDOW_HOURS
value: "1"
envFrom:
- secretRef:
name: ragas-secrets
resources:
requests: {cpu: "1", memory: "2Gi"}
limits: {memory: "4Gi"}
The virtual key in LITELLM_EVAL_KEY has a separate budget cap in LiteLLM so eval cost is tracked and capped independently from production traffic.
Offline eval against a labeled suite
For pre-deploy regression testing, maintain a YAML eval suite in Git:
# eval-suite.yaml
- id: onboarding-multi-hop
question: "Which APIs do I need to call to onboard a new enterprise customer?"
reference_answer: |
The onboarding flow requires calls to (1) /v1/accounts to create the account,
(2) /v1/entitlements to grant plan-tier access, and (3) /v1/provision to
initialize the customer's tenant environment. Optionally, /v1/sso-setup for
SAML integration.
ground_truth_contexts:
- "POST /v1/accounts creates a new enterprise account..."
- "POST /v1/entitlements grants plan-tier access..."
- "POST /v1/provision initializes the customer's tenant..."
- id: policy-lookup
question: "What is our data retention policy for customer audit logs?"
reference_answer: "Audit logs are retained for 7 years, with the last 12 months in hot storage..."
ground_truth_contexts:
- "Section 4.2 Audit Log Retention: ..."
Offline runner script:
# eval/offline.py
import yaml
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
faithfulness, answer_relevancy,
context_precision, context_recall,
)
with open("eval-suite.yaml") as f:
suite = yaml.safe_load(f)
# Run each question through the production RAG endpoint
rows = []
for item in suite:
resp = rag_client.ask(item["question"])
rows.append({
"question": item["question"],
"answer": resp["answer"],
"contexts": resp["retrieved_contexts"],
"ground_truth": item["reference_answer"],
"reference_contexts": item["ground_truth_contexts"],
})
result = evaluate(
dataset=Dataset.from_list(rows),
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
llm=EVALUATOR_LLM,
)
# Fail the CI build if any metric regresses by > 5%
thresholds = {
"faithfulness": 0.85,
"answer_relevancy": 0.80,
"context_precision": 0.75,
"context_recall": 0.70,
}
for metric, threshold in thresholds.items():
score = result[metric].mean()
print(f"{metric}: {score:.3f} (threshold {threshold})")
if score < threshold:
raise SystemExit(f"REGRESSION: {metric} below threshold")
Run this in the CI pipeline before any RAG-related deploy. It’s the equivalent of unit tests for retrieval quality.
Dashboards and alerts
Langfuse surfaces the scores in its UI. For the operations team, wire them into Grafana via Langfuse’s data export or a Prometheus bridge.
Prometheus alert rules:
groups:
- name: rag-quality
rules:
- alert: RAGFaithfulnessDropped
expr: avg_over_time(ragas_faithfulness_hourly[24h]) < 0.80
for: 2h
labels:
severity: warning
annotations:
summary: "RAG faithfulness score dropped below 0.80 over 24h"
runbook: "https://runbooks.example.ae/rag-faithfulness-drop"
- alert: RAGAnswerRelevancyDropped
expr: avg_over_time(ragas_answer_relevancy_hourly[24h]) < 0.75
for: 2h
labels:
severity: critical
annotations:
summary: "RAG answer relevancy critical - investigate model or prompt changes"
The bridge job exposes ragas_*_hourly gauges by querying Langfuse’s aggregate scores API and exporting them as Prometheus metrics every 5 minutes.
Detecting specific failure patterns
Beyond aggregate metrics, Ragas supports custom metrics. Two worth deploying:
1. Hallucination detector - extends faithfulness with entity-level checks:
from ragas.metrics import AspectCritic
hallucination_critic = AspectCritic(
name="hallucination",
definition="Does the answer contain factual claims not supported by the retrieved context?",
llm=EVALUATOR_LLM,
)
2. Refusal tracker - detects over-refusal (model saying “I don’t have enough info” when context is sufficient):
refusal_critic = AspectCritic(
name="unnecessary_refusal",
definition="Did the answer refuse to answer despite the context containing the required information?",
llm=EVALUATOR_LLM,
)
Track these alongside faithfulness - over-refusal is a common failure mode after a prompt-template change.
Sizing and cost
Rough per-trace eval cost on GPT-4o-class evaluator:
faithfulness+answer_relevancy: ~$0.005-$0.02 per trace- Adding
context_precision+context_recall: ~$0.01-$0.04 - Adding custom aspect critics: ~$0.002 each
At 1M RAG requests/month with 1% sampling, that’s 10,000 evals/month:
- Faithfulness + relevancy only: $50-200/month
- Full four metrics: $100-400/month
- With 2-3 custom aspect critics: $150-600/month
Cap eval spend in LiteLLM via the eval virtual key’s budget so overruns fail fast.
Common failure modes
- Eval jobs silently fail to score - Langfuse trace structure changed;
extract_eval_rowreturns empty. Add fixture tests against sample traces in CI. - Scores drop after an LLM provider change - evaluator LLM is inconsistent across providers. Pin the evaluator to a single provider and version.
- Ground truth drift - your labeled eval suite goes stale as the corpus changes. Schedule quarterly suite reviews with the product team.
- Metric thrash on small samples - 1% sampling at low RAG volume gives noisy scores. Either raise sample rate or use 7-day rolling averages in alerts.
- Evaluator quota exhaustion - eval jobs queue and stall behind production traffic on the LiteLLM proxy. Give evaluator a dedicated virtual key with isolated rate limits.
What this closes
Ragas closes the quality loop that the rest of the AI engineering stack opens:
- Langfuse captures traces - Ragas scores them
- LiteLLM provides the evaluator LLM with isolated budgets
- Production RAG Stack generates the content Ragas evaluates
- Any component upgrade (vLLM version, new Qdrant config, pgvector index change) runs the offline eval suite before production cutover
With these 12 pieces in place - observability, gateway, retrieval (vector + graph), serving, platform, evaluation - you have a production AI engineering stack that’s operationally sound and continuously measurable.
Getting help
We deploy Ragas evaluation pipelines for GCC enterprise AI teams running production RAG. The biggest value is usually in the first three weeks: establishing baselines, tuning alert thresholds to your actual quality bar, and getting the first set of real regressions caught before they reach users. AI/ML Infrastructure on Kubernetes covers the engagement.
Frequently Asked Questions
What does Ragas evaluate in a RAG pipeline?
Ragas computes four primary metrics: (1) faithfulness - are answer claims supported by retrieved context, (2) answer_relevancy - does the answer address the question, (3) context_precision - is retrieved context actually useful for the question, (4) context_recall - did retrieval find all the info needed. Faithfulness and answer_relevancy need only the answer + question; context_precision/recall need reference answers. In production we typically run faithfulness + answer_relevancy online (no ground truth needed) and context_precision/recall offline against a labeled eval set.
Is Ragas production-ready for continuous evaluation?
Yes - we run it as a scheduled Kubernetes Job against both a static eval set (nightly) and live Langfuse trace samples (hourly). The library is stable at v0.2+, uses LLM-as-judge for most metrics (requires an evaluator LLM), and has native integrations with Langfuse, LangChain, and LlamaIndex. The operational considerations are mostly cost - LLM-judge evaluations burn tokens, so sample rather than evaluate every trace.
How do I choose an evaluator LLM?
Use a stronger model than your generation model. For a production stack generating with GPT-4o, evaluate with GPT-4o or a Claude 3.5 Sonnet. For a stack generating with a 70B self-hosted model, evaluate with GPT-4o via Azure OpenAI UAE. The evaluator needs to be capable enough that its judgments are trustworthy as a quality signal. Do not evaluate with the same model that generates - you'll measure self-consistency, not quality.
How do I prevent evaluator cost runaway?
Three guardrails: (1) sample rather than evaluate every trace - 1-5% sampling is usually enough to detect quality trends, (2) cap the eval job's LiteLLM virtual-key budget (see our LiteLLM guide) - if evaluation blows past budget the job fails fast rather than silently continuing, (3) cache evaluation results so identical question/answer pairs don't re-evaluate. A typical 1% sample of 1M daily traces costs $20-40/day at GPT-4o rates.
How does Ragas integrate with Langfuse?
Langfuse stores RAG traces (question, retrieved context, answer, metadata). A periodic Kubernetes Job queries Langfuse for recent traces, runs Ragas metrics, and writes scores back to Langfuse as custom scores on each trace. The Langfuse UI then shows quality scores alongside traces, and you can filter/sort/alert on them. This is the closed-loop pattern we deploy - observability captures behavior, evaluation scores it, alerts fire on degradation.
Can Ragas evaluations run on-premises for GCC compliance?
Yes. Ragas itself is a Python library with no external calls - it just needs an LLM to judge with. Point it at your in-region LLM (Azure OpenAI UAE North, self-hosted via LiteLLM, or Bedrock Middle East) and evaluation stays in-region. The Kubernetes Jobs, the Langfuse integration, and the storage of scores all remain inside your cluster.
Complementary NomadX Services
Get Started for Free
We would be happy to speak with you and arrange a free consultation with our Kubernetes Expert in Dubai, UAE. 30-minute call, actionable results in days.
Talk to an Expert