April 23, 2026 · 8 min read · Aizhan Azhybaeva

Ragas on Kubernetes: Continuous RAG Evaluation in Production (2026)

Run Ragas evaluations as a production Kubernetes workload: offline eval suites, online LLM-as-judge sampling from Langfuse traces, drift detection, and closing the RAG quality loop with automated alerts and SLO-based rollback.

Ragas on Kubernetes: Continuous RAG Evaluation in Production (2026)

RAG quality is not a one-time measurement - it’s a continuous SLO. Your retrieval degrades when the corpus evolves. Your generation drifts when you change models. Your context precision drops when you add new document types that look superficially relevant. Without continuous evaluation, the drift stays invisible until a user complains. Ragas on Kubernetes is the pattern we deploy to catch it.

This is the final post in our AI engineering stack series - it closes the loop back to Langfuse, which captures the traces that evaluation scores.

What “continuous evaluation” actually means

Three levels of evaluation rigor, by maturity:

LevelCadenceWhat you catch
1. Offline evalPre-deploy, on a fixed suiteRegressions before they hit users
2. Scheduled online evalHourly or daily on sampled production tracesDrift, quality trends, model changes
3. Continuous LLM-as-judgePer-trace on a sampling rate (1-5%)Real-time quality degradation, alerts

Most production RAG runs at level 1 (eval before release) plus level 2 (nightly drift check). Level 3 is worth it for products with strict quality SLOs or high prompt-injection risk.

Architecture

            ┌────────────────────┐
            │  Production RAG    │   Generates traces into Langfuse
            │  (orchestrator)    │   with question, context, answer
            └──────────┬─────────┘
                       │
                       ▼
            ┌────────────────────┐
            │     Langfuse       │
            │   (trace store)    │
            └──────────┬─────────┘
                       │
        ┌──────────────┼──────────────┐
        │              │              │
        ▼              ▼              ▼
   Offline eval   Scheduled eval  Online sampler
   (manual,       (CronJob,       (KEDA-scaled
   pre-deploy)    hourly/daily)   worker)
        │              │              │
        │              │              │
        └──────────────┼──────────────┘
                       ▼
            ┌────────────────────┐
            │      Ragas         │   LLM-as-judge via
            │    (evaluation     │   LiteLLM gateway
            │     library)       │
            └──────────┬─────────┘
                       │
                       ▼
            ┌────────────────────┐
            │     Scores         │
            │  written back to   │───▶ Grafana dashboard
            │    Langfuse        │───▶ Slack alert on SLO breach
            │  as custom scores  │
            └────────────────────┘

Prerequisites

  • Langfuse deployed and capturing RAG traces
  • LiteLLM gateway with an evaluator LLM virtual key
  • Prometheus + Grafana for dashboarding (already in the NomadX baseline stack)
  • Python 3.10+ image for running evaluation jobs

The evaluation worker image

# Dockerfile
FROM python:3.11-slim

RUN pip install --no-cache-dir \
    ragas==0.2.8 \
    langfuse==2.54.0 \
    langchain-openai==0.2.0 \
    langchain-community==0.3.0 \
    datasets==3.0.0 \
    prometheus-client==0.21.0

COPY eval/ /app/
WORKDIR /app
ENTRYPOINT ["python", "-m", "eval.runner"]

The evaluation runner:

# eval/runner.py
import asyncio
import os
from datetime import datetime, timedelta, timezone

from datasets import Dataset
from langchain_openai import ChatOpenAI
from langfuse import Langfuse
from ragas import evaluate
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_precision,
    context_recall,
)

EVALUATOR_LLM = ChatOpenAI(
    model="gpt-4o-uae-primary",
    openai_api_base=os.environ["LITELLM_URL"],
    openai_api_key=os.environ["LITELLM_EVAL_KEY"],
    temperature=0.0,
    max_tokens=1024,
    # Tag evaluations in Langfuse
    default_headers={"langfuse-tags": "ragas-eval"},
)

lf = Langfuse(
    public_key=os.environ["LANGFUSE_PUBLIC_KEY"],
    secret_key=os.environ["LANGFUSE_SECRET_KEY"],
    host=os.environ["LANGFUSE_HOST"],
)

async def pull_recent_traces(hours: int = 1, sample_rate: float = 0.02):
    since = datetime.now(timezone.utc) - timedelta(hours=hours)
    traces = lf.fetch_traces(
        from_timestamp=since,
        tags=["rag", "prod"],
        limit=5000,
    ).data

    # Sample rather than evaluate all
    import random
    sampled = [t for t in traces if random.random() < sample_rate]
    return sampled

def extract_eval_row(trace):
    """Pull question, retrieved context, answer from a Langfuse trace."""
    q = trace.input.get("message") or trace.input.get("question")
    # Context retrieved is typically in a child span named "retrieve"
    ctx_span = next((s for s in trace.observations if s.name == "retrieve"), None)
    contexts = (ctx_span.output or []) if ctx_span else []
    contexts = [c.get("text", "") for c in contexts] if isinstance(contexts, list) else []
    answer = trace.output.get("answer") if isinstance(trace.output, dict) else str(trace.output)
    return {
        "trace_id": trace.id,
        "question": q,
        "contexts": contexts,
        "answer": answer,
    }

def write_scores_back(trace_id: str, scores: dict):
    for metric, value in scores.items():
        if value is None or value != value:        # skip NaN
            continue
        lf.score(
            trace_id=trace_id,
            name=f"ragas.{metric}",
            value=float(value),
            data_type="NUMERIC",
        )

async def main():
    traces = await pull_recent_traces(hours=1, sample_rate=0.02)
    rows = [extract_eval_row(t) for t in traces if t.output]
    rows = [r for r in rows if r["question"] and r["answer"] and r["contexts"]]

    if not rows:
        print("No eval rows after filtering")
        return

    ds = Dataset.from_list(rows)
    result = evaluate(
        dataset=ds,
        metrics=[faithfulness, answer_relevancy],     # no ground truth needed
        llm=EVALUATOR_LLM,
        raise_exceptions=False,
    )

    for i, row in enumerate(rows):
        write_scores_back(
            row["trace_id"],
            {
                "faithfulness": result["faithfulness"][i],
                "answer_relevancy": result["answer_relevancy"][i],
            },
        )
    lf.flush()
    print(f"Evaluated {len(rows)} traces. Mean faithfulness: {result['faithfulness'].mean():.3f}")

if __name__ == "__main__":
    asyncio.run(main())

Schedule as a Kubernetes CronJob

apiVersion: batch/v1
kind: CronJob
metadata:
  name: ragas-online-eval
  namespace: evaluation
spec:
  schedule: "0 * * * *"                # hourly
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 7
  jobTemplate:
    spec:
      backoffLimit: 2
      activeDeadlineSeconds: 1800      # fail hard if taking > 30 min
      template:
        spec:
          restartPolicy: OnFailure
          containers:
            - name: eval
              image: registry.example.ae/eval/ragas-runner:0.2.8
              env:
                - name: LITELLM_URL
                  value: "http://litellm.llm-gateway.svc.cluster.local:4000"
                - name: LANGFUSE_HOST
                  value: "http://langfuse-web.langfuse.svc.cluster.local:3000"
                - name: SAMPLE_RATE
                  value: "0.02"
                - name: WINDOW_HOURS
                  value: "1"
              envFrom:
                - secretRef:
                    name: ragas-secrets
              resources:
                requests: {cpu: "1", memory: "2Gi"}
                limits: {memory: "4Gi"}

The virtual key in LITELLM_EVAL_KEY has a separate budget cap in LiteLLM so eval cost is tracked and capped independently from production traffic.

Offline eval against a labeled suite

For pre-deploy regression testing, maintain a YAML eval suite in Git:

# eval-suite.yaml
- id: onboarding-multi-hop
  question: "Which APIs do I need to call to onboard a new enterprise customer?"
  reference_answer: |
    The onboarding flow requires calls to (1) /v1/accounts to create the account,
    (2) /v1/entitlements to grant plan-tier access, and (3) /v1/provision to
    initialize the customer's tenant environment. Optionally, /v1/sso-setup for
    SAML integration.
  ground_truth_contexts:
    - "POST /v1/accounts creates a new enterprise account..."
    - "POST /v1/entitlements grants plan-tier access..."
    - "POST /v1/provision initializes the customer's tenant..."

- id: policy-lookup
  question: "What is our data retention policy for customer audit logs?"
  reference_answer: "Audit logs are retained for 7 years, with the last 12 months in hot storage..."
  ground_truth_contexts:
    - "Section 4.2 Audit Log Retention: ..."

Offline runner script:

# eval/offline.py
import yaml
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
    faithfulness, answer_relevancy,
    context_precision, context_recall,
)

with open("eval-suite.yaml") as f:
    suite = yaml.safe_load(f)

# Run each question through the production RAG endpoint
rows = []
for item in suite:
    resp = rag_client.ask(item["question"])
    rows.append({
        "question": item["question"],
        "answer": resp["answer"],
        "contexts": resp["retrieved_contexts"],
        "ground_truth": item["reference_answer"],
        "reference_contexts": item["ground_truth_contexts"],
    })

result = evaluate(
    dataset=Dataset.from_list(rows),
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
    llm=EVALUATOR_LLM,
)

# Fail the CI build if any metric regresses by > 5%
thresholds = {
    "faithfulness": 0.85,
    "answer_relevancy": 0.80,
    "context_precision": 0.75,
    "context_recall": 0.70,
}
for metric, threshold in thresholds.items():
    score = result[metric].mean()
    print(f"{metric}: {score:.3f} (threshold {threshold})")
    if score < threshold:
        raise SystemExit(f"REGRESSION: {metric} below threshold")

Run this in the CI pipeline before any RAG-related deploy. It’s the equivalent of unit tests for retrieval quality.

Dashboards and alerts

Langfuse surfaces the scores in its UI. For the operations team, wire them into Grafana via Langfuse’s data export or a Prometheus bridge.

Prometheus alert rules:

groups:
  - name: rag-quality
    rules:
      - alert: RAGFaithfulnessDropped
        expr: avg_over_time(ragas_faithfulness_hourly[24h]) < 0.80
        for: 2h
        labels:
          severity: warning
        annotations:
          summary: "RAG faithfulness score dropped below 0.80 over 24h"
          runbook: "https://runbooks.example.ae/rag-faithfulness-drop"

      - alert: RAGAnswerRelevancyDropped
        expr: avg_over_time(ragas_answer_relevancy_hourly[24h]) < 0.75
        for: 2h
        labels:
          severity: critical
        annotations:
          summary: "RAG answer relevancy critical - investigate model or prompt changes"

The bridge job exposes ragas_*_hourly gauges by querying Langfuse’s aggregate scores API and exporting them as Prometheus metrics every 5 minutes.

Detecting specific failure patterns

Beyond aggregate metrics, Ragas supports custom metrics. Two worth deploying:

1. Hallucination detector - extends faithfulness with entity-level checks:

from ragas.metrics import AspectCritic

hallucination_critic = AspectCritic(
    name="hallucination",
    definition="Does the answer contain factual claims not supported by the retrieved context?",
    llm=EVALUATOR_LLM,
)

2. Refusal tracker - detects over-refusal (model saying “I don’t have enough info” when context is sufficient):

refusal_critic = AspectCritic(
    name="unnecessary_refusal",
    definition="Did the answer refuse to answer despite the context containing the required information?",
    llm=EVALUATOR_LLM,
)

Track these alongside faithfulness - over-refusal is a common failure mode after a prompt-template change.

Sizing and cost

Rough per-trace eval cost on GPT-4o-class evaluator:

  • faithfulness + answer_relevancy: ~$0.005-$0.02 per trace
  • Adding context_precision + context_recall: ~$0.01-$0.04
  • Adding custom aspect critics: ~$0.002 each

At 1M RAG requests/month with 1% sampling, that’s 10,000 evals/month:

  • Faithfulness + relevancy only: $50-200/month
  • Full four metrics: $100-400/month
  • With 2-3 custom aspect critics: $150-600/month

Cap eval spend in LiteLLM via the eval virtual key’s budget so overruns fail fast.

Common failure modes

  • Eval jobs silently fail to score - Langfuse trace structure changed; extract_eval_row returns empty. Add fixture tests against sample traces in CI.
  • Scores drop after an LLM provider change - evaluator LLM is inconsistent across providers. Pin the evaluator to a single provider and version.
  • Ground truth drift - your labeled eval suite goes stale as the corpus changes. Schedule quarterly suite reviews with the product team.
  • Metric thrash on small samples - 1% sampling at low RAG volume gives noisy scores. Either raise sample rate or use 7-day rolling averages in alerts.
  • Evaluator quota exhaustion - eval jobs queue and stall behind production traffic on the LiteLLM proxy. Give evaluator a dedicated virtual key with isolated rate limits.

What this closes

Ragas closes the quality loop that the rest of the AI engineering stack opens:

With these 12 pieces in place - observability, gateway, retrieval (vector + graph), serving, platform, evaluation - you have a production AI engineering stack that’s operationally sound and continuously measurable.

Getting help

We deploy Ragas evaluation pipelines for GCC enterprise AI teams running production RAG. The biggest value is usually in the first three weeks: establishing baselines, tuning alert thresholds to your actual quality bar, and getting the first set of real regressions caught before they reach users. AI/ML Infrastructure on Kubernetes covers the engagement.

Frequently Asked Questions

What does Ragas evaluate in a RAG pipeline?

Ragas computes four primary metrics: (1) faithfulness - are answer claims supported by retrieved context, (2) answer_relevancy - does the answer address the question, (3) context_precision - is retrieved context actually useful for the question, (4) context_recall - did retrieval find all the info needed. Faithfulness and answer_relevancy need only the answer + question; context_precision/recall need reference answers. In production we typically run faithfulness + answer_relevancy online (no ground truth needed) and context_precision/recall offline against a labeled eval set.

Is Ragas production-ready for continuous evaluation?

Yes - we run it as a scheduled Kubernetes Job against both a static eval set (nightly) and live Langfuse trace samples (hourly). The library is stable at v0.2+, uses LLM-as-judge for most metrics (requires an evaluator LLM), and has native integrations with Langfuse, LangChain, and LlamaIndex. The operational considerations are mostly cost - LLM-judge evaluations burn tokens, so sample rather than evaluate every trace.

How do I choose an evaluator LLM?

Use a stronger model than your generation model. For a production stack generating with GPT-4o, evaluate with GPT-4o or a Claude 3.5 Sonnet. For a stack generating with a 70B self-hosted model, evaluate with GPT-4o via Azure OpenAI UAE. The evaluator needs to be capable enough that its judgments are trustworthy as a quality signal. Do not evaluate with the same model that generates - you'll measure self-consistency, not quality.

How do I prevent evaluator cost runaway?

Three guardrails: (1) sample rather than evaluate every trace - 1-5% sampling is usually enough to detect quality trends, (2) cap the eval job's LiteLLM virtual-key budget (see our LiteLLM guide) - if evaluation blows past budget the job fails fast rather than silently continuing, (3) cache evaluation results so identical question/answer pairs don't re-evaluate. A typical 1% sample of 1M daily traces costs $20-40/day at GPT-4o rates.

How does Ragas integrate with Langfuse?

Langfuse stores RAG traces (question, retrieved context, answer, metadata). A periodic Kubernetes Job queries Langfuse for recent traces, runs Ragas metrics, and writes scores back to Langfuse as custom scores on each trace. The Langfuse UI then shows quality scores alongside traces, and you can filter/sort/alert on them. This is the closed-loop pattern we deploy - observability captures behavior, evaluation scores it, alerts fire on degradation.

Can Ragas evaluations run on-premises for GCC compliance?

Yes. Ragas itself is a Python library with no external calls - it just needs an LLM to judge with. Point it at your in-region LLM (Azure OpenAI UAE North, self-hosted via LiteLLM, or Bedrock Middle East) and evaluation stays in-region. The Kubernetes Jobs, the Langfuse integration, and the storage of scores all remain inside your cluster.

Get Started for Free

We would be happy to speak with you and arrange a free consultation with our Kubernetes Expert in Dubai, UAE. 30-minute call, actionable results in days.

Talk to an Expert