Use NVIDIA’s RAG metrics (Answer Accuracy, Context Relevance, Response Groundedness) via Ragas to grade Arize AX traces and as evaluators in Arize AX experiments.
Use this file to discover all available pages before exploring further.
NVIDIA’s RAG evaluation metrics ship inside Ragas as a dedicated collection: AnswerAccuracy (does the response match a reference), ContextRelevance (does the retrieved context cover the question), and ResponseGroundedness (is the response actually supported by the context). They’re tuned to match NVIDIA’s published RAG quality benchmarks while reusing Ragas’s prompting and execution machinery.This guide shows both ways to wire them into Arize AX: Flow 1 grades existing Arize AX traces with ResponseGroundedness and writes the scores back via client.spans.update_evaluations(...); Flow 2 uploads a small dataset, runs an Arize AX experiment with the same evaluator wrapped as an experiment evaluator, and surfaces the scores in Datasets+Experiments. For the sibling Ragas integration (the standard metrics like Faithfulness and AnswerRelevancy), see the Ragas evaluation guide.Both flows share the same setup. Run the code blocks below in order inside a single Python session — each block builds on imports and variables from earlier ones.
The shared setup: NVIDIA’s v2 ResponseGroundedness metric (from ragas.metrics.collections) backed by gpt-5-mini via Ragas’s llm_factory, the canonical 2-row hallucination dataset both flows score, and an Arize SDK client. temperature=1.0 is passed explicitly because gpt-5 only supports the default temperature.
# combined.pyimport osimport timefrom datetime import datetime, timedelta, timezoneimport pandas as pdfrom arize import ArizeClientfrom openai import AsyncOpenAI, OpenAIfrom ragas.llms import llm_factoryfrom ragas.metrics.collections import ResponseGroundednessSPACE_ID = os.environ["ARIZE_SPACE_ID"]API_KEY = os.environ["ARIZE_API_KEY"]TIMESTAMP = int(time.time())# NVIDIA ResponseGroundedness scores 0.0–1.0 — 1.0 means the response is# fully supported by the retrieved context, 0.0 means it's not. The v2# metric uses Ragas's modern `llm_factory` API and a dual-judge prompt# pair under the hood; it accepts an `InstructorBaseRagasLLM`.ragas_llm = llm_factory( "gpt-5-mini", client=AsyncOpenAI(), temperature=1.0, # gpt-5 only supports the default temperature)response_groundedness = ResponseGroundedness(llm=ragas_llm)# Canonical 2-row dataset — row 0 is factual (answer matches the reference),# row 1 is hallucinated. Both flows grade these same rows.ROWS = [ { "input": "What is the capital of France?", "output": "Paris is the capital of France.", "reference": "Paris is the capital and most populous city of France.", }, { "input": "What is the capital of France?", "output": "Berlin is the capital of France.", "reference": "Paris is the capital and most populous city of France.", },]arize = ArizeClient(api_key=API_KEY)
Instrument OpenAI with OpenInference, make two calls (each forced to echo a known answer so the trace contains predictable text), then pull the resulting spans back from Arize AX.
from openinference.instrumentation.openai import OpenAIInstrumentorfrom opentelemetry import tracefrom opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporterfrom opentelemetry.sdk.resources import Resourcefrom opentelemetry.sdk.trace import TracerProviderfrom opentelemetry.sdk.trace.export import BatchSpanProcessorPROJECT_NAME = f"nv-ragas-tracing-example-{TIMESTAMP}"resource = Resource.create( { "service.name": PROJECT_NAME, "openinference.project.name": PROJECT_NAME, "model_id": PROJECT_NAME, })provider = TracerProvider(resource=resource)provider.add_span_processor( BatchSpanProcessor( OTLPSpanExporter( endpoint="https://otlp.arize.com:443", headers={ "authorization": API_KEY, "arize-space-id": SPACE_ID, "arize-interface": "python", }, ) ))trace.set_tracer_provider(provider)OpenAIInstrumentor().instrument(tracer_provider=provider)sync_oai = OpenAI()for row in ROWS: sync_oai.chat.completions.create( model="gpt-5-mini", messages=[ { "role": "system", "content": ( "You are a fact-recall assistant. The user states the " "exact answer to use; reply with that verbatim." ), }, { "role": "user", "content": ( f"Question: {row['input']}\n" f"Answer (reply verbatim): {row['output']}" ), }, ], )provider.force_flush(timeout_millis=10_000)print(f"Project: {PROJECT_NAME}")# Poll defensively: Arize's OTLP ingest and Flight export use different# catalogs and the new project can briefly appear "unauthorized" to the# export endpoint while still accepting span writes via OTLP, so swallow# transient errors and retry.start = datetime.now(timezone.utc) - timedelta(minutes=5)end = datetime.now(timezone.utc) + timedelta(minutes=1)spans_df = Nonelast_err: Exception | None = Nonefor _ in range(12): time.sleep(5) try: spans_df = arize.spans.export_to_df( space_id=SPACE_ID, project_name=PROJECT_NAME, start_time=start, end_time=end, ) except Exception as e: last_err = e continue if spans_df is not None and len(spans_df) >= len(ROWS): breakelse: raise RuntimeError( f"Spans never appeared after 60s (last error: {last_err})" )spans_df = spans_df.sort_values("start_time").reset_index(drop=True)
The v2 metric exposes ascore(response=..., retrieved_contexts=[...]) directly — no SingleTurnSample wrapper. It’s async, so wrap each call in asyncio.run(...) from sync code. Flow 2 below uses the same metric from inside an async def evaluator wrapper.
import asyncioscores = []labels = []for i, row in spans_df.iterrows(): result = asyncio.run( response_groundedness.ascore( response=row["output"], retrieved_contexts=[ROWS[i]["reference"]], ) ) score = float(result.value) scores.append(score) labels.append("grounded" if score >= 0.5 else "ungrounded")
Flow 1 results: input output score0 What is the capital of France? Paris is the capital of France. 1.01 What is the capital of France? Berlin is the capital of France. 0.0
Open the project named nv-ragas-tracing-example-<timestamp> (the value printed above) in your Arize AX space. Each ChatCompletion span now carries an nv_response_groundedness annotation column showing the 0/1 score and the grounded / ungrounded label.
Experiment evaluators run inside an asyncio loop, so the wrapper is async def and awaits response_groundedness.ascore(...) directly. Return an EvaluationResult with score, label, and explanation populated — leaving any of those as None triggers unsupported cast from null to <type>: reserved column cannot be coerced to canonical type at upload time.
from arize.experiments.evaluators.types import EvaluationResultasync def nv_response_groundedness_eval( input, output, dataset_row) -> EvaluationResult: result = await response_groundedness.ascore( response=output if isinstance(output, str) else str(output), retrieved_contexts=[dataset_row["reference"]], ) score = float(result.value) return EvaluationResult( score=score, label="grounded" if score >= 0.5 else "ungrounded", explanation="NVIDIA ResponseGroundedness (Ragas)", )
Open the Datasets + Experiments tab in Arize AX. The dataset nv-ragas-experiment-example-ds-<timestamp> and the experiment nv-ragas-experiment-example-<timestamp> (names printed above) appear with one run per dataset row, each carrying the nv_response_groundedness score and label columns.
Skipping a sample by assigning it nan score. The judge call failed (rate limit, model error, etc.) and Ragas swallowed the exception. Check the warning lines just above this message in stderr for the actual error.
column "eval.nv_response_groundedness.label": unsupported cast from null to string: reserved column cannot be coerced to canonical type. Your experiment evaluator returned a bare float instead of a fully-populated EvaluationResult(score=..., label=..., explanation=...). Arize AX’s Flight server rejects null reserved columns.
Spans never appear after 60s. Span flush + ingest typically takes 5–15s. If the loop times out, check that ARIZE_SPACE_ID + ARIZE_API_KEY are right and that you’re connecting to the correct region’s OTLP endpoint (otlp.arize.com for US, otlp.eu.arize.com for EU).
Using AnswerAccuracy or ContextRelevance instead. Swap ResponseGroundedness for AnswerAccuracy (requires reference field on the sample) or ContextRelevance (requires user_input + retrieved_contexts). The wiring is otherwise identical. See Ragas NVIDIA metrics docs for the per-metric required fields.
Using the official NVIDIA RAG-Eval suite directly (not via Ragas). The standalone nvidia-rag-eval package exists but is paywalled behind NVIDIA AI Enterprise. The Ragas wrappers used here are the open, community-supported path to the same metric definitions.
Experiment re-runs collide. Both names embed TIMESTAMP = int(time.time()) so a single re-run produces unique names. If you re-execute the same combined.py quickly, regenerate TIMESTAMP first or call arize.experiments.delete(...) / arize.datasets.delete(...) on the prior run’s names.