Skip to main content

Documentation Index

Fetch the complete documentation index at: https://arize-ax.mintlify.dev/docs/llms.txt

Use this file to discover all available pages before exploring further.

The Ragas library ships LLM-as-judge evaluators — faithfulness, answer relevancy, context recall, and many more — designed for RAG and agent workloads. This guide shows both ways to wire Ragas into Arize AX: Flow 1 grades existing Arize AX traces with a Ragas evaluator and writes the scores back via client.spans.update_evaluations(...); Flow 2 uploads a small dataset, runs an Arize AX experiment with a Ragas-backed evaluator function, and surfaces the scores in Datasets+Experiments. Both flows share the same setup. Run the code blocks below in order inside a single Python session — each block builds on imports and variables from earlier ones.

Prerequisites

  • Python 3.11+
  • An ARIZE_SPACE_ID and ARIZE_API_KEY from your Arize AX space settings
  • An OPENAI_API_KEY from OpenAI Platform (used as both the model under trace and Ragas’s judge LLM)

Launch Arize AX

If you don’t already have an Arize AX account, sign up at arize.com and grab your ARIZE_SPACE_ID and ARIZE_API_KEY from Settings → Space Settings.

Install

pip install ragas 'arize>=8.0.0' openai openinference-instrumentation-openai opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc pandas

Configure credentials

export ARIZE_SPACE_ID="<your-space-id>"
export ARIZE_API_KEY="<your-api-key>"
export OPENAI_API_KEY="<your-openai-api-key>"

Define evaluators

The shared setup: a Ragas Faithfulness evaluator backed by GPT-5 (via an AsyncOpenAI client — Ragas’s new collections API requires async), the canonical 2-row hallucination dataset that both flows score, and an Arize SDK client.
# combined.py
import os
import time
from datetime import datetime, timedelta, timezone

import pandas as pd
from arize import ArizeClient
from openai import AsyncOpenAI, OpenAI
from ragas.llms import llm_factory
from ragas.metrics.collections import Faithfulness

SPACE_ID = os.environ["ARIZE_SPACE_ID"]
API_KEY = os.environ["ARIZE_API_KEY"]
TIMESTAMP = int(time.time())

# Ragas Faithfulness measures how well a response is grounded in the
# retrieved context. The new collections API requires an async client.
async_oai = AsyncOpenAI()
ragas_llm = llm_factory("gpt-5", client=async_oai)
faithfulness = Faithfulness(llm=ragas_llm)

# Canonical 2-row dataset — row 0 is factual (answer matches the reference),
# row 1 is hallucinated. Both flows grade these same rows.
ROWS = [
    {
        "input":     "What is the capital of France?",
        "output":    "Paris is the capital of France.",
        "reference": "Paris is the capital and most populous city of France.",
    },
    {
        "input":     "What is the capital of France?",
        "output":    "Berlin is the capital of France.",
        "reference": "Paris is the capital and most populous city of France.",
    },
]

arize = ArizeClient(api_key=API_KEY)

Flow 1 — Evaluate existing traces

Source the spans

Instrument OpenAI with OpenInference, make two calls (each forced to echo a known answer so the trace contains predictable text), then pull the resulting spans back from Arize AX.
from openinference.instrumentation.openai import OpenAIInstrumentor
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

PROJECT_NAME = f"ragas-tracing-example-{TIMESTAMP}"

resource = Resource.create(
    {
        "service.name":                PROJECT_NAME,
        "openinference.project.name":  PROJECT_NAME,
        "model_id":                    PROJECT_NAME,
    }
)
provider = TracerProvider(resource=resource)
provider.add_span_processor(
    BatchSpanProcessor(
        OTLPSpanExporter(
            endpoint="https://otlp.arize.com:443",
            headers={
                "authorization":     API_KEY,
                "arize-space-id":    SPACE_ID,
                "arize-interface":   "python",
            },
        )
    )
)
trace.set_tracer_provider(provider)
OpenAIInstrumentor().instrument(tracer_provider=provider)

sync_oai = OpenAI()
for row in ROWS:
    sync_oai.chat.completions.create(
        model="gpt-5-mini",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a fact-recall assistant. The user states the "
                    "exact answer to use; reply with that verbatim."
                ),
            },
            {
                "role": "user",
                "content": (
                    f"Question: {row['input']}\n"
                    f"Answer (reply verbatim): {row['output']}"
                ),
            },
        ],
    )

provider.force_flush(timeout_millis=10_000)
print(f"Project: {PROJECT_NAME}")

# Spans take ~5–15s to be queryable after flush. Poll defensively: Arize's
# OTLP ingest and Flight export use different catalogs and the new project
# can briefly appear "unauthorized" to the export endpoint while still
# accepting span writes via OTLP, so swallow transient errors and retry.
start = datetime.now(timezone.utc) - timedelta(minutes=5)
end = datetime.now(timezone.utc) + timedelta(minutes=1)
spans_df = None
last_err: Exception | None = None
for _ in range(12):
    time.sleep(5)
    try:
        spans_df = arize.spans.export_to_df(
            space_id=SPACE_ID,
            project_name=PROJECT_NAME,
            start_time=start,
            end_time=end,
        )
    except Exception as e:
        last_err = e
        continue
    if spans_df is not None and len(spans_df) >= len(ROWS):
        break
else:
    raise RuntimeError(
        f"Spans never appeared after 60s (last error: {last_err})"
    )

spans_df = spans_df.sort_values("start_time").reset_index(drop=True)

Run the evaluators

Faithfulness.score(...) is the sync entry point; use it when you’re not already inside an asyncio loop (Flow 2 below switches to ascore(...) because experiments evaluate inside one). Faithfulness returns a continuous score in [0.0, 1.0] that can wobble between runs (Berlin might score 0.0 one run and 0.25 the next, depending on how the judge counts partially-supported statements). The doc binarizes via a 0.5 threshold so the printed score column stays stable across runs. If you want the raw fractional value, drop the 1.0 if … else 0.0 and assign result.value directly.
scores = []
labels = []
for i, row in spans_df.iterrows():
    result = faithfulness.score(
        user_input=ROWS[i]["input"],
        response=row["output"],
        retrieved_contexts=[ROWS[i]["reference"]],
    )
    is_faithful = float(result.value) >= 0.5
    scores.append(1.0 if is_faithful else 0.0)
    labels.append("factual" if is_faithful else "hallucinated")

Log evaluations to Arize AX

update_evaluations(...) requires a context.span_id column (which export_to_df already provides) plus the reserved eval.<name>.{score,label,explanation} columns. Each Ragas score becomes one row in this DataFrame.
eval_df = pd.DataFrame(
    {
        "context.span_id":           spans_df["context.span_id"],
        "eval.faithfulness.score":   scores,
        "eval.faithfulness.label":   labels,
    }
)
arize.spans.update_evaluations(
    space_id=SPACE_ID,
    project_name=PROJECT_NAME,
    dataframe=eval_df,
)

# Print the scores so they appear in stdout for verification.
flow1_display = pd.DataFrame(
    {
        "input":  [r["input"]  for r in ROWS],
        "output": [r["output"] for r in ROWS],
        "score":  scores,
    }
)
print("Flow 1 results:")
print(flow1_display.to_string())

Expected output

Flow 1 results:
                            input                            output  score
0  What is the capital of France?   Paris is the capital of France.    1.0
1  What is the capital of France?  Berlin is the capital of France.    0.0

Verify in Arize AX

Open the project named ragas-tracing-example-<timestamp> (the value printed above) in your Arize AX space. Each ChatCompletion span now carries a faithfulness annotation column showing the score and label written by update_evaluations(...).

Flow 2 — Run an experiment

Create a dataset

The dataset is the same two rows. The space= / examples= kwarg names match the v8 SDK exactly (note: not space_id= and not dataframe=).
DATASET_NAME = f"ragas-experiment-example-ds-{TIMESTAMP}"
dataset_df = pd.DataFrame(ROWS)
arize.datasets.create(
    name=DATASET_NAME,
    space=SPACE_ID,
    examples=dataset_df,
)
print(f"Dataset: {DATASET_NAME}")

Define the task

The task function receives the dataset row and returns whatever the experiment should grade. The parameter name must be one of input, output, metadata, or dataset_row — a single-arg task with an unrecognized name is bound to dataset_row by default. A real workflow would call an LLM here; this passthrough keeps the example deterministic.
def task(dataset_row):
    return dataset_row["output"]

Wrap the evaluators

Experiment evaluators run inside an asyncio loop, so use async def and Ragas’s ascore(...) — the sync score(...) fails with Cannot call sync score() from an async context. Return an EvaluationResult with score and label and explanation populated: leaving any of those reserved fields as None triggers unsupported cast from null to <type>: reserved column cannot be coerced to canonical type at upload time.
from arize.experiments.evaluators.types import EvaluationResult


async def faithfulness_eval(input, output, dataset_row) -> EvaluationResult:
    result = await faithfulness.ascore(
        user_input=dataset_row["input"],
        response=output if isinstance(output, str) else str(output),
        retrieved_contexts=[dataset_row["reference"]],
    )
    is_faithful = float(result.value) >= 0.5
    return EvaluationResult(
        score=1.0 if is_faithful else 0.0,
        label="factual" if is_faithful else "hallucinated",
        explanation=result.reason or "no explanation",
    )

Run the experiment

EXPERIMENT_NAME = f"ragas-experiment-example-{TIMESTAMP}"
experiment, runs_df = arize.experiments.run(
    space=SPACE_ID,
    name=EXPERIMENT_NAME,
    dataset=DATASET_NAME,
    task=task,
    evaluators={"faithfulness": faithfulness_eval},
)
print(f"Experiment: {EXPERIMENT_NAME}")
print("Flow 2 results:")
flow2_display = runs_df[
    ["output", "eval.faithfulness.score", "eval.faithfulness.label"]
].rename(columns={"eval.faithfulness.score": "score", "eval.faithfulness.label": "label"})
print(flow2_display.to_string())

Expected output

Flow 2 results:
                             output  score         label
0   Paris is the capital of France.    1.0       factual
1  Berlin is the capital of France.    0.0  hallucinated

Verify in Arize AX

Open the Datasets + Experiments tab in Arize AX. The dataset ragas-experiment-example-ds-<timestamp> and the experiment ragas-experiment-example-<timestamp> (names printed above) appear with one run per dataset row, each carrying the faithfulness score and label columns.

Troubleshooting

  • Cannot call sync score() from an async context. Your evaluator function in Flow 2 is calling faithfulness.score(...) instead of faithfulness.ascore(...). Experiment evaluators run inside asyncio; use the async API. Flow 1 calls score(...) because it runs outside any loop.
  • column "eval.<name>.label": unsupported cast from null to string: reserved column cannot be coerced to canonical type. Your evaluator returned a bare number or string instead of a fully-populated EvaluationResult(score=..., label=..., explanation=...). Arize AX’s Flight server rejects null values in reserved eval columns — populate all three fields.
  • llm_factory() requires a client instance. The new Ragas collections API removed text-only LLMs. Pass a configured client: llm_factory("gpt-5", client=AsyncOpenAI()).
  • Spans never appear after 60s. Span flush + ingest typically takes 5–15s. If the loop times out, check that ARIZE_SPACE_ID + ARIZE_API_KEY are right and that you’re connecting to the correct region’s OTLP endpoint (otlp.arize.com for US, otlp.eu.arize.com for EU).
  • task failed for example id .... Your task function’s parameter name isn’t one of the recognized names (input, output, metadata, dataset_row). Rename it to dataset_row if you want the whole row, or pick the field you actually need.
  • Experiment runs duplicate or the dataset already exists. Both names embed TIMESTAMP = int(time.time()) so a single re-run produces unique names. If you re-execute the same combined.py quickly, regenerate TIMESTAMP first or call arize.experiments.delete(...) / arize.datasets.delete(...) on the prior run’s names.
  • Using a different Ragas metric. Swap Faithfulness for any class in ragas.metrics.collections (AnswerRelevancy, ContextRecall, FactualCorrectness, etc.). Each metric has slightly different required fields on SingleTurnSample — see the Ragas metrics docs.

Resources

Ragas Documentation

Ragas on GitHub

Logging evaluations to Arize AX

NVIDIA RAG metrics via Ragas