The Ragas library ships LLM-as-judge evaluators — faithfulness, answer relevancy, context recall, and many more — designed for RAG and agent workloads. This guide shows both ways to wire Ragas into Arize AX: Flow 1 grades existing Arize AX traces with a Ragas evaluator and writes the scores back viaDocumentation Index
Fetch the complete documentation index at: https://arize-ax.mintlify.dev/docs/llms.txt
Use this file to discover all available pages before exploring further.
client.spans.update_evaluations(...); Flow 2 uploads a small dataset, runs an Arize AX experiment with a Ragas-backed evaluator function, and surfaces the scores in Datasets+Experiments.
Both flows share the same setup. Run the code blocks below in order inside a single Python session — each block builds on imports and variables from earlier ones.
Prerequisites
- Python 3.11+
- An
ARIZE_SPACE_IDandARIZE_API_KEYfrom your Arize AX space settings - An
OPENAI_API_KEYfrom OpenAI Platform (used as both the model under trace and Ragas’s judge LLM)
Launch Arize AX
If you don’t already have an Arize AX account, sign up at arize.com and grab yourARIZE_SPACE_ID and ARIZE_API_KEY from Settings → Space Settings.
Install
Configure credentials
Define evaluators
The shared setup: a Ragas Faithfulness evaluator backed by GPT-5 (via anAsyncOpenAI client — Ragas’s new collections API requires async), the canonical 2-row hallucination dataset that both flows score, and an Arize SDK client.
Flow 1 — Evaluate existing traces
Source the spans
Instrument OpenAI with OpenInference, make two calls (each forced to echo a known answer so the trace contains predictable text), then pull the resulting spans back from Arize AX.Run the evaluators
Faithfulness.score(...) is the sync entry point; use it when you’re not already inside an asyncio loop (Flow 2 below switches to ascore(...) because experiments evaluate inside one).
Faithfulness returns a continuous score in [0.0, 1.0] that can wobble between runs (Berlin might score 0.0 one run and 0.25 the next, depending on how the judge counts partially-supported statements). The doc binarizes via a 0.5 threshold so the printed score column stays stable across runs. If you want the raw fractional value, drop the 1.0 if … else 0.0 and assign result.value directly.
Log evaluations to Arize AX
update_evaluations(...) requires a context.span_id column (which export_to_df already provides) plus the reserved eval.<name>.{score,label,explanation} columns. Each Ragas score becomes one row in this DataFrame.
Expected output
Verify in Arize AX
Open the project namedragas-tracing-example-<timestamp> (the value printed above) in your Arize AX space. Each ChatCompletion span now carries a faithfulness annotation column showing the score and label written by update_evaluations(...).
Flow 2 — Run an experiment
Create a dataset
The dataset is the same two rows. Thespace= / examples= kwarg names match the v8 SDK exactly (note: not space_id= and not dataframe=).
Define the task
The task function receives the dataset row and returns whatever the experiment should grade. The parameter name must be one ofinput, output, metadata, or dataset_row — a single-arg task with an unrecognized name is bound to dataset_row by default. A real workflow would call an LLM here; this passthrough keeps the example deterministic.
Wrap the evaluators
Experiment evaluators run inside anasyncio loop, so use async def and Ragas’s ascore(...) — the sync score(...) fails with Cannot call sync score() from an async context. Return an EvaluationResult with score and label and explanation populated: leaving any of those reserved fields as None triggers unsupported cast from null to <type>: reserved column cannot be coerced to canonical type at upload time.
Run the experiment
Expected output
Verify in Arize AX
Open the Datasets + Experiments tab in Arize AX. The datasetragas-experiment-example-ds-<timestamp> and the experiment ragas-experiment-example-<timestamp> (names printed above) appear with one run per dataset row, each carrying the faithfulness score and label columns.
Troubleshooting
Cannot call sync score() from an async context. Your evaluator function in Flow 2 is callingfaithfulness.score(...)instead offaithfulness.ascore(...). Experiment evaluators run insideasyncio; use the async API. Flow 1 callsscore(...)because it runs outside any loop.column "eval.<name>.label": unsupported cast from null to string: reserved column cannot be coerced to canonical type. Your evaluator returned a bare number or string instead of a fully-populatedEvaluationResult(score=..., label=..., explanation=...). Arize AX’s Flight server rejects null values in reserved eval columns — populate all three fields.llm_factory() requires a client instance. The new Ragas collections API removed text-only LLMs. Pass a configured client:llm_factory("gpt-5", client=AsyncOpenAI()).- Spans never appear after 60s. Span flush + ingest typically takes 5–15s. If the loop times out, check that
ARIZE_SPACE_ID+ARIZE_API_KEYare right and that you’re connecting to the correct region’s OTLP endpoint (otlp.arize.comfor US,otlp.eu.arize.comfor EU). task failed for example id .... Your task function’s parameter name isn’t one of the recognized names (input,output,metadata,dataset_row). Rename it todataset_rowif you want the whole row, or pick the field you actually need.- Experiment runs duplicate or the dataset already exists. Both names embed
TIMESTAMP = int(time.time())so a single re-run produces unique names. If you re-execute the samecombined.pyquickly, regenerateTIMESTAMPfirst or callarize.experiments.delete(...)/arize.datasets.delete(...)on the prior run’s names. - Using a different Ragas metric. Swap
Faithfulnessfor any class inragas.metrics.collections(AnswerRelevancy,ContextRecall,FactualCorrectness, etc.). Each metric has slightly different required fields onSingleTurnSample— see the Ragas metrics docs.