You have validated evaluators and a golden dataset. Now make them part of your development process. Every time you change a prompt, swap a model, or update your agent logic, score your experiment results against your golden dataset before deploying. Each evaluation runs in a controlled, repeatable environment so you can measure how new versions behave before exposing them to real users.The same evaluators you use for production monitoring work here. Define the criteria once, apply them everywhere.This page covers running evaluators on experiments. To create experiments and datasets, see Datasets and experiments.
Once you have an experiment, attach evaluators and score the results.
By Arize Skills
By Alyx
By UI
By Code
Use the arize-evaluator skill to create an eval task against an experiment via the ax CLI. Install the Arize skills plugin in your coding agent if you have not already. Then ask your agent:
“Create an eval task for my v2-prompt-test experiment using my correctness evaluator”
“Trigger an eval run on my latest experiment and wait for results”
“Score my v2-prompt-test experiment with my hallucination evaluator”
The skill resolves dataset and experiment IDs, configures column mappings, and triggers the run using ax tasks create and ax tasks trigger-run.
LLM evaluators utilize LLMs as judges to assess the success of your experiment. These evaluators can either use a prebuilt LLM evaluation template or be customized to suit your specific needs.Arize AX supports a large number of LLM evaluators out of the box with LLM Classify: Arize Templates. You can also define custom LLM evaluators. Here’s an example of a LLM evaluator that checks for correctness in the model output:
CORRECTNESS_PROMPT_TEMPLATE = """You are given an invention (input) and an inventor (output). Determine whether the inventor correctly corresponds to the invention.[BEGIN DATA][Inventor]: {invention}[Output]: {output}[END DATA]Explain your reasoning step by step, then provide a single-word LABEL at the end: either "correct" or "incorrect".Format:EXPLANATION: Your reasoning about why the output is correct or incorrectLABEL: "correct" or "incorrect"************"""
from phoenix.evals import llm_classify, OpenAIModelfrom arize.experiments import EvaluationResultimport pandas as pddef correctness_eval(output, dataset_row): # Get the original query topic invention = dataset_row.get("attributes.output.value") eval_df = llm_classify( dataframe=pd.DataFrame([{"invention": invention, "output": output}]), template=CORRECTNESS_PROMPT_TEMPLATE, model=OpenAIModel(model="gpt-4o-mini", api_key="your-openai-api-key"), rails=["correct", "incorrect"], provide_explanation=True, ) # Map the eval df to EvaluationResult label = eval_df["label"][0] score = 1 if label == "correct" else 0 explanation = eval_df["explanation"][0] return EvaluationResult(label=label, score=score, explanation=explanation)
In this example, correctness_eval evaluates whether the output of an experiment is correct. The llm_classify function runs the eval, and the evaluator returns an EvaluationResult that includes a score, label, and explanation.Once you define your evaluator class, you can use it in your experiment run like this:
Code evaluators are functions designed to assess the outputs of your experiments. They allow you to define specific criteria for success, which can be as simple or complex as your application requires. Code evaluators are especially useful when you need to apply tailored logic or rules to validate the output of your model.
Creating a custom code evaluator is as simple as writing a Python function. By default, this function will take the output of an experiment run as its single argument. Your custom evaluator can return either a boolean or a numeric value, which will then be recorded as the evaluation score.For example, let’s say our experiment is testing a task that should output a numeric value between 1 and 100. We can create a simple evaluator function to check if the output falls within this range:
def in_bounds(output): return 1 <= output <= 100
By passing the in_bounds function to run_experiment, evaluations will automatically be generated for each experiment run, indicating whether the output is within the allowed range. This allows you to quickly assess the validity of your experiment’s outputs based on custom criteria.
You can also leverage our open-source Phoenix pre-built code evaluators. Pre-built evaluators can be passed directly to the evaluators parameter when running experiments.
Use dry_run=True to test without logging results. Use concurrency=10 to speed up large runs. Start with synchronous evaluators when debugging, then switch to async for speed.
Results appear in the experiment table alongside each row. Compare multiple experiment runs side by side, filter by eval label, and drill into individual rows to inspect the task output and evaluator explanation.Open View Eval Trace from an experiment row when you want the evaluated trace. For per-example outputs and eval explanations together, use Compare Experiments and the Table tab.
Make your change (prompt update, model swap, logic change).
Run your experiment against the golden dataset.
Score the experiment results with your evaluators.
Compare eval scores against the previous experiment run.
If scores regress, fix before merging.
You can integrate this into GitHub Actions or GitLab CI/CD so it runs automatically on every push or pull request. See GitHub Action basics and GitLab CI/CD basics for full guides.