Evaluate experiment via UI
To run your first evaluation on experiments:- Navigate to Evaluators on your experiment page and find Add Evaluator
- Define your Evaluator
- Choose the experiments you want to evaluate from the dropdown menu.
- View Experiments results
Need help writing a custom evaluator template? Use ✨Alyx to write one for you ✨
Evaluate experiment via Code
Here’s the simplest version of an evaluation function:Evaluation Inputs
The evaluator function can take the following optional arguments:| Parameter name | Description | Example |
|---|---|---|
dataset_row | the entire row of the data, including every column as dictionary key | def eval(dataset_row): … |
input | experiment run input, which is mapped to attributes.input.value | def eval(input): … |
output | experiment run output | def eval(output): … |
dataset_output | the expected output if available, mapped to attributes.output.value | def eval(dataset_output): … |
metadata | dataset_row metadata, which is mapped to attributes.metadata | def eval(metadata): … |
Evaluation Outputs
We support several types of evaluation outputs. Label must be a string. Score must range from 0.0 to 1.0. Explanation must be a string.| Evaluator Output Type | Example | How it appears in Arize |
|---|---|---|
boolean | True | label = ‘True’ score = 1.0 |
float | 1.0 | score = 1.0 |
string | ”reasonable” | label = ‘reasonable’ |
tuple | (1.0, “my explanation notes”) | score = 1.0 explanation = ‘my explanation notes’ |
tuple | (“True”, 1.0, “my explanation”) | label = ‘True’ score = 1.0 explanation = “my explanation” |
EvaluationResult |
| score = 1.0 label=‘reasonable’ |
from arize.experimental.datasets.experiments.types import EvaluationResult- One of label or score must be supplied (you can’t have an evaluation with no result).
dataset_row.
run_experiment as following:
Create an LLM Evaluator
LLM evaluators utilize LLMs as judges to assess the success of your experiment. These evaluators can either use a prebuilt LLM evaluation template or be customized to suit your specific needs. Arize supports a large number of LLM evaluators out of the box with LLM Classify: Arize Templates. You can also define custom LLM evaluators. Here’s an example of a LLM evaluator that checks for correctness in the model output:Run Evaluation
correctness_eval evaluates whether the output of an experiment is correct. The llm_classify function runs the eval, and the evaluator returns an EvaluationResult that includes a score, label, and explanation.
Once you define your evaluator class, you can use it in your experiment run like this:
Create a Code Evaluator
Code evaluators are functions designed to assess the outputs of your experiments. They allow you to define specific criteria for success, which can be as simple or complex as your application requires. Code evaluators are especially useful when you need to apply tailored logic or rules to validate the output of your model.Custom Code Evaluators
Creating a custom code evaluator is as simple as writing a Python function. By default, this function will take the output of an experiment run as its single argument. Your custom evaluator can return either a boolean or a numeric value, which will then be recorded as the evaluation score. For example, let’s say our experiment is testing a task that should output a numeric value between 1 and 100. We can create a simple evaluator function to check if the output falls within this range:in_bounds function to run_experiment, evaluations will automatically be generated for each experiment run, indicating whether the output is within the allowed range. This allows you to quickly assess the validity of your experiment’s outputs based on custom criteria.
Prebuilt Phoenix Code Evaluators
You can also leverage our open-source Phoenix pre-built code evaluators. Pre-built evaluators inphoenix.experiments.evaluators can be passed directly to the evaluators parameter when running experiments.