Users have the option to run an experiment by creating an evaluator that inherits from the Evaluator(ABC) base class in the Arize Python SDK. The evaluator takes in a single dataset row as input and returns an EvaluationResult dataclass.This is an alternative you can use if you’d prefer to use object oriented programming instead of functional programming.
Here’s an example of a LLM evaluator that checks for hallucinations in the model output. The Phoenix Evals package is designed for running evaluations in code:
from phoenix.evals import ( HALLUCINATION_PROMPT_RAILS_MAP, HALLUCINATION_PROMPT_TEMPLATE, llm_classify, OpenAIModel,)from arize.experiments import EvaluationResult, Evaluatorimport pandas as pdclass HallucinationEvaluator(Evaluator): def evaluate(self, output, dataset_row, **kwargs) -> EvaluationResult: print("Evaluating outputs") expected_output = dataset_row["attributes.llm.output_messages"] # Create a DataFrame with the actual and expected outputs df_in = pd.DataFrame( {"selected_output": output, "expected_output": expected_output}, index=[0] ) # Run the LLM classification expect_df = llm_classify( dataframe=df_in, template=HALLUCINATION_PROMPT_TEMPLATE, model=OpenAIModel(model="gpt-4o-mini", api_key=OPENAI_API_KEY), rails=HALLUCINATION_PROMPT_RAILS_MAP, provide_explanation=True, ) label = expect_df["label"][0] score = 1 if label == "factual" else 0 explanation = expect_df["explanation"][0] return EvaluationResult(score=score, label=label, explanation=explanation)
In this example, the HallucinationEvaluator class evaluates whether the output of an experiment contains hallucinations by comparing it to the expected output using an LLM. The llm_classify function runs the eval, and the evaluator returns an EvaluationResult that includes a score, label, and explanation.
Arize AX supports running multiple evals on a single experiment, allowing you to comprehensively assess your model’s performance from different angles. When you provide multiple evaluators, Arize creates evaluation runs for every combination of experiment runs and evaluators