You may have evaluators that run on large datasets or use additional external data sources. To help manage resources and control costs, Arize gives you the flexibility to decide when and how your evals run and tracked. With these self-managed evals, you stay in control of execution, data, and evaluator configuration.

1. Import Spans in Code

First, export your traces from Arize. Visit the LLM Tracing tab to see your traces and export them in code. By clicking the export button and choosing Export to Notebook, you can get the boilerplate code to copy/paste to your evaluator.

import os
from datetime import datetime
from arize import ArizeClient

client = ArizeClient(api_key="your-arize-api-key")

primary_df = client.spans.export_to_df(
    space_id="your-arize-space-id",
    project_name="your-project-name",
    start_time=datetime.fromisoformat(''),  # prefilled by export
    end_time=datetime.fromisoformat(''),    # prefilled by export
)

2. Run an Eval

We will run through a sample LLM as a Judge eval. First, define an evaluation template:

MY_SAMPLE_TEMPLATE = '''
    You are evaluating the positivity or negativity of the responses to questions.
    [BEGIN DATA]
    ************
    [Question]: {input}
    ************
    [Response]: {output}
    [END DATA]

    Please focus on the tone of the response.
    Your answer must be single word, either "positive" or "negative"
    '''

Check which attributes are present in your traces dataframe:

primary_df.columns

If you’re using OpenAI traces, set the input/output variables like this:

OpenAI

primary_df["input"] = primary_df["attributes.input.value"]
primary_df["output"] = primary_df["attributes.output.value"]

from phoenix.evals import create_classifier
from phoenix.evals.evaluators import async_evaluate_dataframe
from phoenix.evals.llm import LLM

llm = LLM(provider="openai", model="gpt-5")

sample_evaluator = create_classifier(
    name="sample-eval",
    llm=llm,
    prompt_template=MY_SAMPLE_TEMPLATE,
    choices={"correct": 1.0, "incorrect": 0.0},
)

results_df = await async_evaluate_dataframe(
    dataframe=primary_df,
    evaluators=[sample_evaluator],
)

3. Log Evals

Use the Python SDK to log evaluations to spans in the UI. This requires 4 columns:

eval.<eval_name>.label
eval.<eval_name>.score
eval.<eval_name>.explanation
context.span_id

If you are logging trace- or session-level evals, use the prefixes trace_eval.<eval_name> and session_eval.<eval_name>, respectively.

The code below assumes that you have already completed an evaluation run. We use the to_annotation_dataframe utility to format our results.

from arize import ArizeClient
from phoenix.evals.utils import to_annotation_dataframe

client = ArizeClient(api_key="your-arize-api-key")
sample_eval_df = to_annotation_dataframe(results_df)

sample_eval_df = sample_eval_df.rename(columns={
    "label": "eval.correctness.label",
    "score": "eval.correctness.score",
    "explanation": "eval.correctness.explanation"
})

client.spans.update_evaluations(
    space_id="your-arize-space-id",
    project_name="your-project-name",
    dataframe=sample_eval_df,
)

Alyx

Develop

Prompts

Evaluate

Observe

Machine Learning

Security & Settings

Log Evals to Traces

1. Import Spans in Code

2. Run an Eval

3. Log Evals

Alyx

Develop

Prompts

Evaluate

Observe

Machine Learning

Security & Settings

​1. Import Spans in Code

​2. Run an Eval

​3. Log Evals

1. Import Spans in Code

2. Run an Eval

3. Log Evals