Run Code Evals on Your Traces

Not every evaluation needs an LLM judge. When your criteria are deterministic and well-defined — checking whether a response contains a keyword, validating JSON structure, or matching a regex pattern — code evaluators are faster, cheaper, and more consistent than LLM-based alternatives. Code evaluators run Python functions directly against your trace data. They’re ideal for objective checks that don’t require interpretation or subjective judgment. You can use Arize AX’s pre-built code evaluators for common patterns or write your own custom logic. This tutorial walks through setting up code evaluators in the Arize AX UI to run on your travel agent traces.

Prerequisite: Before starting, run the companion notebook to generate traces from the travel agent. You’ll need traces in your Arize AX project to evaluate.

Step 1: Create an Evaluation Task

Evaluation tasks define what data to evaluate and which evaluators to run. To create one:

Navigate to Eval Tasks in the upper right-hand corner and select Add Eval Task
Choose Code Evaluator
Give your task a name (ex: “Tool Input Validation”)
Set the Cadence to Run on historical data so we can evaluate our existing traces
In this tutorial, we are using a code eval to validate tool call inputs, so we only want to run this evaluator against tool spans. Add a Task Filter to scope the evaluation by setting attributes.openinference.span.kind = TOOL.

Step 2: Choose a Code Evaluator

To add an evaluator to your task, select Add Evaluator —> Create New and browse the available pre-built code evaluators. Arize AX offers several managed code evaluators for common checks:

Evaluator	What It Checks	Parameters
Matches Regex	Whether text matches a specified regex pattern	span attribute, pattern
JSON Parseable	Whether the output is valid JSON	span attribute
Contains Any Keyword	Whether any specified keywords appear	span attribute, keywords
Contains All Keywords	Whether all specified keywords appear	span attribute, keywords

You can also choose CustomArizeEvaluator to define your own code evaluator with custom logic.

Configure the Evaluator

For this tutorial, select JSON Parseable — this evaluator checks whether the input to each tool call is valid JSON, ensuring that the agent is passing properly formatted arguments to its tools.
Give the evaluator an Eval Column Name (e.g. tool_input_json_valid).
Set the scope to Span because we are evaluating specific tool spans.
Set the span attribute to attributes.input.value (the input passed to the tool call).

Step 3: Run and View Results

With your code evaluator configured, save the evaluator and run the task. Code evals execute quickly — even on large datasets.

From the results view:

Filter by label to find tool spans that failed the JSON check — which tool calls received malformed inputs?
Combine with LLM eval results for a complete quality picture. Code evals catch structural issues while LLM evals assess content quality.
Use aggregate metrics to track compliance rates over time.

Combining Code and LLM Evaluators

The most effective evaluation setups use both code and LLM evaluators together. In a single project, you can attach multiple eval tasks of different types. For the travel agent, a practical setup might include:

Code eval: “Does the response mention budget-related terms?” (fast, deterministic)
Code eval: “Does the response cover all three required sections?” (structural check)
LLM eval: “Is the response actionable and helpful?” (subjective quality)
LLM eval: “Is the information factually correct?” (content accuracy)

This layered approach gives you both breadth and depth in your evaluation coverage.

What’s Next

You’ve now completed the evaluation tutorial series. You know how to run pre-built evals, create custom LLM-as-a-Judge evaluators, and set up code evals — all from the Arize AX UI. To continue building evaluations:

Alyx

Develop

Prompts

Evaluate

Observe

Machine Learning

Security & Settings

Run Code Evals on Your Traces

Step 1: Create an Evaluation Task

Step 2: Choose a Code Evaluator

Configure the Evaluator

Step 3: Run and View Results

Combining Code and LLM Evaluators

What’s Next

See the code evals reference

Set Up Online Evals

Browse Eval Templates

Alyx

Develop

Prompts

Evaluate

Observe

Machine Learning

Security & Settings

​Step 1: Create an Evaluation Task

​Step 2: Choose a Code Evaluator

​Configure the Evaluator

​Step 3: Run and View Results

​Combining Code and LLM Evaluators

​What’s Next

See the code evals reference

Set Up Online Evals

Browse Eval Templates

Step 1: Create an Evaluation Task

Step 2: Choose a Code Evaluator

Configure the Evaluator

Step 3: Run and View Results

Combining Code and LLM Evaluators

What’s Next