Prerequisite: Before starting, run the companion notebook to generate traces from the travel agent. You’ll need traces in your Arize AX project to evaluate.
Step 1: Create an Evaluation Task
Evaluation tasks define what data to evaluate and which evaluators to run. To create one:- Navigate to Eval Tasks in the upper right-hand corner and select Add Eval Task
- Choose Code Evaluator
- Give your task a name (ex: “Tool Input Validation”)
- Set the Cadence to Run on historical data so we can evaluate our existing traces
- In this tutorial, we are using a code eval to validate tool call inputs, so we only want to run this evaluator against tool spans. Add a Task Filter to scope the evaluation by setting
attributes.openinference.span.kind = TOOL.
Step 2: Choose a Code Evaluator
To add an evaluator to your task, select Add Evaluator —> Create New and browse the available pre-built code evaluators. Arize AX offers several managed code evaluators for common checks:| Evaluator | What It Checks | Parameters |
|---|---|---|
| Matches Regex | Whether text matches a specified regex pattern | span attribute, pattern |
| JSON Parseable | Whether the output is valid JSON | span attribute |
| Contains Any Keyword | Whether any specified keywords appear | span attribute, keywords |
| Contains All Keywords | Whether all specified keywords appear | span attribute, keywords |
Configure the Evaluator
- For this tutorial, select JSON Parseable — this evaluator checks whether the input to each tool call is valid JSON, ensuring that the agent is passing properly formatted arguments to its tools.
- Give the evaluator an Eval Column Name (e.g.
tool_input_json_valid). - Set the scope to Span because we are evaluating specific tool spans.
- Set the span attribute to
attributes.input.value(the input passed to the tool call).
Step 3: Run and View Results
With your code evaluator configured, save the evaluator and run the task. Code evals execute quickly — even on large datasets.
- Filter by label to find tool spans that failed the JSON check — which tool calls received malformed inputs?
- Combine with LLM eval results for a complete quality picture. Code evals catch structural issues while LLM evals assess content quality.
- Use aggregate metrics to track compliance rates over time.
Combining Code and LLM Evaluators
The most effective evaluation setups use both code and LLM evaluators together. In a single project, you can attach multiple eval tasks of different types. For the travel agent, a practical setup might include:- Code eval: “Does the response mention budget-related terms?” (fast, deterministic)
- Code eval: “Does the response cover all three required sections?” (structural check)
- LLM eval: “Is the response actionable and helpful?” (subjective quality)
- LLM eval: “Is the information factually correct?” (content accuracy)