Prerequisite: Before starting, run the companion notebook to generate traces from the travel agent. You’ll need traces in your Arize AX project to evaluate.
Prefer to use code? See the SDK guide
What Are Pre-Built Eval Templates?
An LLM-as-a-Judge evaluation uses one LLM to assess the output of another. It combines three things:- A judge model — the LLM that produces the judgment
- A prompt template — the criteria the judge applies
- Your data — the traces being evaluated
Step 1: Navigate to Your Project
Open Arize AX and navigate to the LLM Tracing project that contains your travel agent traces. You should see spans from the agent’s execution, including tool calls and LLM completions.Step 2: Create an Evaluation Task
Evaluation tasks define what data to evaluate and which evaluators to run. To create one:- Navigate to New Eval Task in the upper right-hand corner
- Click LLM-as-a-Judge
- Give your task a name (ex: “Travel Agent Performance”)
- Set the Cadence to Run on historical data so we can evaluate our existing traces
Step 3: Select a Pre-Built Evaluator
Next, we are ready to press Add Evaluator. When adding an evaluator to your task, select Create New and browse the available pre-built templates. For this tutorial, choose Toxicity — this evaluator looks at the agent’s output and judges it for any toxic or harmful content. Each pre-built template comes with:- A tested prompt rubric that defines the evaluation criteria
- Predefined output labels (ex: “correct” / “incorrect”, “toxic”/“non-toxic”, etc.)
- Default column mappings that you can customize
Step 4: Configure the LLM-as-a-Judge Model
Next, configure the LLM that will act as the judge. This is the model that reads each trace’s input and output and applies the evaluation rubric.- Select an AI Provider (ex: OpenAI, Azure OpenAI, Bedrock, etc) & enter your credentials for configuration.
- Once an AI Provider is configured, choose a model (be sure that the model chosen is different than what we used for the agent).
Step 5: Task Configuration
- First, we need to define the scope at which the evaluator will operate. Because we are looking at the agent’s overall performance for one call, we will select “Trace” for our scope.
-
Next, map your trace attributes to the template’s variables. These mappings tell the evaluator which trace fields to pass into the judge model. For the Toxicity evaluator, map the agent’s output to the template variable that receives the content to be evaluated:
Template variable for agent output ←
attributes.output.value(the agent’s trip plan)
Step 6: Run the Evaluation
With everything configured, you are ready to save the evaluator and run the task!Step 7: Inspect Results
Once the evaluation completes, results appear alongside your traces. Each evaluated span now has an evaluation score and label attached to it. From the results view, you can:- Filter spans by evaluation label to focus on failures — for example, show only spans labeled “incorrect” to understand where the travel agent went wrong
- Read the judge’s explanation for each evaluation to understand why a specific output was scored the way it was
- View aggregate metrics to see overall performance at a glance — what percentage of responses were correct?
Reusing Evaluators in the Evaluator Hub
The evaluators you create are saved to the Evaluator Hub, making them reusable across tasks and projects. You can find all your evaluators — both pre-built and custom — by navigating to the Evaluators section from the left sidebar. From the Evaluator Hub you can:- Attach existing evaluators to new evaluation tasks without recreating them
- Version your evaluators — track changes over time with commit messages so you know what changed and why
- Share across projects — apply the same quality criteria to different applications