Skip to main content
You’ve instrumented your application and traces are flowing into Arize AX. Now you need to answer the question: is my application actually working well? Pre-built eval templates give you a fast way to measure quality without writing any evaluation logic. Arize AX includes templates for common quality dimensions like correctness, relevance, toxicity, and more. You select a template, point it at your traces, and get structured quality signals in minutes. This tutorial walks through running a pre-built eval on traces from a travel planning agent. By the end, you’ll have evaluation scores attached to your traces that you can inspect, filter, and use to identify areas for improvement.
Prerequisite: Before starting, run the companion notebook to generate traces from the travel agent. You’ll need traces in your Arize AX project to evaluate.

Prefer to use code? See the SDK guide


What Are Pre-Built Eval Templates?

An LLM-as-a-Judge evaluation uses one LLM to assess the output of another. It combines three things:
  1. A judge model — the LLM that produces the judgment
  2. A prompt template — the criteria the judge applies
  3. Your data — the traces being evaluated
Pre-built templates handle the prompt template for you. They encode a well-tested rubric for a specific quality dimension—like whether an answer is factually correct or whether a response is relevant to the question asked. You choose the template, configure the judge model, and Arize AX handles the rest. These templates are a good starting point when you want reliable signal quickly, especially early in development or when establishing a performance baseline.

Step 1: Navigate to Your Project

Open Arize AX and navigate to the LLM Tracing project that contains your travel agent traces. You should see spans from the agent’s execution, including tool calls and LLM completions.

Step 2: Create an Evaluation Task

Evaluation tasks define what data to evaluate and which evaluators to run. To create one:
  1. Navigate to New Eval Task in the upper right-hand corner
  2. Click LLM-as-a-Judge
  3. Give your task a name (ex: “Travel Agent Performance”)
  4. Set the Cadence to Run on historical data so we can evaluate our existing traces

Step 3: Select a Pre-Built Evaluator

Next, we are ready to press Add Evaluator. When adding an evaluator to your task, select Create New and browse the available pre-built templates. For this tutorial, choose Toxicity — this evaluator looks at the agent’s output and judges it for any toxic or harmful content. Each pre-built template comes with:
  • A tested prompt rubric that defines the evaluation criteria
  • Predefined output labels (ex: “correct” / “incorrect”, “toxic”/“non-toxic”, etc.)
  • Default column mappings that you can customize

Step 4: Configure the LLM-as-a-Judge Model

Next, configure the LLM that will act as the judge. This is the model that reads each trace’s input and output and applies the evaluation rubric.
  1. Select an AI Provider (ex: OpenAI, Azure OpenAI, Bedrock, etc) & enter your credentials for configuration.
  2. Once an AI Provider is configured, choose a model (be sure that the model chosen is different than what we used for the agent).

Step 5: Task Configuration

  1. First, we need to define the scope at which the evaluator will operate. Because we are looking at the agent’s overall performance for one call, we will select “Trace” for our scope.
  2. Next, map your trace attributes to the template’s variables. These mappings tell the evaluator which trace fields to pass into the judge model. For the Toxicity evaluator, map the agent’s output to the template variable that receives the content to be evaluated: Template variable for agent outputattributes.output.value (the agent’s trip plan)

Step 6: Run the Evaluation

With everything configured, you are ready to save the evaluator and run the task!

Step 7: Inspect Results

Once the evaluation completes, results appear alongside your traces. Each evaluated span now has an evaluation score and label attached to it. From the results view, you can:
  • Filter spans by evaluation label to focus on failures — for example, show only spans labeled “incorrect” to understand where the travel agent went wrong
  • Read the judge’s explanation for each evaluation to understand why a specific output was scored the way it was
  • View aggregate metrics to see overall performance at a glance — what percentage of responses were correct?
With pre-built evals running on your traces, you now have structured quality signals that go beyond manual inspection. You can track correctness over time, identify failure patterns, and use these signals to guide improvements.

Reusing Evaluators in the Evaluator Hub

The evaluators you create are saved to the Evaluator Hub, making them reusable across tasks and projects. You can find all your evaluators — both pre-built and custom — by navigating to the Evaluators section from the left sidebar. From the Evaluator Hub you can:
  • Attach existing evaluators to new evaluation tasks without recreating them
  • Version your evaluators — track changes over time with commit messages so you know what changed and why
  • Share across projects — apply the same quality criteria to different applications

What’s Next

Pre-built templates are a strong starting point, but your application likely has specific quality criteria that generic templates can’t capture. In the next guide, you’ll learn how to create a custom LLM-as-a-Judge evaluator with your own prompt and criteria tailored to your use case.

Create a Custom LLM-as-a-Judge Eval