Create a Custom LLM-as-a-Judge Eval

Pre-built eval templates cover common quality dimensions, but they can’t capture every application-specific requirement. A generic check doesn’t know that your travel agent should always include a budget breakdown, or that responses without local recommendations are incomplete. Custom LLM-as-a-Judge evaluators let you encode your domain knowledge directly into the evaluation. You write a prompt that describes exactly what “good” looks like for your application, choose the output labels, and configure the judge model. The result is an evaluator that measures what actually matters to your users. This tutorial walks through creating a custom evaluator in the Arize AX UI, from writing the prompt template to running it on your travel agent traces.

Prerequisite: Before starting, run the companion notebook to generate traces from the travel agent. You’ll need traces in your Arize AX project to evaluate.

Prefer to use code? See the SDK guide

Step 1: Create an Evaluation Task

Start by creating a task to define what data you want to evaluate.

Navigate to New Eval Task in the upper right-hand corner
Click LLM-as-a-Judge
Give your task a name (ex: “Travel Plan Completeness”)
Set the Cadence to Run on historical data so we can evaluate our existing traces

Step 2: Create a New Evaluator from Blank

Press Add Evaluator, then select Create New. Instead of choosing a pre-built template, select Create From Blank. This gives you full control over the evaluation prompt, labels, and judge model.

Step 3: Define the Evaluator

Building a custom evaluator involves four configuration steps:

Name the evaluator Give it a descriptive name. This name appears in the Evaluator Hub and in your eval results, so choose something that clearly communicates what the evaluator measures.
Pick a model Select the LLM that will serve as the judge — the model that reads each trace and applies your evaluation criteria.
- Select an AI Provider (ex: OpenAI, Azure OpenAI, Bedrock, etc) & enter your credentials for configuration.
- Once an AI Provider is configured, choose a model (be sure that the model chosen is different than what we used for the agent).

Define a template The prompt template is the core of your evaluator. It should clearly describe the judge’s role, the criteria for each label (when to use “correct” vs. “incorrect”), and the data it will see — marked with template variables like {{input}} and {{output}}. Here’s the custom template for our travel agent. It encodes specific expectations about essential info, budget, and local experiences — criteria that a generic template wouldn’t capture:

You are an expert evaluator judging whether a travel planner agent's
response is correct. The agent is a friendly travel planner that must
combine multiple tools to create a trip plan with: (1) essential info,
(2) budget breakdown, and (3) local flavor/experiences.

CORRECT - The response:
- Accurately addresses the user's destination, duration, and stated interests
- Includes essential travel info (e.g., weather, best time to visit,
  key attractions, etiquette) for the destination
- Includes a budget or cost breakdown appropriate to the destination
  and trip duration
- Includes local experiences, cultural highlights, or authentic
  recommendations matching the user's interests
- Is factually accurate, logically consistent, and helpful for
  planning the trip

INCORRECT - The response contains any of:
- Factual errors about the destination, costs, or local info
- Missing essential info when the user asked for a full trip plan
- Missing or irrelevant budget information for the given destination/duration
- Missing or generic local experiences that do not match the user's interests
- Wrong destination, duration, or interests addressed
- Contradictions, misleading statements, or unhelpful/off-topic content

[BEGIN DATA]
************
[User Input]:
{input}

************
[Travel Plan]:
{output}
************
[END DATA]

Focus on factual accuracy and completeness of the trip plan
(essentials, budget, local flavor). Is the output correct or incorrect?

Define labels Output labels constrain what the judge can return. For this evaluator, define two labels:
- correct (score: 1) — the response meets all criteria
- incorrect (score: 0) — the response fails one or more criteria
Categorical labels are more reliable than numeric scores for most evaluation tasks. They’re easier for the judge to apply consistently and simpler to aggregate across runs.
Explanations: Toggle Explanations to “On” to have the judge provide a brief rationale for each label, which helps with debugging and understanding why an example was scored correct or incorrect.

Step 4: Task Configuration

With the evaluator defined, configure how it connects to your trace data.

Set the scope to Trace. Because we are evaluating the agent’s overall performance for one call, trace-level is the right granularity.
Map your trace attributes to the template’s variables. These mappings tell the evaluator which trace fields to pass into the judge model:
- {{input}} ← attributes.input.value (the user’s travel planning query)
- {{output}} ← attributes.output.value (the agent’s trip plan)

Step 5: Run the Evaluation

With everything configured, save the evaluator and run the task. Custom evals use the same execution flow as pre-built evals — results appear alongside your traces once the judge model finishes processing.

Step 6: Inspect Results

Review the evaluation results to understand where your agent succeeds and where it falls short:

Filter by label to focus on “incorrect” responses and identify patterns — are most failures missing budgets? Giving generic recommendations?
Read explanations to understand the judge’s reasoning for each score
Compare with pre-built eval results if you ran a pre-built template earlier — custom evals often surface issues that generic templates miss

These insights directly inform what to improve. If most failures are about missing budget information, you know exactly which part of your agent’s prompt or tool usage needs attention.

Reusing Evaluators in the Evaluator Hub

The evaluators you create are saved to the Evaluator Hub, making them reusable across tasks and projects. You can find all your evaluators — both pre-built and custom — by navigating to the Evaluators section from the left sidebar. From the Evaluator Hub you can:

Attach existing evaluators to new evaluation tasks without recreating them
Version your evaluators — track changes over time with commit messages so you know what changed and why
Share across projects — apply the same quality criteria to different applications

What’s Next

You’ve now created evaluators that measure application-specific quality. For criteria that are deterministic and don’t require LLM judgment — like checking for keywords, validating JSON, or matching regex patterns — code evaluators are faster and more consistent. The next guide covers setting up code evals in the Arize AX UI.

Alyx

Develop

Prompts

Evaluate

Observe

Machine Learning

Security & Settings

Create a Custom LLM-as-a-Judge Eval

Prefer to use code? See the SDK guide

Step 1: Create an Evaluation Task

Step 2: Create a New Evaluator from Blank

Step 3: Define the Evaluator

Step 4: Task Configuration

Step 5: Run the Evaluation

Step 6: Inspect Results

Reusing Evaluators in the Evaluator Hub

What’s Next

Run Code Evals on Your Traces

Alyx

Develop

Prompts

Evaluate

Observe

Machine Learning

Security & Settings

Prefer to use code? See the SDK guide

​Step 1: Create an Evaluation Task

​Step 2: Create a New Evaluator from Blank

​Step 3: Define the Evaluator

​Step 4: Task Configuration

​Step 5: Run the Evaluation

​Step 6: Inspect Results

​Reusing Evaluators in the Evaluator Hub

​What’s Next

Run Code Evals on Your Traces

Step 1: Create an Evaluation Task

Step 2: Create a New Evaluator from Blank

Step 3: Define the Evaluator

Step 4: Task Configuration

Step 5: Run the Evaluation

Step 6: Inspect Results

Reusing Evaluators in the Evaluator Hub

What’s Next