> ## Documentation Index
> Fetch the complete documentation index at: https://arize-ax.mintlify.site/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Harness as a Judge

> Run evaluations with a Claude Code agent in a sandbox that reads your traces at runtime—no column mapping, better for nuanced or multi-step criteria.

<Info>
  Harness as a Judge is available in closed Enterprise beta. Contact your Arize account team for access.
</Info>

**Harness as a Judge** runs evaluations inside a **Claude Code sandbox**, not as a single LLM API call. You describe what to score in plain language; the agent pulls span and trace data from your project at run time, applies your criteria, and writes results back as eval columns—same as [LLM-as-a-judge](/ax/evaluate/create-evaluators#llm-as-a-judge) and [code evaluators](/ax/evaluate/evaluators/code-evaluations).

Use it when a fixed prompt and column mapping are too rigid for the judgment you need.

## Why use it

|                   | **LLM-as-a-judge**                                           | **Harness as a Judge**                                                                             |
| ----------------- | ------------------------------------------------------------ | -------------------------------------------------------------------------------------------------- |
| **How it scores** | One judge prompt per span/trace; variables mapped to columns | Agent in a sandbox explores exported trace data and scores from your instructions                  |
| **Setup**         | Template + column mappings                                   | Scoring instructions only—no column mapping required                                               |
| **Best for**      | High-volume, repeatable checks with stable inputs            | Nuanced criteria, multi-field reasoning, or evals that benefit from reading context across a trace |

**Harness as a Judge** is for subjective or complex quality checks where you want an **agent** to interpret production data—not just fill a template. Examples:

* Relevance or helpfulness when the right answer depends on full trace context
* Agent trajectory quality (tool choice, recovery, multi-step reasoning)
* Custom rubrics that are easier to describe in prose than to wire into `{variable}` mappings

For deterministic rules (JSON shape, regex, keyword checks), use a [code evaluator](/ax/evaluate/evaluators/code-evaluations). For simple, high-throughput evals use [LLM-as-a-judge](/ax/evaluate/create-evaluators#llm-as-a-judge). Many teams use all three on the same project.

## How it works

**Configure the evaluator** in the [Evaluator Hub](/ax/evaluate/create-evaluators#evaluator-hub)—**Evaluators → Create → Harness Evaluator**. Select the harness (**Claude Code** today), pick an Anthropic model (or **Auto**), then write scoring instructions in plain language. Optional placeholders like `{attributes.output.value}` are filled from span data. Optionally define fixed labels or let the agent decide each run.

**Attach to an online eval task** on an LLM project—date range, query filter, and sampling rate—same flow as [Run online evals on traces](/ax/evaluate/run-evals-on-traces).

**On each run**, the platform starts a sandbox for that harness. The agent reads exported spans for the task window, scores them from your instructions, and publishes `eval.<name>.*` columns on the spans.

**View results** on traces, in dashboards, and in task run history. See [View eval results](/ax/evaluate/evals-overview).

The agent gets read access to traces on the bound project automatically. Add the optional [Arize skill](/ax/agents/skills-and-permissions) only if you need broader API access in the sandbox.

## Create a harness evaluator

<Steps>
  <Step title="Open Evaluator Hub">
    Go to **Evaluators** in the space sidebar, then **Create** and choose **Harness Evaluator**.
  </Step>

  <Step title="Select harness">
    Choose **Claude Code** as the evaluation harness (additional harnesses are coming soon).
  </Step>

  <Step title="Select model">
    Pick an [Anthropic AI integration](/ax/security-and-settings/integrations-playground/overview) and model, or **Auto**.
  </Step>

  <Step title="Write scoring instructions">
    Describe what good and bad look like—for example, whether the assistant's response is relevant to the user input given the full trace.

    The agent reads traces at run time; you do not map template variables to columns upfront.
  </Step>

  <Step title="Configure labels (optional)">
    Leave **Let agent decide labels** on for open-ended rubrics, or turn it off to define fixed labels and scores (for example `relevant` / `irrelevant`).
  </Step>

  <Step title="Save to Evaluator Hub">
    The evaluator is versioned like LLM and code evaluators—reuse it across tasks.
  </Step>
</Steps>

## Run on production traces

After saving the evaluator, create or edit an **online eval task** on your LLM project and add the harness evaluator. Set project, time range, and sampling—the same controls as template eval tasks.

Each task run provisions a sandbox, exports up to the task's span limit for that window, and runs Claude Code with your scoring instructions. Cancel a run from task history to tear down the sandbox.

Results appear as eval attributes on spans. Filter and monitor them in [Tracing](/ax/observe/tracing/view-and-manage-traces) and [Dashboards](/ax/observe/dashboards).
