Skip to main content

Documentation Index

Fetch the complete documentation index at: https://arize-ax.mintlify.dev/docs/llms.txt

Use this file to discover all available pages before exploring further.

Your evals are only as good as your criteria

Automated evals are only as good as what they measure. Before writing criteria, review real interactions and understand how a human would judge them - then build evals that reflect that standard. Arize AX Playgrounds view for an align eval task with a prompt editor using GPT-3.5 to classify clarity and tone, and an experiment table comparing human annotation labels to Human v AI align badges showing aligned or not aligned per row with an average agreement score

Start from human labels

Use Human review to define annotation configs, review traces or dataset rows, and build a ground truth set that reflects your rubric. These labels become the ground truth reference you compare eval scores against.

Measure agreement

On a fixed sample of examples (typically 50 to a few hundred, covering edge cases), run your evaluator and compare its labels to your human annotations. Check accuracy, systematic bias, and per-label precision and recall. Follow the workflow below to run this loop and iterate until you hit a target threshold.

Workflow

Use the Arize skills plugin with the arize-align-evaluator skill in your coding agent. It walks you through aligning LLM-as-a-judge evaluators to human ground truth by composing ax CLI steps into a loop: run the evaluator, compare its labels to human judgments, measure agreement (accuracy, confusion matrix, per-label precision and recall), diagnose systematic bias, revise the evaluator template, and repeat until you hit a target threshold.Get started with a prompt like:
  • “Use the arize-align-evaluator skill to align my correctness evaluator against human annotations on my customer-support project.”
Claude Code terminal after asking to align evals: skill loaded successfully and assistant lists numbered questions for evaluator, ground-truth labels, project or dataset, and space

Common issues

  • High human disagreement: if annotators disagree with each other, evals cannot align to a single standard until the rubric is clarified
  • Small calibration sets: a handful of rows can miss long-tail failures. Aim for at least 50 to 100 labeled examples before trusting metrics or changing production monitors
  • Criteria mismatch: your evals may be scoring a different dimension than your annotations (e.g. fluency vs factual accuracy)

Troubleshooting

If agreement is low but humans are consistent, iterate the judge prompt and confirm your variable mappings match the fields humans reviewed. If scores look good on average but fail on a specific slice, stratify your sample by product area, language, or tool-use path and recheck alignment per slice.

Further reading