Measuring Quality at Scale

Your SupportBot is now instrumented with complete tracing. You can see every LLM call, tool invocation, and retrieval operation. But there’s a problem: Users are still complaining. Some responses are helpful, others are completely wrong. Visibility alone isn’t enough. You need to measure quality. Which traces represent good responses? Which represent failures? And most importantly, how do you identify patterns across thousands of interactions? This chapter teaches you two approaches to measuring quality:

Human Annotations - Review traces and add labels directly in the Arize AX UI
Automated Evaluators - Scale measurement using code-based heuristics or LLM-as-Judge

Follow with Complete Python Notebook

The Challenge

Manually reviewing traces doesn’t scale. If you process 10,000 queries per day, you can’t review them all. But you can:

Review a sample to create ground truth
Run automated evaluators to identify patterns at scale

Let’s see how to implement each approach.

Annotate in the Arize AX UI

The easiest way to start is annotating traces directly in the UI. Navigate to any trace, open the annotation panel, and add labels, scores, or freeform notes — no code required. Arize AX supports three annotation types: Categorical - For yes/no or multi-class labels

Examples: "correct" / "incorrect", "helpful" / "unhelpful" / "harmful"

Continuous - For numeric scores

Examples: 1-5 stars, 0-100 quality scores, 0.0-1.0 confidence

Freeform - For open-ended notes

Examples: "Retrieved irrelevant documents", "Hallucinated product name"

Automated Evaluations

Manual review doesn’t scale. To evaluate thousands of traces automatically, export your spans and run heuristic or LLM-as-Judge evaluators against them.

Tool Result Evaluation

A simple code-based evaluator that checks whether tool calls succeeded:

from openinference.instrumentation import suppress_instrumentation
from arize import ArizeClient
from datetime import datetime, timedelta
import os

arize_client = ArizeClient(api_key=os.getenv("ARIZE_API_KEY"))

# Export spans from the last hour
end_time = datetime.now()
start_time = end_time - timedelta(hours=1)

trace_df = arize_client.spans.export_to_df(
    space_id=os.getenv("ARIZE_SPACE_ID"),
    project_name="supportbot-tutorial",
    start_time=start_time,
    end_time=end_time,
)

def evaluate_tool_results(trace_df):
    """Evaluate if tool calls succeeded or failed."""

    tool_spans = trace_df[trace_df["attributes.openinference.span.kind"] == "TOOL"]
    evaluations = []

    for _, span in tool_spans.iterrows():
        span_id = span["context.span_id"]
        tool_result = span.get("attributes.tool.result", "")

        # Simple heuristic: check for error keywords
        has_error = any(
            keyword in tool_result.lower()
            for keyword in ["error", "failed", "invalid", "not found"]
        )

        evaluations.append({
            "span_id": span_id,
            "label": "FAILED" if has_error else "SUCCESS",
            "score": 0.0 if has_error else 1.0,
            "annotator_kind": "CODE",
        })

    return evaluations

tool_evals = evaluate_tool_results(trace_df)
print(f"Evaluated {len(tool_evals)} tool calls")

success_rate = sum(e["score"] for e in tool_evals) / len(tool_evals)
print(f"Tool success rate: {success_rate:.1%}")

When writing LLM-as-Judge evaluators, wrap the evaluator’s LLM calls with suppress_instrumentation() to prevent them from appearing as application traces.

with suppress_instrumentation():
    response = openai_client.chat.completions.create(...)

Key Takeaways

You now have two layers of quality measurement:

Manual annotations via the UI for ground truth
Automated evaluators for systematic pattern identification at scale

What’s Next?

You can now measure quality for individual queries. But what about conversations? Multi-turn interactions need session-level analysis. In the next chapter, Sessions, you’ll learn how to:

Group related traces into conversations
Track context across multiple turns
Identify where multi-turn interactions break down

Let’s continue! →

Alyx

Observe

Evaluate

Develop

Prompts

Machine Learning

Security & Settings

Annotations & Evaluations

Measuring Quality at Scale

Follow with Complete Python Notebook

The Challenge

Annotate in the Arize AX UI

Automated Evaluations

Tool Result Evaluation

Key Takeaways

What’s Next?

Alyx

Observe

Evaluate

Develop

Prompts

Machine Learning

Security & Settings

​Measuring Quality at Scale

Follow with Complete Python Notebook

​The Challenge

​Annotate in the Arize AX UI

​Automated Evaluations

​Tool Result Evaluation

​Key Takeaways

​What’s Next?

Measuring Quality at Scale

The Challenge

Annotate in the Arize AX UI

Automated Evaluations

Tool Result Evaluation

Key Takeaways

What’s Next?