
A Colab notebook that walks through the complete workflow is available in the Agent Trajectory Evaluation Notebook.
How It Works
- Group tool-calling spans per trace – each tool call (function call) is captured as a span when you instrument with OpenInference.
- Send the ordered list of tool calls to an LLM judge – Phoenix Evals classifies the trajectory as
correctorincorrect(and can produce an explanation). - Log the evaluation back to Arize – the result is attached to the root span of the trace so you can filter and pivot in the UI.
Prerequisites
- Instrumented traces of your agent with the OpenInference schema
- Python 3.10+ and the following packages:
Implementation
1. Pull trace data from Arize
2. Filter to the spans you want to score
Most agents emit many spans (retrieval, LLM calls, DB writes, …). For trajectory scoring we usually care about LLM spans that contain tool calls.3. Extract ordered tool calls for each trace
4. Define the evaluation prompt
The LLM judge receives:{tool_calls}– the actual trajectory (step → tool → arguments){attributes.input.value}– the user input that kicked off the trace{attributes.llm.tools}– the JSON schema of available tools- (Optional)
{reference_outputs}– a golden trajectory you expect