Follow an end to end example of tracing and evaluating an agent
High Level Concepts
In this example, we are building a customer support agent, which takes an input question from customers and decides what to do. In this example, we are deciding between searching for order information or answering the question.

Trace your agent
We have auto-instrumentation for function calling and structured outputs across almost every LLM provider here. We also support tracing for common frameworks using auto-instrumentation such as LangGraph, LlamaIndex Workflows, CrewAI, and AutoGen. To trace a simple agent with function calling, you can use arize-otel, our convenience package for setting up OTEL tracing along with openinference auto-instrumentation, which maps LLM metadata to a standardized set of trace and span attributes. Here’s some sample code on logging all OpenAI calls to Arize AX.product_search, and track_package .
tool_choice as required, so a function will always be returned.

Evaluate your agent
Once we have generated a set of test cases, we can create evaluators to measure performance. This way, we don’t have to manually inspect every single trace to see if the LLM is doing the right thing. Here, we are defining our evaluation templates to judge whether the router selected a function correctly. We also add two more evaluators on whether it selected the right function, and whether it filled the arguments correctly in our colab.response_df which would be generated using the datasets code above.
llm_classify , you’ll get a table of evaluation results. Below is a formatted example for router_eval_df merged with the question and response.
| Question | Response | Label | Explanation |
|---|---|---|---|
| Could you tell me about your selection of eco-friendly gadgets that might be suitable for a modern kitchen setup? | ChatCompletionMessageToolCallname = 'product_search'type = 'function'arguments = { 'query': 'eco-friendly gadgets', 'category': 'kitchen'} | correct | The user’s question is about finding eco-friendly gadgets suitable for a modern kitchen setup. The function call made is ‘product_search’ with the query ‘eco-friendly gadgets’ and category ‘kitchen’, which is appropriate for searching products based on the user’s criteria. Therefore, the function call is correct. |
Next steps
We covered very simple examples of tracing and evaluating an agent that uses function calling to route user requests and take actions in your application. As you build more capabilities into your agent, you’ll need more advanced tooling to measure and improve performance. You can do all of the following in Arize AX:- Manually create tool spans which log your function calling inputs, latency, and outputs (colab example).
- Evaluate your agent across multiple levels, not just the router prompt (more info on evaluations).
- Create experiments to track changes across models, prompts, and parameters (more info on experiments).
Follow an end to end example of tracing and evaluating an agent
Follow an example of manually instrumenting an agent if you need additional logging
