In this example, we are building a customer support agent, which takes an input question from customers and decides what to do. In this example, we are deciding between searching for order information or answering the question.
Let’s break this down further: We are taking the user input and passing it to a router / planner prompt template. This template then decides which function call or agent skill to use. Often after calling a function, it goes back to the router template to decide on the next step, which could involve calling another agent skill.
We have auto-instrumentation for function calling and structured outputs across almost every LLM provider here.We also support tracing for common frameworks using auto-instrumentation such as LangGraph, LlamaIndex Workflows, CrewAI, and AutoGen.To trace a simple agent with function calling, you can use arize-otel, our convenience package for setting up OTEL tracing along with openinference auto-instrumentation, which maps LLM metadata to a standardized set of trace and span attributes.Here’s some sample code on logging all OpenAI calls to Arize AX.
# Import open-telemetry dependenciesfrom arize.otel import register# Setup OTEL via our convenience functiontracer_provider = register( space_id=getpass("Enter your space ID"), api_key=getpass("Enter your API Key"), project_name="agents-cookbook",)# Import the automatic instrumentor from OpenInferencefrom openinference.instrumentation.openai import OpenAIInstrumentor# Finish automatic instrumentationOpenAIInstrumentor().instrument(tracer_provider=tracer_provider)
Let’s create the foundation for our customer support agent. We have 2 functions that we define below: product_search, and track_package .
tools = [ { "type": "function", "function": { "name": "product_search", "description": "Search for products based on criteria.", "parameters": { "type": "object", "properties": { "query": { "type": "string", "description": "The search query string.", }, "category": { "type": "string", "description": "The category to filter the search.", }, "min_price": { "type": "number", "description": "The minimum price of the products to search.", "default": 0, }, "max_price": { "type": "number", "description": "The maximum price of the products to search.", }, "page": { "type": "integer", "description": "The page number for pagination.", "default": 1, }, "page_size": { "type": "integer", "description": "The number of results per page.", "default": 20, }, }, "required": ["query"], }, }, }, { "type": "function", "function": { "name": "track_package", "description": "Track the status of a package based on the tracking number.", "parameters": { "type": "object", "properties": { "tracking_number": { "type": "integer", "description": "The tracking number of the package.", } }, "required": ["tracking_number"], } } }]
We define a function below called run_prompt, which uses the chat completion call from OpenAI with functions, and returns the tool calls. Notice that we label tool_choice as required, so a function will always be returned.
import osimport openaiclient = openai.Client()def run_prompt(input): response = client.chat.completions.create( model="gpt-4o-mini", temperature=0, tools=tools, tool_choice="required", messages=[ { "role": "system", "content": " ", }, { "role": "user", "content": input, }, ], ) if ( hasattr(response.choices[0].message, "tool_calls") and response.choices[0].message.tool_calls is not None and len(response.choices[0].message.tool_calls) > 0 ): return response.choices[0].message.tool_calls else: return []
Let’s test it and see if it returns the right function! If we ask a question about specific products, we’ll get a response that will call the product search function.
run_prompt("I'm interested in energy-efficient appliances, but I'm not sure which ones are best for a small home office. Can you help?")
Once we have generated a set of test cases, we can create evaluators to measure performance. This way, we don’t have to manually inspect every single trace to see if the LLM is doing the right thing.Here, we are defining our evaluation templates to judge whether the router selected a function correctly. We also add two more evaluators on whether it selected the right function, and whether it filled the arguments correctly in our colab.
ROUTER_EVAL_TEMPLATE = """You are comparing a response to a question, and verifying whether that response should have made a function call instead of responding directly. Here is the data: [BEGIN DATA] ************ [Question]: {question} ************ [LLM Response]: {response} ************ [END DATA]Compare the Question above to the response. You must determine whether the reponsedecided to call the correct function.Your response must be single word, either "correct" or "incorrect", and should not contain any text or characters aside from that word. "incorrect" means that the agent should have made function call instead of responding directly and did not, or the function call chosen was the incorrect one."correct" means the selected function would correctly and fully answer the user's question.Here is more information on each function:- product_search: Search for products based on criteria.- track_package: Track the status of a package based on the tracking number."""
We can evaluate our outputs using Phoenix’s llm_classify function. I have a placeholder response_df which would be generated using the datasets code above.
If you inspect each dataframe generated by llm_classify , you’ll get a table of evaluation results. Below is a formatted example for router_eval_df merged with the question and response.
Question
Response
Label
Explanation
Could you tell me about your selection of eco-friendly gadgets that might be suitable for a modern kitchen setup?
The user’s question is about finding eco-friendly gadgets suitable for a modern kitchen setup. The function call made is ‘product_search’ with the query ‘eco-friendly gadgets’ and category ‘kitchen’, which is appropriate for searching products based on the user’s criteria. Therefore, the function call is correct.
We covered very simple examples of tracing and evaluating an agent that uses function calling to route user requests and take actions in your application. As you build more capabilities into your agent, you’ll need more advanced tooling to measure and improve performance.You can do all of the following in Arize AX: