Copy the Arize AX API_KEY and SPACE_ID from your Space Settings page (shown below) to the variables in the cell below.
import osimport nest_asynciofrom getpass import getpassnest_asyncio.apply()SPACE_ID = globals().get("SPACE_ID") or getpass( "🔑 Enter your Arize Space ID: ")API_KEY = globals().get("API_KEY") or getpass("🔑 Enter your Arize API Key: ")OPENAI_API_KEY = globals().get("OPENAI_API_KEY") or getpass( "🔑 Enter your OpenAI API key: ")os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
Here we’ve setup a basic agent that can solve math problems.We have a function tool that can solve math equations, and an agent that can use this tool.We’ll use the Runner class to run the agent and get the final output.
from agents import function_tool, Runner@function_tooldef solve_equation(equation: str) -> str: """Use python to evaluate the math equation, instead of thinking about it yourself. Args: equation: string which to pass into eval() in python """ return str(eval(equation))
from agents import Agentagent = Agent( name="Math Solver", instructions="You solve math problems by evaluating them with python and returning the result", tools=[solve_equation],)
result = await Runner.run(agent, "what is 15 + 28?")# Run Result objectprint(result)# Get the final outputprint(result.final_output)# Get the entire list of messages recorded to generate the final outputprint(result.to_input_list())
Now we have a basic agent, let’s evaluate whether the agent responded correctly!
Tool call accuracy - did our agent choose the right tool with the right arguments?
Tool call results - did the tool respond with the right results?
Agent goal accuracy - did our agent accomplish the stated goal and get to the right outcome?
We’ll setup a simple evaluator that will check if the agent’s response is correct, you can read about different types of agent evals here.Let’s setup our evaluation by defining our task function, our evaluator, and our dataset.
import asynciofrom agents import Runner# This is our task function. It takes a question and returns the final output and the messages recorded to generate the final output.async def solve_math_problem(dataset_row: dict): result = await Runner.run(agent, dataset_row.get("question")) # OPTIONAL: You don't need to return the messages unless you want to use them in your eval return { "final_output": result.final_output, "messages": result.to_input_list(), }dataset_row = {"question": "What is 15 + 28?"}result = asyncio.run(solve_math_problem(dataset_row))print(result)
Let’s create our evaluator.
# Note: This example uses Python SDK v7import pandas as pdfrom phoenix.evals import OpenAIModel, llm_classifyfrom arize.experimental.datasets.experiments.types import EvaluationResultdef correctness_eval(dataset_row: dict, output: dict) -> EvaluationResult: # Create a dataframe with the question and answer df_in = pd.DataFrame( {"question": [dataset_row.get("question")], "response": [output]} ) # Template for evaluating math problem solutions MATH_EVAL_TEMPLATE = """ You are evaluating whether a math problem was solved correctly. [BEGIN DATA] ************ [Question]: {question} ************ [Response]: {response} [END DATA] Assess if the answer to the math problem is correct. First work out the correct answer yourself, then compare with the provided response. Consider that there may be different ways to express the same answer (e.g., "43" vs "The answer is 43" or "5.0" vs "5"). Your answer must be a single word, either "correct" or "incorrect" """ # Run the evaluation rails = ["correct", "incorrect"] eval_df = llm_classify( data=df_in, template=MATH_EVAL_TEMPLATE, model=OpenAIModel(model="gpt-4o"), rails=rails, provide_explanation=True, ) # Extract results label = eval_df["label"][0] score = 1 if label == "correct" else 0 explanation = eval_df["explanation"][0] # Return the evaluation result return EvaluationResult(score=score, label=label, explanation=explanation)
Using the template below, we’re going to generate a dataframe of 25 questions we can use to test our math problem solving agent.
MATH_GEN_TEMPLATE = """You are an assistant that generates diverse math problems for testing a math solver agent.The problems should include:Basic Operations: Simple addition, subtraction, multiplication, division problems.Complex Arithmetic: Problems with multiple operations and parentheses following order of operations.Exponents and Roots: Problems involving powers, square roots, and other nth roots.Percentages: Problems involving calculating percentages of numbers or finding percentage changes.Fractions: Problems with addition, subtraction, multiplication, or division of fractions.Algebra: Simple algebraic expressions that can be evaluated with specific values.Sequences: Finding sums, products, or averages of number sequences.Word Problems: Converting word problems into mathematical equations.Do not include any solutions in your generated problems.Respond with a list, one math problem per line. Do not include any numbering at the beginning of each line.Generate 25 diverse math problems. Ensure there are no duplicate problems."""
With our dataset of questions we generated above, we can use our experiments feature to track changes across models, prompts, parameters for our agent.Let’s create this dataset and upload it into the platform.
# Note: This example uses Python SDK v7from arize.experimental.datasets import ArizeDatasetsClientfrom uuid import uuid1from arize.experimental.datasets.utils.constants import GENERATIVE# Set up the arize clientarize_client = ArizeDatasetsClient(api_key=API_KEY)dataset_name = "math-questions-" + str(uuid1())[:5]dataset_id = arize_client.create_dataset( space_id=SPACE_ID, dataset_name=dataset_name, dataset_type=GENERATIVE, data=math_problems_df,)dataset = arize_client.get_dataset(space_id=SPACE_ID, dataset_id=dataset_id)print(dataset)