Documentation Index
Fetch the complete documentation index at: https://arize-ax.mintlify.dev/docs/llms.txt
Use this file to discover all available pages before exploring further.
Follow with Complete Python Notebook
Why Create a Dataset?
In AI application development, quick iteration can mask regressions or blind spots in quality. Prompt tweaks, model swaps, or architectural changes may seem better in isolation, but without systematic evaluation it’s just guesswork. That’s where datasets come in: they act as structured collections of representative examples that you care about and want to systematically test your application against. A dataset is your definition of the test cases that matter as your system evolves. Each example can capture the input that your application will receive, an expected output, and any metadata such as tags, error types, or model parameters. Datasets provide a reliable foundation for evaluating, tracking, and improving your AI workflows.What Should Your Dataset Contain?
The ideal dataset reflects the core behaviors you want your application to get right. Consider including:- Normal examples that represent typical user interactions.
- Edge cases where your application historically struggled.
- Flagged or failed runs pulled from logs, user feedback, or tracing. These illustrate concrete failure modes you want to improve.
- Golden datasets: Curated examples with human-verified or “ideal” outputs that serve as a reliable benchmark.
- Regression datasets: Cases that previously failed or revealed a weakness you want to prevent from re-occurring.
- Real user logs: Production or staging logs captured via traces
Define an Agent
To run experiments, you’ll need an application or agent to evaluate. In this tutorial, we use a customer support agent built with the Agno framework. You can find the complete agent implementation in the reference notebook below. Arize integrates with many frameworks and LLM providers for easy tracing and evaluation. See the full list below:View All Integrations
Create a Golden Dataset
In this tutorial, we’ll create a golden dataset—a dataset that includes reference outputs (also called ground truth) for each example. A golden dataset serves as a benchmark for performance in your experiments, providing a reliable standard against which you can measure and compare your agent’s outputs across iterations. To run experiments in Arize AX, you need a dataset. A dataset provides the structured examples that your experiments use to run and evaluate your agent. Without a dataset, you can’t systematically measure performance, compare different agent versions, or track improvements over time. In our example, each dataset entry contains:- Query: The user input that will be sent to the agent
- Expected Category (Reference Output): The category the agent should classify the query into
expected_category field. When uploading a dataset to Arize AX, your dataset columns are automatically detected.
Upload Dataset