Datasets are the backbone of effective LLM experimentation, providing structured collections of examples for evaluation and iteration. Datasets allow you to test models consistently across any real-world scenarios and edge cases, quickly identify regressions, and track measurable improvements. In Arize, datasets are fully integrated, allowing you to run experiments in the UI or programmatically via the SDK.

Common Types of Datasets

Golden Datasets: Compare Against Ideal Outputs

Curating golden datasets allows you to establish a reliable benchmark. A golden dataset provides a consistent and trusted “ground truth” for LLM outputs. By meticulously hand-labeling ideal responses, you create a stable benchmark that allows you to objectively measure and compare the performance of different models and prompt versions over time.

Regression Datasets: Focus on Areas of Improvements

A regression dataset captures examples where your application previously failed or performed poorly. These datasets are crucial for ensuring that fixes or improvements persist over time and don’t reintroduce bugs or regressions. Examples are often pulled from user feedback or logs with problematic behavior.

Flexible Dataset Format

Arize supports flexible dataset formats so you can structure data in the way that best fits your LLM application: 1. Key-Value Pairs: Flexible for multi-input/multi-output tasks such as function calls, agents, or classification, ensuring complex workflows can be tested consistently.

Input	Context	Output
`What is Paul Graham known for?`	”Paul Graham is an investor, entrepreneur, and computer scientist known for…"	`"Paul Graham is known for co-founding Y Combinator…”`

2. Prompt-Completion (String Pairs): Simple format for validating single-turn completions, making it easy to measure correctness against expected outputs.

Input	Output
`”do you have to have two license plates in ontario"`	`"True”`

3. Messages or Chat Format: Purpose-built for conversational agents, allowing you to evaluate multi-turn interactions in context.

Input:
{"messages": [{"role": "system", "content": "You are an expert SQL assistant"}]}
Output:
{"messages": [{"role": "assistant", "content": "SELECT * FROM users;"}]}

Learn More

Quickstart

Start experimenting with datasets in UI or code

Dive into a cookbook

Explore end-to-end walkthroughs on building robust datasets for experimentation

Learn more about evals

Read our evaluation concepts page

Alyx

Develop

Prompts

Evaluate

Observe

Machine Learning

Security & Settings

Datasets

Common Types of Datasets

Golden Datasets: Compare Against Ideal Outputs

Regression Datasets: Focus on Areas of Improvements

Flexible Dataset Format

Learn More

Quickstart

Dive into a cookbook

Learn more about evals

Alyx

Develop

Prompts

Evaluate

Observe

Machine Learning

Security & Settings

​Common Types of Datasets

​Golden Datasets: Compare Against Ideal Outputs

​Regression Datasets: Focus on Areas of Improvements

​Flexible Dataset Format

​Learn More

Quickstart

Dive into a cookbook

Learn more about evals

Common Types of Datasets

Golden Datasets: Compare Against Ideal Outputs

Regression Datasets: Focus on Areas of Improvements

Flexible Dataset Format

Learn More