Experiments

The experiments client methods are currently in BETA. The API may change without notice. A one-time warning is emitted on first use.

Track and evaluate changes to prompts, models, and retrieval strategies. Run experiments with automatic tracing and evaluation.

Key Capabilities

Automatic tracing of all LLM calls during experiments
Concurrent execution for faster evaluation
Dry-run mode for testing without logging
Built-in evaluator support
Compare experiments side-by-side in the UI

List Experiments

List all experiments, optionally filtered by dataset.

resp = client.experiments.list(
    dataset="dataset-name-or-id",  # optional
    limit=50,
)

for experiment in resp.experiments:
    print(experiment.id, experiment.name)

For details on pagination, field introspection, and data conversion (to dict/JSON/DataFrame), see Response Objects.

Create an Experiment

Log pre-computed experiment results to Arize. Use this when you’ve already executed your experiment elsewhere and want to record the results. Unlike run(), this does not execute the task - it only logs existing results.

from arize.experiments import (
    ExperimentTaskFieldNames,
    EvaluationResultFieldNames,
)

experiment_runs = [
    {
        "example_id": "ex-1",
        "output": "Paris is the capital of France",
        "latency_ms": 245,
        "correctness_score": 1.0,
        "correctness_label": "correct",
    },
    {
        "example_id": "ex-2",
        "output": "William Shakespeare wrote Romeo and Juliet",
        "latency_ms": 198,
        "correctness_score": 1.0,
        "correctness_label": "correct",
    },
]

task_fields = ExperimentTaskFieldNames(
    example_id="example_id",
    output="output",
)

evaluator_columns = {
    "Correctness": EvaluationResultFieldNames(
        score="correctness_score",
        label="correctness_label",
    )
}

experiment = client.experiments.create(
    name="pre-computed-experiment",
    dataset="dataset-name-or-id",
    experiment_runs=experiment_runs,
    task_fields=task_fields,
    evaluator_columns=evaluator_columns,
)

Get an Experiment

Retrieve experiment details and metadata by name or ID. When using a name, provide dataset and optionally space to disambiguate.

experiment = client.experiments.get(
    experiment="experiment-name-or-id",
    dataset="dataset-name-or-id",  # required when using a name
)

print(experiment)

Delete an Experiment

Delete an experiment by name or ID. This operation is irreversible. There is no response from this call.

client.experiments.delete(
    experiment="experiment-name-or-id",
    dataset="dataset-name-or-id",  # required when using a name
)

print("Experiment deleted successfully")

Run an Experiment

Execute a task function across your dataset examples with automatic evaluation, then log the results to Arize. High-level flow:

Resolve the dataset and download examples (cached if enabled)
Execute the task and evaluators with configurable concurrency
Upload results to Arize (unless in dry-run mode)

# Define your task
import openai

def answer_question(dataset_row):
    invention = dataset_row.get("attributes.input.value")  # example: "Telephone"
    openai_client = openai.OpenAI()

    response = openai_client.chat.completions.create(
        model="gpt-4.1",
        messages=[{"role": "user", "content": f"Who invented {invention}?"}],
        max_tokens=20,
    )

    return response.choices[0].message.content

# Define evaluators (optional)
from arize.experiments import EvaluationResult

def is_correct(output, dataset_row):
    expected = dataset_row.get("attributes.output.value")
    correct = expected in output
    return EvaluationResult(
        score=int(correct),
        label="correct" if correct else "incorrect",
        explanation="Evaluator explanation here"
    )

# Run an experiment
experiment, experiment_df = client.experiments.run(
    name="prompt-v2-experiment",
    dataset="dataset-name-or-id",
    task=answer_question,
    evaluators=[is_correct],
)

print(f"Experiment: {experiment}")
print(f"Results DataFrame shape: {experiment_df.shape}")

Dry Run Mode

Execute your experiment locally without logging results to Arize. Use this to test your task and evaluators before committing to a full run.

experiment, experiment_df = client.experiments.run(
    ...,
    dry_run=True,  # Test locally without logging
    dry_run_count=10,  # Only run on first 10 examples
)

# Note: experiment is None in dry-run mode
print(f"Results DataFrame shape: {experiment_df.shape}")

Concurrency Control

Control parallelism for faster execution.

experiment, experiment_df = client.experiments.run(
    ...,
    concurrency=10,  # Run 10 examples in parallel
)

Error Handling

Stop execution on the first error encountered.

experiment, experiment_df = client.experiments.run(
    ...,
    exit_on_error=True,  # Stop on first error
)

OpenTelemetry Tracing

Set the global OpenTelemetry tracer provider for the experiment run.

experiment, experiment_df = client.experiments.run(
    ...,
    set_global_tracer_provider=True,  # Enable global OTel tracing
)

List Experiment Runs

Retrieve individual runs from an experiment with pagination support. Pass all=True to fetch all runs via Flight (ignores limit).

resp = client.experiments.list_runs(
    experiment="experiment-name-or-id",
    dataset="dataset-name-or-id",  # required when using a name
    limit=100,
)

for run in resp.experiment_runs:
    print(run)

For details on pagination, field introspection, and data conversion (to dict/JSON/DataFrame), see Response Objects.

Annotate Experiment Runs

The annotate_runs method is currently in ALPHA. The API may change without notice.

Write human annotations to a batch of runs in an experiment. Annotations are upserted by annotation config name for each run; submitting the same name for the same run overwrites the previous value. Up to 1000 runs may be annotated per request.

from arize.experiments.types import AnnotateRecordInput, AnnotationInput

result = client.experiments.annotate_runs(
    experiment="experiment-name-or-id",
    dataset="dataset-name-or-id",  # optional, used to resolve experiment by name
    space="your-space-name-or-id",  # optional, used to resolve dataset by name
    annotations=[
        AnnotateRecordInput(
            record_id="your-run-id",
            values=[
                AnnotationInput(name="accuracy", label="correct", score=1.0),
                AnnotationInput(name="notes", text="Well-structured output"),
            ],
        ),
    ],
)

print(result)

Learn more: Experiments Documentation

Version 8

Version 7

Key Capabilities

List Experiments

Create an Experiment

Get an Experiment

Delete an Experiment

Run an Experiment

Dry Run Mode

Concurrency Control

Error Handling

OpenTelemetry Tracing

List Experiment Runs

Annotate Experiment Runs

Version 8

Version 7

Documentation Index

​Key Capabilities

​List Experiments

​Create an Experiment

​Get an Experiment

​Delete an Experiment

​Run an Experiment

​Dry Run Mode

​Concurrency Control

​Error Handling

​OpenTelemetry Tracing

​List Experiment Runs

​Annotate Experiment Runs

Key Capabilities

List Experiments

Create an Experiment

Get an Experiment

Delete an Experiment

Run an Experiment

Dry Run Mode

Concurrency Control

Error Handling

OpenTelemetry Tracing

List Experiment Runs

Annotate Experiment Runs