Microsoft Azure AI Evaluation

The Microsoft Azure AI Evaluation library ships LLM-as-judge evaluators — groundedness, relevance, coherence, fluency, content safety — along with deterministic NLP scorers (BLEU, F1, ROUGE, METEOR). This guide shows both ways to wire them into Arize AX: Flow 1 grades existing Arize AX traces with GroundednessEvaluator and writes the scores back via client.spans.update_evaluations(...); Flow 2 uploads a small dataset, runs an Arize AX experiment with the same evaluator wrapped as an experiment evaluator, and surfaces the scores in Datasets+Experiments. Both flows share the same setup. Run the code blocks below in order inside a single Python session — each block builds on imports and variables from earlier ones.

Prerequisites

Python 3.11+
An ARIZE_SPACE_ID and ARIZE_API_KEY from your Arize AX space settings
An OPENAI_API_KEY from OpenAI Platform (used as both the model under trace and the judge model for GroundednessEvaluator)

Launch Arize AX

If you don’t already have an Arize AX account, sign up at arize.com and grab your ARIZE_SPACE_ID and ARIZE_API_KEY from Settings → Space Settings.

Install

pip install azure-ai-evaluation 'arize>=8.0.0' openai openinference-instrumentation-openai opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc pandas

Configure credentials

export ARIZE_SPACE_ID="<your-space-id>"
export ARIZE_API_KEY="<your-api-key>"
export OPENAI_API_KEY="<your-openai-api-key>"

Define evaluators

The shared setup: a Microsoft GroundednessEvaluator backed by gpt-4.1-mini, the canonical 2-row hallucination dataset that both flows score, and an Arize SDK client. The judge model is pinned to gpt-4.1-mini because azure-ai-evaluation still sends the legacy max_tokens parameter, which the GPT-5 and o-series families reject. gpt-4.1-mini accepts max_tokens natively and is more deterministic at temperature 0 than gpt-4o-mini.

# combined.py
import os
import time
from datetime import datetime, timedelta, timezone

import pandas as pd
from arize import ArizeClient
from azure.ai.evaluation import GroundednessEvaluator
from openai import OpenAI

SPACE_ID = os.environ["ARIZE_SPACE_ID"]
API_KEY = os.environ["ARIZE_API_KEY"]
TIMESTAMP = int(time.time())

# Microsoft GroundednessEvaluator scores 1–5 — higher = better grounded
# in the supplied context. 5 means fully supported; 1 means the response
# directly contradicts the context.
groundedness = GroundednessEvaluator(
    model_config={
        "type":     "openai",
        "api_key":  os.environ["OPENAI_API_KEY"],
        "model":    "gpt-4.1-mini",
        "base_url": "https://api.openai.com/v1",
    }
)

# Canonical 2-row dataset — row 0 is factual (answer matches the reference),
# row 1 is hallucinated. Both flows grade these same rows.
ROWS = [
    {
        "input":     "What is the capital of France?",
        "output":    "Paris is the capital of France.",
        "reference": "Paris is the capital and most populous city of France.",
    },
    {
        "input":     "What is the capital of France?",
        "output":    "Berlin is the capital of France.",
        "reference": "Paris is the capital and most populous city of France.",
    },
]

arize = ArizeClient(api_key=API_KEY)

Flow 1 — Evaluate existing traces

Source the spans

Instrument OpenAI with OpenInference, make two calls (each forced to echo a known answer so the trace contains predictable text), then pull the resulting spans back from Arize AX.

from openinference.instrumentation.openai import OpenAIInstrumentor
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

PROJECT_NAME = f"microsoft-tracing-example-{TIMESTAMP}"

resource = Resource.create(
    {
        "service.name":                PROJECT_NAME,
        "openinference.project.name":  PROJECT_NAME,
        "model_id":                    PROJECT_NAME,
    }
)
provider = TracerProvider(resource=resource)
provider.add_span_processor(
    BatchSpanProcessor(
        OTLPSpanExporter(
            endpoint="https://otlp.arize.com:443",
            headers={
                "authorization":   API_KEY,
                "arize-space-id":  SPACE_ID,
                "arize-interface": "python",
            },
        )
    )
)
trace.set_tracer_provider(provider)
OpenAIInstrumentor().instrument(tracer_provider=provider)

sync_oai = OpenAI()
for row in ROWS:
    sync_oai.chat.completions.create(
        model="gpt-5-mini",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a fact-recall assistant. The user states the "
                    "exact answer to use; reply with that verbatim."
                ),
            },
            {
                "role": "user",
                "content": (
                    f"Question: {row['input']}\n"
                    f"Answer (reply verbatim): {row['output']}"
                ),
            },
        ],
    )

provider.force_flush(timeout_millis=10_000)
print(f"Project: {PROJECT_NAME}")

# Poll defensively: Arize's OTLP ingest and Flight export use different
# catalogs and the new project can briefly appear "unauthorized" to the
# export endpoint while still accepting span writes via OTLP, so swallow
# transient errors and retry.
start = datetime.now(timezone.utc) - timedelta(minutes=5)
end = datetime.now(timezone.utc) + timedelta(minutes=1)
spans_df = None
last_err: Exception | None = None
for _ in range(12):
    time.sleep(5)
    try:
        spans_df = arize.spans.export_to_df(
            space_id=SPACE_ID,
            project_name=PROJECT_NAME,
            start_time=start,
            end_time=end,
        )
    except Exception as e:
        last_err = e
        continue
    if spans_df is not None and len(spans_df) >= len(ROWS):
        break
else:
    raise RuntimeError(
        f"Spans never appeared after 60s (last error: {last_err})"
    )

spans_df = spans_df.sort_values("start_time").reset_index(drop=True)

Run the evaluators

GroundednessEvaluator.__call__ is sync — call it once per span, pulling the question / answer / reference triple. The raw groundedness value is a 1–5 score from the judge LLM, which can drift by one point between runs at the extremes. The doc grades on the deterministic groundedness_result (pass / fail against the configured threshold of 3) and normalizes to 1.0 / 0.0 so the score column is stable across runs. If you want the raw 1–5 number, swap in float(result["groundedness"]).

scores = []
labels = []
for i, row in spans_df.iterrows():
    result = groundedness(
        query=ROWS[i]["input"],
        response=row["output"],
        context=ROWS[i]["reference"],
    )
    passed = result["groundedness_result"] == "pass"
    scores.append(1.0 if passed else 0.0)
    labels.append("grounded" if passed else "ungrounded")

Log evaluations to Arize AX

eval_df = pd.DataFrame(
    {
        "context.span_id":          spans_df["context.span_id"],
        "eval.groundedness.score":  scores,
        "eval.groundedness.label":  labels,
    }
)
arize.spans.update_evaluations(
    space_id=SPACE_ID,
    project_name=PROJECT_NAME,
    dataframe=eval_df,
)

flow1_display = pd.DataFrame(
    {
        "input":  [r["input"]  for r in ROWS],
        "output": [r["output"] for r in ROWS],
        "score":  scores,
    }
)
print("Flow 1 results:")
print(flow1_display.to_string())

Expected output

Flow 1 results:
                            input                            output  score
0  What is the capital of France?   Paris is the capital of France.    1.0
1  What is the capital of France?  Berlin is the capital of France.    0.0

Verify in Arize AX

Open the project named microsoft-tracing-example-<timestamp> (the value printed above) in your Arize AX space. Each ChatCompletion span now carries a groundedness annotation column showing the normalized 0/1 score and the grounded / ungrounded label.

Flow 2 — Run an experiment

Create a dataset

DATASET_NAME = f"microsoft-experiment-example-ds-{TIMESTAMP}"
dataset_df = pd.DataFrame(ROWS)
arize.datasets.create(
    name=DATASET_NAME,
    space=SPACE_ID,
    examples=dataset_df,
)
print(f"Dataset: {DATASET_NAME}")

Define the task

def task(dataset_row):
    return dataset_row["output"]

Wrap the evaluators

GroundednessEvaluator.__call__ is already safe to invoke from inside an asyncio loop (the library wraps its async core with async_run_allowing_running_loop), so the experiment evaluator is a plain def, not async def. Return an EvaluationResult with score, label, and explanation populated — leaving any of those as None triggers unsupported cast from null to <type>: reserved column cannot be coerced to canonical type at upload time.

from arize.experiments.evaluators.types import EvaluationResult


def groundedness_eval(input, output, dataset_row) -> EvaluationResult:
    result = groundedness(
        query=dataset_row["input"],
        response=output if isinstance(output, str) else str(output),
        context=dataset_row["reference"],
    )
    passed = result["groundedness_result"] == "pass"
    return EvaluationResult(
        score=1.0 if passed else 0.0,
        label="grounded" if passed else "ungrounded",
        explanation=result.get("groundedness_reason") or "no explanation",
    )

Run the experiment

EXPERIMENT_NAME = f"microsoft-experiment-example-{TIMESTAMP}"
experiment, runs_df = arize.experiments.run(
    space=SPACE_ID,
    name=EXPERIMENT_NAME,
    dataset=DATASET_NAME,
    task=task,
    evaluators={"groundedness": groundedness_eval},
)
print(f"Experiment: {EXPERIMENT_NAME}")
print("Flow 2 results:")
flow2_display = runs_df[
    ["output", "eval.groundedness.score", "eval.groundedness.label"]
].rename(
    columns={
        "eval.groundedness.score": "score",
        "eval.groundedness.label": "label",
    }
)
print(flow2_display.to_string())

Expected output

Flow 2 results:
                             output  score       label
0   Paris is the capital of France.    1.0    grounded
1  Berlin is the capital of France.    0.0  ungrounded

Verify in Arize AX

Open the Datasets + Experiments tab in Arize AX. The dataset microsoft-experiment-example-ds-<timestamp> and the experiment microsoft-experiment-example-<timestamp> (names printed above) appear with one run per dataset row, each carrying the groundedness score and label columns.

Troubleshooting

OpenAIConnection.__init__() missing 1 required positional argument: 'base_url'. The Azure AI Evaluation library requires an explicit base_url in the model_config even for plain OpenAI. Set it to https://api.openai.com/v1 as shown in the Define evaluators block.
Unsupported parameter: 'max_tokens' is not supported with this model. azure-ai-evaluation sends OpenAI requests with the legacy max_tokens parameter that GPT-5 and o-series models reject. Pin the judge to a model that still accepts max_tokens (gpt-4.1-mini, gpt-4o-mini, gpt-4o).
column "eval.groundedness.label": unsupported cast from null to string: reserved column cannot be coerced to canonical type. Your experiment evaluator returned a bare float or a dict that didn’t fill all three of score / label / explanation. Return a fully-populated EvaluationResult(...).
Spans never appear after 60s. Span flush + ingest typically takes 5–15s. If the loop times out, check that ARIZE_SPACE_ID + ARIZE_API_KEY are right and that you’re connecting to the correct region’s OTLP endpoint (otlp.arize.com for US, otlp.eu.arize.com for EU).
Using Azure OpenAI for the judge model. Swap the model_config for {"type": "azure", "api_key": "...", "azure_endpoint": "https://<resource>.openai.azure.com", "azure_deployment": "<deployment-name>", "api_version": "2024-10-21"}. The rest of the doc is unchanged.
Using safety evaluators (HateUnfairness, Violence, etc.) instead. Those require an Azure AI Foundry project and use AzureAIProject(subscription_id=..., resource_group_name=..., project_name=...) as the evaluator’s second arg instead of model_config. See Microsoft’s safety eval docs.
Using a different score scale. Microsoft’s LLM-judged evaluators (Groundedness, Relevance, Coherence, Fluency, Similarity, Retrieval) all return scores on a 1–5 scale. To project to 0/1 for downstream tooling, normalize before assigning: score = (raw - 1) / 4.
Experiment re-runs collide. Both names embed TIMESTAMP = int(time.time()) so a single re-run produces unique names. If you re-execute the same combined.py quickly, regenerate TIMESTAMP first or call arize.experiments.delete(...) / arize.datasets.delete(...) on the prior run’s names.

OpenTelemetry

LLM Providers

Python Agent Frameworks

TS/JS Agent Frameworks

Java Agent Frameworks

Coding Agents

Platforms

Evaluation Integrations

Microsoft Azure AI Evaluation

Prerequisites

Launch Arize AX

Install

Configure credentials

Define evaluators

Flow 1 — Evaluate existing traces

Source the spans

Run the evaluators

Log evaluations to Arize AX

Expected output

Verify in Arize AX

Flow 2 — Run an experiment

Create a dataset

Define the task

Wrap the evaluators

Run the experiment

Expected output

Verify in Arize AX

Troubleshooting

Resources

Azure AI Evaluation Documentation

azure-ai-evaluation on PyPI

Logging evaluations to Arize AX

Ragas evaluators in Arize AX

OpenTelemetry

LLM Providers

Python Agent Frameworks

TS/JS Agent Frameworks

Java Agent Frameworks

Coding Agents

Platforms

Evaluation Integrations

Documentation Index

​Prerequisites

​Launch Arize AX

​Install

​Configure credentials

​Define evaluators

​Flow 1 — Evaluate existing traces

​Source the spans

​Run the evaluators

​Log evaluations to Arize AX

​Expected output

​Verify in Arize AX

​Flow 2 — Run an experiment

​Create a dataset

​Define the task

​Wrap the evaluators

​Run the experiment

​Expected output

​Verify in Arize AX

​Troubleshooting

​Resources

Azure AI Evaluation Documentation

azure-ai-evaluation on PyPI

Logging evaluations to Arize AX

Ragas evaluators in Arize AX

Prerequisites

Launch Arize AX

Install

Configure credentials

Define evaluators

Flow 1 — Evaluate existing traces

Source the spans

Run the evaluators

Log evaluations to Arize AX

Expected output

Verify in Arize AX

Flow 2 — Run an experiment

Create a dataset

Define the task

Wrap the evaluators

Run the experiment

Expected output

Verify in Arize AX

Troubleshooting

Resources