Skip to main content

Documentation Index

Fetch the complete documentation index at: https://arize-ax.mintlify.dev/docs/llms.txt

Use this file to discover all available pages before exploring further.

The arize-phoenix-evals library uses an LLM-as-judge to grade model output — hallucinations, factuality, helpfulness, toxicity, custom rubrics. LiteLLM is the universal proxy provider in Phoenix Evals: pass provider="litellm" and a model="<provider>/<id>" string to the LLM(...) wrapper to route the judge through any of 100+ LiteLLM-supported backends — useful when no native Phoenix adapter exists (Mistral, Bedrock, Together, Groq, Ollama, etc.) or when you want one piece of eval code to switch backends with a single string change.

Prerequisites

  • Python 3.11+
  • An API key for whichever upstream provider you want LiteLLM to route to. The example below uses OpenAI (OPENAI_API_KEY).

Install

pip install arize-phoenix-evals litellm pandas

Configure credentials

Set the env var for whichever upstream provider you’re targeting. LiteLLM reads the matching env var based on the <provider>/ prefix on the model id:
export OPENAI_API_KEY="<your-openai-api-key>"
# Or, for other providers:
# export ANTHROPIC_API_KEY="..."
# export MISTRAL_API_KEY="..."
# export AWS_ACCESS_KEY_ID="..."; export AWS_SECRET_ACCESS_KEY="..."  # bedrock/...
See LiteLLM’s provider list for the full env var map.

Setup the eval LLM

# eval_setup.py
from phoenix.evals import LLM

# The `<provider>/` prefix tells LiteLLM which backend to dispatch to
# and which env var to read.
llm = LLM(provider="litellm", model="openai/gpt-5")
Swap openai/gpt-5 for anthropic/claude-sonnet-4-6-20250929, mistral/mistral-large-latest, bedrock/us.anthropic.claude-sonnet-4-6, ollama/llama3, etc. — same evaluator code, different backend.

Run an evaluation

This example builds a hallucination classifier and grades two sample question/answer pairs against a reference. The pattern generalizes: replace the prompt template, choices, and DataFrame columns with whatever metric you want to evaluate.
# example.py
import pandas as pd

from phoenix.evals import LLM, create_classifier, evaluate_dataframe

llm = LLM(provider="litellm", model="openai/gpt-5")

HALLUCINATION_PROMPT = """\
Determine whether the answer below is factually supported by the
reference. Reply with exactly one of: factual, hallucinated.

Question: {input}
Answer: {output}
Reference: {reference}
"""

evaluator = create_classifier(
    name="hallucination",
    prompt_template=HALLUCINATION_PROMPT,
    llm=llm,
    # `choices` maps each label the LLM may emit to a numeric score.
    # `direction="maximize"` (the default) means higher score is better.
    choices={"factual": 1.0, "hallucinated": 0.0},
)

df = pd.DataFrame([
    {
        "input":     "What is the capital of France?",
        "output":    "Paris is the capital of France.",
        "reference": "Paris is the capital and most populous city of France.",
    },
    {
        "input":     "What is the capital of France?",
        "output":    "Berlin is the capital of France.",
        "reference": "Paris is the capital and most populous city of France.",
    },
])

results = evaluate_dataframe(dataframe=df, evaluators=[evaluator])

# `hallucination_score` is a Score row (a dict-like with `score`, `label`,
# `explanation`, …) — pull the numeric out for a flat display column.
results["score"] = results["hallucination_score"].apply(lambda r: r["score"])
print(results[["input", "output", "score"]].to_string())

Expected output

                            input                            output  score
0  What is the capital of France?   Paris is the capital of France.    1.0
1  What is the capital of France?  Berlin is the capital of France.    0.0
The full returned DataFrame also includes hallucination_execution_details (status + exceptions + timing) and the original hallucination_score column with each evaluator result’s full dict (name, score, label, explanation, metadata, kind, direction) — useful for surfacing the LLM’s reasoning, persisting eval rows back to Arize AX, or filtering retries.

Troubleshooting

  • 401 / 403 from the upstream provider. Verify the relevant env var is set (OPENAI_API_KEY, ANTHROPIC_API_KEY, etc.) and matches the <provider>/ prefix on your model id.
  • BadRequestError: LLM Provider NOT provided. The model id is missing its provider prefix — LiteLLM needs openai/gpt-5, not gpt-5. Check the LiteLLM provider docs for the exact prefix for your backend.
  • All rows return the same label. Your prompt template isn’t differentiating cases. Make sure each row’s {input}/{output}/{reference} columns expose enough context for the judge to discriminate, and that choices lists every label your prompt asks the LLM to emit.
  • Some rows fail with timeout / rate-limit. Pass max_retries= to evaluate_dataframe(...) (defaults to 3). For large batches, also pass initial_per_second_request_rate=... to LLM(...) to throttle.
  • Logging results back to Arize AX. This guide stops at producing the eval DataFrame. To attach those evals to existing spans in an Arize AX project, use log_evaluations_sync on arize.Client.
  • Routing through a self-hosted LiteLLM Proxy. Pass sync_client_kwargs={"api_base": "https://your-proxy.example.com", "api_key": "<proxy-key>"} to LLM(...) to point at a hosted LiteLLM gateway instead of letting the SDK call providers directly.

Resources

Phoenix Evals Documentation

arize-phoenix-evals on PyPI

Phoenix Evals Source

LiteLLM Tracing (instrument app calls)