TheDocumentation Index
Fetch the complete documentation index at: https://arize-ax.mintlify.dev/docs/llms.txt
Use this file to discover all available pages before exploring further.
arize-phoenix-evals library uses an LLM-as-judge to grade model output — hallucinations, factuality, helpfulness, toxicity, custom rubrics. LiteLLM is the universal proxy provider in Phoenix Evals: pass provider="litellm" and a model="<provider>/<id>" string to the LLM(...) wrapper to route the judge through any of 100+ LiteLLM-supported backends — useful when no native Phoenix adapter exists (Mistral, Bedrock, Together, Groq, Ollama, etc.) or when you want one piece of eval code to switch backends with a single string change.
Prerequisites
- Python 3.11+
- An API key for whichever upstream provider you want LiteLLM to route to. The example below uses OpenAI (
OPENAI_API_KEY).
Install
Configure credentials
Set the env var for whichever upstream provider you’re targeting. LiteLLM reads the matching env var based on the<provider>/ prefix on the model id:
Setup the eval LLM
openai/gpt-5 for anthropic/claude-sonnet-4-6-20250929, mistral/mistral-large-latest, bedrock/us.anthropic.claude-sonnet-4-6, ollama/llama3, etc. — same evaluator code, different backend.
Run an evaluation
This example builds a hallucination classifier and grades two sample question/answer pairs against a reference. The pattern generalizes: replace the prompt template, choices, and DataFrame columns with whatever metric you want to evaluate.Expected output
hallucination_execution_details (status + exceptions + timing) and the original hallucination_score column with each evaluator result’s full dict (name, score, label, explanation, metadata, kind, direction) — useful for surfacing the LLM’s reasoning, persisting eval rows back to Arize AX, or filtering retries.
Troubleshooting
401/403from the upstream provider. Verify the relevant env var is set (OPENAI_API_KEY,ANTHROPIC_API_KEY, etc.) and matches the<provider>/prefix on your model id.BadRequestError: LLM Provider NOT provided. The model id is missing its provider prefix — LiteLLM needsopenai/gpt-5, notgpt-5. Check the LiteLLM provider docs for the exact prefix for your backend.- All rows return the same label. Your prompt template isn’t differentiating cases. Make sure each row’s
{input}/{output}/{reference}columns expose enough context for the judge to discriminate, and thatchoiceslists every label your prompt asks the LLM to emit. - Some rows fail with timeout / rate-limit. Pass
max_retries=toevaluate_dataframe(...)(defaults to 3). For large batches, also passinitial_per_second_request_rate=...toLLM(...)to throttle. - Logging results back to Arize AX. This guide stops at producing the eval DataFrame. To attach those evals to existing spans in an Arize AX project, use
log_evaluations_synconarize.Client. - Routing through a self-hosted LiteLLM Proxy. Pass
sync_client_kwargs={"api_base": "https://your-proxy.example.com", "api_key": "<proxy-key>"}toLLM(...)to point at a hosted LiteLLM gateway instead of letting the SDK call providers directly.