TheDocumentation Index
Fetch the complete documentation index at: https://arize-ax.mintlify.dev/docs/llms.txt
Use this file to discover all available pages before exploring further.
arize-phoenix-evals library uses an LLM-as-judge to grade model output — hallucinations, factuality, helpfulness, toxicity, custom rubrics. Phoenix Evals does not ship a native Mistral adapter, so plug Mistral in via the LiteLLM proxy by passing provider="litellm" and model="mistral/<model-id>" to the LLM(...) wrapper, then build a create_classifier(...) evaluator and run it over a DataFrame with evaluate_dataframe(...).
Prerequisites
- Python 3.11+
- A
MISTRAL_API_KEYfrom the Mistral AI Console
Install
litellm is the proxy that routes the eval calls to Mistral. You don’t need the mistralai SDK installed separately.
Configure credentials
MISTRAL_API_KEY from the environment when it sees a mistral/... model id.
Setup the eval LLM
mistral-large-latest is a strong default judge; for cheaper batch evals swap in mistral/mistral-small-latest. The judge’s job is classification, not generation, so a smaller model is often sufficient.
Run an evaluation
This example builds a hallucination classifier and grades two sample question/answer pairs against a reference. The pattern generalizes: replace the prompt template, choices, and DataFrame columns with whatever metric you want to evaluate.Expected output
hallucination_execution_details (status + exceptions + timing) and the original hallucination_score column with each evaluator result’s full dict (name, score, label, explanation, metadata, kind, direction) — useful for surfacing the LLM’s reasoning, persisting eval rows back to Arize AX, or filtering retries.
Troubleshooting
401/403from Mistral. VerifyMISTRAL_API_KEYis set and has access to the model. Generate a new key at console.mistral.ai.model_not_foundor404. Confirm the model id is correct — LiteLLM expectsmistral/<id>(e.g.mistral/mistral-large-latest,mistral/mistral-small-latest). See the LiteLLM Mistral provider docs for the current list.- All rows return the same label. Your prompt template isn’t differentiating cases. Make sure each row’s
{input}/{output}/{reference}columns expose enough context for the judge to discriminate, and thatchoiceslists every label your prompt asks the LLM to emit. - Some rows fail with timeout / rate-limit. Pass
max_retries=toevaluate_dataframe(...)(defaults to 3). For large batches, also passinitial_per_second_request_rate=...toLLM(...)to throttle. - Logging results back to Arize AX. This guide stops at producing the eval DataFrame. To attach those evals to existing spans in an Arize AX project, use
log_evaluations_synconarize.Client. - Using a different LiteLLM-supported provider. Swap the
mistral/...prefix for any LiteLLM-supported model —provider="litellm"is the generic escape hatch when no native Phoenix adapter exists.