Skip to main content

Documentation Index

Fetch the complete documentation index at: https://arize-ax.mintlify.dev/docs/llms.txt

Use this file to discover all available pages before exploring further.

The arize-phoenix-evals library uses an LLM-as-judge to grade model output — hallucinations, factuality, helpfulness, toxicity, custom rubrics. Plug Vertex AI in as the judge by passing provider="vertex" to the LLM(...) wrapper, then build a create_classifier(...) evaluator and run it over a DataFrame with evaluate_dataframe(...).

Prerequisites

  • Python 3.11+
  • A Google Cloud project with the Vertex AI API enabled
  • A service account or user with the roles/aiplatform.user IAM role
  • Authenticated Application Default Credentials (gcloud auth application-default login) or a service account JSON file referenced by GOOGLE_APPLICATION_CREDENTIALS

Install

pip install arize-phoenix-evals litellm google-auth pandas
The vertex provider dispatches via the LiteLLM backend to the regional aiplatform.googleapis.com endpoint. google-auth is required so LiteLLM can resolve Application Default Credentials; without it the first eval call exits with ModuleNotFoundError: No module named 'google'.

Configure credentials

Vertex AI uses Google Cloud auth, not an API key. Authenticate locally and tell the SDK which project/region to target:
# Recommended for local dev — uses your gcloud user credentials.
gcloud auth application-default login

# Or, with a service account JSON file:
# export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"

export VERTEXAI_PROJECT="<your-gcp-project-id>"
export VERTEXAI_LOCATION="us-central1"  # optional; LiteLLM defaults to us-central1
VERTEXAI_PROJECT is mandatory — the SDK exits with Could not resolve project_id if it isn’t set. VERTEXAI_LOCATION is optional and defaults to us-central1; set it explicitly when you need a different region (e.g. europe-west1 for EU residency, or to match where the target model is enabled).

Setup the eval LLM

# eval_setup.py
from phoenix.evals import LLM

# `provider="vertex"` dispatches via the LiteLLM backend to the
# Vertex AI endpoint, picking up VERTEXAI_PROJECT, VERTEXAI_LOCATION,
# and Application Default Credentials from the environment.
llm = LLM(provider="vertex", model="gemini-2.5-flash")
gemini-2.5-flash is a strong default judge — fast and cheap relative to gemini-2.5-pro. The judge’s job is classification, not generation, so a smaller model is often sufficient.

Run an evaluation

This example builds a hallucination classifier and grades two sample question/answer pairs against a reference. The pattern generalizes: replace the prompt template, choices, and DataFrame columns with whatever metric you want to evaluate.
# example.py
import pandas as pd

from phoenix.evals import LLM, create_classifier, evaluate_dataframe

llm = LLM(provider="vertex", model="gemini-2.5-flash")

HALLUCINATION_PROMPT = """\
Determine whether the answer below is factually supported by the
reference. Reply with exactly one of: factual, hallucinated.

Question: {input}
Answer: {output}
Reference: {reference}
"""

evaluator = create_classifier(
    name="hallucination",
    prompt_template=HALLUCINATION_PROMPT,
    llm=llm,
    # `choices` maps each label the LLM may emit to a numeric score.
    # `direction="maximize"` (the default) means higher score is better.
    choices={"factual": 1.0, "hallucinated": 0.0},
)

df = pd.DataFrame([
    {
        "input":     "What is the capital of France?",
        "output":    "Paris is the capital of France.",
        "reference": "Paris is the capital and most populous city of France.",
    },
    {
        "input":     "What is the capital of France?",
        "output":    "Berlin is the capital of France.",
        "reference": "Paris is the capital and most populous city of France.",
    },
])

results = evaluate_dataframe(dataframe=df, evaluators=[evaluator])

# `hallucination_score` is a Score row (a dict-like with `score`, `label`,
# `explanation`, …) — pull the numeric out for a flat display column.
results["score"] = results["hallucination_score"].apply(lambda r: r["score"])
print(results[["input", "output", "score"]].to_string())

Expected output

                            input                            output  score
0  What is the capital of France?   Paris is the capital of France.    1.0
1  What is the capital of France?  Berlin is the capital of France.    0.0
The full returned DataFrame also includes hallucination_execution_details (status + exceptions + timing) and the original hallucination_score column with each evaluator result’s full dict (name, score, label, explanation, metadata, kind, direction) — useful for surfacing the LLM’s reasoning, persisting eval rows back to Arize AX, or filtering retries.

Troubleshooting

  • ModuleNotFoundError: No module named 'google'. The google-auth package isn’t installed. Add it to your install line (pip install ... google-auth ...) — or, equivalently, install litellm[google] which pulls in the full google-cloud-aiplatform SDK plus its auth deps.
  • Permission denied on resource project ... / PERMISSION_DENIED. The principal in your ADC doesn’t have roles/aiplatform.user (or finer-grained Vertex permissions) on the project, or you authenticated with end-user credentials that have no quota project. Grant the role in the IAM console, then run gcloud auth application-default set-quota-project <PROJECT_ID>.
  • Reauthentication needed / expired credentials. Run gcloud auth application-default login again, or rotate the service account key referenced by GOOGLE_APPLICATION_CREDENTIALS.
  • Could not resolve project_id. VERTEXAI_PROJECT isn’t set and ADC didn’t surface a default project. Either export VERTEXAI_PROJECT explicitly or run gcloud config set project <PROJECT_ID> before gcloud auth application-default login.
  • 404 NOT_FOUND for the model. The model isn’t available in the region you set for VERTEXAI_LOCATION (or in the default us-central1 if you didn’t set one). Check the Vertex AI generative model availability matrix and swap regions accordingly.
  • All rows return the same label. Your prompt template isn’t differentiating cases. Make sure each row’s {input}/{output}/{reference} columns expose enough context for the judge to discriminate, and that choices lists every label your prompt asks the LLM to emit.
  • Some rows fail with timeout / rate-limit. Pass max_retries= to evaluate_dataframe(...) (defaults to 3). For large batches, also pass initial_per_second_request_rate=... to LLM(...) to throttle.
  • Logging results back to Arize AX. This guide stops at producing the eval DataFrame. To attach those evals to existing spans in an Arize AX project, use log_evaluations_sync on arize.Client.
  • Using the Gemini API instead of Vertex. Set GEMINI_API_KEY and switch to provider="google" — see the Gemini evals doc for the full pattern.

Resources

Phoenix Evals Documentation

arize-phoenix-evals on PyPI

Phoenix Evals Source

Vertex AI Tracing (instrument app calls)