Vertex AI

The arize-phoenix-evals library uses an LLM-as-judge to grade model output — hallucinations, factuality, helpfulness, toxicity, custom rubrics. Plug Vertex AI in as the judge by passing provider="vertex" to the LLM(...) wrapper, then build a create_classifier(...) evaluator and run it over a DataFrame with evaluate_dataframe(...).

Prerequisites

Python 3.11+
A Google Cloud project with the Vertex AI API enabled
A service account or user with the roles/aiplatform.user IAM role
Authenticated Application Default Credentials (gcloud auth application-default login) or a service account JSON file referenced by GOOGLE_APPLICATION_CREDENTIALS

Install

pip install arize-phoenix-evals litellm google-auth pandas

The vertex provider dispatches via the LiteLLM backend to the regional aiplatform.googleapis.com endpoint. google-auth is required so LiteLLM can resolve Application Default Credentials; without it the first eval call exits with ModuleNotFoundError: No module named 'google'.

Configure credentials

Vertex AI uses Google Cloud auth, not an API key. Authenticate locally and tell the SDK which project/region to target:

# Recommended for local dev — uses your gcloud user credentials.
gcloud auth application-default login

# Or, with a service account JSON file:
# export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"

export VERTEXAI_PROJECT="<your-gcp-project-id>"
export VERTEXAI_LOCATION="us-central1"  # optional; LiteLLM defaults to us-central1

VERTEXAI_PROJECT is mandatory — the SDK exits with Could not resolve project_id if it isn’t set. VERTEXAI_LOCATION is optional and defaults to us-central1; set it explicitly when you need a different region (e.g. europe-west1 for EU residency, or to match where the target model is enabled).

Setup the eval LLM

# eval_setup.py
from phoenix.evals import LLM

# `provider="vertex"` dispatches via the LiteLLM backend to the
# Vertex AI endpoint, picking up VERTEXAI_PROJECT, VERTEXAI_LOCATION,
# and Application Default Credentials from the environment.
llm = LLM(provider="vertex", model="gemini-2.5-flash")

gemini-2.5-flash is a strong default judge — fast and cheap relative to gemini-2.5-pro. The judge’s job is classification, not generation, so a smaller model is often sufficient.

Run an evaluation

This example builds a hallucination classifier and grades two sample question/answer pairs against a reference. The pattern generalizes: replace the prompt template, choices, and DataFrame columns with whatever metric you want to evaluate.

# example.py
import pandas as pd

from phoenix.evals import LLM, create_classifier, evaluate_dataframe

llm = LLM(provider="vertex", model="gemini-2.5-flash")

HALLUCINATION_PROMPT = """\
Determine whether the answer below is factually supported by the
reference. Reply with exactly one of: factual, hallucinated.

Question: {input}
Answer: {output}
Reference: {reference}
"""

evaluator = create_classifier(
    name="hallucination",
    prompt_template=HALLUCINATION_PROMPT,
    llm=llm,
    # `choices` maps each label the LLM may emit to a numeric score.
    # `direction="maximize"` (the default) means higher score is better.
    choices={"factual": 1.0, "hallucinated": 0.0},
)

df = pd.DataFrame([
    {
        "input":     "What is the capital of France?",
        "output":    "Paris is the capital of France.",
        "reference": "Paris is the capital and most populous city of France.",
    },
    {
        "input":     "What is the capital of France?",
        "output":    "Berlin is the capital of France.",
        "reference": "Paris is the capital and most populous city of France.",
    },
])

results = evaluate_dataframe(dataframe=df, evaluators=[evaluator])

# `hallucination_score` is a Score row (a dict-like with `score`, `label`,
# `explanation`, …) — pull the numeric out for a flat display column.
results["score"] = results["hallucination_score"].apply(lambda r: r["score"])
print(results[["input", "output", "score"]].to_string())

Expected output

                            input                            output  score
0  What is the capital of France?   Paris is the capital of France.    1.0
1  What is the capital of France?  Berlin is the capital of France.    0.0

The full returned DataFrame also includes hallucination_execution_details (status + exceptions + timing) and the original hallucination_score column with each evaluator result’s full dict (name, score, label, explanation, metadata, kind, direction) — useful for surfacing the LLM’s reasoning, persisting eval rows back to Arize AX, or filtering retries.

Troubleshooting

ModuleNotFoundError: No module named 'google'. The google-auth package isn’t installed. Add it to your install line (pip install ... google-auth ...) — or, equivalently, install litellm[google] which pulls in the full google-cloud-aiplatform SDK plus its auth deps.
Permission denied on resource project ... / PERMISSION_DENIED. The principal in your ADC doesn’t have roles/aiplatform.user (or finer-grained Vertex permissions) on the project, or you authenticated with end-user credentials that have no quota project. Grant the role in the IAM console, then run gcloud auth application-default set-quota-project <PROJECT_ID>.
Reauthentication needed / expired credentials. Run gcloud auth application-default login again, or rotate the service account key referenced by GOOGLE_APPLICATION_CREDENTIALS.
Could not resolve project_id. VERTEXAI_PROJECT isn’t set and ADC didn’t surface a default project. Either export VERTEXAI_PROJECT explicitly or run gcloud config set project <PROJECT_ID> before gcloud auth application-default login.
404 NOT_FOUND for the model. The model isn’t available in the region you set for VERTEXAI_LOCATION (or in the default us-central1 if you didn’t set one). Check the Vertex AI generative model availability matrix and swap regions accordingly.
All rows return the same label. Your prompt template isn’t differentiating cases. Make sure each row’s {input}/{output}/{reference} columns expose enough context for the judge to discriminate, and that choices lists every label your prompt asks the LLM to emit.
Some rows fail with timeout / rate-limit. Pass max_retries= to evaluate_dataframe(...) (defaults to 3). For large batches, also pass initial_per_second_request_rate=... to LLM(...) to throttle.
Logging results back to Arize AX. This guide stops at producing the eval DataFrame. To attach those evals to existing spans in an Arize AX project, use log_evaluations_sync on arize.Client.
Using the Gemini API instead of Vertex. Set GEMINI_API_KEY and switch to provider="google" — see the Gemini evals doc for the full pattern.

OpenTelemetry

LLM Providers

Python Agent Frameworks

TS/JS Agent Frameworks

Java Agent Frameworks

Coding Agents

Platforms

Evaluation Integrations

Prerequisites

Install

Configure credentials

Setup the eval LLM

Run an evaluation

Expected output

Troubleshooting

Resources

Phoenix Evals Documentation

arize-phoenix-evals on PyPI

Phoenix Evals Source

Vertex AI Tracing (instrument app calls)

OpenTelemetry

LLM Providers

Python Agent Frameworks

TS/JS Agent Frameworks

Java Agent Frameworks

Coding Agents

Platforms

Evaluation Integrations

Documentation Index

​Prerequisites

​Install

​Configure credentials

​Setup the eval LLM

​Run an evaluation

​Expected output

​Troubleshooting

​Resources

Phoenix Evals Documentation

arize-phoenix-evals on PyPI

Phoenix Evals Source

Vertex AI Tracing (instrument app calls)

Prerequisites

Install

Configure credentials

Setup the eval LLM

Run an evaluation

Expected output

Troubleshooting

Resources