Amazon Bedrock

The arize-phoenix-evals library uses an LLM-as-judge to grade model output — hallucinations, factuality, helpfulness, toxicity, custom rubrics. Plug Bedrock-hosted models in as the judge by passing provider="bedrock" to the LLM(...) wrapper, then build a create_classifier(...) evaluator and run it over a DataFrame with evaluate_dataframe(...).

Prerequisites

Python 3.11+
AWS credentials with bedrock:InvokeModel permission on the model you want to judge with
The target foundation model enabled in your AWS region’s Bedrock model access page

Install

pip install arize-phoenix-evals litellm boto3 pandas

The bedrock provider uses the LiteLLM backend under the hood; boto3 provides the AWS SDK and SigV4 signing.

Configure credentials

The bedrock provider picks up the standard AWS credential chain — env vars, shared credentials file (~/.aws/credentials), or an attached IAM role. Set the env vars directly if you don’t already have AWS credentials configured:

export AWS_ACCESS_KEY_ID="<your-access-key-id>"
export AWS_SECRET_ACCESS_KEY="<your-secret-access-key>"
export AWS_REGION="us-east-1"          # region where the model is enabled
# Optional, only if you're using STS / SSO short-term credentials:
# export AWS_SESSION_TOKEN="..."

Setup the eval LLM

# eval_setup.py
from phoenix.evals import LLM

# `provider="bedrock"` dispatches via the LiteLLM backend to the Bedrock
# Runtime API, picking up the ambient AWS credentials.
llm = LLM(provider="bedrock", model="us.anthropic.claude-sonnet-4-6")

Bedrock model ids vary by region and provider — many Anthropic models on Bedrock require the cross-region us. / eu. inference profile prefix shown above. See the Bedrock model catalog for the id to use in your region.

Run an evaluation

This example builds a hallucination classifier and grades two sample question/answer pairs against a reference. The pattern generalizes: replace the prompt template, choices, and DataFrame columns with whatever metric you want to evaluate.

# example.py
import pandas as pd

from phoenix.evals import LLM, create_classifier, evaluate_dataframe

llm = LLM(provider="bedrock", model="us.anthropic.claude-sonnet-4-6")

HALLUCINATION_PROMPT = """\
Determine whether the answer below is factually supported by the
reference. Reply with exactly one of: factual, hallucinated.

Question: {input}
Answer: {output}
Reference: {reference}
"""

evaluator = create_classifier(
    name="hallucination",
    prompt_template=HALLUCINATION_PROMPT,
    llm=llm,
    # `choices` maps each label the LLM may emit to a numeric score.
    # `direction="maximize"` (the default) means higher score is better.
    choices={"factual": 1.0, "hallucinated": 0.0},
)

df = pd.DataFrame([
    {
        "input":     "What is the capital of France?",
        "output":    "Paris is the capital of France.",
        "reference": "Paris is the capital and most populous city of France.",
    },
    {
        "input":     "What is the capital of France?",
        "output":    "Berlin is the capital of France.",
        "reference": "Paris is the capital and most populous city of France.",
    },
])

results = evaluate_dataframe(dataframe=df, evaluators=[evaluator])

# `hallucination_score` is a Score row (a dict-like with `score`, `label`,
# `explanation`, …) — pull the numeric out for a flat display column.
results["score"] = results["hallucination_score"].apply(lambda r: r["score"])
print(results[["input", "output", "score"]].to_string())

Expected output

                            input                            output  score
0  What is the capital of France?   Paris is the capital of France.    1.0
1  What is the capital of France?  Berlin is the capital of France.    0.0

The full returned DataFrame also includes hallucination_execution_details (status + exceptions + timing) and the original hallucination_score column with each evaluator result’s full dict (name, score, label, explanation, metadata, kind, direction) — useful for surfacing the LLM’s reasoning, persisting eval rows back to Arize AX, or filtering retries.

Troubleshooting

AccessDeniedException / UnrecognizedClientException. Your AWS credentials don’t have bedrock:InvokeModel on the target model, or the credentials aren’t being picked up. Verify with aws sts get-caller-identity and confirm the role has Bedrock permissions.
ValidationException: ... on-demand throughput isn't supported. The base Anthropic model id (e.g. anthropic.claude-sonnet-4-6) requires a cross-region inference profile. Switch to the regional prefix (us.anthropic.claude-sonnet-4-6 for US, eu.... for EU).
AccessDeniedException: You don't have access to the model. The model isn’t enabled in your region. Enable it on the Bedrock model access page.
All rows return the same label. Your prompt template isn’t differentiating cases. Make sure each row’s {input}/{output}/{reference} columns expose enough context for the judge to discriminate, and that choices lists every label your prompt asks the LLM to emit.
Some rows fail with timeout / rate-limit. Pass max_retries= to evaluate_dataframe(...) (defaults to 3). For large batches, also pass initial_per_second_request_rate=... to LLM(...) to throttle.
Logging results back to Arize AX. This guide stops at producing the eval DataFrame. To attach those evals to existing spans in an Arize AX project, use log_evaluations_sync on arize.Client.
Assuming a role from a different account. Use boto3.client("sts").assume_role(...), export the temporary credentials as env vars, then call LLM(...) — the provider will pick them up on the next request.

OpenTelemetry

LLM Providers

Python Agent Frameworks

TS/JS Agent Frameworks

Java Agent Frameworks

Coding Agents

Platforms

Evaluation Integrations

Prerequisites

Install

Configure credentials

Setup the eval LLM

Run an evaluation

Expected output

Troubleshooting

Resources

Phoenix Evals Documentation

arize-phoenix-evals on PyPI

Phoenix Evals Source

Amazon Bedrock Tracing (instrument app calls)

OpenTelemetry

LLM Providers

Python Agent Frameworks

TS/JS Agent Frameworks

Java Agent Frameworks

Coding Agents

Platforms

Evaluation Integrations

Documentation Index

​Prerequisites

​Install

​Configure credentials

​Setup the eval LLM

​Run an evaluation

​Expected output

​Troubleshooting

​Resources

Phoenix Evals Documentation

arize-phoenix-evals on PyPI

Phoenix Evals Source

Amazon Bedrock Tracing (instrument app calls)

Prerequisites

Install

Configure credentials

Setup the eval LLM

Run an evaluation

Expected output

Troubleshooting

Resources