Evaluating A RAG-Powered Chatbot

https://storage.googleapis.com/arize-phoenix-assets/assets/images/arize-docs-images/cookbooks/gc.png

Google Colab

In this tutorial we will:

Build a RAG application using Llama-Index
Set up Phoenix as a trace collector for the Llama-Index application
Use Phoenix’s evals library to compute LLM generated evaluations of our RAG app responses
Use arize SDK to export the traces and evaluations to Arize AX

You can read more about LLM tracing in Arize AX here.

Install Dependencies

Let’s get the notebook setup with dependencies.

# Dependencies needed to build the Llama Index RAG application
!pip install -qq gcsfs llama-index-llms-openai llama-index-embeddings-openai llama-index-core

# Dependencies needed to export spans and send them to our collector: Phoenix
!pip install -qq llama-index-callbacks-arize-phoenix

# Install Phoenix to generate evaluations
!pip install -qq "arize-phoenix[evals]>7.0.0"

# Install Arize AX SDK with `Tracing` extra dependencies to export Phoenix data to Arize AX
!pip install -qq "arize>7.29.0"

Set up Phoenix as a Trace Collector in our LLM app

To get started, launch the phoenix app. Make sure to open the app in your browser using the link below.

import phoenix as px

session = px.launch_app()

Once you have started a Phoenix server, you can start your LlamaIndex application and configure it to send traces to Phoenix. To do this, you will have to add configure Phoenix as the global handler

from llama_index.core import set_global_handler

set_global_handler("arize_phoenix")

That’s it! The Llama-Index application we build next will send traces to Phoenix.

Build Your Llama Index RAG Application

We start by setting your OpenAI API key if it is not already set as an environment variable.

import os
from getpass import getpass

OPENAI_API_KEY = globals().get("OPENAI_API_KEY") or getpass(
    "🔑 Enter your OpenAI API key: "
)
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

This example uses a RetrieverQueryEngine over a pre-built index of the Arize AX documentation, but you can use whatever LlamaIndex application you like. Download the pre-built index of the Arize AX docs from cloud storage and instantiate your storage context.

from gcsfs import GCSFileSystem
from llama_index.core import StorageContext

file_system = GCSFileSystem(project="public-assets-275721")
index_path = "arize-phoenix-assets/datasets/unstructured/llm/llama-index/arize-docs/index/"
storage_context = StorageContext.from_defaults(
    fs=file_system,
    persist_dir=index_path,
)

We are now ready to instantiate our query engine that will perform retrieval-augmented generation (RAG). Query engine is a generic interface in LlamaIndex that allows you to ask question over your data. A query engine takes in a natural language query, and returns a rich response. It is built on top of Retrievers. You can compose multiple query engines to achieve more advanced capability.

from llama_index.llms.openai import OpenAI
from llama_index.core import (
    Settings,
    load_index_from_storage,
)
from llama_index.embeddings.openai import OpenAIEmbedding


Settings.llm = OpenAI(model="gpt-4o")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-ada-002")
index = load_index_from_storage(
    storage_context,
)
query_engine = index.as_query_engine()

Let’s test our app by asking a question about the Arize AX documentation:

response = query_engine.query(
    "What is Arize AX and how can it help me as an AI Engineer?"
)
print(response)

Great! Our application works!

Use the instrumented Query Engine

We will download a dataset of questions for our RAG application to answer.

from urllib.request import urlopen
import json

queries_url = "http://storage.googleapis.com/arize-phoenix-assets/datasets/unstructured/llm/context-retrieval/arize_docs_queries.jsonl"
queries = []
with urlopen(queries_url) as response:
    for line in response:
        line = line.decode("utf-8").strip()
        data = json.loads(line)
        queries.append(data["query"])

queries[:5]

We use the instrumented query engine and get responses from our RAG app.

from tqdm.notebook import tqdm

N = 10  # Sample size
qa_pairs = []
for query in tqdm(queries[:N]):
    resp = query_engine.query(query)
    qa_pairs.append((query, resp))

To see the questions and answers in phoenix, use the link described when we started the phoenix server

Run Evaluations on the data in Phoenix

We will use the phoenix client to extract data in the correct format for specific evaluations and the custom evaluators, also from phoenix, to run evaluations on our RAG application.

from phoenix.session.evaluation import get_qa_with_reference

px_client = px.Client()  # Define phoenix client
queries_df = get_qa_with_reference(
    px_client
)  # Get question, answer and reference data from phoenix

Next, we enable concurrent evaluations for better performance.

import nest_asyncio

nest_asyncio.apply()  # needed for concurrent evals in notebook environments

Then, we define our evaluators and run the evaluations

from phoenix.evals import (
    HallucinationEvaluator,
    OpenAIModel,
    QAEvaluator,
    run_evals,
)

eval_model = OpenAIModel(
    model="gpt-4o",
)
hallucination_evaluator = HallucinationEvaluator(eval_model)
qa_correctness_evaluator = QAEvaluator(eval_model)

hallucination_eval_df, qa_correctness_eval_df = run_evals(
    dataframe=queries_df,
    evaluators=[hallucination_evaluator, qa_correctness_evaluator],
    provide_explanation=True,
)

Finally, we log the evaluations into Phoenix

from phoenix.trace import SpanEvaluations

px_client.log_evaluations(
    SpanEvaluations(eval_name="Hallucination", dataframe=hallucination_eval_df),
    SpanEvaluations(
        eval_name="QA_Correctness", dataframe=qa_correctness_eval_df
    ),
)

Export data to Arize

Get data into dataframes

We extract the spans and evals dataframes from the phoenix client

tds = px_client.get_trace_dataset()
spans_df = tds.get_spans_dataframe(include_evaluations=False)
spans_df.head()

evals_df = tds.get_evals_dataframe()
evals_df.head()

Initialize Arize Client

from arize.pandas.logger import Client

SPACE_ID = globals().get("SPACE_ID") or getpass(
    "🔑 Enter your Arize AX Space ID: "
)
API_KEY = globals().get("API_KEY") or getpass("🔑 Enter your Arize AX API Key: ")

arize_client = Client(
    space_id=SPACE_ID,
    api_key=API_KEY,
)
model_id = "tutorial-tracing-llama-index-rag-export-from-phoenix"
model_version = "1.0"

Lastly, we use log_spans from the arize client to log our spans data and, if we have evaluations, we can pass the optional evals_dataframe.

response = arize_client.log_spans(
    dataframe=spans_df,
    evals_dataframe=evals_df,
    model_id=model_id,
    model_version=model_version,
)

# If successful, the server will return a status_code of 200
if response.status_code != 200:
    print(
        f"❌ logging failed with response code {response.status_code}, {response.text}"
    )
else:
    print("✅ You have successfully logged traces set to Arize AX")

AI Engineering Workflows

Agents

Human-in-the-Loop Workflows (Annotations)

Experiments

Prompt Learning

Evaluation

Evaluating A RAG-Powered Chatbot

Google Colab

Install Dependencies

Set up Phoenix as a Trace Collector in our LLM app

Build Your Llama Index RAG Application

Use the instrumented Query Engine

Run Evaluations on the data in Phoenix

Export data to Arize

Get data into dataframes

Initialize Arize Client

AI Engineering Workflows

Agents

Human-in-the-Loop Workflows (Annotations)

Experiments

Prompt Learning

Evaluation

Google Colab

​Install Dependencies

​Set up Phoenix as a Trace Collector in our LLM app

​Build Your Llama Index RAG Application

​Use the instrumented Query Engine

​Run Evaluations on the data in Phoenix

​Export data to Arize

​Get data into dataframes

​Initialize Arize Client

Install Dependencies

Set up Phoenix as a Trace Collector in our LLM app

Build Your Llama Index RAG Application

Use the instrumented Query Engine

Run Evaluations on the data in Phoenix

Export data to Arize

Get data into dataframes

Initialize Arize Client