Retrieval Evaluation Colab Tutorial

Vector Stores enable teams to connect their own data to LLMs. A common application is chatbots looking across a company’s knowledge base/context to answer specific questions.

How Search and Retrieval Works

Here’s an example of what retrieval looks like for a Chatbot Application. A user asked a specific question, an embedding was generated from the query, and all relevant context in the knowledge base was pulled and added into the prompt to the LLM.

Common Problems with Search and Retrieval Systems

When the application using RAG doesn’t give a good response, it can be because of different reasons. The common issues we see are

There weren’t enough documents to answer the question
The document retrieved wasn’t good enough to answer

Arize helps evaluate how good retrieval is and identify where it went wrong.

Logging data to Arize for Search and Retrieval Tracing

Arize logs both a sample of the knowledge base and the production prompt/response pairs of the deployed application. Here’s a high-level view of what is logged:

Step 1: Logging a Corpus Dataset (Knowledge Base)

The first thing we need is to collect documents from your vector store, to be able to compare against later. This is to be able to see if some sections are not being retrieved, or some sections are getting a lot of traffic where you might want to beef up your context or documents in that area.

Example of Corpus Dataframe (Knowledge Base)

document_id	text	text_vector
`123`	`The Variety Theater in Cleveland, once a ...`	`[-0.0051908623, -0.05508642, -0.28958365, -0.2...`

# Logging the Sample of the Corpus
from arize.pandas.logger import Client, Schema
from arize.utils.types import EmbeddingColumnNames

API_KEY = 'ARIZE_API_KEY'
SPACE_ID = 'YOUR SPACE ID'
arize_client = Client(space_id=SPACE_ID, api_key=API_KEY)

response = arize_client.log(
    dataframe=corpus_knowledge_base_df, # Refers to the above dataframe with the example row
    path="/tmp/arrow-inferences.bin",
    model_id="search-and-retrieval-with-corpus-dataset",
    model_type=ModelTypes.GENERATIVE_LLM,
    environment=Environments.CORPUS,
    schema=CorpusSchema(
        document_id_column_name='document_id',
        document_text_embedding_column_names=EmbeddingColumnNames(
            vector_column_name='text_vector',
            data_column_name='text'
        ),
        document_version_column_name='document_version'
    ),        
)

Step 2: Logging Production Prompt/Responses to Arize

We also will be logging the prompt/response pairs from the deployed application. Example Dataframe: prompts-response.df

prediction-ID	user-query	query-vector	document	document-vector	response	response vector	user feedback
`dd824bd3-2097…`	`What is the Variety Theater in Cleveland?`	`[-0.5686951, -0.7092256, -0.34603243, -0.4858…`	`The Variety Theater in Cleveland, once a …`	`[-0.1869151, -0.2092136, -0.1660343, -0.3258…`	`The Variety Theater is …`	`[-0.18691051, -0.2092136, -0.16603243, -0.3258…`	`thumbs-down`

# Logging the production prompt and response pairs

# Declare embedding feature columns
prompt_columns=EmbeddingColumnNames(
    vector_column_name="query-vector",
    data_column_name="user-query"
),
response_columns=EmbeddingColumnNames(
    vector_column_name="response vector",
    data_column_name="response"
)

# Define the Schema, including embedding information
schema = Schema(
            prediction_id_column_name="prediction-ID",
            timestamp_column_name="prediction_ts",
            tag_column_names=[
                "cost_per_call",
                "euclidean_distance_0",
                "euclidean_distance_1",
                "instruction",
                "openai_precision_1",
                "openai_precision_2",
                "openai_relevance_0",
                "openai_relevance_1",
                "prompt_template",
                "prompt_template_name",
                "retrieval_text_0",
                "retrieval_text_1",
                "text_similarity_0",
                "text_similarity_1",
                "tokens_used",
                "user_feedback",
                "user_query",
            ],
            prediction_label_column_name="pred_label",
            actual_label_column_name="user_feedback",
            retrieved_document_ids_column_name="retrieved_doc_ids",
            prompt_column_names=prompt_columns,
            response_column_names=response_columns,
            llm_config_column_names=LLMConfigColumnNames(
                model_column_name="llm_config_model_name",
                params_column_name="llm_params",
            ),
            prompt_template_column_names=PromptTemplateColumnNames(
                template_column_name="prompt_template",
                template_version_column_name = "prompt_template_name",
            )
    )

# Log the dataframe with the schema mapping 
response = arize_client.log(  
        dataframe=prompts_response_df, # Refers to the above dataframe with the example row
        path="/tmp/arrow-inferences.bin",
        model_id="search-and-retrieval-with-corpus-dataset",
        model_version="1.0",
        model_type=ModelTypes.GENERATIVE_LLM,
        environment=Environments.PRODUCTION,
        schema=schema,
    )

Tracing Search and Retrieval Systems with Arize

Issue #1: Bad Response

The first issue we see, and often the easiest to uncover, is bad responses. Navigate to the Embeddings projctor tab to debug your search and retrieval. If you have logged performance metrics (like user feedback or eval scores on the LLM response), we will automatically surface up any clusters that received poor feedback.

Note: you may need to create a custom metric based on an eval tag to surface up

Create a custom metric in Arize to capture user feedback

Arize will automatically surface up clusters with the worst user feedback

Bad responses are often the result of something else going on. Your LLM is likely not giving a poor response for no reason. Next, we will show you how to trace it back to the root of the problem.

Issue #2: Don’t Have Any Documents Close Enough

Maybe, the retriever wasn’t able to find any documents that were close enough to the query embedding. This means that users are asking questions about context that is missing from your corpus. Arize can help you identify if there is context that is missing from your corpus. By visualizing query density, you can understand what topics you need to add additional documentation for in order to improve your chatbot responses.

Visualize Query Density (Euclidean or Cosine Distance)

By setting my “production” dataset as the user queries, and the “baseline” dataset as the context I have in my vector store, I can see if there are clusters of user query embeddings that have no nearby context embeddings, as seen in the example above.

Issue #3: Most Similar != Most Relevant Document

There is also the possibility that the document that was retrieved was considered most similar, and had the closest embedding to the query, but wasn’t actually the most relevant document to answer the user’s question appropriately. Arize can help uncover when irrelevant context is retrieved with LLM-assisted ranking metrics. By ranking the relevance of the context retrieved, we can help you identify areas to dig into to improve the retrieval.

In order to catch these instances where the most relevant context might not be the most “similar” Arize sends the user query and context retrieved to GPT-4, or another LLM, and asks it to rank or provide a score on the relevance of the context retrieved. In the example above, both of the pieces of context retrieved got an “irrelevant” score or precision@k ranking of 0. We can also see this coincides with receiving negative user feedback on the response.

Troubleshooting Tip:Found a problematic cluster you want to dig into, but don’t want to manually sift through all of the prompts and responses? Use our Open AI Cluster Summarization tool to quickly get a summary of the selected cluster for quick analysis.

Alyx

Observe

Evaluate

Develop

Prompts

Machine Learning

Security & Settings

Retrieval Evaluation

Retrieval Evaluation Colab Tutorial

How Search and Retrieval Works

Common Problems with Search and Retrieval Systems

Logging data to Arize for Search and Retrieval Tracing

Step 1: Logging a Corpus Dataset (Knowledge Base)

Example of Corpus Dataframe (Knowledge Base)

Step 2: Logging Production Prompt/Responses to Arize

Tracing Search and Retrieval Systems with Arize

Issue #1: Bad Response

Issue #2: Don’t Have Any Documents Close Enough

Issue #3: Most Similar != Most Relevant Document

Alyx

Observe

Evaluate

Develop

Prompts

Machine Learning

Security & Settings

Retrieval Evaluation Colab Tutorial

​How Search and Retrieval Works

​Common Problems with Search and Retrieval Systems

​Logging data to Arize for Search and Retrieval Tracing

​Step 1: Logging a Corpus Dataset (Knowledge Base)

​Example of Corpus Dataframe (Knowledge Base)

​Step 2: Logging Production Prompt/Responses to Arize

​Tracing Search and Retrieval Systems with Arize

​Issue #1: Bad Response

​Issue #2: Don’t Have Any Documents Close Enough

​Issue #3: Most Similar != Most Relevant Document

How Search and Retrieval Works

Common Problems with Search and Retrieval Systems

Logging data to Arize for Search and Retrieval Tracing

Step 1: Logging a Corpus Dataset (Knowledge Base)

Example of Corpus Dataframe (Knowledge Base)

Step 2: Logging Production Prompt/Responses to Arize

Tracing Search and Retrieval Systems with Arize

Issue #1: Bad Response

Issue #2: Don’t Have Any Documents Close Enough

Issue #3: Most Similar != Most Relevant Document