Example Custom Metrics

Custom metrics are a powerful tool to assess various aspects of your LLM application. From analyzing cost and usage to monitoring application performance, custom metrics offer flexibility to tailor evaluations to your specific needs. Use this page as a guide on how to create custom metrics, though this is not an exhaustive list.

Percent of Correct

This example demonstrates how to calculate the percentage of predictions with an incorrect QA_Correctness_Eval. We achieve this using a FILTER (WHERE ...) clause, applying the filter only to the numerator and not the denominator:

SELECT
    COUNT(*) FILTER (WHERE "eval.QA_Correctness.label" = 'incorrect') / 
    COUNT(*) FILTER (WHERE "eval.QA_Correctness.label" != null) * 100
FROM model

Learn more about FILTER (WHERE) clauses here.

Performance Metrics

Use natively supported performance metrics as functions that can take multiple arguments for enhanced flexibility. Additionally, you can create entirely new metrics using conditionals and other logic, as shown below. Explore the documentation for performance metrics here.

Precision

You can calculate the precision of your evaluations using annotations as the ground truth:

SELECT
    PRECISION(
        predicted = "eval.QA_Correctness_Eval.label", 
        actual = "annotation.Judge Correctness.label", 
        pos_class = 'correct'
    )
FROM model

Learn more about PRECISION and related functions here. You can also use any of our built-in functions to assess the performance of your evaluations. Feel free to utilize a dimension you have traced for your actual values.

Total Costs

Calculate the total cost by summing up completion and prompt token counts:

SELECT 
    SUM("attributes.llm.token_count.completion" * 0.0001 + 
        "attributes.llm.token_count.prompt" * 0.0000025)
FROM model

Average Cost per User Query

Below is an example with the per-input token cost set to 0.0000025. Adjust it according to your per-token cost:

SELECT 
    AVG("attributes.llm.token_count.prompt" * 0.0000025)
FROM model

Alternatively, you can use per-million token costs:

SELECT 
    SUM("attributes.llm.token_count.completion" * 10 / 1000000 + 
        "attributes.llm.token_count.prompt" * 2.5 / 1000000)
FROM model

Evaluation Cost Estimate

To calculate evaluation costs, estimate token counts by exporting traces and determining the token length of your evaluation template along with the average output cost. Below is an example where the prompt template includes input and output as variables, allowing for direct token count calculation. Adjust your template accordingly if additional variables are present:

SELECT 
    (SUM(
        "attributes.llm.token_count.prompt" + 
        "attributes.llm.token_count.completion" + 
        229 -- prompt token length 
    ) * 0.00000025) + 
    (COUNT(*) * (104 * 0.00001)) -- 104 is the estimated output token length
FROM model

Number of Errors

To determine the total number of errors, use the following query:

SELECT 
    SUM(
        CASE
            WHEN "status_code" = 'ERROR' THEN 1
            ELSE 0
        END
    )
FROM model

Number of Sessions

Estimate the total number of distinct sessions:

SELECT 
    approx_count_distinct("attributes.session.id")
FROM model

Alyx

Develop

Prompts

Evaluate

Observe

Machine Learning

Security & Settings

Custom Metric Examples

Example Custom Metrics

Percent of Correct

Performance Metrics

Precision

Total Costs

Average Cost per User Query

Evaluation Cost Estimate

Number of Errors

Number of Sessions

Alyx

Develop

Prompts

Evaluate

Observe

Machine Learning

Security & Settings

​Example Custom Metrics

​Percent of Correct

​Performance Metrics

​Precision

​Total Costs

​Average Cost per User Query

​Evaluation Cost Estimate

​Number of Errors

​Number of Sessions

Example Custom Metrics

Percent of Correct

Performance Metrics

Precision

Total Costs

Average Cost per User Query

Evaluation Cost Estimate

Number of Errors

Number of Sessions