Best Practices

The Prompt Playground is a powerful tool for testing LLM Evaluators before deploying them as an online task. Within this environment, users can easily iterate and improve their evaluator configurations:

The Prompt Template: Experiment with different prompt structures to see which works best. For example, if you’re iterating on a hallucination evaluator template, you might experiment with adding few shot examples.
The LLM Model: Compare how different LLM models, such as GPT 4o-mini or o1-mini, affect evaluation results. You can also explore performance across various providers (e.g., Anthropic’s Bedrock models or Vertex AI Gemini) and adjust LLM parameters as needed.

A key best practice is to use a golden dataset to test your evaluator. This dataset should contain carefully selected examples with ground truth labels. By comparing your evaluator’s output against these labels, you can ensure that the evaluator is performing as expected, especially in nuanced scenarios.

Workflow

Load Evaluator Template & Spans into Playground

When setting up an online task, you can use the “Test in Playground” button to evaluate your evaluator in real-time. Before entering the playground, you’ll have the opportunity to preview the relevant span attributes and choose which spans from your project should be included for testing.

Iterate on Evaluator to Generate New Output

Once inside the playground, select “Run” to generate output based on the selected spans. When using our provided function, this output will include both an evaluation label (e.g., “factual” or “hallucinated”) and an explanation to clarify the reasoning behind the label. If the evaluation label is incorrect, consider iterating on your template or adjusting the model to refine the results. You can repeat this process as many times as needed to fine-tune your evaluator.

Load in a Golden Dataset

Additionally, you can load a golden dataset consisting of carefully selected examples into Prompt Playground to further test the evaluator’s performance. This ensures its accuracy and consistency across a specific set of test cases.

Update Task with Changes

After testing in the playground and making improvements—whether by adjusting the model, modifying LLM parameters, or fine-tuning the prompt template—return to the task creation page to apply these changes. This step ensures that the updates made during testing are reflected in your live task, allowing for more accurate evaluations in production.

Alyx

Develop

Prompts

Evaluate

Observe

Machine Learning

Security & Settings

Test LLM Evaluator in playground

Best Practices

Workflow

Load Evaluator Template & Spans into Playground

Iterate on Evaluator to Generate New Output

Load in a Golden Dataset

Update Task with Changes

Alyx

Develop

Prompts

Evaluate

Observe

Machine Learning

Security & Settings

​Best Practices

​Workflow

​Load Evaluator Template & Spans into Playground

​Iterate on Evaluator to Generate New Output

​Load in a Golden Dataset

​Update Task with Changes

Best Practices

Workflow

Load Evaluator Template & Spans into Playground

Iterate on Evaluator to Generate New Output

Load in a Golden Dataset

Update Task with Changes