Skip to main content

Documentation Index

Fetch the complete documentation index at: https://arize-ax.mintlify.dev/docs/llms.txt

Use this file to discover all available pages before exploring further.

In the previous guide, the groundedness evaluator revealed a pattern: the chatbot makes claims not in the policy documents - the system prompt says “be helpful” but doesn’t enforce grounding. Rather than guessing at a fix and redeploying, start from a real failure, fix it in Playground using the exact inputs that went wrong, then validate across a full dataset before shipping.
Arize AX skyserve-chatbot Traces view with Alyx assistant open, user request about groundedness-check failures this week, and Alyx task plan and progress
This is Part 3 of the Arize AX Get Started series. You should have completed the Evaluations guide first, with evaluation scores visible on your traces.

Choose how you want to work

Use Arize Skills to have your coding agent run improvement workflows from your editor, Alyx for a conversational approach inside the Arize platform, the UI for a hands-on step-by-step experience, or Code to run programmatically. In each path, you’ll build a dataset from failing traces, iterate on your prompt, and compare experiments before shipping.
Use Arize Skills with your coding agent to run the same workflow from your editor. The example prompts below are what you type to your agent — the skill loads automatically and handles the rest. Install the skills plugin and follow Set up Arize with AI coding agents for authentication and CLI setup.

Step 1: See evaluation results on your traces

arize-traceSpans include labels once an eval task has run; see Viewing results in the tracing UI.For example, you might say:
Export spans from skyserve-chatbot where groundedness-check failed this week
Terminal showing ax spans export command, export success message, summary of spans with low groundedness flagged as hallucinated, and a table of span and trace IDs with evaluator columns

Step 2: Create a dataset

arize-datasetFor example, you might say:
Create skyserve-test-cases from those failing traces
Terminal: dataset skyserve-test-cases created from failing traces, with schema fields question, reference_text, original_output, trace and span IDs, and status counts

Step 3: Improve the system prompt

arize-prompt-optimizationFor example, you might say:
Extract the system prompt from the failing skyserve-chatbot spans and generate an improved version. Use the groundedness-check eval labels and explanations as signal for what to fix.
Improved SkyServe system prompt with grounding rules and notes referencing groundedness-check eval labels and span-level failures

Step 4: Run both prompts as experiments

arize-experimentReuse the same evaluators you trust in production; see Run evals on experiments.For example, you might say:
Run both prompt versions (original and the updated one) against the dataset and compare groundedness scores.
Experiment comparison table: skyserve-original-prompt at 4/5 groundedness versus skyserve-improved-prompt at 5/5 (100%)

Congratulations!

You’ve completed the full improvement loop:
  1. Traced your app to see what’s happening inside it.
  2. Evaluated responses automatically to measure quality.
  3. Improved your prompt using real failure data in the Playground.
  4. Proved the improvement works across a representative dataset with experiments.
You now have a repeatable, data-driven process for improving your LLM application. No more guessing, no more hoping - you can measure quality and demonstrate improvement. Next up: Deepen your tracing foundation so your improvement loop stays grounded in complete, high-quality telemetry.

Next: Tracing concepts

Learn more about Experiments