Setting up CI/CD pipelines for LLMs helps you maintain control as your applications evolve. Just like in traditional software, automated testing is crucial to catch issues early. With Arize, you can create experiments that automatically validate changes—whether it’s a tweak to a prompt, model, or function—using a curated dataset and your preferred evaluation method. These tests can be integrated with GitHub Actions, or GitLab CI/CD so they run automatically when you push a change, giving you confidence that your updates are solid without the need for manual testing.
This guide will walk you through setting up an automated experiment using our platform. It includes preparing your experiment file, defining the task and evaluator, and running the experiment.
To test locally be sure to install the dependencies: pip install -q arize==7.19.0 arize-phoenix==4.21.0 nest_asyncio packaging openai 'gql[all]'
The experiment file organizes all components necessary for conducting your experiment. It includes sections for the dataset, task, and evaluator.DatasetThe first step is to set up and retrieve your dataset:
Copy
Ask AI
from arize import ArizeClientclient = ArizeClient(api_key="your-arize-api-key")# Get the current dataset metadatadataset = client.datasets.get(dataset_id=dataset_id)# Get the dataset examplesexamples_response = client.datasets.list_examples( dataset_id=dataset_id, dataset_version_id=dataset_version_id, # Optional, defaults to latest)dataset_df = examples_response.to_df()
TaskDefine the tasks that your model needs to perform. Typically, the task replicates the LLM functionality you’re aiming to test. In this example, the focus is on whether the router selected the correct function, so the task involves returning the tool call:
Copy
Ask AI
def task(example) -> str: ## You can import directly from your repo to automatically grab the latest version from prompt_func.search.search_router import ROUTER_TEMPLATE print("running task") prompt_vars = json.loads( example.dataset_row["attributes.llm.prompt_template.variables"] ) response = client.chat.completions.create( model=TASK_MODEL, temperature=0, messages=[ {"role": "system", "content": ROUTER_TEMPLATE}, ], tools=avail_tools, ) tool_response = response.choices[0].message.tool_calls return tool_responsedef run_task(example) -> str: return task(example)
EvaluatorAn evaluator serves as the measure of success for your experiment. You can define multiple evaluators, ranging from LLM-based judges to code-based evaluations. The evaluator is central to testing and validating the outcomes of your experiment:
You can retrieve information about existing experiments using a GraphQL query. This is useful for tracking experiment history and performance.
Copy
Ask AI
from gql import Client, gqlfrom gql.transport.requests import RequestsHTTPTransportdef fetch_experiment_details(gql_client, dataset_id): experiments_query = gql( """ query getExperimentDetails($DatasetId:ID!){ node(id: $DatasetId) { ... on Dataset { name experiments(first: 1){ edges{ node{ name createdAt evaluationScoreMetrics{ name meanScore } } } } } } } """ ) params = {"DatasetId": dataset_id} response = gql_client.execute(experiments_query, params) experiments = response["node"]["experiments"]["edges"] experiments_list = [] for experiment in experiments: node = experiment["node"] experiment_name = node["name"] for metric in node["evaluationScoreMetrics"]: experiments_list.append([ experiment_name, metric["name"], metric["meanScore"] ]) return experiments_list
This function returns a list of experiments with their names, metric names, and mean scores.Determine Experiment SuccessYou can use the mean score from an experiment to automatically determine if it passed or failed:
This function exits with code 0 if the experiment is successful (score > 0.7) or code 1 if it fails.Auto-increment Experiment NamesTo ensure unique experiment names, you can automatically increment the version number:
Copy
Ask AI
def increment_experiment_name(experiment_name): ## example name: AI Search V1.1 match = re.search(r"V(\d+)\.(\d+)", experiment_name) if not match: return experiment_name major, minor = map(int, match.groups()) new_version = f"V{major}.{minor + 1}" return re.sub(r"V\d+\.\d+", new_version, experiment_name)
GitLab CI/CD pipelines are defined in a .gitlab-ci.yml file stored in the root of your repository. You can use YAML syntax to define your pipeline.Example .gitlab-ci.yml File:
Copy
Ask AI
stages: - testvariables: # These variables need to be defined in GitLab CI/CD settings # The $ syntax is how GitLab references variables OPENAI_API_KEY: $OPENAI_API_KEY ARIZE_API_KEY: $ARIZE_API_KEY SPACE_ID: $SPACE_ID DATASET_ID: $DATASET_IDllm-experiment-job: stage: test image: python:3.10 # The 'only' directive specifies when this job should run # This will run for merge requests that change files in copilot/search only: refs: - merge_requests changes: - copilot/search/**/* script: - pip install -q arize==7.36.0 arize-phoenix==4.29.0 nest_asyncio packaging openai 'gql[all]' - python ./copilot/experiments/ai_search_test.py artifacts: paths: - experiment_results.json expire_in: 1 week