Experiments
Dataset runs, comparisons, and experiment analytics — score a dataset against a model and compare runs.
Experiments
API reference · Hanzo Evals API → — every endpoint, generated from the OpenAPI spec.
An experiment runs a dataset against a model — optionally with an LLM judge — and records a score for every item. Run the same dataset against different models or prompts and compare the results side by side. Part of the Hanzo Cloud evaluation surface, tenant-scoped by org.
Run an experiment
POST /v1/evals/runs is real orchestration: the cloud evaluation service runs each dataset item through the model, grades the output with the judge, and returns a run summary.
curl -X POST https://api.hanzo.ai/v1/evals/runs \
-H "Authorization: Bearer hk-..." \
-H "Content-Type: application/json" \
-d '{
"dataset": "support-qa",
"model": "qwen3-4b",
"runName": "qwen3-baseline",
"judge": {
"model": "claude-sonnet-4-5-20250929",
"criteria": "Is the answer correct and grounded in the expected output?"
}
}'{
"dataset": "support-qa",
"model": "qwen3-4b",
"judgeModel": "claude-sonnet-4-5-20250929",
"runName": "qwen3-baseline",
"items": 40,
"scored": 40,
"avgScore": 0.82,
"results": [
{ "itemId": "it_1", "traceId": "tr_abc123", "score": 1, "output": "..." }
]
}List and compare runs
Each run (also called a dataset run) is retrievable so you can compare avgScore across models and prompt versions:
curl https://api.hanzo.ai/v1/evals/dataset-runs \
-H "Authorization: Bearer hk-..."Every run also emits per-item traces, so a regression in avgScore links straight to the traces and scores that explain it.
Related
How is this guide?
Last updated on