Dataset runs, comparisons, and experiment analytics — score a dataset against a model and compare runs.

Experiments

API reference · Hanzo Evals API → — every endpoint, generated from the OpenAPI spec.

An experiment runs a dataset against a model — optionally with an LLM judge — and records a score for every item. Run the same dataset against different models or prompts and compare the results side by side. Part of the Hanzo Cloud evaluation surface, tenant-scoped by org.

Run an experiment

POST /v1/evals/runs is real orchestration: the cloud evaluation service runs each dataset item through the model, grades the output with the judge, and returns a run summary.

curl -X POST https://api.hanzo.ai/v1/evals/runs \
  -H "Authorization: Bearer hk-..." \
  -H "Content-Type: application/json" \
  -d '{
    "dataset": "support-qa",
    "model": "qwen3-4b",
    "runName": "qwen3-baseline",
    "judge": {
      "model": "claude-sonnet-4-5-20250929",
      "criteria": "Is the answer correct and grounded in the expected output?"
    }
  }'

{
  "dataset": "support-qa",
  "model": "qwen3-4b",
  "judgeModel": "claude-sonnet-4-5-20250929",
  "runName": "qwen3-baseline",
  "items": 40,
  "scored": 40,
  "avgScore": 0.82,
  "results": [
    { "itemId": "it_1", "traceId": "tr_abc123", "score": 1, "output": "..." }
  ]
}

List and compare runs

Each run (also called a dataset run) is retrievable so you can compare avgScore across models and prompt versions:

curl https://api.hanzo.ai/v1/evals/dataset-runs \
  -H "Authorization: Bearer hk-..."

Every run also emits per-item traces, so a regression in avgScore links straight to the traces and scores that explain it.

Datasets — the items an experiment runs against
Scores — the graded results
Traces — the per-item runs behind each score

Experiments

Experiments

Run an experiment

List and compare runs

Related

On this page