Curate evaluation datasets and items — the input/expected-output pairs your experiments run against.

Datasets

A dataset is a curated collection of items — each an input paired with an expected output — used to evaluate models and agents. Datasets are the fixed yardstick an experiment measures against. Part of the Hanzo Cloud evaluation surface, tenant-scoped by org.

Create a dataset

curl -X POST https://api.hanzo.ai/v1/evals/datasets \
  -H "Authorization: Bearer hk-..." \
  -H "Content-Type: application/json" \
  -d '{
    "name": "support-qa",
    "description": "Golden answers for common support questions"
  }'

Add items

Each item carries an input and an expectedOutput; metadata is free-form and travels with the item into every run.

curl -X POST https://api.hanzo.ai/v1/evals/dataset-items \
  -H "Authorization: Bearer hk-..." \
  -H "Content-Type: application/json" \
  -d '{
    "datasetName": "support-qa",
    "input": "How do I rotate an API key?",
    "expectedOutput": "Create a new key, deploy it, then revoke the old one.",
    "metadata": { "topic": "api-keys" }
  }'

Items are versioned by dataset, so a run always evaluates against a known snapshot. List datasets and items with the matching GET routes:

curl https://api.hanzo.ai/v1/evals/datasets \
  -H "Authorization: Bearer hk-..."

Run against a model

A dataset becomes useful when you run it. Pass the dataset name to an experiment — POST /v1/evals/runs — to score every item against a model, optionally with an LLM judge.

Experiments — dataset runs, comparisons, and analytics
Scores — the per-item results a run produces
Prompts — version the prompt an experiment evaluates

Datasets

Datasets

Create a dataset

Add items

Run against a model

Related

On this page