Unified proxy for 200+ LLM providers. One API, all models. Load balancing, caching, rate limiting, and observability.

LLM Gateway

One API for 200+ language models from every major provider. OpenAI-compatible interface with load balancing, caching, rate limiting, fallbacks, and full observability.

curl https://llm.hanzo.ai/v1/chat/completions \
  -H "Authorization: Bearer $HANZO_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4-5-20250929",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Why Hanzo LLM Gateway?

200+ Models — Claude, GPT, Gemini, Llama, Mistral, and more

OpenAI Compatible — Drop-in replacement, use any OpenAI SDK

Smart Routing — Automatic fallbacks, load balancing, model selection

Cost Control — Per-key budgets, rate limits, usage analytics

Caching — Semantic and exact caching to reduce costs up to 90%

Observability — Full request logging, latency tracking, token analytics

Supported Providers

Provider	Models	Features
Anthropic	Claude 4.x, Claude 3.x	Vision, tool use, extended context
OpenAI	GPT-4o, o3, o4-mini	Function calling, vision, DALL-E
Google	Gemini 2.x, PaLM	Multimodal, grounding
Meta	Llama 3.x, Llama 4	Open source, self-hosted
Mistral	Mistral Large, Codestral	European, code generation
Together AI	50+ open models	Fast inference, fine-tuning
Groq	Llama, Mixtral	Fastest inference
Zen LM	600M-480B	Frontier open-weight models

SDK Usage

Python

from openai import OpenAI

client = OpenAI(
    api_key="your-hanzo-api-key",
    base_url="https://llm.hanzo.ai/v1"
)

response = client.chat.completions.create(
    model="claude-sonnet-4-5-20250929",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)
print(response.choices[0].message.content)

TypeScript

import OpenAI from 'openai'

const client = new OpenAI({
  apiKey: process.env.HANZO_API_KEY,
  baseURL: 'https://llm.hanzo.ai/v1',
})

const completion = await client.chat.completions.create({
  model: 'claude-sonnet-4-5-20250929',
  messages: [{ role: 'user', content: 'Explain quantum computing' }],
})
console.log(completion.choices[0].message.content)

Key Features

Smart Routing

Automatic fallbacks between providers when one is down

Cost Management

Per-key budgets, rate limits, and usage analytics

Semantic Caching

Cache similar requests to reduce costs up to 90%

Guardrails

Content filtering, PII detection, and safety controls

Observability

Full request logging with latency and token analytics

Fine-tuning

Custom model training via Together AI and Zen LM

Configuration

model_list:
  - model_name: "default"
    litellm_params:
      model: "anthropic/claude-sonnet-4-5-20250929"
      api_key: "os.environ/ANTHROPIC_API_KEY"

  - model_name: "default"
    litellm_params:
      model: "openai/gpt-4o"
      api_key: "os.environ/OPENAI_API_KEY"

  - model_name: "fast"
    litellm_params:
      model: "groq/llama-3.1-70b"
      api_key: "os.environ/GROQ_API_KEY"

router_settings:
  routing_strategy: "latency-based-routing"
  fallbacks:
    - default: ["fast"]

API Endpoints

Endpoint	Description
`POST /v1/chat/completions`	Chat completions (streaming supported)
`POST /v1/completions`	Text completions
`POST /v1/embeddings`	Text embeddings
`POST /v1/images/generations`	Image generation
`GET /v1/models`	List available models
`POST /key/generate`	Create API keys with budgets
`GET /key/info`	Key usage and budget info

Hanzo Chat

AI chat with Zen models, 100+ third-party, MCP and ZAP tools

Gateway Service

Full API reference and deployment guide

MCP

Model Context Protocol tools

Hanzo Dev

AI coding agent using the gateway

LLM Gateway

Smart Routing

Cost Management

Semantic Caching

Guardrails

Observability

Fine-tuning

Hanzo Chat

Gateway Service

MCP

Hanzo Dev

Platform

API Reference

SDKs

Overview

Hanzo Studio

MCP

Hanzo Dev

ZAP Protocol

On this page