Hanzo Docs
LLM Gateway

LLM Gateway

Unified proxy for 200+ LLM providers. One API, all models. Load balancing, caching, rate limiting, and observability.

LLM Gateway

One API for 200+ language models from every major provider. OpenAI-compatible interface with load balancing, caching, rate limiting, fallbacks, and full observability.

curl https://llm.hanzo.ai/v1/chat/completions \
  -H "Authorization: Bearer $HANZO_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-sonnet-4-5-20250929",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Why Hanzo LLM Gateway?

  • 200+ Models — Claude, GPT, Gemini, Llama, Mistral, Qwen, and more
  • OpenAI Compatible — Drop-in replacement, use any OpenAI SDK
  • Smart Routing — Automatic fallbacks, load balancing, model selection
  • Cost Control — Per-key budgets, rate limits, usage analytics
  • Caching — Semantic and exact caching to reduce costs up to 90%
  • Observability — Full request logging, latency tracking, token analytics

Supported Providers

ProviderModelsFeatures
AnthropicClaude 4.x, Claude 3.xVision, tool use, extended context
OpenAIGPT-4o, o3, o4-miniFunction calling, vision, DALL-E
GoogleGemini 2.x, PaLMMultimodal, grounding
MetaLlama 3.x, Llama 4Open source, self-hosted
MistralMistral Large, CodestralEuropean, code generation
Together AI50+ open modelsFast inference, fine-tuning
GroqLlama, MixtralFastest inference
Zen LMQwen 3+ (600M-480B)Co-developed frontier models

SDK Usage

Python

from openai import OpenAI

client = OpenAI(
    api_key="your-hanzo-api-key",
    base_url="https://llm.hanzo.ai/v1"
)

response = client.chat.completions.create(
    model="claude-sonnet-4-5-20250929",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)
print(response.choices[0].message.content)

TypeScript

import OpenAI from 'openai'

const client = new OpenAI({
  apiKey: process.env.HANZO_API_KEY,
  baseURL: 'https://llm.hanzo.ai/v1',
})

const completion = await client.chat.completions.create({
  model: 'claude-sonnet-4-5-20250929',
  messages: [{ role: 'user', content: 'Explain quantum computing' }],
})
console.log(completion.choices[0].message.content)

Key Features

Smart Routing

Automatic fallbacks between providers when one is down

Cost Management

Per-key budgets, rate limits, and usage analytics

Semantic Caching

Cache similar requests to reduce costs up to 90%

Guardrails

Content filtering, PII detection, and safety controls

Observability

Full request logging with latency and token analytics

Fine-tuning

Custom model training via Together AI and Zen LM

Configuration

model_list:
  - model_name: "default"
    litellm_params:
      model: "anthropic/claude-sonnet-4-5-20250929"
      api_key: "os.environ/ANTHROPIC_API_KEY"

  - model_name: "default"
    litellm_params:
      model: "openai/gpt-4o"
      api_key: "os.environ/OPENAI_API_KEY"

  - model_name: "fast"
    litellm_params:
      model: "groq/llama-3.1-70b"
      api_key: "os.environ/GROQ_API_KEY"

router_settings:
  routing_strategy: "latency-based-routing"
  fallbacks:
    - default: ["fast"]

API Endpoints

EndpointDescription
POST /v1/chat/completionsChat completions (streaming supported)
POST /v1/completionsText completions
POST /v1/embeddingsText embeddings
POST /v1/images/generationsImage generation
GET /v1/modelsList available models
POST /key/generateCreate API keys with budgets
GET /key/infoKey usage and budget info

How is this guide?

Last updated on

On this page