LLM Gateway
LLM Gateway
Unified proxy for 200+ LLM providers. One API, all models. Load balancing, caching, rate limiting, and observability.
LLM Gateway
One API for 200+ language models from every major provider. OpenAI-compatible interface with load balancing, caching, rate limiting, fallbacks, and full observability.
curl https://llm.hanzo.ai/v1/chat/completions \
-H "Authorization: Bearer $HANZO_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "claude-sonnet-4-5-20250929",
"messages": [{"role": "user", "content": "Hello"}]
}'Why Hanzo LLM Gateway?
- 200+ Models — Claude, GPT, Gemini, Llama, Mistral, Qwen, and more
- OpenAI Compatible — Drop-in replacement, use any OpenAI SDK
- Smart Routing — Automatic fallbacks, load balancing, model selection
- Cost Control — Per-key budgets, rate limits, usage analytics
- Caching — Semantic and exact caching to reduce costs up to 90%
- Observability — Full request logging, latency tracking, token analytics
Supported Providers
| Provider | Models | Features |
|---|---|---|
| Anthropic | Claude 4.x, Claude 3.x | Vision, tool use, extended context |
| OpenAI | GPT-4o, o3, o4-mini | Function calling, vision, DALL-E |
| Gemini 2.x, PaLM | Multimodal, grounding | |
| Meta | Llama 3.x, Llama 4 | Open source, self-hosted |
| Mistral | Mistral Large, Codestral | European, code generation |
| Together AI | 50+ open models | Fast inference, fine-tuning |
| Groq | Llama, Mixtral | Fastest inference |
| Zen LM | Qwen 3+ (600M-480B) | Co-developed frontier models |
SDK Usage
Python
from openai import OpenAI
client = OpenAI(
api_key="your-hanzo-api-key",
base_url="https://llm.hanzo.ai/v1"
)
response = client.chat.completions.create(
model="claude-sonnet-4-5-20250929",
messages=[{"role": "user", "content": "Explain quantum computing"}]
)
print(response.choices[0].message.content)TypeScript
import OpenAI from 'openai'
const client = new OpenAI({
apiKey: process.env.HANZO_API_KEY,
baseURL: 'https://llm.hanzo.ai/v1',
})
const completion = await client.chat.completions.create({
model: 'claude-sonnet-4-5-20250929',
messages: [{ role: 'user', content: 'Explain quantum computing' }],
})
console.log(completion.choices[0].message.content)Key Features
Smart Routing
Automatic fallbacks between providers when one is down
Cost Management
Per-key budgets, rate limits, and usage analytics
Semantic Caching
Cache similar requests to reduce costs up to 90%
Guardrails
Content filtering, PII detection, and safety controls
Observability
Full request logging with latency and token analytics
Fine-tuning
Custom model training via Together AI and Zen LM
Configuration
model_list:
- model_name: "default"
litellm_params:
model: "anthropic/claude-sonnet-4-5-20250929"
api_key: "os.environ/ANTHROPIC_API_KEY"
- model_name: "default"
litellm_params:
model: "openai/gpt-4o"
api_key: "os.environ/OPENAI_API_KEY"
- model_name: "fast"
litellm_params:
model: "groq/llama-3.1-70b"
api_key: "os.environ/GROQ_API_KEY"
router_settings:
routing_strategy: "latency-based-routing"
fallbacks:
- default: ["fast"]API Endpoints
| Endpoint | Description |
|---|---|
POST /v1/chat/completions | Chat completions (streaming supported) |
POST /v1/completions | Text completions |
POST /v1/embeddings | Text embeddings |
POST /v1/images/generations | Image generation |
GET /v1/models | List available models |
POST /key/generate | Create API keys with budgets |
GET /key/info | Key usage and budget info |
Related
How is this guide?
Last updated on
