Hanzo Engine
High-performance LLM inference engine — blazing-fast Rust-based serving with Metal/CUDA acceleration, quantization, vision, audio, and MCP tools
Hanzo Engine
Hanzo Engine is a high-performance, multimodal inference engine written in Rust. It serves LLMs, vision models, speech models, image generation models, and embedding models through an OpenAI-compatible HTTP API, with optional MCP (Model Context Protocol) server support.
Key design goals: zero-config model loading, hardware-aware quantization, and production-grade throughput via PagedAttention and continuous batching.
Features
- Hardware Acceleration: Native Metal (Apple Silicon) and CUDA (NVIDIA) backends, FlashAttention V2/V3, multi-GPU tensor parallelism
- Quantization: GGUF (2-8 bit), GPTQ, AWQ, HQQ, FP8, BNB, AFQ (Metal-optimized), and In-Situ Quantization (ISQ) of any HuggingFace model
- PagedAttention: High-throughput continuous batching on CUDA and Metal with prefix caching and KV cache quantization
- Multimodal: Text, vision, audio/speech, image generation, and embeddings in a single binary
- MCP Integration: Built-in MCP server exposing model tools via JSON-RPC, plus MCP client for connecting to external tool servers
- OpenAI-Compatible API: Drop-in replacement for OpenAI endpoints —
/v1/chat/completions,/v1/completions,/v1/embeddings,/v1/models - Agentic: Integrated tool calling with Python/Rust callbacks, web search, and structured output (JSON schema, regex, grammar)
- LoRA / X-LoRA: Adapter model support with runtime weight merging
- AnyMoE: Create mixture-of-experts on any base model
- Per-Layer Topology: Fine-tune quantization bit depth per layer for optimal quality/speed tradeoffs
- Web UI: Built-in chat interface at
/uiwhen serving
Architecture
┌───────────────────────────────────────────────────────────────────┐
│ Hanzo Engine │
├───────────────────────────────────────────────────────────────────┤
│ │
│ Clients │
│ ─────── │
│ OpenAI SDK ─┐ │
│ curl / HTTP ─┤──▶ ┌──────────────────────────────────┐ │
│ Python SDK ─┤ │ HTTP Server (axum) │ │
│ Rust SDK ─┤ │ :36900 │ │
│ MCP Client ─┘──▶ │ /v1/chat/completions │ │
│ │ /v1/completions │ │
│ │ /v1/embeddings │ │
│ │ /v1/models │ │
│ │ /health │ │
│ │ /docs (SwaggerUI) │ │
│ └─────────┬────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────┐ │
│ │ Pipeline Engine │ │
│ │ ────────────────── │ │
│ │ • Continuous batching │ │
│ │ • PagedAttention + prefix cache │ │
│ │ • ISQ / GGUF / GPTQ / AWQ │ │
│ │ • Device mapping (multi-GPU) │ │
│ │ • Chat template detection │ │
│ └─────────┬────────────────────────┘ │
│ │ │
│ ┌────────────────┼────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ Metal (GPU) │ │ CUDA (GPU) │ │ CPU (fallback) │ │
│ │ Apple Si │ │ FlashAttn │ │ AVX2 / MKL │ │
│ │ AFQ quants │ │ Marlin kern │ │ GGUF / HQQ │ │
│ └──────────────┘ └──────────────┘ └──────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ MCP Server (:4321) │ │
│ │ Exposes model tools via JSON-RPC 2.0 │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
└───────────────────────────────────────────────────────────────────┘Supported Models
Text Models
| Model | Architectures | Notes |
|---|---|---|
| Llama | Llama 2, 3, 3.1, 3.2 | GGUF supported |
| Mistral | Mistral 7B, Mixtral 8x7B/8x22B | MoE support |
| Qwen | Qwen 2, 3, 3 Next, 3 MoE | Thinking mode |
| Gemma | Gemma, Gemma 2 | |
| Phi | Phi 2, 3, 3.5 MoE | |
| DeepSeek | V2, V3 / R1 | MoE with MoQE |
| GLM | GLM 4, 4.7 Flash, 4.7 MoE | |
| Starcoder | Starcoder 2 | Code models |
| SmolLM | SmolLM 3 | Small models |
| Granite | Granite 4.0 | |
| GPT-OSS | GPT-OSS (Harmony format) | reasoning_effort support |
Vision Models
| Model | Input Types |
|---|---|
| Qwen 3-VL, 3-VL MoE | Image + Text |
| Qwen 2-VL, 2.5-VL | Image + Text |
| Gemma 3, 3n | Image + Text |
| Llama 4, 3.2 Vision | Image + Text |
| Mistral 3 | Image + Text |
| Phi 4 Multimodal, Phi 3V | Image + Text |
| MiniCPM-O | Image + Text |
| Idefics 2, 3 | Image + Text |
| LLaVA, LLaVA Next | Image + Text |
Speech, Image Generation, and Embedding Models
| Category | Models |
|---|---|
| Speech (ASR) | Voxtral, Dia |
| Image Generation | FLUX |
| Embeddings | Embedding Gemma, Qwen 3 Embedding |
Quick Start
Install
Build from source with the appropriate backend:
# macOS (Metal)
cargo build --package hanzo-engine --release --no-default-features --features metal
# Linux (CUDA)
cargo build --package hanzo-engine --release --features cuda
# Linux (CUDA + FlashAttention)
cargo build --package hanzo-engine --release --features "cuda flash-attn cudnn"
# Install binary
cargo install --path hanzo-engine --no-default-features --features metalRun a Model
# Interactive chat with auto-detected architecture
hanzo-engine run -m Qwen/Qwen3-4B
# Start HTTP server with web UI
hanzo-engine serve --ui -m google/gemma-3-4b-it
# Start server with ISQ 4-bit quantization
hanzo-engine serve -m meta-llama/Llama-3.2-3B-Instruct --isq 4
# Run a GGUF quantized model
hanzo-engine run --format gguf -f ./model.Q4_K_M.gguf
# Auto-tune for your hardware
hanzo-engine tune -m Qwen/Qwen3-4B --emit-config config.toml
hanzo-engine from-config -f config.toml
# System diagnostics
hanzo-engine doctorThe server listens on port 36900 by default. Visit http://localhost:36900/ui for the built-in chat interface, or http://localhost:36900/docs for interactive Swagger API docs.
CLI Commands
| Command | Purpose |
|---|---|
run | Interactive chat mode |
serve | Start HTTP + optional MCP server |
from-config | Run from a TOML config file |
quantize | Generate UQFF quantized model |
tune | Auto-benchmark and recommend settings |
doctor | System diagnostics (CUDA, Metal, HF connectivity) |
login | Authenticate with HuggingFace Hub |
cache | Manage downloaded model cache |
bench | Run performance benchmarks |
API Reference
Hanzo Engine exposes an OpenAI-compatible REST API. Any client that speaks the OpenAI API works out of the box.
Chat Completions
curl http://localhost:36900/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer EMPTY" \
-d '{
"model": "default",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain PagedAttention in one paragraph."}
],
"max_tokens": 256,
"temperature": 0.7
}'Using the Python OpenAI SDK
import openai
client = openai.OpenAI(
base_url="http://localhost:36900/v1",
api_key="EMPTY",
)
completion = client.chat.completions.create(
model="default",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a haiku about Rust."},
],
max_tokens=128,
)
print(completion.choices[0].message.content)Streaming
Set "stream": true in the request body. The server returns Server-Sent Events (SSE) compatible with the OpenAI streaming format.
stream = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "Count to 10 slowly."}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)Text Completions
curl http://localhost:36900/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"prompt": "The Rust programming language is",
"max_tokens": 128
}'Embeddings
Serve an embedding model to enable this endpoint:
hanzo-engine serve -m google/embeddinggemma-300mcurl http://localhost:36900/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"input": "Hello, world!"
}'Endpoints Summary
| Method | Path | Description |
|---|---|---|
POST | /v1/chat/completions | Chat completions (streaming supported) |
POST | /v1/completions | Text completions |
POST | /v1/embeddings | Vector embeddings |
GET | /v1/models | List loaded models |
GET | /health | Server health check |
GET | /docs | Interactive Swagger UI |
GET | /ui | Built-in chat web UI |
Extended Parameters
Beyond the standard OpenAI API, Hanzo Engine supports additional request parameters:
| Parameter | Type | Description |
|---|---|---|
top_k | int | Top-K sampling |
min_p | float | Min-P sampling threshold |
grammar | object | Constrained generation (regex, JSON schema, Lark grammar, llguidance) |
enable_thinking | bool | Enable chain-of-thought for supported models |
truncate_sequence | bool | Truncate over-length prompts instead of rejecting |
repetition_penalty | float | Multiplicative penalty for repeated tokens |
web_search_options | object | Enable web search integration |
reasoning_effort | string | Reasoning depth for Harmony-format models: low, medium, high |
dry_multiplier | float | DRY anti-repetition penalty strength |
Quantization Guide
Hanzo Engine supports extensive quantization methods for reducing model size and increasing throughput.
In-Situ Quantization (ISQ)
ISQ quantizes model weights during loading -- the full unquantized model never needs to fit in memory. Just pass --isq <bits>:
# 4-bit quantization (auto-selects best method for your hardware)
hanzo-engine serve -m meta-llama/Llama-3.2-3B-Instruct --isq 4
# 8-bit quantization
hanzo-engine serve -m Qwen/Qwen3-4B --isq 8On Metal, ISQ uses AFQ (affine quantization) optimized for Apple Silicon. On CUDA, it uses Q/K quantization for best throughput.
Quantization Methods
| Method | Bit Widths | Devices | Notes |
|---|---|---|---|
| ISQ (auto) | 2, 3, 4, 5, 6, 8 | All | Hardware-aware auto-selection |
| GGUF | 2-8 bit (Q/K types) | CPU, CUDA, Metal | Most portable format |
| GPTQ | 2, 3, 4, 8 | CUDA only | Marlin kernel for 4/8-bit |
| AWQ | 4, 8 | CUDA only | Marlin kernel for 4/8-bit |
| HQQ | 4, 8 | All | Half-quadratic quantization |
| FP8 | 8 | All | E4M3 floating point |
| BNB | 4, 8 | CUDA | bitsandbytes int8, fp4, nf4 |
| AFQ | 2, 3, 4, 6, 8 | Metal only | Fastest on Apple Silicon |
| MLX | Pre-quantized | Metal | Apple MLX format |
Running GGUF Models
# Direct GGUF file
hanzo-engine run --format gguf -f ./Qwen3-4B-Q4_K_M.gguf
# Auto-detect GPTQ model from HuggingFace
hanzo-engine run -m kaitchup/Phi-3-mini-4k-instruct-gptq-4bitPer-Layer Topology
Fine-tune quantization per layer using a topology file. This lets you keep attention layers at higher precision while aggressively quantizing feed-forward layers:
hanzo-engine serve -m Qwen/Qwen3-4B --topology topology.tomlPagedAttention
PagedAttention accelerates inference and enables high-throughput continuous batching:
- CUDA: Enabled by default; disable with
--no-paged-attn - Metal: Opt-in with
--paged-attn - Prefix caching: Reuses KV cache blocks across requests sharing common prefixes (system prompts)
- KV cache quantization: FP8 (E4M3) reduces KV cache memory by ~50%
# Custom KV cache memory allocation
hanzo-engine serve -m Qwen/Qwen3-4B --paged-attn --paged-attn-gpu-memory 8GiB
# KV cache quantization for longer contexts
hanzo-engine serve -m Qwen/Qwen3-4B --paged-attn --paged-cache-type f8e4m3MCP Integration
Hanzo Engine can act as both an MCP server (exposing model tools) and an MCP client (connecting to external tool servers).
MCP Server
Expose model capabilities as MCP tools alongside the HTTP API:
hanzo-engine serve \
-m Qwen/Qwen3-4B \
--port 36900 \
--mcp-port 4321The MCP server exposes tools based on loaded model modalities (text, vision, etc.) via JSON-RPC 2.0.
MCP Client
Connect to external MCP tool servers so the model can call external tools during inference:
hanzo-engine serve \
-m Qwen/Qwen3-4B \
--mcp-client-config mcp-tools.jsonDocker
# CUDA
docker run --gpus all -p 36900:36900 ghcr.io/hanzoai/engine:latest \
serve -m Qwen/Qwen3-4B --port 36900
# With ISQ quantization
docker run --gpus all -p 36900:36900 ghcr.io/hanzoai/engine:latest \
serve -m meta-llama/Llama-3.2-3B-Instruct --isq 4 --port 36900Configuration
Environment Variables
| Variable | Description |
|---|---|
HANZO_ENGINE_PORT | Server port (default: 36900) |
KEEP_ALIVE_INTERVAL | SSE keep-alive interval in ms |
OTEL_EXPORTER_OTLP_ENDPOINT | OpenTelemetry export endpoint |
OTEL_SERVICE_NAME | Service name for traces |
HF_TOKEN | HuggingFace Hub authentication token |
TOML Configuration
For reproducible deployments, use a TOML config:
# Generate optimal config for your hardware
hanzo-engine tune -m Qwen/Qwen3-4B --emit-config config.toml
# Run from config
hanzo-engine from-config -f config.tomlPerformance Tips
- Use ISQ:
--isq 4gives the best balance of quality and speed on most hardware - Enable PagedAttention: Required for high-throughput batched serving
- Run
tune:hanzo-engine tunebenchmarks your hardware and emits optimal settings - KV cache quantization:
--paged-cache-type f8e4m3halves KV cache memory for longer contexts - Multi-GPU: Use device mapping for models that exceed single-GPU memory
- FlashAttention: Add
--features flash-attnat build time for CUDA (significant speedup)
Related Services
How is this guide?
Last updated on
Hanzo Visor
VM and container runtime for AI workloads with GPU passthrough, live migration, snapshot/restore, and Kubernetes CRI integration.
Hanzo O11y
Full-stack observability platform — Prometheus metrics, Grafana dashboards, OpenTelemetry distributed tracing, log aggregation, alerting, and SLO management for Hanzo infrastructure and applications.